1
|
Viñas R, Azevedo T, Gamazon ER, Liò P. Deep Learning Enables Fast and Accurate Imputation of Gene Expression. Front Genet 2021; 12:624128. [PMID: 33927746 PMCID: PMC8076954 DOI: 10.3389/fgene.2021.624128] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2020] [Accepted: 03/12/2021] [Indexed: 11/26/2022] Open
Abstract
A question of fundamental biological significance is to what extent the expression of a subset of genes can be used to recover the full transcriptome, with important implications for biological discovery and clinical application. To address this challenge, we propose two novel deep learning methods, PMI and GAIN-GTEx, for gene expression imputation. In order to increase the applicability of our approach, we leverage data from GTEx v8, a reference resource that has generated a comprehensive collection of transcriptomes from a diverse set of human tissues. We show that our approaches compare favorably to several standard and state-of-the-art imputation methods in terms of predictive performance and runtime in two case studies and two imputation scenarios. In comparison conducted on the protein-coding genes, PMI attains the highest performance in inductive imputation whereas GAIN-GTEx outperforms the other methods in in-place imputation. Furthermore, our results indicate strong generalization on RNA-Seq data from 3 cancer types across varying levels of missingness. Our work can facilitate a cost-effective integration of large-scale RNA biorepositories into genomic studies of disease, with high applicability across diverse tissue types.
Collapse
Affiliation(s)
- Ramon Viñas
- Department of Computer Science and Technology, University of Cambridge, Cambridge, United Kingdom
| | - Tiago Azevedo
- Department of Computer Science and Technology, University of Cambridge, Cambridge, United Kingdom
| | - Eric R Gamazon
- Vanderbilt Genetics Institute and Data Science Institute, VUMC, Nashville, TN, United States.,MRC Epidemiology Unit, University of Cambridge, Cambridge, United Kingdom.,Clare Hall, University of Cambridge, Cambridge, United Kingdom
| | - Pietro Liò
- Department of Computer Science and Technology, University of Cambridge, Cambridge, United Kingdom
| |
Collapse
|
2
|
Ugidos M, Tarazona S, Prats-Montalbán JM, Ferrer A, Conesa A. MultiBaC: A strategy to remove batch effects between different omic data types. Stat Methods Med Res 2020; 29:2851-2864. [PMID: 32131696 DOI: 10.1177/0962280220907365] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Diversity of omic technologies has expanded in the last years together with the number of omic data integration strategies. However, multiomic data generation is costly, and many research groups cannot afford research projects where many different omic techniques are generated, at least at the same time. As most researchers share their data in public repositories, different omic datasets of the same biological system obtained at different labs can be combined to construct a multiomic study. However, data obtained at different labs or moments in time are typically subjected to batch effects that need to be removed for successful data integration. While there are methods to correct batch effects on the same data types obtained in different studies, they cannot be applied to correct lab or batch effects across omics. This impairs multiomic meta-analysis. Fortunately, in many cases, at least one omics platform-i.e. gene expression- is repeatedly measured across labs, together with the additional omic modalities that are specific to each study. This creates an opportunity for batch analysis. We have developed MultiBaC (multiomic Multiomics Batch-effect Correction correction), a strategy to correct batch effects from multiomic datasets distributed across different labs or data acquisition events. Our strategy is based on the existence of at least one shared data type which allows data prediction across omics. We validate this approach both on simulated data and on a case where the multiomic design is fully shared by two labs, hence batch effect correction within the same omic modality using traditional methods can be compared with the MultiBaC correction across data types. Finally, we apply MultiBaC to a true multiomic data integration problem to show that we are able to improve the detection of meaningful biological effects.
Collapse
Affiliation(s)
- Manuel Ugidos
- Gene expression and RNA Metabolism Laboratory, Instituto de Biomedicina de Valencia, Consejo Superior de Investigaciones Científicas (CSIC), Valencia, Spain
| | - Sonia Tarazona
- Multivariate Statistical Engineering Group, Department of Applied Statistics, Operations Research and Quality, Universitat Politècnica de València, Valencia, Spain
| | - José M Prats-Montalbán
- Multivariate Statistical Engineering Group, Department of Applied Statistics, Operations Research and Quality, Universitat Politècnica de València, Valencia, Spain
| | - Alberto Ferrer
- Multivariate Statistical Engineering Group, Department of Applied Statistics, Operations Research and Quality, Universitat Politècnica de València, Valencia, Spain
| | - Ana Conesa
- Microbiology and Cell Science Department, Institute of Food and Agricultural Sciences, University of Florida, Gainesville, USA
| |
Collapse
|
3
|
Shukla R, Oh H, Sibille E. Molecular and Cellular Evidence for Age by Disease Interactions: Updates and Path Forward. Am J Geriatr Psychiatry 2020; 28:237-247. [PMID: 31285153 DOI: 10.1016/j.jagp.2019.06.001] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/26/2019] [Revised: 05/14/2019] [Accepted: 06/01/2019] [Indexed: 12/31/2022]
Abstract
Characterization of age-associated gene expression changes shows that the brain engages a specific set of genes and biologic pathways along a continuous life-long trajectory and that these genes and pathways overlap with those associated with brain-related disorders. Based on this correlative observation, we have suggested a model of age-by-disease interaction by which brain ageing promotes biologic changes associated with diseases and where deviations from expected age-related trajectories, due to biologic and environmental factors, contribute to defining disease risk or resiliency. In this review, we first evaluate various biomarkers that can be used to study age-by-disease interactions and then focus on transcriptome analysis (i.e., the set of all expressed genes) as a useful tool to explore this interaction. Using the specific example of brain-derived neurotrophic factor and brain-derived neurotrophic factor-associated genes, we then describe molecular events and mechanisms potentially contributing to age-by-disease interactions. Finally, we suggest that long-term biologic adaptations within distinct cellular components of cortical microcircuits, as determined by transcriptome analysis, may integrate and mediate the effects of ageing and diseases. Moving forward, we suggest that analysis of transcriptome similarities between ageing and small molecule-induced system perturbations may lead to novel therapeutics discovery.
Collapse
Affiliation(s)
- Rammohan Shukla
- Campbell Family Mental Health Research Institute of CAMH, Toronto, Canada; Department of Psychiatry, University of Toronto, Toronto, Canada
| | - Hyunjung Oh
- Campbell Family Mental Health Research Institute of CAMH, Toronto, Canada; Department of Psychiatry, University of Toronto, Toronto, Canada
| | - Etienne Sibille
- Campbell Family Mental Health Research Institute of CAMH, Toronto, Canada; Department of Psychiatry, University of Toronto, Toronto, Canada; Department of Pharmacology and Toxicology, University of Toronto, Toronto, Canada; Institute of Medical Science, University of Toronto, Toronto, Ontario, Canada.
| |
Collapse
|
4
|
Keenan AB, Wojciechowicz ML, Wang Z, Jagodnik KM, Jenkins SL, Lachmann A, Ma'ayan A. Connectivity Mapping: Methods and Applications. Annu Rev Biomed Data Sci 2019. [DOI: 10.1146/annurev-biodatasci-072018-021211] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Connectivity mapping resources consist of signatures representing changes in cellular state following systematic small-molecule, disease, gene, or other form of perturbations. Such resources enable the characterization of signatures from novel perturbations based on similarity; provide a global view of the space of many themed perturbations; and allow the ability to predict cellular, tissue, and organismal phenotypes for perturbagens. A signature search engine enables hypothesis generation by finding connections between query signatures and the database of signatures. This framework has been used to identify connections between small molecules and their targets, to discover cell-specific responses to perturbations and ways to reverse disease expression states with small molecules, and to predict small-molecule mimickers for existing drugs. This review provides a historical perspective and the current state of connectivity mapping resources with a focus on both methodology and community implementations.
Collapse
Affiliation(s)
- Alexandra B. Keenan
- Department of Pharmacological Sciences and Mount Sinai Center for Bioinformatics, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Megan L. Wojciechowicz
- Department of Pharmacological Sciences and Mount Sinai Center for Bioinformatics, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Zichen Wang
- Department of Pharmacological Sciences and Mount Sinai Center for Bioinformatics, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Kathleen M. Jagodnik
- Department of Pharmacological Sciences and Mount Sinai Center for Bioinformatics, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Sherry L. Jenkins
- Department of Pharmacological Sciences and Mount Sinai Center for Bioinformatics, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Alexander Lachmann
- Department of Pharmacological Sciences and Mount Sinai Center for Bioinformatics, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Avi Ma'ayan
- Department of Pharmacological Sciences and Mount Sinai Center for Bioinformatics, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| |
Collapse
|
5
|
Abstract
Background:Time series expression data of genes contain relations among different genes, which are difficult to model precisely. Slime-forming bacteria is one of the three major harmful bacteria types in industrial circulating cooling water systems.Objective:This study aimed at constructing gene regulation network(GRN) for slime-forming bacteria to understand the microbial fouling mechanism.Methods:For this purpose, an Adaptive Elman Neural Network (AENN) to reveal the relationships among genes using gene expression time series is proposed. The parameters of Elman neural network were optimized adaptively by a Genetic Algorithm (GA). And a Pearson correlation analysis is applied to discover the relationships among genes. In addition, the gene expression data of slime-forming bacteria by transcriptome gene sequencing was presented.Results:To evaluate our proposed method, we compared several alternative data-driven approaches, including a Neural Fuzzy Recurrent Network (NFRN), a basic Elman Neural Network (ENN), and an ensemble network. The experimental results of simulated and real datasets demonstrate that the proposed approach has a promising performance for modeling Gene Regulation Networks (GRNs). We also applied the proposed method for the GRN construction of slime-forming bacteria and at last a GRN for 6 genes was constructed.Conclusion:The proposed GRN construction method can effectively extract the regulations among genes. This is also the first report to construct the GRN for slime-forming bacteria.
Collapse
Affiliation(s)
- Shengxian Cao
- School of Automation Engineering, Northeast Electric Power University, Jilin, China
| | - Yu Wang
- School of Automation Engineering, Northeast Electric Power University, Jilin, China
| | - Zhenhao Tang
- School of Automation Engineering, Northeast Electric Power University, Jilin, China
| |
Collapse
|
6
|
Lee YS, Krishnan A, Oughtred R, Rust J, Chang CS, Ryu J, Kristensen VN, Dolinski K, Theesfeld CL, Troyanskaya OG. A Computational Framework for Genome-wide Characterization of the Human Disease Landscape. Cell Syst 2019; 8:152-162.e6. [PMID: 30685436 PMCID: PMC7374759 DOI: 10.1016/j.cels.2018.12.010] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2018] [Revised: 10/16/2018] [Accepted: 12/20/2018] [Indexed: 01/21/2023]
Abstract
A key challenge for the diagnosis and treatment of complex human diseases is identifying their molecular basis. Here, we developed a unified computational framework, URSAHD (Unveiling RNA Sample Annotation for Human Diseases), that leverages machine learning and the hierarchy of anatomical relationships present among diseases to integrate thousands of clinical gene expression profiles and identify molecular characteristics specific to each of the hundreds of complex diseases. URSAHD can distinguish between closely related diseases more accurately than literature-validated genes or traditional differential-expression-based computational approaches and is applicable to any disease, including rare and understudied ones. We demonstrate the utility of URSAHD in classifying related nervous system cancers and experimentally verifying novel neuroblastoma-associated genes identified by URSAHD. We highlight the applications for potential targeted drug-repurposing and for quantitatively assessing the molecular response to clinical therapies. URSAHD is freely available for public use, including the use of underlying models, at ursahd.princeton.edu.
Collapse
Affiliation(s)
- Young-Suk Lee
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ, USA; Department of Computer Science, Princeton University, Princeton, NJ, USA; School of Biological Sciences, Seoul National University, Seoul, South Korea
| | - Arjun Krishnan
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ, USA; Departments of Computational Mathematics, Science, and Engineering and Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI, USA
| | - Rose Oughtred
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ, USA
| | - Jennifer Rust
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ, USA
| | - Christie S Chang
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ, USA
| | - Joseph Ryu
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ, USA
| | - Vessela N Kristensen
- Department of Genetics, Institute of Cancer Research, Oslo University Hospital, Radiumhospitalet, Oslo, Norway; Institute of Clinical Medicine, Faculty of Medicine, University of Oslo, Oslo, Norway; Department of Clinical Molecular Biology (EpiGen), Division of Medicine, Akershus University Hospital, Lørenskog, Norway
| | - Kara Dolinski
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ, USA
| | - Chandra L Theesfeld
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ, USA.
| | - Olga G Troyanskaya
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ, USA; Department of Computer Science, Princeton University, Princeton, NJ, USA; Flatiron Institute, Simons Foundation, New York, NY, USA.
| |
Collapse
|
7
|
Musa A, Ghoraie LS, Zhang SD, Glazko G, Yli-Harja O, Dehmer M, Haibe-Kains B, Emmert-Streib F. A review of connectivity map and computational approaches in pharmacogenomics. Brief Bioinform 2018; 19:506-523. [PMID: 28069634 PMCID: PMC5952941 DOI: 10.1093/bib/bbw112] [Citation(s) in RCA: 93] [Impact Index Per Article: 15.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open
Abstract
Large-scale perturbation databases, such as Connectivity Map (CMap) or Library of Integrated Network-based Cellular Signatures (LINCS), provide enormous opportunities for computational pharmacogenomics and drug design. A reason for this is that in contrast to classical pharmacology focusing at one target at a time, the transcriptomics profiles provided by CMap and LINCS open the door for systems biology approaches on the pathway and network level. In this article, we provide a review of recent developments in computational pharmacogenomics with respect to CMap and LINCS and related applications.
Collapse
Affiliation(s)
- Aliyu Musa
- Predictive Medicine and Analytics Lab, Department of Signal Processing, Tampere University of Technology, Tampere, Finland
| | - Laleh Soltan Ghoraie
- Bioinformatics and Computational Genomics Laboratory, Princess Margaret Cancer Center, University Health Network, Toronto, ON, Canada
| | - Shu-Dong Zhang
- Northern Ireland Centre for Stratified Medicine, Biomedical Sciences Research Institute, University of Ulster, C-TRIC Building, Altnagelvin Area Hospital, Glenshane Road, Derry/Londonderry, Northern Ireland, UK
| | - Galina Glazko
- University of Rochester Department of Biostatistics and Computational Biology, Rochester, New York, USA
| | - Olli Yli-Harja
- Computational Systems Biology, Department of Signal Processing, Tampere University of Technology, Tampere, Finland
| | - Matthias Dehmer
- Institute for Bioinformatics and Translational Research, UMIT- The Health and Life Sciences University, Eduard Wallnoefer Zentrum 1, Hall in Tyrol, Austria
| | - Benjamin Haibe-Kains
- Bioinformatics and Computational Genomics Laboratory, Princess Margaret Cancer Center, University Health Network, Toronto, ON, Canada
- Department of Medical Biophysics, University of Toronto, Toronto, ON, Canada
- Department of Computer Science, University of Toronto, Toronto, ON, Canada
- Ontario Institute of Cancer Research, Toronto, ON, Canada
| | - Frank Emmert-Streib
- Predictive Medicine and Analytics Lab, Department of Signal Processing, Tampere University of Technology, Tampere, Finland
| |
Collapse
|
8
|
Musa A, Ghoraie LS, Zhang SD, Glazko G, Yli-Harja O, Dehmer M, Haibe-Kains B, Emmert-Streib F. A review of connectivity map and computational approaches in pharmacogenomics. Brief Bioinform 2018. [PMID: 28069634 DOI: 10.1093/bib] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/14/2023] Open
Abstract
Large-scale perturbation databases, such as Connectivity Map (CMap) or Library of Integrated Network-based Cellular Signatures (LINCS), provide enormous opportunities for computational pharmacogenomics and drug design. A reason for this is that in contrast to classical pharmacology focusing at one target at a time, the transcriptomics profiles provided by CMap and LINCS open the door for systems biology approaches on the pathway and network level. In this article, we provide a review of recent developments in computational pharmacogenomics with respect to CMap and LINCS and related applications.
Collapse
Affiliation(s)
- Aliyu Musa
- Predictive Medicine and Analytics Lab, Department of Signal Processing, Tampere University of Technology, Tampere, Finland
| | - Laleh Soltan Ghoraie
- Bioinformatics and Computational Genomics Laboratory, Princess Margaret Cancer Center, University Health Network, Toronto, ON, Canada
| | - Shu-Dong Zhang
- Northern Ireland Centre for Stratified Medicine, Biomedical Sciences Research Institute, University of Ulster, C-TRIC Building, Altnagelvin Area Hospital, Glenshane Road, Derry/Londonderry BT47 6SB, Northern Ireland, UK
| | - Galina Glazko
- University of Rochester Department of Biostatistics and Computational Biology, Rochester, New York 14642, USA
| | - Olli Yli-Harja
- Computational Systems Biology, Department of Signal Processing, Tampere University of Technology, Tampere, Finland
| | - Matthias Dehmer
- Institute for Bioinformatics and Translational Research, UMIT- The Health and Life Sciences University, Eduard Wallnoefer Zentrum 1, 6060 Hall in Tyrol, Austria
| | - Benjamin Haibe-Kains
- Bioinformatics and Computational Genomics Laboratory, Princess Margaret Cancer Center, University Health Network, Toronto, ON, Canada
- Department of Medical Biophysics, University of Toronto, Toronto, ON, Canada
- Department of Computer Science, University of Toronto, Toronto, ON, Canada
- Ontario Institute of Cancer Research, Toronto, ON, Canada
| | - Frank Emmert-Streib
- Predictive Medicine and Analytics Lab, Department of Signal Processing, Tampere University of Technology, Tampere, Finland
| |
Collapse
|
9
|
Ahmed AA, Abedalthagafi M. Cancer diagnostics: The journey from histomorphology to molecular profiling. Oncotarget 2018; 7:58696-58708. [PMID: 27509178 PMCID: PMC5295463 DOI: 10.18632/oncotarget.11061] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2016] [Accepted: 07/19/2016] [Indexed: 12/15/2022] Open
Abstract
Although histomorphology has made significant advances into the understanding of cancer etiology, classification and pathogenesis, it is sometimes complicated by morphologic ambiguities, and other shortcomings that necessitate the development of ancillary tests to complement its diagnostic value. A new approach to cancer patient management consists of targeting specific molecules or gene mutations in the cancer genome by inhibitory therapy. Molecular diagnostic tests and genomic profiling methods are increasingly being developed to identify tumor targeted molecular profile that is the basis of targeted therapy. Novel targeted therapy has revolutionized the treatment of gastrointestinal stromal tumor, renal cell carcinoma and other cancers that were previously difficult to treat with standard chemotherapy. In this review, we discuss the role of histomorphology in cancer diagnosis and management and the rising role of molecular profiling in targeted therapy. Molecular profiling in certain diagnostic and therapeutic difficulties may provide a practical and useful complement to histomorphology and opens new avenues for targeted therapy and alternative methods of cancer patient management.
Collapse
Affiliation(s)
- Atif A Ahmed
- Department of Pathology and Laboratory Medicine, Children's Mercy Hospital, Kansas City, Missouri, USA
| | - Malak Abedalthagafi
- Brigham and Women's Hospital, Harvard Medical School, Boston, Massachusetts, USA.,The Saudi Human Genome Laboratory, Department of Pathology, King Fahad Medical City, King Abdulaziz City for Science and Technology, Riyadh, Saudi Arabia
| |
Collapse
|
10
|
Yu KH, Snyder M. Omics Profiling in Precision Oncology. Mol Cell Proteomics 2016; 15:2525-36. [PMID: 27099341 PMCID: PMC4974334 DOI: 10.1074/mcp.o116.059253] [Citation(s) in RCA: 70] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/29/2016] [Revised: 04/15/2016] [Indexed: 12/11/2022] Open
Abstract
Cancer causes significant morbidity and mortality worldwide, and is the area most targeted in precision medicine. Recent development of high-throughput methods enables detailed omics analysis of the molecular mechanisms underpinning tumor biology. These studies have identified clinically actionable mutations, gene and protein expression patterns associated with prognosis, and provided further insights into the molecular mechanisms indicative of cancer biology and new therapeutics strategies such as immunotherapy. In this review, we summarize the techniques used for tumor omics analysis, recapitulate the key findings in cancer omics studies, and point to areas requiring further research on precision oncology.
Collapse
Affiliation(s)
- Kun-Hsing Yu
- From the ‡Department of Genetics, Stanford University School of Medicine, Stanford, California; §Biomedical Informatics Program, Stanford University School of Medicine, Stanford, California
| | - Michael Snyder
- From the ‡Department of Genetics, Stanford University School of Medicine, Stanford, California;
| |
Collapse
|
11
|
SOKOLOV ARTEM, PAULL EVANO, STUART JOSHUAM. ONE-CLASS DETECTION OF CELL STATES IN TUMOR SUBTYPES. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2016; 21:405-16. [PMID: 26776204 PMCID: PMC4856035] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
The cellular composition of a tumor greatly influences the growth, spread, immune activity, drug response, and other aspects of the disease. Tumor cells are usually comprised of a heterogeneous mixture of subclones, each of which could contain their own distinct character. The presence of minor subclones poses a serious health risk for patients as any one of them could harbor a fitness advantage with respect to the current treatment regimen, fueling resistance. It is therefore vital to accurately assess the make-up of cell states within a tumor biopsy. Transcriptome-wide assays from RNA sequencing provide key data from which cell state signatures can be detected. However, the challenge is to find them within samples containing mixtures of cell types of unknown proportions. We propose a novel one-class method based on logistic regression and show that its performance is competitive to two established SVM-based methods for this detection task. We demonstrate that one-class models are able to identify specific cell types in heterogeneous cell populations better than their binary predictor counterparts. We derive one-class predictors for the major breast and bladder subtypes and reaffirm the connection between these two tissues. In addition, we use a one-class predictor to quantitatively associate an embryonic stem cell signature with an aggressive breast cancer subtype that reveals shared stemness pathways potentially important for treatment.
Collapse
Affiliation(s)
- ARTEM SOKOLOV
- Department of Biomolecular Engineering, University of California Santa Cruz
| | - EVAN O. PAULL
- Department of Biomolecular Engineering, University of California Santa Cruz
| | - JOSHUA M. STUART
- Department of Biomolecular Engineering, University of California Santa Cruz
| |
Collapse
|
12
|
Amar D, Hait T, Izraeli S, Shamir R. Integrated analysis of numerous heterogeneous gene expression profiles for detecting robust disease-specific biomarkers and proposing drug targets. Nucleic Acids Res 2015; 43:7779-89. [PMID: 26261215 PMCID: PMC4652780 DOI: 10.1093/nar/gkv810] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2015] [Revised: 07/23/2015] [Accepted: 07/29/2015] [Indexed: 12/18/2022] Open
Abstract
Genome-wide expression profiling has revolutionized biomedical research; vast amounts of expression data from numerous studies of many diseases are now available. Making the best use of this resource in order to better understand disease processes and treatment remains an open challenge. In particular, disease biomarkers detected in case-control studies suffer from low reliability and are only weakly reproducible. Here, we present a systematic integrative analysis methodology to overcome these shortcomings. We assembled and manually curated more than 14,000 expression profiles spanning 48 diseases and 18 expression platforms. We show that when studying a particular disease, judicious utilization of profiles from other diseases and information on disease hierarchy improves classification quality, avoids overoptimistic evaluation of that quality, and enhances disease-specific biomarker discovery. This approach yielded specific biomarkers for 24 of the analyzed diseases. We demonstrate how to combine these biomarkers with large-scale interaction, mutation and drug target data, forming a highly valuable disease summary that suggests novel directions in disease understanding and drug repurposing. Our analysis also estimates the number of samples required to reach a desired level of biomarker stability. This methodology can greatly improve the exploitation of the mountain of expression profiles for better disease analysis.
Collapse
Affiliation(s)
- David Amar
- The Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv 69978, Israel
| | - Tom Hait
- The Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv 69978, Israel
| | - Shai Izraeli
- Department of Pediatric Hematology-Oncology, Safra Children's Hospital, Sheba Medical Center, Tel Hashomer, Ramat Gan 52620, Israel Sackler School of Medicine, Tel-Aviv University, Tel Aviv 69978, Israel
| | - Ron Shamir
- The Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv 69978, Israel
| |
Collapse
|
13
|
Ji Z, Vokes SA, Dang CV, Ji H. Turning publicly available gene expression data into discoveries using gene set context analysis. Nucleic Acids Res 2015; 44:e8. [PMID: 26350211 PMCID: PMC4705686 DOI: 10.1093/nar/gkv873] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2015] [Accepted: 08/20/2015] [Indexed: 12/17/2022] Open
Abstract
Gene Set Context Analysis (GSCA) is an open source software package to help researchers use massive amounts of publicly available gene expression data (PED) to make discoveries. Users can interactively visualize and explore gene and gene set activities in 25,000+ consistently normalized human and mouse gene expression samples representing diverse biological contexts (e.g. different cells, tissues and disease types, etc.). By providing one or multiple genes or gene sets as input and specifying a gene set activity pattern of interest, users can query the expression compendium to systematically identify biological contexts associated with the specified gene set activity pattern. In this way, researchers with new gene sets from their own experiments may discover previously unknown contexts of gene set functions and hence increase the value of their experiments. GSCA has a graphical user interface (GUI). The GUI makes the analysis convenient and customizable. Analysis results can be conveniently exported as publication quality figures and tables. GSCA is available at https://github.com/zji90/GSCA. This software significantly lowers the bar for biomedical investigators to use PED in their daily research for generating and screening hypotheses, which was previously difficult because of the complexity, heterogeneity and size of the data.
Collapse
Affiliation(s)
- Zhicheng Ji
- Department of Biostatistics, Johns Hopkins University Bloomberg School of Public Health, 615 North Wolfe Street, Baltimore, MD 21205, USA
| | - Steven A Vokes
- Department of Molecular Biosciences, The University of Texas at Austin, 2500 Speedway Stop A4800, Austin, TX 78712, USA Institute for Cellular and Molecular Biology, The University of Texas at Austin, 2500 Speedway Stop A4800, Austin, TX 78712, USA
| | - Chi V Dang
- Abramson Cancer Center, University of Pennsylvania, 3400 Spruce Street, Philadelphia, PA 19104, USA
| | - Hongkai Ji
- Department of Biostatistics, Johns Hopkins University Bloomberg School of Public Health, 615 North Wolfe Street, Baltimore, MD 21205, USA
| |
Collapse
|
14
|
Yazdani A, Dunson DB. A hybrid bayesian approach for genome-wide association studies on related individuals. Bioinformatics 2015; 31:3890-6. [PMID: 26323717 DOI: 10.1093/bioinformatics/btv496] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2015] [Accepted: 08/18/2015] [Indexed: 01/09/2023] Open
Abstract
MOTIVATION Both single marker and simultaneous analysis face challenges in GWAS due to the large number of markers genotyped for a small number of subjects. This large p small n problem is particularly challenging when the trait under investigation has low heritability. METHOD In this article, we propose a two-stage approach that is a hybrid method of single and simultaneous analysis designed to improve genomic prediction of complex traits. In the first stage, we use a Bayesian independent screening method to select the most promising SNPs. In the second stage, we rely on a hierarchical model to analyze the joint impact of the selected markers. The model is designed to take into account familial dependence in the different subjects, while using local-global shrinkage priors on the marker effects. RESULTS We evaluate the performance in simulation studies, and consider an application to animal breeding data. The illustrative data analysis reveals an encouraging result in terms of prediction performance and computational cost.
Collapse
Affiliation(s)
- A Yazdani
- Human Genetic Center, University of Texas at Houston Health Science Center, Houston, USA and
| | - D B Dunson
- Department of Statistical Science, Duke University, Durham, North Carolina USA
| |
Collapse
|
15
|
Uziela K, Honkela A. Probe Region Expression Estimation for RNA-Seq Data for Improved Microarray Comparability. PLoS One 2015; 10:e0126545. [PMID: 25966034 PMCID: PMC4429080 DOI: 10.1371/journal.pone.0126545] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2014] [Accepted: 04/03/2015] [Indexed: 01/25/2023] Open
Abstract
Rapidly growing public gene expression databases contain a wealth of data for building an unprecedentedly detailed picture of human biology and disease. This data comes from many diverse measurement platforms that make integrating it all difficult. Although RNA-sequencing (RNA-seq) is attracting the most attention, at present, the rate of new microarray studies submitted to public databases far exceeds the rate of new RNA-seq studies. There is clearly a need for methods that make it easier to combine data from different technologies. In this paper, we propose a new method for processing RNA-seq data that yields gene expression estimates that are much more similar to corresponding estimates from microarray data, hence greatly improving cross-platform comparability. The method we call PREBS is based on estimating the expression from RNA-seq reads overlapping the microarray probe regions, and processing these estimates with standard microarray summarisation algorithms. Using paired microarray and RNA-seq samples from TCGA LAML data set we show that PREBS expression estimates derived from RNA-seq are more similar to microarray-based expression estimates than those from other RNA-seq processing methods. In an experiment to retrieve paired microarray samples from a database using an RNA-seq query sample, gene signatures defined based on PREBS expression estimates were found to be much more accurate than those from other methods. PREBS also allows new ways of using RNA-seq data, such as expression estimation for microarray probe sets. An implementation of the proposed method is available in the Bioconductor package “prebs.”
Collapse
Affiliation(s)
- Karolis Uziela
- Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki, Helsinki, Finland
- Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm University, 17121 Solna, Sweden
| | - Antti Honkela
- Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki, Helsinki, Finland
- * E-mail:
| |
Collapse
|
16
|
Han F, Sun W, Ling QH. A novel strategy for gene selection of microarray data based on gene-to-class sensitivity information. PLoS One 2014; 9:e97530. [PMID: 24844313 PMCID: PMC4028211 DOI: 10.1371/journal.pone.0097530] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2013] [Accepted: 04/21/2014] [Indexed: 11/19/2022] Open
Abstract
To obtain predictive genes with lower redundancy and better interpretability, a hybrid gene selection method encoding prior information is proposed in this paper. To begin with, the prior information referred to as gene-to-class sensitivity (GCS) of all genes from microarray data is exploited by a single hidden layered feedforward neural network (SLFN). Then, to select more representative and lower redundant genes, all genes are grouped into some clusters by K-means method, and some low sensitive genes are filtered out according to their GCS values. Finally, a modified binary particle swarm optimization (BPSO) encoding the GCS information is proposed to perform further gene selection from the remainder genes. For considering the GCS information, the proposed method selects those genes highly correlated to sample classes. Thus, the low redundant gene subsets obtained by the proposed method also contribute to improve classification accuracy on microarray data. The experiments results on some open microarray data verify the effectiveness and efficiency of the proposed approach.
Collapse
Affiliation(s)
- Fei Han
- School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang, China
| | - Wei Sun
- School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang, China
| | - Qing-Hua Ling
- School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang, China
- School of Computer Science and Engineering, Jiangsu University of Science and Technology, Zhenjiang, China
| |
Collapse
|
17
|
Abstract
Transcriptomics meta-analysis aims at re-using existing data to derive novel biological hypotheses, and is motivated by the public availability of a large number of independent studies. Current methods are based on breaking down studies into multiple comparisons between phenotypes (e.g. disease vs. healthy), based on the studies' experimental designs, followed by computing the overlap between the resulting differential expression signatures. While useful, in this methodology each study yields multiple independent phenotype comparisons, and connections are established not between studies, but rather between subsets of the studies corresponding to phenotype comparisons. We propose a rank-based statistical meta-analysis framework that establishes global connections between transcriptomics studies without breaking down studies into sets of phenotype comparisons. By using a rank product method, our framework extracts global features from each study, corresponding to genes that are consistently among the most expressed or differentially expressed genes in that study. Those features are then statistically modelled via a term-frequency inverse-document frequency (TF-IDF) model, which is then used for connecting studies. Our framework is fast and parameter-free; when applied to large collections of Homo sapiens and Streptococcus pneumoniae transcriptomics studies, it performs better than similarity-based approaches in retrieving related studies, using a Medical Subject Headings gold standard. Finally, we highlight via case studies how the framework can be used to derive novel biological hypotheses regarding related studies and the genes that drive those connections. Our proposed statistical framework shows that it is possible to perform a meta-analysis of transcriptomics studies with arbitrary experimental designs by deriving global expression features rather than decomposing studies into multiple phenotype comparisons.
Collapse
|
18
|
100% classification accuracy considered harmful: the normalized information transfer factor explains the accuracy paradox. PLoS One 2014; 9:e84217. [PMID: 24427282 PMCID: PMC3888391 DOI: 10.1371/journal.pone.0084217] [Citation(s) in RCA: 52] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2013] [Accepted: 11/13/2013] [Indexed: 11/20/2022] Open
Abstract
The most widely spread measure of performance, accuracy, suffers from a paradox: predictive models with a given level of accuracy may have greater predictive power than models with higher accuracy. Despite optimizing classification error rate, high accuracy models may fail to capture crucial information transfer in the classification task. We present evidence of this behavior by means of a combinatorial analysis where every possible contingency matrix of 2, 3 and 4 classes classifiers are depicted on the entropy triangle, a more reliable information-theoretic tool for classification assessment. Motivated by this, we develop from first principles a measure of classification performance that takes into consideration the information learned by classifiers. We are then able to obtain the entropy-modulated accuracy (EMA), a pessimistic estimate of the expected accuracy with the influence of the input distribution factored out, and the normalized information transfer factor (NIT), a measure of how efficient is the transmission of information from the input to the output set of classes. The EMA is a more natural measure of classification performance than accuracy when the heuristic to maximize is the transfer of information through the classifier instead of classification error count. The NIT factor measures the effectiveness of the learning process in classifiers and also makes it harder for them to “cheat” using techniques like specialization, while also promoting the interpretability of results. Their use is demonstrated in a mind reading task competition that aims at decoding the identity of a video stimulus based on magnetoencephalography recordings. We show how the EMA and the NIT factor reject rankings based in accuracy, choosing more meaningful and interpretable classifiers.
Collapse
|
19
|
Lagunin AA, Goel RK, Gawande DY, Pahwa P, Gloriozova TA, Dmitriev AV, Ivanov SM, Rudik AV, Konova VI, Pogodin PV, Druzhilovsky DS, Poroikov VV. Chemo- and bioinformatics resources for in silico drug discovery from medicinal plants beyond their traditional use: a critical review. Nat Prod Rep 2014; 31:1585-611. [DOI: 10.1039/c4np00068d] [Citation(s) in RCA: 87] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]
Abstract
An overview of databases andin silicotools for discovery of the hidden therapeutic potential of medicinal plants.
Collapse
Affiliation(s)
- Alexey A. Lagunin
- Orekhovich Institute of Biomedical Chemistry of Rus. Acad. Med. Sci
- Moscow, Russia
- Russian National Research Medical University
- Medico-Biologic Faculty
- Moscow, Russia
| | - Rajesh K. Goel
- Department of Pharmaceutical Sciences and Drug Research
- Punjabi University
- Patiala-147002, India
| | - Dinesh Y. Gawande
- Department of Pharmaceutical Sciences and Drug Research
- Punjabi University
- Patiala-147002, India
| | - Priynka Pahwa
- Department of Pharmaceutical Sciences and Drug Research
- Punjabi University
- Patiala-147002, India
| | | | | | - Sergey M. Ivanov
- Orekhovich Institute of Biomedical Chemistry of Rus. Acad. Med. Sci
- Moscow, Russia
| | - Anastassia V. Rudik
- Orekhovich Institute of Biomedical Chemistry of Rus. Acad. Med. Sci
- Moscow, Russia
| | - Varvara I. Konova
- Orekhovich Institute of Biomedical Chemistry of Rus. Acad. Med. Sci
- Moscow, Russia
| | - Pavel V. Pogodin
- Orekhovich Institute of Biomedical Chemistry of Rus. Acad. Med. Sci
- Moscow, Russia
- Russian National Research Medical University
- Medico-Biologic Faculty
- Moscow, Russia
| | | | - Vladimir V. Poroikov
- Orekhovich Institute of Biomedical Chemistry of Rus. Acad. Med. Sci
- Moscow, Russia
- Russian National Research Medical University
- Medico-Biologic Faculty
- Moscow, Russia
| |
Collapse
|
20
|
Using gene expression programming to infer gene regulatory networks from time-series data. Comput Biol Chem 2013; 47:198-206. [DOI: 10.1016/j.compbiolchem.2013.09.004] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2013] [Revised: 09/19/2013] [Accepted: 09/21/2013] [Indexed: 11/22/2022]
|
21
|
Emmert-Streib F, Dehmer M. Enhancing systems medicine beyond genotype data by dynamic patient signatures: having information and using it too. Front Genet 2013; 4:241. [PMID: 24312119 PMCID: PMC3832803 DOI: 10.3389/fgene.2013.00241] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2013] [Accepted: 10/24/2013] [Indexed: 01/08/2023] Open
Abstract
In order to establish systems medicine, based on the results and insights from basic biological research applicable for a medical and a clinical patient care, it is essential to measure patient-based data that represent the molecular and cellular state of the patient's pathology. In this paper, we discuss potential limitations of the sole usage of static genotype data, e.g., from next-generation sequencing, for translational research. The hypothesis advocated in this paper is that dynOmics data, i.e., high-throughput data that are capable of capturing dynamic aspects of the activity of samples from patients, are important for enabling personalized medicine by complementing genotype data.
Collapse
Affiliation(s)
- Frank Emmert-Streib
- Computational Biology and Machine Learning Laboratory, Faculty of Medicine, Health and Life Sciences, Center for Cancer Research and Cell Biology, School of Medicine, Dentistry and Biomedical Sciences, Queen's University BelfastBelfast, UK
| | - Matthias Dehmer
- Institute for Bioinformatics and Translational Research, UMITHall in Tyrol, Austria
| |
Collapse
|
22
|
Lee YS, Krishnan A, Zhu Q, Troyanskaya OG. Ontology-aware classification of tissue and cell-type signals in gene expression profiles across platforms and technologies. ACTA ACUST UNITED AC 2013; 29:3036-44. [PMID: 24037214 PMCID: PMC3834796 DOI: 10.1093/bioinformatics/btt529] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
Motivation: Leveraging gene expression data through large-scale integrative analyses for multicellular organisms is challenging because most samples are not fully annotated to their tissue/cell-type of origin. A computational method to classify samples using their entire gene expression profiles is needed. Such a method must be applicable across thousands of independent studies, hundreds of gene expression technologies and hundreds of diverse human tissues and cell-types. Results: We present Unveiling RNA Sample Annotation (URSA) that leverages the complex tissue/cell-type relationships and simultaneously estimates the probabilities associated with hundreds of tissues/cell-types for any given gene expression profile. URSA provides accurate and intuitive probability values for expression profiles across independent studies and outperforms other methods, irrespective of data preprocessing techniques. Moreover, without re-training, URSA can be used to classify samples from diverse microarray platforms and even from next-generation sequencing technology. Finally, we provide a molecular interpretation for the tissue and cell-type models as the biological basis for URSA’s classifications. Availability and implementation: An interactive web interface for using URSA for gene expression analysis is available at: ursa.princeton.edu. The source code is available at https://bitbucket.org/youngl/ursa_backend. Contact:ogt@cs.princeton.edu Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Young-suk Lee
- Department of Computer Science, Princeton University, Princeton, NJ 08544, USA and Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08540, USA
| | | | | | | |
Collapse
|
23
|
Wu G, Ji H. ChIPXpress: using publicly available gene expression data to improve ChIP-seq and ChIP-chip target gene ranking. BMC Bioinformatics 2013; 14:188. [PMID: 23758851 PMCID: PMC3684512 DOI: 10.1186/1471-2105-14-188] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2012] [Accepted: 06/04/2013] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND ChIPx (i.e., ChIP-seq and ChIP-chip) is increasingly used to map genome-wide transcription factor (TF) binding sites. A single ChIPx experiment can identify thousands of TF bound genes, but typically only a fraction of these genes are functional targets that respond transcriptionally to perturbations of TF expression. To identify promising functional target genes for follow-up studies, researchers usually collect gene expression data from TF perturbation experiments to determine which of the TF targets respond transcriptionally to binding. Unfortunately, approximately 40% of ChIPx studies do not have accompanying gene expression data from TF perturbation experiments. For these studies, genes are often prioritized solely based on the binding strengths of ChIPx signals in order to choose follow-up candidates. ChIPXpress is a novel method that improves upon this ChIPx-only ranking approach by integrating ChIPx data with large amounts of Publicly available gene Expression Data (PED). RESULTS We demonstrate that PED does contain useful information to identify functional TF target genes despite its inherent heterogeneity. A truncated absolute correlation measure is developed to better capture the regulatory relationships between TFs and their target genes in PED. By integrating the information from ChIPx and PED, ChIPXpress can significantly increase the chance of finding functional target genes responsive to TF perturbation among the top ranked genes. ChIPXpress is implemented as an easy-to-use R/Bioconductor package. We evaluate ChIPXpress using 10 different ChIPx datasets in mouse and human and find that ChIPXpress rankings are more accurate than rankings based solely on ChIPx data and may result in substantial improvement in prediction accuracy, irrespective of which peak calling algorithm is used to analyze the ChIPx data. CONCLUSIONS ChIPXpress provides a new tool to better prioritize TF bound genes from ChIPx experiments for follow-up studies when investigators do not have their own gene expression data. It demonstrates that the regulatory information from PED can be used to boost ChIPx data analyses. It also represents an important step towards more fully utilizing the valuable, but highly heterogeneous data contained in public gene expression databases.
Collapse
Affiliation(s)
- George Wu
- Department of Biostatistics, Johns Hopkins University Bloomberg School of Public Health, 615 North Wolfe Street, Baltimore, MD 21205, USA
| | | |
Collapse
|
24
|
Lagunin A, Ivanov S, Rudik A, Filimonov D, Poroikov V. DIGEP-Pred: web service for in silico prediction of drug-induced gene expression profiles based on structural formula. ACTA ACUST UNITED AC 2013; 29:2062-3. [PMID: 23740741 DOI: 10.1093/bioinformatics/btt322] [Citation(s) in RCA: 67] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
SUMMARY Experimentally found gene expression profiles are used to solve different problems in pharmaceutical studies, such as drug repositioning, resistance, toxicity and drug-drug interactions. A special web service, DIGEP-Pred, for prediction of drug-induced changes of gene expression profiles based on structural formulae of chemicals has been developed. Structure-activity relationships for prediction of drug-induced gene expression profiles were determined by Prediction of Activity Spectra for Substances (PASS) software. Comparative Toxicogenomics Database with data on the known drug-induced gene expression profiles of chemicals was used to create mRNA- and protein-based training sets. An average prediction accuracy for the training sets (ROC AUC) calculated by leave-one-out cross-validation on the basis of mRNA data (1385 compounds, 952 genes, 500 up- and 475 down-regulations) and protein data (1451 compounds, 139 genes, 93 up- and 55 down-regulations) exceeded 0.85. AVAILABILITY Freely available on the web at http://www.way2drug.com/GE.
Collapse
Affiliation(s)
- Alexey Lagunin
- Laboratory for Structure-Function Based Drug Design, Orekhovich Institute of Biomedical Chemistry of the Russian Academy of Medical Sciences, Moscow, Russia.
| | | | | | | | | |
Collapse
|
25
|
Csermely P, Korcsmáros T, Kiss HJM, London G, Nussinov R. Structure and dynamics of molecular networks: a novel paradigm of drug discovery: a comprehensive review. Pharmacol Ther 2013; 138:333-408. [PMID: 23384594 PMCID: PMC3647006 DOI: 10.1016/j.pharmthera.2013.01.016] [Citation(s) in RCA: 512] [Impact Index Per Article: 46.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2013] [Accepted: 01/22/2013] [Indexed: 02/02/2023]
Abstract
Despite considerable progress in genome- and proteome-based high-throughput screening methods and in rational drug design, the increase in approved drugs in the past decade did not match the increase of drug development costs. Network description and analysis not only give a systems-level understanding of drug action and disease complexity, but can also help to improve the efficiency of drug design. We give a comprehensive assessment of the analytical tools of network topology and dynamics. The state-of-the-art use of chemical similarity, protein structure, protein-protein interaction, signaling, genetic interaction and metabolic networks in the discovery of drug targets is summarized. We propose that network targeting follows two basic strategies. The "central hit strategy" selectively targets central nodes/edges of the flexible networks of infectious agents or cancer cells to kill them. The "network influence strategy" works against other diseases, where an efficient reconfiguration of rigid networks needs to be achieved by targeting the neighbors of central nodes/edges. It is shown how network techniques can help in the identification of single-target, edgetic, multi-target and allo-network drug target candidates. We review the recent boom in network methods helping hit identification, lead selection optimizing drug efficacy, as well as minimizing side-effects and drug toxicity. Successful network-based drug development strategies are shown through the examples of infections, cancer, metabolic diseases, neurodegenerative diseases and aging. Summarizing >1200 references we suggest an optimized protocol of network-aided drug development, and provide a list of systems-level hallmarks of drug quality. Finally, we highlight network-related drug development trends helping to achieve these hallmarks by a cohesive, global approach.
Collapse
Affiliation(s)
- Peter Csermely
- Department of Medical Chemistry, Semmelweis University, P.O. Box 260, H-1444 Budapest 8, Hungary.
| | | | | | | | | |
Collapse
|
26
|
Wu G, Yustein JT, McCall MN, Zilliox M, Irizarry RA, Zeller K, Dang CV, Ji H. ChIP-PED enhances the analysis of ChIP-seq and ChIP-chip data. ACTA ACUST UNITED AC 2013; 29:1182-9. [PMID: 23457041 PMCID: PMC3658457 DOI: 10.1093/bioinformatics/btt108] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
Motivation: Although chromatin immunoprecipitation coupled with
high-throughput sequencing (ChIP-seq) or tiling array hybridization (ChIP-chip) is
increasingly used to map genome-wide–binding sites of transcription factors (TFs),
it still remains difficult to generate a quality ChIPx (i.e. ChIP-seq or ChIP-chip)
dataset because of the tremendous amount of effort required to develop effective
antibodies and efficient protocols. Moreover, most laboratories are unable to easily
obtain ChIPx data for one or more TF(s) in more than a handful of biological contexts.
Thus, standard ChIPx analyses primarily focus on analyzing data from one experiment, and
the discoveries are restricted to a specific biological context. Results: We propose to enrich this existing data analysis paradigm by
developing a novel approach, ChIP-PED, which superimposes ChIPx data on large amounts of
publicly available human and mouse gene expression data containing a diverse collection of
cell types, tissues and disease conditions to discover new biological contexts with
potential TF regulatory activities. We demonstrate ChIP-PED using a number of examples,
including a novel discovery that MYC, a human TF, plays an important
functional role in pediatric Ewing sarcoma cell lines. These examples show that ChIP-PED
increases the value of ChIPx data by allowing one to expand the scope of possible
discoveries made from a ChIPx experiment. Availability:http://www.biostat.jhsph.edu/∼gewu/ChIPPED/ Contact:hji@jhsph.edu Supplementary information:Supplementary data are available at Bioinformatics
online.
Collapse
Affiliation(s)
- George Wu
- Department of Biostatistics, Johns Hopkins University Bloomberg School of Public Health, Baltimore, MD 21205, USA
| | | | | | | | | | | | | | | |
Collapse
|
27
|
Goldstein TC, Paull EO, Ellis MJ, Stuart JM. Molecular pathways: extracting medical knowledge from high-throughput genomic data. Clin Cancer Res 2013; 19:3114-20. [PMID: 23430023 DOI: 10.1158/1078-0432.ccr-12-2093] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
High-throughput genomic data that measures RNA expression, DNA copy number, mutation status, and protein levels provide us with insights into the molecular pathway structure of cancer. Genomic lesions (amplifications, deletions, mutations) and epigenetic modifications disrupt biochemical cellular pathways. Although the number of possible lesions is vast, different genomic alterations may result in concordant expression and pathway activities, producing common tumor subtypes that share similar phenotypic outcomes. How can these data be translated into medical knowledge that provides prognostic and predictive information? First-generation mRNA expression signatures such as Genomic Health's Oncotype DX already provide prognostic information, but do not provide therapeutic guidance beyond the current standard of care, which is often inadequate in high-risk patients. Rather than building molecular signatures based on gene expression levels, evidence is growing that signatures based on higher-level quantities such as from genetic pathways may provide important prognostic and diagnostic cues. We provide examples of how activities for molecular entities can be predicted from pathway analysis and how the composite of all such activities, referred to here as the "activitome," helps connect genomic events to clinical factors to predict the drivers of poor outcome.
Collapse
Affiliation(s)
- Theodore C Goldstein
- Department of Biomolecular Engineering, University of California, Santa Cruz, California 95064, USA
| | | | | | | |
Collapse
|
28
|
Abstract
Our understanding of gene expression has changed dramatically over the past decade, largely catalysed by technological developments. High-throughput experiments - microarrays and next-generation sequencing - have generated large amounts of genome-wide gene expression data that are collected in public archives. Added-value databases process, analyse and annotate these data further to make them accessible to every biologist. In this Review, we discuss the utility of the gene expression data that are in the public domain and how researchers are making use of these data. Reuse of public data can be very powerful, but there are many obstacles in data preparation and analysis and in the interpretation of the results. We will discuss these challenges and provide recommendations that we believe can improve the utility of such data.
Collapse
Affiliation(s)
- Johan Rung
- EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | | |
Collapse
|
29
|
Qu XA, Rajpal DK. Applications of Connectivity Map in drug discovery and development. Drug Discov Today 2012; 17:1289-98. [DOI: 10.1016/j.drudis.2012.07.017] [Citation(s) in RCA: 175] [Impact Index Per Article: 14.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2012] [Revised: 06/01/2012] [Accepted: 07/13/2012] [Indexed: 11/17/2022]
|
30
|
Coletta A, Molter C, Duqué R, Steenhoff D, Taminau J, de Schaetzen V, Meganck S, Lazar C, Venet D, Detours V, Nowé A, Bersini H, Weiss Solís DY. InSilico DB genomic datasets hub: an efficient starting point for analyzing genome-wide studies in GenePattern, Integrative Genomics Viewer, and R/Bioconductor. Genome Biol 2012; 13:R104. [PMID: 23158523 PMCID: PMC3580496 DOI: 10.1186/gb-2012-13-11-r104] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2012] [Accepted: 11/18/2012] [Indexed: 12/18/2022] Open
Abstract
Genomics datasets are increasingly useful for gaining biomedical insights, with adoption in the clinic underway. However, multiple hurdles related to data management stand in the way of their efficient large-scale utilization. The solution proposed is a web-based data storage hub. Having clear focus, flexibility and adaptability, InSilico DB seamlessly connects genomics dataset repositories to state-of-the-art and free GUI and command-line data analysis tools. The InSilico DB platform is a powerful collaborative environment, with advanced capabilities for biocuration, dataset sharing, and dataset subsetting and combination. InSilico DB is available from https://insilicodb.org.
Collapse
|
31
|
Georgii E, Salojärvi J, Brosché M, Kangasjärvi J, Kaski S. Targeted retrieval of gene expression measurements using regulatory models. Bioinformatics 2012; 28:2349-56. [DOI: 10.1093/bioinformatics/bts361] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023] Open
|
32
|
Caldas J, Gehlenborg N, Kettunen E, Faisal A, Rönty M, Nicholson AG, Knuutila S, Brazma A, Kaski S. Data-driven information retrieval in heterogeneous collections of transcriptomics data links SIM2s to malignant pleural mesothelioma. ACTA ACUST UNITED AC 2011; 28:246-53. [PMID: 22106335 PMCID: PMC3259436 DOI: 10.1093/bioinformatics/btr634] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023]
Abstract
Motivation: Genome-wide measurement of transcript levels is an ubiquitous tool in biomedical research. As experimental data continues to be deposited in public databases, it is becoming important to develop search engines that enable the retrieval of relevant studies given a query study. While retrieval systems based on meta-data already exist, data-driven approaches that retrieve studies based on similarities in the expression data itself have a greater potential of uncovering novel biological insights. Results: We propose an information retrieval method based on differential expression. Our method deals with arbitrary experimental designs and performs competitively with alternative approaches, while making the search results interpretable in terms of differential expression patterns. We show that our model yields meaningful connections between biological conditions from different studies. Finally, we validate a previously unknown connection between malignant pleural mesothelioma and SIM2s suggested by our method, via real-time polymerase chain reaction in an independent set of mesothelioma samples. Availability:Supplementary data and source code are available from http://www.ebi.ac.uk/fg/research/rex. Contact:samuel.kaski@aalto.fi Supplementary Information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- José Caldas
- Helsinki Institute for Information Technology HIIT, Department of Information and Computer Science, Aalto University, Helsinki, Finland
| | | | | | | | | | | | | | | | | |
Collapse
|
33
|
Inferring gene-phenotype associations via global protein complex network propagation. PLoS One 2011; 6:e21502. [PMID: 21799737 PMCID: PMC3143124 DOI: 10.1371/journal.pone.0021502] [Citation(s) in RCA: 64] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2011] [Accepted: 05/30/2011] [Indexed: 12/05/2022] Open
Abstract
Background Phenotypically similar diseases have been found to be caused by functionally related genes, suggesting a modular organization of the genetic landscape of human diseases that mirrors the modularity observed in biological interaction networks. Protein complexes, as molecular machines that integrate multiple gene products to perform biological functions, express the underlying modular organization of protein-protein interaction networks. As such, protein complexes can be useful for interrogating the networks of phenome and interactome to elucidate gene-phenotype associations of diseases. Methodology/Principal Findings We proposed a technique called RWPCN (Random Walker on Protein Complex Network) for predicting and prioritizing disease genes. The basis of RWPCN is a protein complex network constructed using existing human protein complexes and protein interaction network. To prioritize candidate disease genes for the query disease phenotypes, we compute the associations between the protein complexes and the query phenotypes in their respective protein complex and phenotype networks. We tested RWPCN on predicting gene-phenotype associations using leave-one-out cross-validation; our method was observed to outperform existing approaches. We also applied RWPCN to predict novel disease genes for two representative diseases, namely, Breast Cancer and Diabetes. Conclusions/Significance Guilt-by-association prediction and prioritization of disease genes can be enhanced by fully exploiting the underlying modular organizations of both the disease phenome and the protein interactome. Our RWPCN uses a novel protein complex network as a basis for interrogating the human phenome-interactome network. As the protein complex network can capture the underlying modularity in the biological interaction networks better than simple protein interaction networks, RWPCN was found to be able to detect and prioritize disease genes better than traditional approaches that used only protein-phenotype associations.
Collapse
|
34
|
Engreitz JM, Morgan AA, Dudley JT, Chen R, Thathoo R, Altman RB, Butte AJ. Content-based microarray search using differential expression profiles. BMC Bioinformatics 2010; 11:603. [PMID: 21172034 PMCID: PMC3022631 DOI: 10.1186/1471-2105-11-603] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2010] [Accepted: 12/21/2010] [Indexed: 12/20/2022] Open
Abstract
Background With the expansion of public repositories such as the Gene Expression Omnibus (GEO), we are rapidly cataloging cellular transcriptional responses to diverse experimental conditions. Methods that query these repositories based on gene expression content, rather than textual annotations, may enable more effective experiment retrieval as well as the discovery of novel associations between drugs, diseases, and other perturbations. Results We develop methods to retrieve gene expression experiments that differentially express the same transcriptional programs as a query experiment. Avoiding thresholds, we generate differential expression profiles that include a score for each gene measured in an experiment. We use existing and novel dimension reduction and correlation measures to rank relevant experiments in an entirely data-driven manner, allowing emergent features of the data to drive the results. A combination of matrix decomposition and p-weighted Pearson correlation proves the most suitable for comparing differential expression profiles. We apply this method to index all GEO DataSets, and demonstrate the utility of our approach by identifying pathways and conditions relevant to transcription factors Nanog and FoxO3. Conclusions Content-based gene expression search generates relevant hypotheses for biological inquiry. Experiments across platforms, tissue types, and protocols inform the analysis of new datasets.
Collapse
Affiliation(s)
- Jesse M Engreitz
- Department of Bioengineering, Stanford University School of Medicine, CA, USA
| | | | | | | | | | | | | |
Collapse
|
35
|
Ma S, Funk CC, Price ND. Systems approaches to molecular cancer diagnostics. DISCOVERY MEDICINE 2010; 10:531-542. [PMID: 21189224 PMCID: PMC3155470] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
The search for improved molecular cancer diagnostics is a challenge for which systems approaches show great promise. As is becoming increasingly clear, cancer is a perpetually-evolving, highly multi-factorial disease. With next generation sequencing providing an ever-increasing amount of high-throughput data, the need for analytical tools that can provide meaningful context is critical. Systems approaches have demonstrated an ability to separate meaningful signal from noise that arises from population heterogeneity, heterogeneity within and across tumors, and multiple sources of technical variation when sufficient sample sizes are obtained and standardized measurement technologies are used. The ability to develop clinically useful molecular cancer diagnostics will be predicated on advancements on two major fronts: 1) more comprehensive and accurate measurements of multiple endpoints, and 2) more sophisticated analytical tools that synthesize high-throughput data into meaningful reflections of cellular states. To this end, systems approaches that have integrated transcriptomic data onto biomolecular networks have shown promise in their ability to classify tumor subtypes, predict clinical progression, and inform treatment options. Ultimately, the success of systems approaches will be measured by their ability to develop molecular cancer diagnostics through distilling complex, systems-wide information into actionable information in the clinic.
Collapse
|
36
|
Barrett T, Troup DB, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, Marshall KA, Phillippy KH, Sherman PM, Muertter RN, Holko M, Ayanbule O, Yefanov A, Soboleva A. NCBI GEO: archive for functional genomics data sets--10 years on. Nucleic Acids Res 2010; 39:D1005-10. [PMID: 21097893 PMCID: PMC3013736 DOI: 10.1093/nar/gkq1184] [Citation(s) in RCA: 813] [Impact Index Per Article: 58.1] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022] Open
Abstract
A decade ago, the Gene Expression Omnibus (GEO) database was established at the National Center for Biotechnology Information (NCBI). The original objective of GEO was to serve as a public repository for high-throughput gene expression data generated mostly by microarray technology. However, the research community quickly applied microarrays to non-gene-expression studies, including examination of genome copy number variation and genome-wide profiling of DNA-binding proteins. Because the GEO database was designed with a flexible structure, it was possible to quickly adapt the repository to store these data types. More recently, as the microarray community switches to next-generation sequencing technologies, GEO has again adapted to host these data sets. Today, GEO stores over 20,000 microarray- and sequence-based functional genomics studies, and continues to handle the majority of direct high-throughput data submissions from the research community. Multiple mechanisms are provided to help users effectively search, browse, download and visualize the data at the level of individual genes or entire studies. This paper describes recent database enhancements, including new search and data representation tools, as well as a brief review of how the community uses GEO data. GEO is freely accessible at http://www.ncbi.nlm.nih.gov/geo/.
Collapse
Affiliation(s)
- Tanya Barrett
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 45 Center Drive, Bethesda, MD 20892, USA.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|