101
|
Harrington LX, Way GP, Doherty JA, Greene CS. Functional network community detection can disaggregate and filter multiple underlying pathways in enrichment analyses. Pac Symp Biocomput 2018; 23:157-167. [PMID: 29218878 PMCID: PMC5760988] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Differential expression experiments or other analyses often end in a list of genes. Pathway enrichment analysis is one method to discern important biological signals and patterns from noisy expression data. However, pathway enrichment analysis may perform suboptimally in situations where there are multiple implicated pathways - such as in the case of genes that define subtypes of complex diseases. Our simulation study shows that in this setting, standard overrepresentation analysis identifies many false positive pathways along with the true positives. These false positives hamper investigators' attempts to glean biological insights from enrichment analysis. We develop and evaluate an approach that combines community detection over functional networks with pathway enrichment to reduce false positives. Our simulation study demonstrates that a large reduction in false positives can be obtained with a small decrease in power. Though we hypothesized that multiple communities might underlie previously described subtypes of high-grade serous ovarian cancer and applied this approach, our results do not support this hypothesis. In summary, applying community detection before enrichment analysis may ease interpretation for complex gene sets that represent multiple distinct pathways.
Collapse
Affiliation(s)
- Lia X Harrington
- Department of Biomedical Data Science, Geisel School of Medicine at Dartmouth College, Hanover 03784, USA,
| | | | | | | |
Collapse
|
102
|
Allaway RJ, Fischer DA, de Abreu FB, Gardner TB, Gordon SR, Barth RJ, Colacchio TA, Wood M, Kacsoh BZ, Bouley SJ, Cui J, Hamilton J, Choi JA, Lange JT, Peterson JD, Padmanabhan V, Tomlinson CR, Tsongalis GJ, Suriawinata AA, Greene CS, Sanchez Y, Smith KD. Genomic characterization of patient-derived xenograft models established from fine needle aspirate biopsies of a primary pancreatic ductal adenocarcinoma and from patient-matched metastatic sites. Oncotarget 2017; 7:17087-102. [PMID: 26934555 PMCID: PMC4941373 DOI: 10.18632/oncotarget.7718] [Citation(s) in RCA: 35] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2015] [Accepted: 01/13/2016] [Indexed: 12/12/2022] Open
Abstract
N-of-1 trials target actionable mutations, yet such approaches do not test genomically-informed therapies in patient tumor models prior to patient treatment. To address this, we developed patient-derived xenograft (PDX) models from fine needle aspiration (FNA) biopsies (FNA-PDX) obtained from primary pancreatic ductal adenocarcinoma (PDAC) at the time of diagnosis. Here, we characterize PDX models established from one primary and two metastatic sites of one patient. We identified an activating KRAS G12R mutation among other mutations in these models. In explant cells derived from these PDX tumor models with a KRAS G12R mutation, treatment with inhibitors of CDKs (including CDK9) reduced phosphorylation of a marker of CDK9 activity (phospho-RNAPII CTD Ser2/5) and reduced viability/growth of explant cells derived from PDAC PDX models. Similarly, a CDK inhibitor reduced phospho-RNAPII CTD Ser2/5, increased apoptosis, and inhibited tumor growth in FNA-PDX and patient-matched metastatic-PDX models. In summary, PDX models can be constructed from FNA biopsies of PDAC which in turn can enable genomic characterization and identification of potential therapies.
Collapse
Affiliation(s)
- Robert J Allaway
- Department of Pharmacology and Toxicology, Geisel School of Medicine, Dartmouth College, Hanover, NH 03755, USA
| | - Dawn A Fischer
- Department of Surgery, Division of Surgical Oncology, Dartmouth-Hitchcock Medical Center, Lebanon, NH 03756, USA
| | - Francine B de Abreu
- Department of Pathology, Dartmouth-Hitchcock Medical Center, Lebanon, NH 03756, USA
| | - Timothy B Gardner
- Department of Medicine, Section of Gastroenterology and Hepatology, Dartmouth-Hitchcock Medical Center, Lebanon, NH 03756, USA
| | - Stuart R Gordon
- Department of Medicine, Section of Gastroenterology and Hepatology, Dartmouth-Hitchcock Medical Center, Lebanon, NH 03756, USA
| | - Richard J Barth
- Department of Surgery, Division of Surgical Oncology, Dartmouth-Hitchcock Medical Center, Lebanon, NH 03756, USA.,Dartmouth-Hitchcock Norris Cotton Cancer Center, Lebanon, NH 03756, USA
| | - Thomas A Colacchio
- Department of Surgery, Division of Surgical Oncology, Dartmouth-Hitchcock Medical Center, Lebanon, NH 03756, USA.,Dartmouth-Hitchcock Norris Cotton Cancer Center, Lebanon, NH 03756, USA
| | - Matthew Wood
- Department of Pharmacology and Toxicology, Geisel School of Medicine, Dartmouth College, Hanover, NH 03755, USA.,Current location: Department of Pathology, University of California, San Francisco, CA 94143, USA
| | - Balint Z Kacsoh
- Department of Genetics, Geisel School of Medicine, Dartmouth College, Hanover, NH 03756, USA
| | - Stephanie J Bouley
- Department of Pharmacology and Toxicology, Geisel School of Medicine, Dartmouth College, Hanover, NH 03755, USA
| | - Jingxuan Cui
- Department of Genetics, Geisel School of Medicine, Dartmouth College, Hanover, NH 03756, USA
| | - Joanna Hamilton
- Department of Pharmacology and Toxicology, Geisel School of Medicine, Dartmouth College, Hanover, NH 03755, USA.,Department of Medicine, Dartmouth-Hitchcock Medical Center, Lebanon, NH 03756, USA
| | - Jungbin A Choi
- Department of Pharmacology and Toxicology, Geisel School of Medicine, Dartmouth College, Hanover, NH 03755, USA
| | - Joshua T Lange
- Department of Pharmacology and Toxicology, Geisel School of Medicine, Dartmouth College, Hanover, NH 03755, USA
| | - Jason D Peterson
- Department of Pathology, Dartmouth-Hitchcock Medical Center, Lebanon, NH 03756, USA
| | | | - Craig R Tomlinson
- Department of Pharmacology and Toxicology, Geisel School of Medicine, Dartmouth College, Hanover, NH 03755, USA.,Dartmouth-Hitchcock Norris Cotton Cancer Center, Lebanon, NH 03756, USA.,Department of Medicine, Dartmouth-Hitchcock Medical Center, Lebanon, NH 03756, USA
| | - Gregory J Tsongalis
- Department of Pathology, Dartmouth-Hitchcock Medical Center, Lebanon, NH 03756, USA.,Dartmouth-Hitchcock Norris Cotton Cancer Center, Lebanon, NH 03756, USA
| | - Arief A Suriawinata
- Department of Pathology, Dartmouth-Hitchcock Medical Center, Lebanon, NH 03756, USA
| | - Casey S Greene
- Dartmouth-Hitchcock Norris Cotton Cancer Center, Lebanon, NH 03756, USA.,Department of Genetics, Geisel School of Medicine, Dartmouth College, Hanover, NH 03756, USA.,Institute for Quantitative Biomedical Sciences, Dartmouth College, Hanover, NH 03755, USA
| | - Yolanda Sanchez
- Department of Pharmacology and Toxicology, Geisel School of Medicine, Dartmouth College, Hanover, NH 03755, USA.,Dartmouth-Hitchcock Norris Cotton Cancer Center, Lebanon, NH 03756, USA
| | - Kerrington D Smith
- Department of Surgery, Division of Surgical Oncology, Dartmouth-Hitchcock Medical Center, Lebanon, NH 03756, USA.,Dartmouth-Hitchcock Norris Cotton Cancer Center, Lebanon, NH 03756, USA
| |
Collapse
|
103
|
Tan J, Huyck M, Hu D, Zelaya RA, Hogan DA, Greene CS. ADAGE signature analysis: differential expression analysis with data-defined gene sets. BMC Bioinformatics 2017; 18:512. [PMID: 29166858 PMCID: PMC5700673 DOI: 10.1186/s12859-017-1905-4] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2017] [Accepted: 11/01/2017] [Indexed: 12/18/2022] Open
Abstract
BACKGROUND Gene set enrichment analysis and overrepresentation analyses are commonly used methods to determine the biological processes affected by a differential expression experiment. This approach requires biologically relevant gene sets, which are currently curated manually, limiting their availability and accuracy in many organisms without extensively curated resources. New feature learning approaches can now be paired with existing data collections to directly extract functional gene sets from big data. RESULTS Here we introduce a method to identify perturbed processes. In contrast with methods that use curated gene sets, this approach uses signatures extracted from public expression data. We first extract expression signatures from public data using ADAGE, a neural network-based feature extraction approach. We next identify signatures that are differentially active under a given treatment. Our results demonstrate that these signatures represent biological processes that are perturbed by the experiment. Because these signatures are directly learned from data without supervision, they can identify uncurated or novel biological processes. We implemented ADAGE signature analysis for the bacterial pathogen Pseudomonas aeruginosa. For the convenience of different user groups, we implemented both an R package (ADAGEpath) and a web server ( http://adage.greenelab.com ) to run these analyses. Both are open-source to allow easy expansion to other organisms or signature generation methods. We applied ADAGE signature analysis to an example dataset in which wild-type and ∆anr mutant cells were grown as biofilms on the Cystic Fibrosis genotype bronchial epithelial cells. We mapped active signatures in the dataset to KEGG pathways and compared with pathways identified using GSEA. The two approaches generally return consistent results; however, ADAGE signature analysis also identified a signature that revealed the molecularly supported link between the MexT regulon and Anr. CONCLUSIONS We designed ADAGE signature analysis to perform gene set analysis using data-defined functional gene signatures. This approach addresses an important gap for biologists studying non-traditional model organisms and those without extensive curated resources available. We built both an R package and web server to provide ADAGE signature analysis to the community.
Collapse
Affiliation(s)
- Jie Tan
- Department of Molecular and Systems Biology, Geisel School of Medicine at Dartmouth, Hanover, NH, 03755, USA
| | - Matthew Huyck
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, PA, 19104, USA.,Department of Microbiology and Immunology, Geisel School of Medicine at Dartmouth, Hanover, NH, 03755, USA
| | - Dongbo Hu
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, PA, 19104, USA
| | - René A Zelaya
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, PA, 19104, USA
| | - Deborah A Hogan
- Department of Microbiology and Immunology, Geisel School of Medicine at Dartmouth, Hanover, NH, 03755, USA
| | - Casey S Greene
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, PA, 19104, USA.
| |
Collapse
|
104
|
Doherty JA, Peres LC, Wang C, Way GP, Greene CS, Schildkraut JM. Challenges and Opportunities in Studying the Epidemiology of Ovarian Cancer Subtypes. CURR EPIDEMIOL REP 2017; 4:211-220. [PMID: 29226065 PMCID: PMC5718213 DOI: 10.1007/s40471-017-0115-y] [Citation(s) in RCA: 52] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
PURPOSE OF REVIEW Only recently has it become clear that epithelial ovarian cancer (EOC) is comprised of such distinct histotypes--with different cells of origin, morphology, molecular features, epidemiologic factors, clinical features, and survival patterns-that they can be thought of as different diseases sharing an anatomical location. Herein, we review opportunities and challenges in studying EOC heterogeneity. RECENT FINDINGS The 2014 World Health Organization diagnostic guidelines incorporate accumulated evidence that high- and low-grade serous tumors have different underlying pathogenesis, and that, on the basis of shared molecular features, most high grade tumors, including some previously classified as endometrioid, are now considered to be high-grade serous. At the same time, several studies have reported that high-grade serous EOC, which is the most common histotype, is itself made up of reproducible subtypes discernable by gene expression patterns. SUMMARY These major advances in understanding set the stage for a new era of research on EOC risk and clinical outcomes with the potential to reduce morbidity and mortality. We highlight the need for multidisciplinary studies with pathology review using the current guidelines, further molecular characterization of the histotypes and subtypes, inclusion of women of diverse racial/ethnic and socioeconomic backgrounds, and updated epidemiologic and clinical data relevant to current generations of women at risk of EOC.
Collapse
Affiliation(s)
- Jennifer Anne Doherty
- Department of Population Health Sciences, Huntsman Cancer Institute, University of Utah, 2000 Circle of Hope, Rm 4125, Salt Lake City, Utah, 84112
| | - Lauren Cole Peres
- Department of Public Health Sciences, University of Virginia, P.O. Box 800765, Charlottesville, Virginia, 22903
| | - Chen Wang
- Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota
| | - Gregory P. Way
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania
| | - Casey S. Greene
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania
| | - Joellen M. Schildkraut
- Department of Public Health Sciences, University of Virginia, P.O. Box 800765, Charlottesville, Virginia, 22903
| |
Collapse
|
105
|
Tan J, Doing G, Lewis KA, Price CE, Chen KM, Cady KC, Perchuk B, Laub MT, Hogan DA, Greene CS. Unsupervised Extraction of Stable Expression Signatures from Public Compendia with an Ensemble of Neural Networks. Cell Syst 2017; 5:63-71.e6. [PMID: 28711280 PMCID: PMC5532071 DOI: 10.1016/j.cels.2017.06.003] [Citation(s) in RCA: 51] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2016] [Revised: 04/11/2017] [Accepted: 06/08/2017] [Indexed: 01/18/2023]
Abstract
Cross-experiment comparisons in public data compendia are challenged by unmatched conditions and technical noise. The ADAGE method, which performs unsupervised integration with denoising autoencoder neural networks, can identify biological patterns, but because ADAGE models, like many neural networks, are over-parameterized, different ADAGE models perform equally well. To enhance model robustness and better build signatures consistent with biological pathways, we developed an ensemble ADAGE (eADAGE) that integrated stable signatures across models. We applied eADAGE to a compendium of Pseudomonas aeruginosa gene expression profiling experiments performed in 78 media. eADAGE revealed a phosphate starvation response controlled by PhoB in media with moderate phosphate and predicted that a second stimulus provided by the sensor kinase, KinB, is required for this PhoB activation. We validated this relationship using both targeted and unbiased genetic approaches. eADAGE, which captures stable biological patterns, enables cross-experiment comparisons that can highlight measured but undiscovered relationships.
Collapse
Affiliation(s)
- Jie Tan
- Department of Molecular and Systems Biology, Geisel School of Medicine at Dartmouth, Hanover, NH, USA
| | - Georgia Doing
- Department of Microbiology and Immunology, Geisel School of Medicine at Dartmouth, Hanover, NH, USA
| | - Kimberley A Lewis
- Department of Microbiology and Immunology, Geisel School of Medicine at Dartmouth, Hanover, NH, USA
| | - Courtney E Price
- Department of Microbiology and Immunology, Geisel School of Medicine at Dartmouth, Hanover, NH, USA
| | - Kathleen M Chen
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Kyle C Cady
- Department of Biology, Massachusetts Institute of Technology, Cambridge, MA, USA; Howard Hughes Medical Institute, Cambridge, MA, USA
| | - Barret Perchuk
- Department of Biology, Massachusetts Institute of Technology, Cambridge, MA, USA; Howard Hughes Medical Institute, Cambridge, MA, USA
| | - Michael T Laub
- Department of Biology, Massachusetts Institute of Technology, Cambridge, MA, USA; Howard Hughes Medical Institute, Cambridge, MA, USA
| | - Deborah A Hogan
- Department of Microbiology and Immunology, Geisel School of Medicine at Dartmouth, Hanover, NH, USA
| | - Casey S Greene
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, PA, USA.
| |
Collapse
|
106
|
Affiliation(s)
- Casey S Greene
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Lana X Garmire
- Cancer Epidemiology Program, University of Hawaii Cancer Center, University of Hawaii, Honolulu, Hawaii, USA
| | - Jack A Gilbert
- Department of Surgery, University of Chicago School of Medicine, Chicago, Illinois, USA
| | - Marylyn D Ritchie
- Biomedical and Translational Informatics Program, Geisinger Health System, Danville, Pennsylvania, USA
| | - Lawrence E Hunter
- Department of Pharmacology, University of Colorado School of Medicine, Aurora, Colorado, USA
| |
Collapse
|
107
|
Taroni JN, Greene CS, Martyanov V, Wood TA, Christmann RB, Farber HW, Lafyatis RA, Denton CP, Hinchcliff ME, Pioli PA, Mahoney JM, Whitfield ML. A novel multi-network approach reveals tissue-specific cellular modulators of fibrosis in systemic sclerosis. Genome Med 2017; 9:27. [PMID: 28330499 PMCID: PMC5363043 DOI: 10.1186/s13073-017-0417-1] [Citation(s) in RCA: 67] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2016] [Accepted: 02/23/2017] [Indexed: 12/22/2022] Open
Abstract
Background Systemic sclerosis (SSc) is a multi-organ autoimmune disease characterized by skin fibrosis. Internal organ involvement is heterogeneous. It is unknown whether disease mechanisms are common across all involved affected tissues or if each manifestation has a distinct underlying pathology. Methods We used consensus clustering to compare gene expression profiles of biopsies from four SSc-affected tissues (skin, lung, esophagus, and peripheral blood) from patients with SSc, and the related conditions pulmonary fibrosis (PF) and pulmonary arterial hypertension, and derived a consensus disease-associate signature across all tissues. We used this signature to query tissue-specific functional genomic networks. We performed novel network analyses to contrast the skin and lung microenvironments and to assess the functional role of the inflammatory and fibrotic genes in each organ. Lastly, we tested the expression of macrophage activation state-associated gene sets for enrichment in skin and lung using a Wilcoxon rank sum test. Results We identified a common pathogenic gene expression signature—an immune–fibrotic axis—indicative of pro-fibrotic macrophages (MØs) in multiple tissues (skin, lung, esophagus, and peripheral blood mononuclear cells) affected by SSc. While the co-expression of these genes is common to all tissues, the functional consequences of this upregulation differ by organ. We used this disease-associated signature to query tissue-specific functional genomic networks to identify common and tissue-specific pathologies of SSc and related conditions. In contrast to skin, in the lung-specific functional network we identify a distinct lung-resident MØ signature associated with lipid stimulation and alternative activation. In keeping with our network results, we find distinct MØ alternative activation transcriptional programs in SSc-associated PF lung and in the skin of patients with an “inflammatory” SSc gene expression signature. Conclusions Our results suggest that the innate immune system is central to SSc disease processes but that subtle distinctions exist between tissues. Our approach provides a framework for examining molecular signatures of disease in fibrosis and autoimmune diseases and for leveraging publicly available data to understand common and tissue-specific disease processes in complex human diseases. Electronic supplementary material The online version of this article (doi:10.1186/s13073-017-0417-1) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Jaclyn N Taroni
- Department of Molecular and Systems Biology, Geisel School of Medicine at Dartmouth, 7400 Remsen, Hanover, NH, 03755, USA
| | - Casey S Greene
- Department of Systems Pharmacology & Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, 19104, USA
| | - Viktor Martyanov
- Department of Molecular and Systems Biology, Geisel School of Medicine at Dartmouth, 7400 Remsen, Hanover, NH, 03755, USA
| | - Tammara A Wood
- Department of Molecular and Systems Biology, Geisel School of Medicine at Dartmouth, 7400 Remsen, Hanover, NH, 03755, USA
| | - Romy B Christmann
- Division of Rheumatology, Department of Medicine, Boston University School of Medicine, Boston, MA, USA
| | - Harrison W Farber
- Pulmonary Center, Department of Medicine, Boston University School of Medicine, Boston, MA, 02118, USA
| | - Robert A Lafyatis
- Division of Rheumatology, Department of Medicine, Boston University School of Medicine, Boston, MA, USA.,Division of Rheumatology and Clinical Immunology, Department of Medicine, University of Pittsburgh Medical Center, Pittsburgh, PA, 15261, USA
| | | | - Monique E Hinchcliff
- Division of Rheumatology, Department of Medicine, Feinberg School of Medicine, Northwestern University, Chicago, IL, 60611, USA
| | - Patricia A Pioli
- Department of Microbiology and Immunology, Geisel School of Medicine at Dartmouth, Lebanon, NH, 03756, USA
| | - J Matthew Mahoney
- Department of Neurological Sciences, Larner College of Medicine, University of Vermont, HSRF 426, 149 Beaumont Avenue, Burlington, VT, 05405, USA.
| | - Michael L Whitfield
- Department of Molecular and Systems Biology, Geisel School of Medicine at Dartmouth, 7400 Remsen, Hanover, NH, 03755, USA.
| |
Collapse
|
108
|
Beaulieu-Jones BK, Greene CS. Reproducibility of computational workflows is automated using continuous analysis. Nat Biotechnol 2017. [PMID: 28288103 DOI: 10.1038/nbt.3780.] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Replication, validation and extension of experiments are crucial for scientific progress. Computational experiments are scriptable and should be easy to reproduce. However, computational analyses are designed and run in a specific computing environment, which may be difficult or impossible to match using written instructions. We report the development of continuous analysis, a workflow that enables reproducible computational analyses. Continuous analysis combines Docker, a container technology akin to virtual machines, with continuous integration, a software development technique, to automatically rerun a computational analysis whenever updates or improvements are made to source code or data. This enables researchers to reproduce results without contacting the study authors. Continuous analysis allows reviewers, editors or readers to verify reproducibility without manually downloading and rerunning code and can provide an audit trail for analyses of data that cannot be shared.
Collapse
Affiliation(s)
- Brett K Beaulieu-Jones
- Genomics and Computational Biology Graduate Group, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Casey S Greene
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| |
Collapse
|
109
|
Abstract
Network neighbors improve yeast to human gene mapping for the study of parkinsonism.
Collapse
Affiliation(s)
- Casey S. Greene
- Department of Systems Pharmacology and Translational Therapeutics. Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| |
Collapse
|
110
|
Way GP, Allaway RJ, Bouley SJ, Fadul CE, Sanchez Y, Greene CS. A machine learning classifier trained on cancer transcriptomes detects NF1 inactivation signal in glioblastoma. BMC Genomics 2017; 18:127. [PMID: 28166733 PMCID: PMC5292791 DOI: 10.1186/s12864-017-3519-7] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2016] [Accepted: 01/26/2017] [Indexed: 12/14/2022] Open
Abstract
BACKGROUND We have identified molecules that exhibit synthetic lethality in cells with loss of the neurofibromin 1 (NF1) tumor suppressor gene. However, recognizing tumors that have inactivation of the NF1 tumor suppressor function is challenging because the loss may occur via mechanisms that do not involve mutation of the genomic locus. Degradation of the NF1 protein, independent of NF1 mutation status, phenocopies inactivating mutations to drive tumors in human glioma cell lines. NF1 inactivation may alter the transcriptional landscape of a tumor and allow a machine learning classifier to detect which tumors will benefit from synthetic lethal molecules. RESULTS We developed a strategy to predict tumors with low NF1 activity and hence tumors that may respond to treatments that target cells lacking NF1. Using RNAseq data from The Cancer Genome Atlas (TCGA), we trained an ensemble of 500 logistic regression classifiers that integrates mutation status with whole transcriptomes to predict NF1 inactivation in glioblastoma (GBM). On TCGA data, the classifier detected NF1 mutated tumors (test set area under the receiver operating characteristic curve (AUROC) mean = 0.77, 95% quantile = 0.53 - 0.95) over 50 random initializations. On RNA-Seq data transformed into the space of gene expression microarrays, this method produced a classifier with similar performance (test set AUROC mean = 0.77, 95% quantile = 0.53 - 0.96). We applied our ensemble classifier trained on the transformed TCGA data to a microarray validation set of 12 samples with matched RNA and NF1 protein-level measurements. The classifier's NF1 score was associated with NF1 protein concentration in these samples. CONCLUSIONS We demonstrate that TCGA can be used to train accurate predictors of NF1 inactivation in GBM. The ensemble classifier performed well for samples with very high or very low NF1 protein concentrations but had mixed performance in samples with intermediate NF1 concentrations. Nevertheless, high-performing and validated predictors have the potential to be paired with targeted therapies and personalized medicine.
Collapse
Affiliation(s)
- Gregory P Way
- Genomics and Computational Biology Graduate Program, University of Pennsylvania, Philadelphia, PA, USA.,Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, 10-131 SCTR 34th and Civic Center Blvd, Philadelphia, PA, 19104, USA
| | - Robert J Allaway
- Department of Molecular and Systems Biology, Geisel School of Medicine at Dartmouth, Dartmouth College, HB 7650, Hanover, NH, 03755, USA
| | - Stephanie J Bouley
- Department of Molecular and Systems Biology, Geisel School of Medicine at Dartmouth, Dartmouth College, HB 7650, Hanover, NH, 03755, USA
| | - Camilo E Fadul
- Department of Neurology, University of Virginia, Charlottesville, VA, USA
| | - Yolanda Sanchez
- Department of Molecular and Systems Biology, Geisel School of Medicine at Dartmouth, Dartmouth College, HB 7650, Hanover, NH, 03755, USA. .,Norris Cotton Cancer Center, Dartmouth-Hitchcock Medical Center, Lebanon, NH, USA.
| | - Casey S Greene
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, 10-131 SCTR 34th and Civic Center Blvd, Philadelphia, PA, 19104, USA.
| |
Collapse
|
111
|
Greene CS, Himmelstein DS. Genetic Association-Guided Analysis of Gene Networks for the Study of Complex Traits. ACTA ACUST UNITED AC 2017; 9:179-84. [PMID: 27094199 DOI: 10.1161/circgenetics.115.001181] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2015] [Accepted: 03/08/2016] [Indexed: 12/29/2022]
Affiliation(s)
- Casey S Greene
- From the Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia (C.S.G.); and Biological and Medical Informatics, University of California, San Francisco (D.S.H.).
| | - Daniel S Himmelstein
- From the Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia (C.S.G.); and Biological and Medical Informatics, University of California, San Francisco (D.S.H.)
| |
Collapse
|
112
|
Moore JH, Jennings SF, Greene CS, Hunter LE, Perkins AD, Williams-Devane C, Wunsch DC, Zhao Z, Huang X. NO-BOUNDARY THINKING IN BIOINFORMATICS. Pac Symp Biocomput 2017; 22:646-648. [PMID: 27897015 DOI: 10.1142/9789813207813_0060] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
The following sections are included:Bioinformatics is a Mature DisciplineThe Golden Era of Bioinformatics Has BegunNo-Boundary Thinking in BioinformaticsReferences.
Collapse
Affiliation(s)
- Jason H Moore
- Institute for Biomedical Informatics, University of Pennsylvania Philadelphia, PA 19104, USA,
| | | | | | | | | | | | | | | | | |
Collapse
|
113
|
Abstract
Optimized workflows analyze RNA-seq samples for less than a dime each.
Collapse
Affiliation(s)
- Casey S Greene
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA.
| |
Collapse
|
114
|
Abstract
New computational methods predict the unobserved.
Collapse
Affiliation(s)
- Casey S. Greene
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania. Philadelphia, PA 19104, USA
| |
Collapse
|
115
|
Beaulieu-Jones BK, Greene CS. Semi-supervised learning of the electronic health record for phenotype stratification. J Biomed Inform 2016. [PMID: 27744022 DOI: 10.1016/j.jbi.2016.10.007.] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Patient interactions with health care providers result in entries to electronic health records (EHRs). EHRs were built for clinical and billing purposes but contain many data points about an individual. Mining these records provides opportunities to extract electronic phenotypes, which can be paired with genetic data to identify genes underlying common human diseases. This task remains challenging: high quality phenotyping is costly and requires physician review; many fields in the records are sparsely filled; and our definitions of diseases are continuing to improve over time. Here we develop and evaluate a semi-supervised learning method for EHR phenotype extraction using denoising autoencoders for phenotype stratification. By combining denoising autoencoders with random forests we find classification improvements across multiple simulation models and improved survival prediction in ALS clinical trial data. This is particularly evident in cases where only a small number of patients have high quality phenotypes, a common scenario in EHR-based research. Denoising autoencoders perform dimensionality reduction enabling visualization and clustering for the discovery of new subtypes of disease. This method represents a promising approach to clarify disease subtypes and improve genotype-phenotype association studies that leverage EHRs.
Collapse
Affiliation(s)
- Brett K Beaulieu-Jones
- Graduate Group in Genomics and Computational Biology, Perelman School of Medicine, University of Pennsylvania, United States; Institute for Biomedical Informatics, Perelman School of Medicine, University of Pennsylvania, United States.
| | - Casey S Greene
- Institute for Biomedical Informatics, Perelman School of Medicine, University of Pennsylvania, United States; Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, United States; Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Perelman School of Medicine, University of Pennsylvania, United States.
| | | |
Collapse
|
116
|
Beaulieu-Jones BK, Greene CS. Semi-supervised learning of the electronic health record for phenotype stratification. J Biomed Inform 2016; 64:168-178. [PMID: 27744022 DOI: 10.1016/j.jbi.2016.10.007] [Citation(s) in RCA: 79] [Impact Index Per Article: 9.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2016] [Revised: 10/05/2016] [Accepted: 10/08/2016] [Indexed: 12/12/2022]
Abstract
Patient interactions with health care providers result in entries to electronic health records (EHRs). EHRs were built for clinical and billing purposes but contain many data points about an individual. Mining these records provides opportunities to extract electronic phenotypes, which can be paired with genetic data to identify genes underlying common human diseases. This task remains challenging: high quality phenotyping is costly and requires physician review; many fields in the records are sparsely filled; and our definitions of diseases are continuing to improve over time. Here we develop and evaluate a semi-supervised learning method for EHR phenotype extraction using denoising autoencoders for phenotype stratification. By combining denoising autoencoders with random forests we find classification improvements across multiple simulation models and improved survival prediction in ALS clinical trial data. This is particularly evident in cases where only a small number of patients have high quality phenotypes, a common scenario in EHR-based research. Denoising autoencoders perform dimensionality reduction enabling visualization and clustering for the discovery of new subtypes of disease. This method represents a promising approach to clarify disease subtypes and improve genotype-phenotype association studies that leverage EHRs.
Collapse
Affiliation(s)
- Brett K Beaulieu-Jones
- Graduate Group in Genomics and Computational Biology, Perelman School of Medicine, University of Pennsylvania, United States; Institute for Biomedical Informatics, Perelman School of Medicine, University of Pennsylvania, United States.
| | - Casey S Greene
- Institute for Biomedical Informatics, Perelman School of Medicine, University of Pennsylvania, United States; Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, United States; Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Perelman School of Medicine, University of Pennsylvania, United States.
| | | |
Collapse
|
117
|
Greene CS. A stromal focus reveals tumor immune signatures. Sci Transl Med 2016. [DOI: 10.1126/scitranslmed.aai8224] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
Abstract
A computational analysis of cancer biopsies suggests immunotherapy strategies.
Collapse
Affiliation(s)
- Casey S. Greene
- Department of Systems Pharmacology and Translational Therapeutics. Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| |
Collapse
|
118
|
|
119
|
Jiang Y, Oron TR, Clark WT, Bankapur AR, D'Andrea D, Lepore R, Funk CS, Kahanda I, Verspoor KM, Ben-Hur A, Koo DCE, Penfold-Brown D, Shasha D, Youngs N, Bonneau R, Lin A, Sahraeian SME, Martelli PL, Profiti G, Casadio R, Cao R, Zhong Z, Cheng J, Altenhoff A, Skunca N, Dessimoz C, Dogan T, Hakala K, Kaewphan S, Mehryary F, Salakoski T, Ginter F, Fang H, Smithers B, Oates M, Gough J, Törönen P, Koskinen P, Holm L, Chen CT, Hsu WL, Bryson K, Cozzetto D, Minneci F, Jones DT, Chapman S, Bkc D, Khan IK, Kihara D, Ofer D, Rappoport N, Stern A, Cibrian-Uhalte E, Denny P, Foulger RE, Hieta R, Legge D, Lovering RC, Magrane M, Melidoni AN, Mutowo-Meullenet P, Pichler K, Shypitsyna A, Li B, Zakeri P, ElShal S, Tranchevent LC, Das S, Dawson NL, Lee D, Lees JG, Sillitoe I, Bhat P, Nepusz T, Romero AE, Sasidharan R, Yang H, Paccanaro A, Gillis J, Sedeño-Cortés AE, Pavlidis P, Feng S, Cejuela JM, Goldberg T, Hamp T, Richter L, Salamov A, Gabaldon T, Marcet-Houben M, Supek F, Gong Q, Ning W, Zhou Y, Tian W, Falda M, Fontana P, Lavezzo E, Toppo S, Ferrari C, Giollo M, Piovesan D, Tosatto SCE, Del Pozo A, Fernández JM, Maietta P, Valencia A, Tress ML, Benso A, Di Carlo S, Politano G, Savino A, Rehman HU, Re M, Mesiti M, Valentini G, Bargsten JW, van Dijk ADJ, Gemovic B, Glisic S, Perovic V, Veljkovic V, Veljkovic N, Almeida-E-Silva DC, Vencio RZN, Sharan M, Vogel J, Kansakar L, Zhang S, Vucetic S, Wang Z, Sternberg MJE, Wass MN, Huntley RP, Martin MJ, O'Donovan C, Robinson PN, Moreau Y, Tramontano A, Babbitt PC, Brenner SE, Linial M, Orengo CA, Rost B, Greene CS, Mooney SD, Friedberg I, Radivojac P. An expanded evaluation of protein function prediction methods shows an improvement in accuracy. Genome Biol 2016; 17:184. [PMID: 27604469 PMCID: PMC5015320 DOI: 10.1186/s13059-016-1037-6] [Citation(s) in RCA: 252] [Impact Index Per Article: 31.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2015] [Accepted: 08/04/2016] [Indexed: 12/02/2022] Open
Abstract
BACKGROUND A major bottleneck in our understanding of the molecular underpinnings of life is the assignment of function to proteins. While molecular experiments provide the most reliable annotation of proteins, their relatively low throughput and restricted purview have led to an increasing role for computational function prediction. However, assessing methods for protein function prediction and tracking progress in the field remain challenging. RESULTS We conducted the second critical assessment of functional annotation (CAFA), a timed challenge to assess computational methods that automatically assign protein function. We evaluated 126 methods from 56 research groups for their ability to predict biological functions using Gene Ontology and gene-disease associations using Human Phenotype Ontology on a set of 3681 proteins from 18 species. CAFA2 featured expanded analysis compared with CAFA1, with regards to data set size, variety, and assessment metrics. To review progress in the field, the analysis compared the best methods from CAFA1 to those of CAFA2. CONCLUSIONS The top-performing methods in CAFA2 outperformed those from CAFA1. This increased accuracy can be attributed to a combination of the growing number of experimental annotations and improved methods for function prediction. The assessment also revealed that the definition of top-performing algorithms is ontology specific, that different performance metrics can be used to probe the nature of accurate predictions, and the relative diversity of predictions in the biological process and human phenotype ontologies. While there was methodological improvement between CAFA1 and CAFA2, the interpretation of results and usefulness of individual methods remain context-dependent.
Collapse
Affiliation(s)
- Yuxiang Jiang
- Department of Computer Science and Informatics, Indiana University, Bloomington, IN, USA
| | | | - Wyatt T Clark
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
| | - Asma R Bankapur
- Department of Microbiology, Miami University, Oxford, OH, USA
| | | | | | - Christopher S Funk
- Computational Bioscience Program, University of Colorado School of Medicine, Aurora, CO, USA
| | - Indika Kahanda
- Department of Computer Science, Colorado State University, Fort Collins, CO, USA
| | - Karin M Verspoor
- Department of Computing and Information Systems, University of Melbourne, Parkville, Victoria, Australia
- Health and Biomedical Informatics Centre, University of Melbourne, Parkville, Victoria, Australia
| | - Asa Ben-Hur
- Department of Computer Science, Colorado State University, Fort Collins, CO, USA
| | | | - Duncan Penfold-Brown
- Social Media and Political Participation Lab, New York University, New York, NY, USA
- CY Data Science, New York, NY, USA
| | - Dennis Shasha
- Department of Computer Science, New York University, New York, NY, USA
| | - Noah Youngs
- CY Data Science, New York, NY, USA
- Department of Computer Science, New York University, New York, NY, USA
- Simons Center for Data Analysis, New York, NY, USA
| | - Richard Bonneau
- Department of Computer Science, New York University, New York, NY, USA
- Simons Center for Data Analysis, New York, NY, USA
- Center for Genomics and Systems Biology, Department of Biology, New York University, New York, NY, USA
| | - Alexandra Lin
- Department of Electrical Engineering and Computer Sciences, University of California Berkeley, Berkeley, CA, USA
| | - Sayed M E Sahraeian
- Department of Plant and Microbial Biology, University of California Berkeley, Berkeley, CA, USA
| | | | - Giuseppe Profiti
- Biocomputing Group, BiGeA, University of Bologna, Bologna, Italy
| | - Rita Casadio
- Biocomputing Group, BiGeA, University of Bologna, Bologna, Italy
| | - Renzhi Cao
- Computer Science Department, University of Missouri, Columbia, MO, USA
| | - Zhaolong Zhong
- Computer Science Department, University of Missouri, Columbia, MO, USA
| | - Jianlin Cheng
- Computer Science Department, University of Missouri, Columbia, MO, USA
| | - Adrian Altenhoff
- ETH Zurich, Zurich, Switzerland
- Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - Nives Skunca
- ETH Zurich, Zurich, Switzerland
- Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - Christophe Dessimoz
- Bioinformatics Group, Department of Computer Science, University College London, London, UK
- University of Lausanne, Lausanne, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Tunca Dogan
- European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridge, UK
| | - Kai Hakala
- Department of Information Technology, University of Turku, Turku, Finland
- University of Turku Graduate School, University of Turku, Turku, Finland
| | - Suwisa Kaewphan
- Department of Information Technology, University of Turku, Turku, Finland
- University of Turku Graduate School, University of Turku, Turku, Finland
- Turku Centre for Computer Science, Turku, Finland
| | - Farrokh Mehryary
- Department of Information Technology, University of Turku, Turku, Finland
- University of Turku Graduate School, University of Turku, Turku, Finland
| | - Tapio Salakoski
- Department of Information Technology, University of Turku, Turku, Finland
- Turku Centre for Computer Science, Turku, Finland
| | - Filip Ginter
- Department of Information Technology, University of Turku, Turku, Finland
| | - Hai Fang
- University of Bristol, Bristol, UK
| | | | | | | | - Petri Törönen
- Institute of Biotechnology, University of Helsinki, Helsinki, Finland
| | - Patrik Koskinen
- Institute of Biotechnology, University of Helsinki, Helsinki, Finland
| | - Liisa Holm
- Institute of Biotechnology, University of Helsinki, Helsinki, Finland
- Department of Biological and Environmental Sciences, Universitity of Helsinki, Helsinki, Finland
| | - Ching-Tai Chen
- Institute of Information Science, Academia Sinica, Taipei, Taiwan
| | - Wen-Lian Hsu
- Institute of Information Science, Academia Sinica, Taipei, Taiwan
| | - Kevin Bryson
- Bioinformatics Group, Department of Computer Science, University College London, London, UK
| | - Domenico Cozzetto
- Bioinformatics Group, Department of Computer Science, University College London, London, UK
| | - Federico Minneci
- Bioinformatics Group, Department of Computer Science, University College London, London, UK
| | - David T Jones
- Bioinformatics Group, Department of Computer Science, University College London, London, UK
| | - Samuel Chapman
- Department of Computational Science and Engineering, North Carolina A&T State University, Greensboro, NC, USA
| | - Dukka Bkc
- Department of Computational Science and Engineering, North Carolina A&T State University, Greensboro, NC, USA
| | - Ishita K Khan
- Department of Computer Science, Purdue University, West Lafayette, IN, USA
| | - Daisuke Kihara
- Department of Computer Science, Purdue University, West Lafayette, IN, USA
- Department of Biological Sciences, Purdue University, West Lafayette, IN, USA
| | - Dan Ofer
- Department of Biological Chemistry, Institute of Life Sciences, The Hebrew University of Jerusalem, Jerusalem, Israel
| | - Nadav Rappoport
- Department of Biological Chemistry, Institute of Life Sciences, The Hebrew University of Jerusalem, Jerusalem, Israel
- School of Computer Science and Engineering, The Hebrew University of Jerusalem, Jerusalem, Israel
| | - Amos Stern
- Department of Biological Chemistry, Institute of Life Sciences, The Hebrew University of Jerusalem, Jerusalem, Israel
- School of Computer Science and Engineering, The Hebrew University of Jerusalem, Jerusalem, Israel
| | - Elena Cibrian-Uhalte
- European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridge, UK
| | - Paul Denny
- Centre for Cardiovascular Genetics, Institute of Cardiovascular Science, University College London, London, UK
| | - Rebecca E Foulger
- Centre for Cardiovascular Genetics, Institute of Cardiovascular Science, University College London, London, UK
| | - Reija Hieta
- European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridge, UK
| | - Duncan Legge
- European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridge, UK
| | - Ruth C Lovering
- Centre for Cardiovascular Genetics, Institute of Cardiovascular Science, University College London, London, UK
| | - Michele Magrane
- European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridge, UK
| | - Anna N Melidoni
- Centre for Cardiovascular Genetics, Institute of Cardiovascular Science, University College London, London, UK
| | | | - Klemens Pichler
- European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridge, UK
| | - Aleksandra Shypitsyna
- European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridge, UK
| | - Biao Li
- Buck Institute for Research on Aging, Novato, CA, USA
| | - Pooya Zakeri
- Department of Electrical Engineering, STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics, KU Leuven, Leuven, Belgium
- iMinds Department Medical Information Technologies, Leuven, Belgium
| | - Sarah ElShal
- Department of Electrical Engineering, STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics, KU Leuven, Leuven, Belgium
- iMinds Department Medical Information Technologies, Leuven, Belgium
| | - Léon-Charles Tranchevent
- Inserm UMR-S1052, CNRS UMR5286, Cancer Research Centre of Lyon, Lyon, France
- Université de Lyon 1, Villeurbanne, France
- Centre Léon Bérard, Lyon, France
| | - Sayoni Das
- Institute of Structural and Molecular Biology, University College London, London, UK
| | - Natalie L Dawson
- Institute of Structural and Molecular Biology, University College London, London, UK
| | - David Lee
- Institute of Structural and Molecular Biology, University College London, London, UK
| | - Jonathan G Lees
- Institute of Structural and Molecular Biology, University College London, London, UK
| | - Ian Sillitoe
- Institute of Structural and Molecular Biology, University College London, London, UK
| | | | | | - Alfonso E Romero
- Department of Computer Science, Centre for Systems and Synthetic Biology, Royal Holloway University of London, Egham, UK
| | - Rajkumar Sasidharan
- Department of Molecular, Cell and Developmental Biology, University of California at Los Angeles, Los Angeles, CA, USA
| | - Haixuan Yang
- School of Mathematics, Statistics and Applied Mathematics, National University of Ireland, Galway, Ireland
| | - Alberto Paccanaro
- Department of Computer Science, Centre for Systems and Synthetic Biology, Royal Holloway University of London, Egham, UK
| | - Jesse Gillis
- Stanley Institute for Cognitive Genomics Cold Spring Harbor Laboratory, New York, NY, USA
| | | | - Paul Pavlidis
- Department of Psychiatry and Michael Smith Laboratories, University of British Columbia, Vancouver, Canada
| | - Shou Feng
- Department of Computer Science and Informatics, Indiana University, Bloomington, IN, USA
| | - Juan M Cejuela
- Department for Bioinformatics and Computational Biology-I12, Technische Universität München, Garching, Germany
| | - Tatyana Goldberg
- Department for Bioinformatics and Computational Biology-I12, Technische Universität München, Garching, Germany
| | - Tobias Hamp
- Department for Bioinformatics and Computational Biology-I12, Technische Universität München, Garching, Germany
| | - Lothar Richter
- Department for Bioinformatics and Computational Biology-I12, Technische Universität München, Garching, Germany
| | - Asaf Salamov
- DOE Joint Genome Institute, Walnut Creek, CA, USA
| | - Toni Gabaldon
- Bioinformatics and Genomics, Centre for Genomic Regulation, Barcelona, Spain
- Universitat Pompeu Fabra, Barcelona, Spain
- Institució Catalana de Recerca i Estudis Avançats, Barcelona, Spain
| | - Marina Marcet-Houben
- Bioinformatics and Genomics, Centre for Genomic Regulation, Barcelona, Spain
- Universitat Pompeu Fabra, Barcelona, Spain
| | - Fran Supek
- Universitat Pompeu Fabra, Barcelona, Spain
- Division of Electronics, Rudjer Boskovic Institute, Zagreb, Croatia
- EMBL/CRG Systems Biology Research Unit, Centre for Genomic Regulation, Barcelona, Spain
| | - Qingtian Gong
- State Key Laboratory of Genetic Engineering, Collaborative Innovation Center of Genetics and Development, Department of Biostatistics and Computational Biology, School of Life Science, Fudan University, Shanghai, China
- Children's Hospital of Fudan University, Shanghai, China
| | - Wei Ning
- State Key Laboratory of Genetic Engineering, Collaborative Innovation Center of Genetics and Development, Department of Biostatistics and Computational Biology, School of Life Science, Fudan University, Shanghai, China
- Children's Hospital of Fudan University, Shanghai, China
| | - Yuanpeng Zhou
- State Key Laboratory of Genetic Engineering, Collaborative Innovation Center of Genetics and Development, Department of Biostatistics and Computational Biology, School of Life Science, Fudan University, Shanghai, China
- Children's Hospital of Fudan University, Shanghai, China
| | - Weidong Tian
- State Key Laboratory of Genetic Engineering, Collaborative Innovation Center of Genetics and Development, Department of Biostatistics and Computational Biology, School of Life Science, Fudan University, Shanghai, China
- Children's Hospital of Fudan University, Shanghai, China
| | - Marco Falda
- Department of Molecular Medicine, University of Padua, Padua, Italy
| | - Paolo Fontana
- Research and Innovation Center, Edmund Mach Foundation, San Michele all'Adige, Italy
| | - Enrico Lavezzo
- Department of Molecular Medicine, University of Padua, Padua, Italy
| | - Stefano Toppo
- Department of Molecular Medicine, University of Padua, Padua, Italy
| | - Carlo Ferrari
- Department of Information Engineering, University of Padua, Padova, Italy
| | - Manuel Giollo
- Department of Information Engineering, University of Padua, Padova, Italy
- Department of Biomedical Sciences, University of Padua, Padova, Italy
| | - Damiano Piovesan
- Department of Information Engineering, University of Padua, Padova, Italy
| | - Silvio C E Tosatto
- Department of Information Engineering, University of Padua, Padova, Italy
| | - Angela Del Pozo
- Instituto De Genetica Medica y Molecular, Hospital Universitario de La Paz, Madrid, Spain
| | - José M Fernández
- Spanish National Bioinformatics Institute, Spanish National Cancer Research Institute, Madrid, Spain
| | - Paolo Maietta
- Structural and Computational Biology Programme, Spanish National Cancer Research Institute, Madrid, Spain
| | - Alfonso Valencia
- Structural and Computational Biology Programme, Spanish National Cancer Research Institute, Madrid, Spain
| | - Michael L Tress
- Structural and Computational Biology Programme, Spanish National Cancer Research Institute, Madrid, Spain
| | - Alfredo Benso
- Control and Computer Engineering Department, Politecnico di Torino, Torino, Italy
| | - Stefano Di Carlo
- Control and Computer Engineering Department, Politecnico di Torino, Torino, Italy
| | - Gianfranco Politano
- Control and Computer Engineering Department, Politecnico di Torino, Torino, Italy
| | - Alessandro Savino
- Control and Computer Engineering Department, Politecnico di Torino, Torino, Italy
| | - Hafeez Ur Rehman
- National University of Computer & Emerging Sciences, Islamabad, Pakistan
| | - Matteo Re
- Anacleto Lab, Dipartimento di informatica, Università degli Studi di Milano, Milan, Italy
| | - Marco Mesiti
- Anacleto Lab, Dipartimento di informatica, Università degli Studi di Milano, Milan, Italy
| | - Giorgio Valentini
- Anacleto Lab, Dipartimento di informatica, Università degli Studi di Milano, Milan, Italy
| | - Joachim W Bargsten
- Applied Bioinformatics, Bioscience, Wageningen University and Research Centre, Wageningen, Netherlands
| | - Aalt D J van Dijk
- Applied Bioinformatics, Bioscience, Wageningen University and Research Centre, Wageningen, Netherlands
- Biometris, Wageningen University, Wageningen, Netherlands
| | - Branislava Gemovic
- Center for Multidisciplinary Research, Institute of Nuclear Sciences Vinca, University of Belgrade, Belgrade, Serbia
| | - Sanja Glisic
- Center for Multidisciplinary Research, Institute of Nuclear Sciences Vinca, University of Belgrade, Belgrade, Serbia
| | - Vladmir Perovic
- Center for Multidisciplinary Research, Institute of Nuclear Sciences Vinca, University of Belgrade, Belgrade, Serbia
| | - Veljko Veljkovic
- Center for Multidisciplinary Research, Institute of Nuclear Sciences Vinca, University of Belgrade, Belgrade, Serbia
| | - Nevena Veljkovic
- Center for Multidisciplinary Research, Institute of Nuclear Sciences Vinca, University of Belgrade, Belgrade, Serbia
| | | | - Ricardo Z N Vencio
- Department of Computing and Mathematics FFCLRP-USP, University of Sao Paulo, Ribeirao Preto, Brazil
| | - Malvika Sharan
- Institute for Molecular Infection Biology, University of Würzburg, Würzburg, Germany
| | - Jörg Vogel
- Institute for Molecular Infection Biology, University of Würzburg, Würzburg, Germany
| | - Lakesh Kansakar
- Department of Computer and Information Sciences, Temple University, Philadelphia, PA, USA
| | - Shanshan Zhang
- Department of Computer and Information Sciences, Temple University, Philadelphia, PA, USA
| | - Slobodan Vucetic
- Department of Computer and Information Sciences, Temple University, Philadelphia, PA, USA
| | - Zheng Wang
- University of Southern Mississippi, Hattiesburg, MS, USA
| | - Michael J E Sternberg
- Centre for Integrative Systems Biology and Bioinformatics, Department of Life Sciences, Imperial College London, London, UK
| | - Mark N Wass
- School of Biosciences, University of Kent, Canterbury, Kent, UK
| | - Rachael P Huntley
- European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridge, UK
| | - Maria J Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridge, UK
| | - Claire O'Donovan
- European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridge, UK
| | - Peter N Robinson
- Institut für Medizinische Genetik und Humangenetik, Charité - Universitätsmedizin Berlin, Berlin, Germany
| | - Yves Moreau
- Department of Electrical Engineering ESAT-SCD and IBBT-KU Leuven Future Health Department, Katholieke Universiteit Leuven, Leuven, Belgium
| | | | - Patricia C Babbitt
- California Institute for Quantitative Biosciences, University of California San Francisco, San Francisco, CA, USA
| | - Steven E Brenner
- Department of Plant and Microbial Biology, University of California Berkeley, Berkeley, CA, USA
| | - Michal Linial
- Department of Chemical Biology, The Hebrew University of Jerusalem, Jerusalem, Israel
| | - Christine A Orengo
- Institute of Structural and Molecular Biology, University College London, London, UK
| | - Burkhard Rost
- Department for Bioinformatics and Computational Biology-I12, Technische Universität München, Garching, Germany
| | - Casey S Greene
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Sean D Mooney
- Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, WA, USA
| | - Iddo Friedberg
- Department of Microbiology, Miami University, Oxford, OH, USA.
- Department of Computer Science, Miami University, Oxford, OH, USA.
| | - Predrag Radivojac
- Department of Computer Science and Informatics, Indiana University, Bloomington, IN, USA.
| |
Collapse
|
120
|
Abstract
A new analysis asks which microbiome signatures work across studies.
Collapse
Affiliation(s)
- Casey S. Greene
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| |
Collapse
|
121
|
Doherty JA, Greene CS, Rudd JE, Tafe LJ, Alberg AJ, Bandera EV, Barnholtz-Sloan J, Bondy M, Cote ML, Funkhouser E, Moorman PG, Peters ES, Schwartz AG, Terry P, Bentley R, Berchuck A, Marks JR, Schildkraut JM. Abstract 3407: Gene expression subtypes of high grade serous ovarian cancer in African American women. Cancer Res 2016. [DOI: 10.1158/1538-7445.am2016-3407] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Abstract
Ovarian cancer accounts for 5% of cancer deaths and is the fifth leading cause of cancer death in women in the United States. While incidence is higher in European American (EA) than African American (AA) women, five-year survival is worse for AA women (36%) than EA women (44%). Access to appropriate surgery and treatment is a major contributor but does not completely explain this disparity. The Cancer Genome Atlas (TCGA) identified four gene expression-based subtypes of the most common and lethal histotype, high grade serous carcinoma (HGSC): mesenchymal, proliferative, differentiated, and immunoreactive. We sought to characterize similarities and differences in gene expression-based subtypes arising in AA and EA women to determine whether there are underlying biologic features that may influence survival. We performed two distinct analyses, first using TCGA data and second using cases from the population-based African American Cancer Epidemiology Study (AACES). For both we summarized differential expression patterns for each subtype with moderated t statistic vectors for >10,000 genes using Significance Analysis of Microarrays. We calculated Pearson's correlations of these vectors to determine concordance of expression patterns between subtypes across EA and AA women. In TCGA, we observed correlations of subtype-specific expression patterns between the 24 AA and 475 EA tumors of 0.52-0.60 for each of the four subtypes. Thus, while analogous subtypes can be identified in AA and EA women, the magnitude of these correlations suggests that there are potential differences in gene expression patterns between AA and EA tumors that are assigned to the same subtype. We generated additional data from 58 AACES HGSC cases using the Affymetrix Human Transcriptome Array 2.0. Instead of assigning these tumors to previously-defined subtypes, we clustered samples to identify four subtypes de novo. We observed concordance with two of the TCGA subtypes; correlations for the mesenchymal-like and proliferative-like subtypes were 0.56-0.65. The mesenchymal-like subtype was more common in these AA women than in the TCGA EA women (33% versus 25%), and the proliferative-like subtype was marginally less common (14% versus 19%). Concordance for the differentiated-like subtype was considerably lower, at 0.21, and this subtype was less common in AA than EA women (19% versus 34%). Another subtype comprising 34% of the AA samples was only weakly correlated (-0.21-0.10) with any of the TCGA subtypes, suggesting that it is a novel subtype. The limited data available on HGSC in AA women suggest that at least two subtypes are comparable to those in EA women but differ in prevalence, and that there may be a novel subtype in AA women that does not strongly correspond to those described in EA women.
Citation Format: Jennifer A. Doherty, Casey S. Greene, James E. Rudd, Laura J. Tafe, Anthony J. Alberg, Elisa V. Bandera, Jill Barnholtz-Sloan, Melissa Bondy, Michele L. Cote, Ellen Funkhouser, Patricia G. Moorman, Edward S. Peters, Ann G. Schwartz, Paul Terry, Rex Bentley, Andrew Berchuck, Jeffrey R. Marks, Joellen M. Schildkraut. Gene expression subtypes of high grade serous ovarian cancer in African American women. [abstract]. In: Proceedings of the 107th Annual Meeting of the American Association for Cancer Research; 2016 Apr 16-20; New Orleans, LA. Philadelphia (PA): AACR; Cancer Res 2016;76(14 Suppl):Abstract nr 3407.
Collapse
Affiliation(s)
- Jennifer A. Doherty
- 1Department of Epidemiology, The Geisel School of Medicine at Dartmouth, Lebanon, NH
| | - Casey S. Greene
- 2Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA
| | - James E. Rudd
- 3The Geisel School of Medicine at Dartmouth, Lebanon, NH
| | - Laura J. Tafe
- 3The Geisel School of Medicine at Dartmouth, Lebanon, NH
| | - Anthony J. Alberg
- 4Hollings Cancer Center and Department of Public Health Sciences, Medical University of South Carolina, Charleston, SC
| | - Elisa V. Bandera
- 5Department of Population Science, Rutgers Cancer Institute of New Jersey, New Brunswick, NJ
| | - Jill Barnholtz-Sloan
- 6Case Comprehensive Cancer Center, Case Western Reserve University School of Medicine, Cleveland, OH
| | - Melissa Bondy
- 7Cancer Prevention and Population Sciences Program, Baylor College of Medicine, Houston, TX
| | - Michele L. Cote
- 8Department of Oncology and the Karmanos Cancer Institute Population Studies and Disparities Research Program, Detroit, MI
| | - Ellen Funkhouser
- 9Division of Preventive Medicine, University of Alabama at Birmingham, Birmingham, AL
| | | | - Edward S. Peters
- 11Epidemiology Program, Louisiana State University Health Sciences Center School of Public Health, New Orleans, LA
| | - Ann G. Schwartz
- 8Department of Oncology and the Karmanos Cancer Institute Population Studies and Disparities Research Program, Detroit, MI
| | - Paul Terry
- 12Department of Medicine, University of Tennessee Medical Center-Knoxville, Knoxville, TN
| | - Rex Bentley
- 13Department of Pathology, Duke University, Durham, NC
| | - Andrew Berchuck
- 14Department of Obstetrics and Gynecology, Duke University, Durham, NC
| | | | | |
Collapse
|
122
|
Rudd J, Shea EK, Way GP, Greene CS, Doherty JA. Abstract 815: Patterns of metagene activation in ovarian cancer subtypes. Cancer Res 2016. [DOI: 10.1158/1538-7445.am2016-815] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Abstract
High grade serous ovarian cancer (HGSC) is a complex and aggressive disease. Recently, three or four gene expression-based subtypes, which may be differentially associated with survival, have been reported in several populations. To identify the biological functions that define the subtypes, we determined the extent to which metagenes—linear combinations of gene expression vectors—were differentially activated across subtypes, could be reliably identified across populations, and showed consistent associations with survival.
We previously clustered HGSC samples using gene expression data from TCGA, Tothill (GSE9891), Yoshihara (GSE32062), and Mayo (GSE74357) to identify subtypes across populations. We found subtype-specific genes within each population through differential expression analysis (p < 4.6×10-6). Using the intersection of differentially expressed genes for parallel subtypes across these four populations, we applied non-negative matrix factorization to identify metagenes. To determine whether the metagenes were consistently observed across populations, we performed leave-one-dataset-out cross validation. For each metagene, we performed gene set enrichment analysis against the National Cancer Institute pathway interaction database to annotate metagene pathways. We examined whether increasing tertiles of metagene activity, which we termed low, medium, and high activity, were associated with survival using a random effects meta-analysis of Cox regression estimates adjusting for age at diagnosis, tumor stage, tumor grade, and debulking status.
Five metagenes were consistently identified and significantly associated with HGSC subtypes (p < 0.0001). Of these, a metagene weakly enriched for the CMYB pathway was associated with subtype 1; three metagenes (one significantly enriched with the IL12 pathway and the others weakly enriched with the FCER1 and CXCR4 pathways) were associated with subtype 2; and a metagene weakly enriched with the AVB3 Integrin pathway distinguished between all 3 subtypes. Neither the CYMB metagene nor the IL12 metagene was significantly associated with survival. High activity of the CXCR4 and AVB3 metagenes was associated with poorer survival (hazard ratios (HR) and 95% confidence intervals (CI) are, respectively: 1.21, 0.99-1.48 and 1.34, 1.09-1.64). In contrast, high activity of the FCER1 metagene was associated with improved survival (HR 0.76, 95% CI 0.62-0.93).
Metagenes that are consistently and statistically significantly associated with subtype may be indicative of functional differences between HGSC subtypes. The contrast in hazard estimates for metagenes associated with subtype 2 may indicate that the metagenes capture survival signal distinct from the subtype association. Future work associating metagene activity with subtype uncertainty may better enable the refinement of subtype definitions and the development of subtype specific treatment strategies.
Citation Format: James Rudd, Emily K. Shea, Gregory P. Way, Casey S. Greene, Jennifer A. Doherty. Patterns of metagene activation in ovarian cancer subtypes. [abstract]. In: Proceedings of the 107th Annual Meeting of the American Association for Cancer Research; 2016 Apr 16-20; New Orleans, LA. Philadelphia (PA): AACR; Cancer Res 2016;76(14 Suppl):Abstract nr 815.
Collapse
|
123
|
Abstract
A new algorithm infers shared biology behind detected gene expression levels.
Collapse
Affiliation(s)
- Casey S. Greene
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| |
Collapse
|
124
|
Greene CS, Voight BF. Pathway and network-based strategies to translate genetic discoveries into effective therapies. Hum Mol Genet 2016; 25:R94-R98. [PMID: 27340225 DOI: 10.1093/hmg/ddw160] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2016] [Accepted: 05/19/2016] [Indexed: 11/13/2022] Open
Abstract
One way to design a drug is to attempt to phenocopy a genetic variant that is known to have the desired effect. In general, drugs that are supported by genetic associations progress further in the development pipeline. However, the number of associations that are candidates for development into drugs is limited because many associations are in non-coding regions or difficult to target genes. Approaches that overlay information from pathway databases or biological networks can expand the potential target list. In cases where the initial variant is not targetable or there is no variant with the desired effect, this may reveal new means to target a disease. In this review, we discuss recent examples in the domain of pathway and network-based drug repositioning from genetic associations. We highlight important caveats and challenges for the field, and we discuss opportunities for further development.
Collapse
Affiliation(s)
- Casey S Greene
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine Institute for Translational Medicine and Therapeutics, Perelman School of Medicine
| | - Benjamin F Voight
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine Institute for Translational Medicine and Therapeutics, Perelman School of Medicine Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19103 USA
| |
Collapse
|
125
|
Abstract
A deep learning algorithm sniffs out chromatin accessibility marks.
Collapse
Affiliation(s)
- Casey S. Greene
- Department of Systems Pharmacology and Translational Therapeutics. Perelman School of Medicine, University of Pennsylvania. Philadelphia, PA 19104, USA
| |
Collapse
|
126
|
Abstract
Meta-analysis for unsupervised clustering of clinical data empowers scientists to use small data sets for patient subtype discovery.
Collapse
Affiliation(s)
- Casey S. Greene
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| |
Collapse
|
127
|
Himmelstein DS, Greene CS, Moore JH. Erratum to: Evolving hard problems: generating human genetics datasets with a complex etiology. BioData Min 2016; 9:9. [PMID: 26848312 PMCID: PMC4740998 DOI: 10.1186/s13040-016-0085-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2016] [Accepted: 01/18/2016] [Indexed: 12/02/2022] Open
Affiliation(s)
- Daniel S Himmelstein
- Department of Genetics, Dartmouth Medical School, One Medical Center Drive, Lebanon, NH 03756 USA ; LewisSigler Institute for Integrative Genomics, Princeton University, Carl Icahn Laboratory, Princeton, NJ 08544 USA
| | - Casey S Greene
- LewisSigler Institute for Integrative Genomics, Princeton University, Carl Icahn Laboratory, Princeton, NJ 08544 USA
| | - Jason H Moore
- Department of Genetics, Dartmouth Medical School, One Medical Center Drive, Lebanon, NH 03756 USA
| |
Collapse
|
128
|
Thompson JA, Tan J, Greene CS. Cross-platform normalization of microarray and RNA-seq data for machine learning applications. PeerJ 2016; 4:e1621. [PMID: 26844019 PMCID: PMC4736986 DOI: 10.7717/peerj.1621] [Citation(s) in RCA: 57] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2015] [Accepted: 01/02/2016] [Indexed: 01/08/2023] Open
Abstract
Large, publicly available gene expression datasets are often analyzed with the aid of machine learning algorithms. Although RNA-seq is increasingly the technology of choice, a wealth of expression data already exist in the form of microarray data. If machine learning models built from legacy data can be applied to RNA-seq data, larger, more diverse training datasets can be created and validation can be performed on newly generated data. We developed Training Distribution Matching (TDM), which transforms RNA-seq data for use with models constructed from legacy platforms. We evaluated TDM, as well as quantile normalization, nonparanormal transformation, and a simple log 2 transformation, on both simulated and biological datasets of gene expression. Our evaluation included both supervised and unsupervised machine learning approaches. We found that TDM exhibited consistently strong performance across settings and that quantile normalization also performed well in many circumstances. We also provide a TDM package for the R programming language.
Collapse
Affiliation(s)
- Jeffrey A. Thompson
- Department of Genetics, Geisel School of Medicine at Dartmouth, Hanover, New Hampshire, United States of America
- Quantitative Biomedical Sciences Program, Geisel School of Medicine at Dartmouth, Hanover, New Hampshire, United States of America
| | - Jie Tan
- Department of Genetics, Geisel School of Medicine at Dartmouth, Hanover, New Hampshire, United States of America
- Molecular and Cellular Biology, Geisel School of Medicine at Dartmouth, Hanover, New Hampshire, United States of America
| | - Casey S. Greene
- Department of Genetics, Geisel School of Medicine at Dartmouth, Hanover, New Hampshire, United States of America
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
- Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, Pennslyvania, United States of America
| |
Collapse
|
129
|
Song A, Yan J, Kim S, Risacher SL, Wong AK, Saykin AJ, Shen L, Greene CS. Network-based analysis of genetic variants associated with hippocampal volume in Alzheimer's disease: a study of ADNI cohorts. BioData Min 2016; 9:3. [PMID: 26788126 PMCID: PMC4717572 DOI: 10.1186/s13040-016-0082-8] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2015] [Accepted: 01/14/2016] [Indexed: 12/25/2022] Open
Abstract
Background Alzheimer’s disease (AD) is a neurodegenerative disease that causes dementia. While molecular basis of AD is not fully understood, genetic factors are expected to participate in the development and progression of the disease. Our goal was to uncover novel genetic underpinnings of Alzheimer’s disease with a bioinformatics approach that accounts for tissue specificity. Findings We performed genome-wide association studies (GWAS) for hippocampal volume in two Alzheimer’s Disease Neuroimaging Initiative (ADNI) cohorts. We used these GWAS in a subsequent tissue-specific network-wide association study (NetWAS), which applied nominally significant associations in the initial GWAS to identify disease relevant patterns in a functional network for the hippocampus. We compared prioritized gene lists from NetWAS and GWAS with literature curated AD-associated genes from the Online Mendelian Inheritance in Man (OMIM) database. In the ADNI-1 GWAS, where we also observed an enrichment of low p-values, NetWAS prioritized disease-gene associations in accordance with OMIM annotations. This was not observed in the ADNI-2 dataset. We provide source code to replicate these analyses as well as complete results under permissive licenses. Conclusions We performed the first analysis of hippocampal volume using NetWAS, which uses machine learning algorithms applied to tissue-specific functional interaction network to prioritize GWAS results. Our findings support the idea that tissue-specific networks may provide helpful context for understanding the etiology of common human diseases and reveal challenges that network-based approaches encounter in some datasets. Our source code and intermediate results files can facilitate the development of methods to address these challenges. Electronic supplementary material The online version of this article (doi:10.1186/s13040-016-0082-8) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Ailin Song
- Department of Genetics, Geisel School of Medicine at Dartmouth, Hanover, New Hampshire USA ; Dartmouth-Hitchcock Norris Cotton Cancer Center, Lebanon, New Hampshire USA ; Institute for Quantitative Biomedical Sciences, Dartmouth College, Hanover, New Hampshire USA
| | - Jingwen Yan
- Center for Neuroimaging, Department of Radiology and Imaging Sciences, Indiana University School of Medicine, Indianapolis, Indiana USA ; Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, Indiana USA ; School of Informatics and Computing, Indiana University Indianapolis, Indianapolis, Indiana USA
| | - Sungeun Kim
- Center for Neuroimaging, Department of Radiology and Imaging Sciences, Indiana University School of Medicine, Indianapolis, Indiana USA ; Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, Indiana USA ; Indiana Alzheimer Disease Center, Indiana University School of Medicine, Indianapolis, Indiana USA
| | - Shannon Leigh Risacher
- Center for Neuroimaging, Department of Radiology and Imaging Sciences, Indiana University School of Medicine, Indianapolis, Indiana USA ; Indiana Alzheimer Disease Center, Indiana University School of Medicine, Indianapolis, Indiana USA
| | - Aaron K Wong
- Simons Center for Data Analysis, Simons Foundation, New York, NY USA
| | - Andrew J Saykin
- Center for Neuroimaging, Department of Radiology and Imaging Sciences, Indiana University School of Medicine, Indianapolis, Indiana USA ; Indiana Alzheimer Disease Center, Indiana University School of Medicine, Indianapolis, Indiana USA
| | - Li Shen
- Center for Neuroimaging, Department of Radiology and Imaging Sciences, Indiana University School of Medicine, Indianapolis, Indiana USA ; Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, Indiana USA ; School of Informatics and Computing, Indiana University Indianapolis, Indianapolis, Indiana USA ; Indiana Alzheimer Disease Center, Indiana University School of Medicine, Indianapolis, Indiana USA
| | - Casey S Greene
- Department of Genetics, Geisel School of Medicine at Dartmouth, Hanover, New Hampshire USA ; Dartmouth-Hitchcock Norris Cotton Cancer Center, Lebanon, New Hampshire USA ; Institute for Quantitative Biomedical Sciences, Dartmouth College, Hanover, New Hampshire USA ; Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvnia USA
| | | |
Collapse
|
130
|
Greene CS, Foster JA, Stanton BA, Hogan DA, Bromberg Y. COMPUTATIONAL APPROACHES TO STUDY MICROBES AND MICROBIOMES. Pac Symp Biocomput 2016; 21:557-567. [PMID: 26776218 PMCID: PMC4832978] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Technological advances are making large-scale measurements of microbial communities commonplace. These newly acquired datasets are allowing researchers to ask and answer questions about the composition of microbial communities, the roles of members in these communities, and how genes and molecular pathways are regulated in individual community members and communities as a whole to effectively respond to diverse and changing environments. In addition to providing a more comprehensive survey of the microbial world, this new information allows for the development of computational approaches to model the processes underlying microbial systems. We anticipate that the field of computational microbiology will continue to grow rapidly in the coming years. In this manuscript we highlight both areas of particular interest in microbiology as well as computational approaches that begin to address these challenges.
Collapse
Affiliation(s)
| | - James A. Foster
- Institute of Bioinformatics and Evolutionary Studies, University of Idaho, Moscow, ID 83844 USA
| | - Bruce A. Stanton
- Department of Microbiology and Immunology, The Geisel School of Medicine at Dartmouth, Hanover, NH 03755, USA
| | - Deborah A. Hogan
- Department of Microbiology and Immunology, The Geisel School of Medicine at Dartmouth, Hanover, NH 03755, USA
| | - Yana Bromberg
- Biochemistry and Microbiology, School of Environmental and Biological Sciences, Rutgers University, New Brunswick, NJ 08901, USA, Institute for Advanced Study, Technische Universität München Garching, Germany
| |
Collapse
|
131
|
Qian DC, Byun J, Han Y, Greene CS, Field JK, Hung RJ, Brhane Y, Mclaughlin JR, Fehringer G, Landi MT, Rosenberger A, Bickeböller H, Malhotra J, Risch A, Heinrich J, Hunter DJ, Henderson BE, Haiman CA, Schumacher FR, Eeles RA, Easton DF, Seminara D, Amos CI. Identification of shared and unique susceptibility pathways among cancers of the lung, breast, and prostate from genome-wide association studies and tissue-specific protein interactions. Hum Mol Genet 2015; 24:7406-20. [PMID: 26483192 PMCID: PMC4664175 DOI: 10.1093/hmg/ddv440] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2015] [Revised: 09/11/2015] [Accepted: 10/12/2015] [Indexed: 12/18/2022] Open
Abstract
Results from genome-wide association studies (GWAS) have indicated that strong single-gene effects are the exception, not the rule, for most diseases. We assessed the joint effects of germline genetic variations through a pathway-based approach that considers the tissue-specific contexts of GWAS findings. From GWAS meta-analyses of lung cancer (12 160 cases/16 838 controls), breast cancer (15 748 cases/18 084 controls) and prostate cancer (14 160 cases/12 724 controls) in individuals of European ancestry, we determined the tissue-specific interaction networks of proteins expressed from genes that are likely to be affected by disease-associated variants. Reactome pathways exhibiting enrichment of proteins from each network were compared across the cancers. Our results show that pathways associated with all three cancers tend to be broad cellular processes required for growth and survival. Significant examples include the nerve growth factor (P = 7.86 × 10(-33)), epidermal growth factor (P = 1.18 × 10(-31)) and fibroblast growth factor (P = 2.47 × 10(-31)) signaling pathways. However, within these shared pathways, the genes that influence risk largely differ by cancer. Pathways found to be unique for a single cancer focus on more specific cellular functions, such as interleukin signaling in lung cancer (P = 1.69 × 10(-15)), apoptosis initiation by Bad in breast cancer (P = 3.14 × 10(-9)) and cellular responses to hypoxia in prostate cancer (P = 2.14 × 10(-9)). We present the largest comparative cross-cancer pathway analysis of GWAS to date. Our approach can also be applied to the study of inherited mechanisms underlying risk across multiple diseases in general.
Collapse
Affiliation(s)
- David C Qian
- Department of Biomedical Data Science, Geisel School of Medicine at Dartmouth, Hanover, NH 03755, USA
| | - Jinyoung Byun
- Department of Biomedical Data Science, Geisel School of Medicine at Dartmouth, Hanover, NH 03755, USA
| | - Younghun Han
- Department of Biomedical Data Science, Geisel School of Medicine at Dartmouth, Hanover, NH 03755, USA
| | - Casey S Greene
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine at the University of Pennsylvania, Philadelphia, PA 19104, USA
| | - John K Field
- Department of Molecular and Clinical Cancer Medicine, University of Liverpool Cancer Research Centre, Liverpool L69 3GA, UK
| | - Rayjean J Hung
- Lunenfeld-Tanenbaum Research Institute of Mount Sinai Hospital, Toronto, ON M5G 1X5, Canada
| | - Yonathan Brhane
- Lunenfeld-Tanenbaum Research Institute of Mount Sinai Hospital, Toronto, ON M5G 1X5, Canada
| | - John R Mclaughlin
- Dalla Lana School of Public Health, University of Toronto, Toronto, ON M5T 3M7, Canada
| | - Gordon Fehringer
- Lunenfeld-Tanenbaum Research Institute of Mount Sinai Hospital, Toronto, ON M5G 1X5, Canada
| | - Maria Teresa Landi
- National Cancer Institute, National Institutes of Health, Bethesda, MD 20892, USA
| | - Albert Rosenberger
- Department of Genetic Epidemiology, University Medical Centre Göttingen, 37099 Göttingen, Germany
| | - Heike Bickeböller
- Department of Genetic Epidemiology, University Medical Centre Göttingen, 37099 Göttingen, Germany
| | - Jyoti Malhotra
- Division of Hematology and Oncology, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Angela Risch
- Division of Epigenomics and Cancer Risk Factors, German Cancer Research Center, 69120 Heidelberg, Germany
| | - Joachim Heinrich
- Institute of Epidemiology I, German Research Center for Environmental Health, 85764 Neuherberg, Germany
| | - David J Hunter
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA
| | - Brian E Henderson
- Department of Preventive Medicine, Keck School of Medicine, University of Southern California, Los Angeles, CA 90033, USA
| | - Christopher A Haiman
- Department of Preventive Medicine, Keck School of Medicine, University of Southern California, Los Angeles, CA 90033, USA
| | - Fredrick R Schumacher
- Department of Preventive Medicine, Keck School of Medicine, University of Southern California, Los Angeles, CA 90033, USA
| | - Rosalind A Eeles
- Department of Cancer Genetics, Institute of Cancer Research, London SW7 3RP, UK and
| | - Douglas F Easton
- Centre for Cancer Genetic Epidemiology, Department of Oncology, University of Cambridge, Cambridge CB1 8RN, UK
| | - Daniela Seminara
- National Cancer Institute, National Institutes of Health, Bethesda, MD 20892, USA
| | - Christopher I Amos
- Department of Biomedical Data Science, Geisel School of Medicine at Dartmouth, Hanover, NH 03755, USA,
| |
Collapse
|
132
|
Rudd J, Zelaya RA, Demidenko E, Goode EL, Greene CS, Doherty JA. Leveraging global gene expression patterns to predict expression of unmeasured genes. BMC Genomics 2015; 16:1065. [PMID: 26666289 PMCID: PMC4678722 DOI: 10.1186/s12864-015-2250-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2015] [Accepted: 11/27/2015] [Indexed: 12/31/2022] Open
Abstract
Background Large collections of paraffin-embedded tissue represent a rich resource to test hypotheses based on gene expression patterns; however, measurement of genome-wide expression is cost-prohibitive on a large scale. Using the known expression correlation structure within a given disease type (in this case, high grade serous ovarian cancer; HGSC), we sought to identify reduced sets of directly measured (DM) genes which could accurately predict the expression of a maximized number of unmeasured genes. Results We developed a greedy gene set selection (GGS) algorithm which returns a DM set of user specified size based on a specific correlation threshold (|rP|) and minimum number of DM genes that must be correlated to an unmeasured gene in order to infer the value of the unmeasured gene (redundancy). We evaluated GGS in the Cancer Genome Atlas (TCGA) HGSC data across 144 combinations of DM size, redundancy (1–3), and |rP| (0.60, 0.65, 0.70). Across the parameter sweep, GGS allows on average 9 times more gene expression information to be captured compared to the DM set alone. GGS successfully augments prognostic HGSC gene sets; the addition of 20 GGS selected genes more than doubles the number of genes whose expression is predictable. Moreover, the expression prediction is highly accurate. After training regression models for the predictable gene set using 2/3 of the TCGA data, the average accuracy (ranked correlation of true and predicted values) in the 1/3 testing partition and four independent populations is above 0.65 and approaches 0.8 for conservative parameter sets. We observe similar accuracies in the TCGA HGSC RNA-sequencing data. Specifically, the prediction accuracy increases with increasing redundancy and increasing |rP|. Conclusions GGS-selected genes, which maximize expression information about unmeasured genes, can be combined with candidate gene sets as a cost effective way to increase the amount of gene expression information obtained in large studies. This method can be applied to any organism, model system, disease, or tissue type for which whole genome gene expression data exists. Electronic supplementary material The online version of this article (doi:10.1186/s12864-015-2250-5) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- James Rudd
- Department of Epidemiology, Geisel School of Medicine at Dartmouth College, One Medical Center Drive, 7927 Rubin Building, Lebanon, NH, 03756, USA.
| | - René A Zelaya
- Department of Genetics, Geisel School of Medicine at Dartmouth College; Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania Perelman School of Medicine, 10-131 SCTR, 34th & Civic Center Boulevard, Philadelphia, PA, 19104-5158, USA.
| | - Eugene Demidenko
- Department of Biomedical Data Science, Geisel School of Medicine at Dartmouth College, One Medical Center Drive, 7927 Rubin Building, Lebanon, NH, 03756, USA.
| | - Ellen L Goode
- Department of Health Sciences Research, Division of Epidemiology, Mayo Clinic, 200 First St. SW, Rochester, MN, 55905, USA.
| | - Casey S Greene
- Department of Genetics, Geisel School of Medicine at Dartmouth College; Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania Perelman School of Medicine, 10-131 SCTR, 34th & Civic Center Boulevard, Philadelphia, PA, 19104-5158, USA.
| | - Jennifer A Doherty
- Department of Epidemiology, Geisel School of Medicine at Dartmouth College, One Medical Center Drive, 7927 Rubin Building, Lebanon, NH, 03756, USA.
| |
Collapse
|
133
|
Gonzalez GH, Tahsin T, Goodale BC, Greene AC, Greene CS. Recent Advances and Emerging Applications in Text and Data Mining for Biomedical Discovery. Brief Bioinform 2015; 17:33-42. [PMID: 26420781 PMCID: PMC4719073 DOI: 10.1093/bib/bbv087] [Citation(s) in RCA: 103] [Impact Index Per Article: 11.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2015] [Indexed: 02/06/2023] Open
Abstract
Precision medicine will revolutionize the way we treat and prevent disease. A major barrier to the implementation of precision medicine that clinicians and translational scientists face is understanding the underlying mechanisms of disease. We are starting to address this challenge through automatic approaches for information extraction, representation and analysis. Recent advances in text and data mining have been applied to a broad spectrum of key biomedical questions in genomics, pharmacogenomics and other fields. We present an overview of the fundamental methods for text and data mining, as well as recent advances and emerging applications toward precision medicine.
Collapse
|
134
|
Rudd J, Zelaya RA, Demidenko E, Greene CS, Doherty JA. Abstract 2171: Leveraging global gene expression patterns to identify gene sets that predict expression of large numbers of unmeasured genes. Cancer Res 2015. [DOI: 10.1158/1538-7445.am2015-2171] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Abstract
Large collections of formalin-fixed paraffin embedded tissue are a rich resource to test hypotheses based on gene expression patterns; however, measurement of genome-wide expression is cost-prohibitive on a large scale, and a reduced set of “candidate” genes must be selected and assayed with platforms such as NanoString nCounter®. Using the known expression correlation structure within a given tissue (high grade serous ovarian cancer; HGSC), we sought to determine whether reduced sets of directly measured genes could accurately predict a maximized number of unmeasured genes. To maximize the number of unmeasured genes that can be inferred from reduced set assays, we developed an algorithm with three key parameters: the number of genes to directly measure; Pearson correlation thresholds between genes (rP); and the number of directly measured genes that must be correlated with the unmeasured genes (ie, redundancy). We evaluated this algorithm across a range of parameter values: 10-400 directly assayed genes, redundancy of 1-3, and |rP| of 0.60, 0.65, and 0.70. In a training partition of the Cancer Genome Atlas (TCGA) HGSC Affymetrix U133 Plus 2.0 gene expression data (n = 386), we used the selected directly measured genes to build linear models of predicted gene expression. We then evaluated the predicted expression values using true expression values from the following HGSC datasets: TCGA testing partition (n = 159); GSE9891 (Australian, U133 Plus 2.0 array, n = 264); and GSE32062 (Japanese, Agilent 4×44k microarray, n = 258). After restricting to genes with median absolute deviation (MAD) > 0.5 and using our most conservative parameters (|rP| = 0.7; redundancy = 3), 400 directly measured genes predicted an additional 198 unmeasured genes, with average Spearman rank coefficients (rS) and bootstrapped standard errors between predicted and true expression values of 0.854 (0.005) in the testing partition of TCGA, 0.871 (0.006) in the Australian data, and 0.832 (0.010) in the Japanese data. Removing MAD filtering predicted 332 unmeasured genes but lowered accuracy, with respective average rS values of 0.800 (0.009), 0.816 (0.011), and 0.750 (0.015). Relaxing redundancy to 2 and |rP| to 0.65 predicted 701 unmeasured genes, but respective average rS values decreased to 0.732 (0.007), 0.733 (0.008), and 0.686 (0.009). The number of predicted genes increases as the parameters become less conservative, with a concomitant decrease in accuracy. In summary, we show that for a given disease type a minimal set of directly measured genes can be used to maximize the amount of gene expression information captured in data sets across populations and assay platforms. Genes selected using this method can be combined with candidate gene sets as a cost-effective way to increase the amount of gene expression information obtained in large studies where using a genome-wide measurement platform is not feasible.
Citation Format: James Rudd, Rene A. Zelaya, Eugene Demidenko, Casey S. Greene, Jennifer A. Doherty. Leveraging global gene expression patterns to identify gene sets that predict expression of large numbers of unmeasured genes. [abstract]. In: Proceedings of the 106th Annual Meeting of the American Association for Cancer Research; 2015 Apr 18-22; Philadelphia, PA. Philadelphia (PA): AACR; Cancer Res 2015;75(15 Suppl):Abstract nr 2171. doi:10.1158/1538-7445.AM2015-2171
Collapse
Affiliation(s)
- James Rudd
- 1Department of Epidemiology, Geisel School of Medicine at Dartmouth College, Lebanon, NH
| | - Rene A. Zelaya
- 2Department of Genetics, Geisel School of Medicine at Dartmouth College, Lebanon, NH
| | - Eugene Demidenko
- 3Department of Community & Family Medicine, Geisel School of Medicine at Dartmouth College, Lebanon, NH
| | - Casey S. Greene
- 2Department of Genetics, Geisel School of Medicine at Dartmouth College, Lebanon, NH
| | - Jennifer A. Doherty
- 1Department of Epidemiology, Geisel School of Medicine at Dartmouth College, Lebanon, NH
| |
Collapse
|
135
|
Way GP, Rudd J, Greene CS, Doherty JA. Abstract 1928: High-grade serous ovarian cancer subtypes are similar across diverse populations. Cancer Res 2015. [DOI: 10.1158/1538-7445.am2015-1928] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Abstract
The most common and lethal type of invasive epithelial ovarian cancer is high grade serous (HGSC). Three to four gene expression-based HGSC subtypes have been identified in prior studies. In contrast to most previous studies, which have assessed the performance of survival classifiers in validation sets, we sought to determine the degree of similarity of gene expression patterns in subtypes between populations using systematic unsupervised clustering within populations.
We analyzed publically-available mRNA expression data from studies with >200 HGSC tumors: The Cancer Genome Atlas (TCGA, US, n = 519, Affymetrix HT U133a), Tothill et al. (GSE9891, Australia, n = 242, Affymetrix U133 Plus 2.0) and Yoshihara et al. (GSE32062, Japan, n = 258, Agilent G4112a). We restricted analyses to the 12,249 genes shared across all datasets and selected from these the union of the 1,500 most variant genes per population (2,824). Using these datasets, we performed k-means clustering within each population for k = 3 and k = 4. We compared each cluster to all other clusters using Significance Analysis of Microarrays, which outputs an F score for all 12,249 genes, measuring cluster-specific differential expression. We then calculated the correlation of the resulting F score vectors across populations and within populations across both numbers of centroids (k = 3 or k = 4). We identified analogous clusters by high F score correlations and determined each cluster's similarity to the TCGA subtypes based on cluster-specific differentially expressed genes.
We observed high concordance of gene expression patterns for clusters across populations and across k-means runs, suggesting that analogous clusters exist in most analyses. For k = 3, F score correlations across populations for clusters 1, 2 and 3, respectively, ranged between 0.77-0.85, 0.80-0.90, and 0.66-0.72. For k = 4, F score correlations for clusters 1-4 were, respectively: 0.76-0.85, 0.82-0.85, 0.65-0.78, and 0.52-0.78. Across k = 3 and k = 4, correlations for cluster 1 within TCGA, Tothill, and Yoshihara were 0.99, 1.00 and 1.00, and correlations for cluster 2 were 0.96, 0.98, and 0.95, respectively. Correlations for cluster 3 were less strong: 0.56, 0.88, and 0.60, respectively. For k = 4, cluster 4 was composed mainly of samples that belonged to cluster 3 for k = 3; 88% for TCGA, 54% for Tothill, and 95% for Yoshihara. When compared to TCGA subtypes, cluster 1 corresponded most strongly to mesenchymal, cluster 2 to proliferative, cluster 3 to differentiated, and cluster 4 to immunoreactive.
Our observation of highly correlated gene expression patterns between clusters across populations, across platforms, and across the number of centroids provides strong evidence that at least three biological HGSC subtypes exist. The mesenchymal-like and proliferative-like subtypes are particularly consistent across populations, and could be uniquely targeted for treatment.
Citation Format: Gregory P. Way, James Rudd, Casey S. Greene, Jennifer A. Doherty. High-grade serous ovarian cancer subtypes are similar across diverse populations. [abstract]. In: Proceedings of the 106th Annual Meeting of the American Association for Cancer Research; 2015 Apr 18-22; Philadelphia, PA. Philadelphia (PA): AACR; Cancer Res 2015;75(15 Suppl):Abstract nr 1928. doi:10.1158/1538-7445.AM2015-1928
Collapse
Affiliation(s)
- Gregory P. Way
- 1Institute for Quantitative Biomedical Sciences, Geisel School of Medicine at Dartmouth College, Lebanon, NH
| | - James Rudd
- 2Department of Epidemiology, Geisel School of Medicine at Dartmouth College, Lebanon, NH
| | - Casey S. Greene
- 3Department of Genetics, Geisel School of Medicine at Dartmouth College, Lebanon, NH
| | - Jennifer A. Doherty
- 2Department of Epidemiology, Geisel School of Medicine at Dartmouth College, Lebanon, NH
| |
Collapse
|
136
|
Gui J, Greene CS, Sullivan C, Taylor W, Moore JH, Kim C. Testing multiple hypotheses through IMP weighted FDR based on a genetic functional network with application to a new zebrafish transcriptome study. BioData Min 2015; 8:17. [PMID: 26097506 PMCID: PMC4474579 DOI: 10.1186/s13040-015-0050-8] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2014] [Accepted: 06/08/2015] [Indexed: 11/10/2022] Open
Abstract
In genome-wide studies, hundreds of thousands of hypothesis tests are performed simultaneously. Bonferroni correction and False Discovery Rate (FDR) can effectively control type I error but often yield a high false negative rate. We aim to develop a more powerful method to detect differentially expressed genes. We present a Weighted False Discovery Rate (WFDR) method that incorporate biological knowledge from genetic networks. We first identify weights using Integrative Multi-species Prediction (IMP) and then apply the weights in WFDR to identify differentially expressed genes through an IMP-WFDR algorithm. We performed a gene expression experiment to identify zebrafish genes that change expression in the presence of arsenic during a systemic Pseudomonas aeruginosa infection. Zebrafish were exposed to arsenic at 10 parts per billion and/or infected with P. aeruginosa. Appropriate controls were included. We then applied IMP-WFDR during the analysis of differentially expressed genes. We compared the mRNA expression for each group and found over 200 differentially expressed genes and several enriched pathways including defense response pathways, arsenic response pathways, and the Notch signaling pathway.
Collapse
Affiliation(s)
- Jiang Gui
- Department of Biomedical Data Science, Geisel school of medicine, Dartmouth College, Hanover, NH USA.,Dartmouth-Hitchcock Medical Center, 883 Rubin Bldg, HB7927, One Medical Center Dr., Lebanon, NH USA
| | - Casey S Greene
- Department of Genetics, Geisel school of medicine, Dartmouth College, Hanover, NH USA
| | - Con Sullivan
- Department of Molecular and Biomedical Sciences, University of Maine, Orono, ME USA.,Graduate School of Biomedical Science and Engineeering, University of Maine, Orono, ME USA
| | - Walter Taylor
- Department of Genetics, Geisel school of medicine, Dartmouth College, Hanover, NH USA
| | - Jason H Moore
- Department of Biostatistics and Epidemiology, The Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA USA
| | - Carol Kim
- Department of Molecular and Biomedical Sciences, University of Maine, Orono, ME USA.,Graduate School of Biomedical Science and Engineeering, University of Maine, Orono, ME USA
| |
Collapse
|
137
|
Abstract
Modern technologies are capable of generating enormous amounts of data that measure complex biological systems. Computational biologists and bioinformatics scientists are increasingly being asked to use these data to reveal key systems-level properties. We review the extent to which curricula are changing in the era of big data. We identify key competencies that scientists dealing with big data are expected to possess across fields, and we use this information to propose courses to meet these growing needs. While bioinformatics programs have traditionally trained students in data-intensive science, we identify areas of particular biological, computational and statistical emphasis important for this era that can be incorporated into existing curricula. For each area, we propose a course structured around these topics, which can be adapted in whole or in parts into existing curricula. In summary, specific challenges associated with big data provide an important opportunity to update existing curricula, but we do not foresee a wholesale redesign of bioinformatics training programs.
Collapse
|
138
|
Mahoney JM, Taroni J, Martyanov V, Wood TA, Greene CS, Pioli PA, Hinchcliff ME, Whitfield ML. Systems level analysis of systemic sclerosis shows a network of immune and profibrotic pathways connected with genetic polymorphisms. PLoS Comput Biol 2015; 11:e1004005. [PMID: 25569146 PMCID: PMC4288710 DOI: 10.1371/journal.pcbi.1004005] [Citation(s) in RCA: 94] [Impact Index Per Article: 10.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2014] [Accepted: 10/27/2014] [Indexed: 12/15/2022] Open
Abstract
Systemic sclerosis (SSc) is a rare systemic autoimmune disease characterized by skin and organ fibrosis. The pathogenesis of SSc and its progression are poorly understood. The SSc intrinsic gene expression subsets (inflammatory, fibroproliferative, normal-like, and limited) are observed in multiple clinical cohorts of patients with SSc. Analysis of longitudinal skin biopsies suggests that a patient's subset assignment is stable over 6–12 months. Genetically, SSc is multi-factorial with many genetic risk loci for SSc generally and for specific clinical manifestations. Here we identify the genes consistently associated with the intrinsic subsets across three independent cohorts, show the relationship between these genes using a gene-gene interaction network, and place the genetic risk loci in the context of the intrinsic subsets. To identify gene expression modules common to three independent datasets from three different clinical centers, we developed a consensus clustering procedure based on mutual information of partitions, an information theory concept, and performed a meta-analysis of these genome-wide gene expression datasets. We created a gene-gene interaction network of the conserved molecular features across the intrinsic subsets and analyzed their connections with SSc-associated genetic polymorphisms. The network is composed of distinct, but interconnected, components related to interferon activation, M2 macrophages, adaptive immunity, extracellular matrix remodeling, and cell proliferation. The network shows extensive connections between the inflammatory- and fibroproliferative-specific genes. The network also shows connections between these subset-specific genes and 30 SSc-associated polymorphic genes including STAT4, BLK, IRF7, NOTCH4, PLAUR, CSK, IRAK1, and several human leukocyte antigen (HLA) genes. Our analyses suggest that the gene expression changes underlying the SSc subsets may be long-lived, but mechanistically interconnected and related to a patients underlying genetic risk. Systemic sclerosis (SSc) is a rare autoimmune disease characterized by skin thickening (fibrosis) and progressive organ failure. Previous studies of SSc skin biopsies have identified molecular subsets of SSc based upon gene expression termed the inflammatory, fibroproliferative, normal-like, and limited intrinsic subsets. These gene expression signatures are large and although the biological processes are conserved, the exact list of genes can vary across datasets due to random variation, as well as minor differences in the composition of the study cohorts (e.g. early vs. late disease). We developed a computational tool to identify the consensus genes underlying the subsets across heterogeneous data and characterized the biological role of the consensus genes in SSc in order to obtain a systems level perspective of the SSc subsets. Our analysis reveals a complex network of genes connecting two of the major SSc intrinsic subsets, inflammatory and fibroproliferative. Many genetic loci associated with SSc risk show connections with the consensus genes of the intrinsic subsets, indicating that differential expression of genes defining the subsets may be related to genetic risk for SSc, thus for the first time placing the genetic risk factors in the context of, and showing putative relationships with, the intrinsic gene expression subsets.
Collapse
Affiliation(s)
- J Matthew Mahoney
- Department of Genetics, Geisel School of Medicine at Dartmouth, Hannover, New Hampshire, United States of America
| | - Jaclyn Taroni
- Department of Genetics, Geisel School of Medicine at Dartmouth, Hannover, New Hampshire, United States of America
| | - Viktor Martyanov
- Department of Genetics, Geisel School of Medicine at Dartmouth, Hannover, New Hampshire, United States of America
| | - Tammara A Wood
- Department of Genetics, Geisel School of Medicine at Dartmouth, Hannover, New Hampshire, United States of America
| | - Casey S Greene
- Department of Genetics, Geisel School of Medicine at Dartmouth, Hannover, New Hampshire, United States of America
| | - Patricia A Pioli
- Department of Obstetrics and Gynecology, Geisel School of Medicine at Dartmouth, Hannover, New Hampshire, United States of America
| | - Monique E Hinchcliff
- Department of Medicine, Northwestern University Feinberg School of Medicine, Chicago, Illinois, United States of America
| | - Michael L Whitfield
- Department of Genetics, Geisel School of Medicine at Dartmouth, Hannover, New Hampshire, United States of America
| |
Collapse
|
139
|
Tan J, Ung M, Cheng C, Greene CS. Unsupervised feature construction and knowledge extraction from genome-wide assays of breast cancer with denoising autoencoders. Pac Symp Biocomput 2015; 20:132-143. [PMID: 25592575 DOI: 10.1142/9789814644730_0014] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/23/2023]
Abstract
Big data bring new opportunities for methods that efficiently summarize and automatically extract knowledge from such compendia. While both supervised learning algorithms and unsupervised clustering algorithms have been successfully applied to biological data, they are either dependent on known biology or limited to discerning the most significant signals in the data. Here we present denoising autoencoders (DAs), which employ a data-defined learning objective independent of known biology, as a method to identify and extract complex patterns from genomic data. We evaluate the performance of DAs by applying them to a large collection of breast cancer gene expression data. Results show that DAs successfully construct features that contain both clinical and molecular information. There are features that represent tumor or normal samples, estrogen receptor (ER) status, and molecular subtypes. Features constructed by the autoencoder generalize to an independent dataset collected using a distinct experimental platform. By integrating data from ENCODE for feature interpretation, we discover a feature representing ER status through association with key transcription factors in breast cancer. We also identify a feature highly predictive of patient survival and it is enriched by FOXM1 signaling pathway. The features constructed by DAs are often bimodally distributed with one peak near zero and another near one, which facilitates discretization. In summary, we demonstrate that DAs effectively extract key biological principles from gene expression data and summarize them into constructed features with convenient properties.
Collapse
Affiliation(s)
- Jie Tan
- Department of Genetics, Institute for Quantitative Biomedical Sciences, Norris Cotton Cancer Center, The Geisel School of Medicine at Dartmouth, Hanover, NH 03755, USA
| | | | | | | |
Collapse
|
140
|
Greene CS, Tan J, Ung M, Moore JH, Cheng C. Big data bioinformatics. J Cell Physiol 2014; 229:1896-900. [PMID: 24799088 DOI: 10.1002/jcp.24662] [Citation(s) in RCA: 88] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2014] [Accepted: 05/01/2014] [Indexed: 12/17/2022]
Abstract
Recent technological advances allow for high throughput profiling of biological systems in a cost-efficient manner. The low cost of data generation is leading us to the "big data" era. The availability of big data provides unprecedented opportunities but also raises new challenges for data mining and analysis. In this review, we introduce key concepts in the analysis of big data, including both "machine learning" algorithms as well as "unsupervised" and "supervised" examples of each. We note packages for the R programming language that are available to perform machine learning analyses. In addition to programming based solutions, we review webservers that allow users with limited or no programming background to perform these analyses on large data compendia.
Collapse
Affiliation(s)
- Casey S Greene
- Department of Genetics, Geisel School of Medicine at Dartmouth, Hanover, New Hampshire; Institute for Quantitative Biomedical Sciences, Geisel School of Medicine at Dartmouth, Lebanon, New Hampshire; Norris Cotton Cancer Center, Geisel School of Medicine at Dartmouth, Lebanon, New Hampshire
| | | | | | | | | |
Collapse
|
141
|
Zieselman AL, Fisher JM, Hu T, Andrews PC, Greene CS, Shen L, Saykin AJ, Moore JH. Computational genetics analysis of grey matter density in Alzheimer's disease. BioData Min 2014; 7:17. [PMID: 25165488 PMCID: PMC4145360 DOI: 10.1186/1756-0381-7-17] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2014] [Accepted: 08/18/2014] [Indexed: 12/24/2022] Open
Abstract
Background Alzheimer’s disease is the most common form of progressive dementia and there is currently no known cure. The cause of onset is not fully understood but genetic factors are expected to play a significant role. We present here a bioinformatics approach to the genetic analysis of grey matter density as an endophenotype for late onset Alzheimer’s disease. Our approach combines machine learning analysis of gene-gene interactions with large-scale functional genomics data for assessing biological relationships. Results We found a statistically significant synergistic interaction among two SNPs located in the intergenic region of an olfactory gene cluster. This model did not replicate in an independent dataset. However, genes in this region have high-confidence biological relationships and are consistent with previous findings implicating sensory processes in Alzheimer’s disease. Conclusions Previous genetic studies of Alzheimer’s disease have revealed only a small portion of the overall variability due to DNA sequence differences. Some of this missing heritability is likely due to complex gene-gene and gene-environment interactions. We have introduced here a novel bioinformatics analysis pipeline that embraces the complexity of the genetic architecture of Alzheimer’s disease while at the same time harnessing the power of functional genomics. These findings represent novel hypotheses about the genetic basis of this complex disease and provide open-access methods that others can use in their own studies.
Collapse
Affiliation(s)
- Amanda L Zieselman
- Department of Genetics, Institute for Quantitative Biomedical Sciences, Geisel School of Medicine, Dartmouth College, Hanover, New Hampshire 03755, USA
| | - Jonathan M Fisher
- Department of Genetics, Institute for Quantitative Biomedical Sciences, Geisel School of Medicine, Dartmouth College, Hanover, New Hampshire 03755, USA
| | - Ting Hu
- Department of Genetics, Institute for Quantitative Biomedical Sciences, Geisel School of Medicine, Dartmouth College, Hanover, New Hampshire 03755, USA
| | - Peter C Andrews
- Department of Genetics, Institute for Quantitative Biomedical Sciences, Geisel School of Medicine, Dartmouth College, Hanover, New Hampshire 03755, USA
| | - Casey S Greene
- Department of Genetics, Institute for Quantitative Biomedical Sciences, Geisel School of Medicine, Dartmouth College, Hanover, New Hampshire 03755, USA
| | - Li Shen
- Department of Radiology and Imaging Sciences, Center for Neuroimaging and Indiana Alzheimer's Disease Center, Indiana University School of Medicine, Indianapolis, IN 46202, USA
| | - Andrew J Saykin
- Department of Radiology and Imaging Sciences, Center for Neuroimaging and Indiana Alzheimer's Disease Center, Indiana University School of Medicine, Indianapolis, IN 46202, USA
| | - Jason H Moore
- Department of Genetics, Institute for Quantitative Biomedical Sciences, Geisel School of Medicine, Dartmouth College, Hanover, New Hampshire 03755, USA
| |
Collapse
|
142
|
Bogenberger JM, Rudd JE, Chow D, Kassner M, Yin H, Greene CS, Tibes R. Abstract A28: Identification of HDAC inhibitor potentiating targets in acute myeloid leukemia cells by large-scale RNA-interference. Mol Cancer Ther 2014. [DOI: 10.1158/1535-7163.pms-a28] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Abstract
The lysine deacetylase inhibitor suberoylanilide acid (SAHA) has shown promising but limited activity in the treatment of acute myeloid leukemia (AML). To identify potential targets for rational combination therapies that increase the efficacy of SAHA in AML, we used a functional RNA-interference (RNAi) screening approach to identify genes, that when inhibited, potentiate the in vitro anti-leukemic activity of SAHA. A total of 901 kinase, phosphatase and associated signaling genes were silenced, with four different siRNA sequences per gene, in combination with SAHA treatment. Log2 values of the ratio [(siRNA + SAHA)/(siRNA alone)] were calculated, with median and standard deviation determined on a per-plate basis. Hits were defined as ≥ 2 standard deviations from the log2 ratio median. Hit lists for each cell line were over-laid on an integrated functional relationship network. We applied a community detection algorithm to this sub-network and identified siRNA sensitive modules. Each module represents a highly connected set of genes in the integrated network. To identify the pathways represented by each module, we evaluated enrichment using the National Cancer Institute (NCI) Protein Interaction Database (PID) pathways. Several novel sensitizing targets, grouped into a small number of pathways, emerged from these screens. Some hits exhibit little to no anti-leukemic activity when silenced alone, indicative of synthetic lethal interaction with SAHA treatment. Initial validation experiments with siRNA and novel small molecule inhibitors confirm RNAi screen results and pharmacological sensitization is observed. The first reported large-scale HDAC inhibitor RNAi screen in leukemias has identified a novel rational combination that can be translated into design of a clinical trial.
Citation Format: James M. Bogenberger, James E. Rudd, Donald Chow, Michelle Kassner, Holly Yin, Casey S. Greene, Raoul Tibes. Identification of HDAC inhibitor potentiating targets in acute myeloid leukemia cells by large-scale RNA-interference. [abstract]. In: Proceedings of the AACR Precision Medicine Series: Synthetic Lethal Approaches to Cancer Vulnerabilities; May 17-20, 2013; Bellevue, WA. Philadelphia (PA): AACR; Mol Cancer Ther 2013;12(5 Suppl):Abstract nr A28.
Collapse
Affiliation(s)
| | - James E. Rudd
- 2Institute for Quantitative Biomedical Sciences, The Geisel School of Medicine at Dartmouth, Hanover, NH,
| | - Donald Chow
- 3Translational Genomics Research Institute, Scottsdale, AZ
| | | | - Holly Yin
- 3Translational Genomics Research Institute, Scottsdale, AZ
| | - Casey S. Greene
- 2Institute for Quantitative Biomedical Sciences, The Geisel School of Medicine at Dartmouth, Hanover, NH,
| | - Raoul Tibes
- 1Mayo Clinic in Arizona, Department of Hematology/Oncology, Scottsdale, AZ,
| |
Collapse
|
143
|
Penrod NM, Greene CS, Moore JH. Predicting targeted drug combinations based on Pareto optimal patterns of coexpression network connectivity. Genome Med 2014; 6:33. [PMID: 24944582 PMCID: PMC4062052 DOI: 10.1186/gm550] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2013] [Accepted: 04/22/2014] [Indexed: 01/05/2023] Open
Abstract
Background Molecularly targeted drugs promise a safer and more effective treatment modality than conventional chemotherapy for cancer patients. However, tumors are dynamic systems that readily adapt to these agents activating alternative survival pathways as they evolve resistant phenotypes. Combination therapies can overcome resistance but finding the optimal combinations efficiently presents a formidable challenge. Here we introduce a new paradigm for the design of combination therapy treatment strategies that exploits the tumor adaptive process to identify context-dependent essential genes as druggable targets. Methods We have developed a framework to mine high-throughput transcriptomic data, based on differential coexpression and Pareto optimization, to investigate drug-induced tumor adaptation. We use this approach to identify tumor-essential genes as druggable candidates. We apply our method to a set of ER+ breast tumor samples, collected before (n = 58) and after (n = 60) neoadjuvant treatment with the aromatase inhibitor letrozole, to prioritize genes as targets for combination therapy with letrozole treatment. We validate letrozole-induced tumor adaptation through coexpression and pathway analyses in an independent data set (n = 18). Results We find pervasive differential coexpression between the untreated and letrozole-treated tumor samples as evidence of letrozole-induced tumor adaptation. Based on patterns of coexpression, we identify ten genes as potential candidates for combination therapy with letrozole including EPCAM, a letrozole-induced essential gene and a target to which drugs have already been developed as cancer therapeutics. Through replication, we validate six letrozole-induced coexpression relationships and confirm the epithelial-to-mesenchymal transition as a process that is upregulated in the residual tumor samples following letrozole treatment. Conclusions To derive the greatest benefit from molecularly targeted drugs it is critical to design combination treatment strategies rationally. Incorporating knowledge of the tumor adaptation process into the design provides an opportunity to match targeted drugs to the evolving tumor phenotype and surmount resistance.
Collapse
Affiliation(s)
- Nadia M Penrod
- Department of Pharmacology and Toxicology, Geisel School of Medicine at Dartmouth College, HB7937 One Medical Center Dr, Lebanon NH 03766, USA
| | - Casey S Greene
- Department of Genetics, Geisel School of Medicine at Dartmouth College, HB7937 One Medical Center Dr, Lebanon NH 03766, USA ; Institute for Quantitative Biomedical Sciences, Geisel School of Medicine at Dartmouth College, HB7937 One Medical Center Dr, Lebanon NH 03766, USA
| | - Jason H Moore
- Department of Genetics, Geisel School of Medicine at Dartmouth College, HB7937 One Medical Center Dr, Lebanon NH 03766, USA ; Institute for Quantitative Biomedical Sciences, Geisel School of Medicine at Dartmouth College, HB7937 One Medical Center Dr, Lebanon NH 03766, USA
| |
Collapse
|
144
|
Greene CS, Himmelstein DS, Nelson HH, Kelsey KT, Williams SM, Andrew AS, Karagas MR, Moore JH. Enabling personal genomics with an explicit test of epistasis. Pac Symp Biocomput 2013:327-36. [PMID: 19908385 DOI: 10.1142/9789814295291_0035] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
One goal of personal genomics is to use information about genomic variation to predict who is at risk for various common diseases. Technological advances in genotyping have spawned several personal genetic testing services that market genotyping services directly to the consumer. An important goal of consumer genetic testing is to provide health information along with the genotyping results. This has the potential to integrate detailed personal genetic and genomic information into healthcare decision making. Despite the potential importance of these advances, there are some important limitations. One concern is that much of the literature that is used to formulate personal genetics reports is based on genetic association studies that consider each genetic variant independently of the others. It is our working hypothesis that the true value of personal genomics will only be realized when the complexity of the genotype-to-phenotype mapping relationship is embraced, rather than ignored. We focus here on complexity in genetic architecture due to epistasis or nonlinear gene-gene interaction. We have previously developed a multifactor dimensionality reduction (MDR) algorithm and software package for detecting nonlinear interactions in genetic association studies. In most prior MDR analyses, the permutation testing strategy used to assess statistical significance was unable to differentiate MDR models that captured only interaction effects from those that also detected independent main effects. Statistical interpretation of MDR models required post-hoc analysis using entropy-based measures of interaction information. We introduce here a novel permutation test that allows the effects of nonlinear interactions between multiple genetic variants to be specifically tested in a manner that is not confounded by linear additive effects. We show using simulated nonlinear interactions that the power using the explicit test of epistasis is no different than a standard permutation test. We also show that the test has the appropriate size or type I error rate of approximately 0.05. We then apply MDR with the new explicit test of epistasis to a large genetic study of bladder cancer and show that a previously reported nonlinear interaction between is indeed significant, even after considering the strong additive effect of smoking in the model. Finally, we evaluated the power of the explicit test of epistasis to detect the nonlinear interaction between two XPD gene polymorphisms by simulating data from the MDR model of bladder cancer susceptibility. The results of this study provide for the first time a simple method for explicitly testing epistasis or gene-gene interaction effects in genetic association studies. Although we demonstrated the method with MDR, an important advantage is that it can be combined with any modeling approach. The explicit test of epistasis brings us a step closer to the type of routine gene-gene interaction analysis that is needed if we are to enable personal genomics.
Collapse
Affiliation(s)
- Casey S Greene
- Department of Genetics, Dartmouth Medical School, Lebanon, NH 03756, USA
| | | | | | | | | | | | | | | |
Collapse
|
145
|
Ju W, Greene CS, Eichinger F, Nair V, Hodgin JB, Bitzer M, Lee YS, Zhu Q, Kehata M, Li M, Jiang S, Rastaldi MP, Cohen CD, Troyanskaya OG, Kretzler M. Defining cell-type specificity at the transcriptional level in human disease. Genome Res 2013; 23:1862-73. [PMID: 23950145 PMCID: PMC3814886 DOI: 10.1101/gr.155697.113] [Citation(s) in RCA: 171] [Impact Index Per Article: 15.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Cell-lineage–specific transcripts are essential for differentiated tissue function, implicated in hereditary organ failure, and mediate acquired chronic diseases. However, experimental identification of cell-lineage–specific genes in a genome-scale manner is infeasible for most solid human tissues. We developed the first genome-scale method to identify genes with cell-lineage–specific expression, even in lineages not separable by experimental microdissection. Our machine-learning–based approach leverages high-throughput data from tissue homogenates in a novel iterative statistical framework. We applied this method to chronic kidney disease and identified transcripts specific to podocytes, key cells in the glomerular filter responsible for hereditary and most acquired glomerular kidney disease. In a systematic evaluation of our predictions by immunohistochemistry, our in silico approach was significantly more accurate (65% accuracy in human) than predictions based on direct measurement of in vivo fluorescence-tagged murine podocytes (23%). Our method identified genes implicated as causal in hereditary glomerular disease and involved in molecular pathways of acquired and chronic renal diseases. Furthermore, based on expression analysis of human kidney disease biopsies, we demonstrated that expression of the podocyte genes identified by our approach is significantly related to the degree of renal impairment in patients. Our approach is broadly applicable to define lineage specificity in both cell physiology and human disease contexts. We provide a user-friendly website that enables researchers to apply this method to any cell-lineage or tissue of interest. Identified cell-lineage–specific transcripts are expected to play essential tissue-specific roles in organogenesis and disease and can provide starting points for the development of organ-specific diagnostics and therapies.
Collapse
Affiliation(s)
- Wenjun Ju
- Division of Nephrology, Department of Internal Medicine, University of Michigan, Ann Arbor, Michigan 48109, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
146
|
Park CY, Wong AK, Greene CS, Rowland J, Guan Y, Bongo LA, Burdine RD, Troyanskaya OG. Functional knowledge transfer for high-accuracy prediction of under-studied biological processes. PLoS Comput Biol 2013; 9:e1002957. [PMID: 23516347 PMCID: PMC3597527 DOI: 10.1371/journal.pcbi.1002957] [Citation(s) in RCA: 47] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2012] [Accepted: 01/15/2013] [Indexed: 11/19/2022] Open
Abstract
A key challenge in genetics is identifying the functional roles of genes in pathways. Numerous functional genomics techniques (e.g. machine learning) that predict protein function have been developed to address this question. These methods generally build from existing annotations of genes to pathways and thus are often unable to identify additional genes participating in processes that are not already well studied. Many of these processes are well studied in some organism, but not necessarily in an investigator's organism of interest. Sequence-based search methods (e.g. BLAST) have been used to transfer such annotation information between organisms. We demonstrate that functional genomics can complement traditional sequence similarity to improve the transfer of gene annotations between organisms. Our method transfers annotations only when functionally appropriate as determined by genomic data and can be used with any prediction algorithm to combine transferred gene function knowledge with organism-specific high-throughput data to enable accurate function prediction. We show that diverse state-of-art machine learning algorithms leveraging functional knowledge transfer (FKT) dramatically improve their accuracy in predicting gene-pathway membership, particularly for processes with little experimental knowledge in an organism. We also show that our method compares favorably to annotation transfer by sequence similarity. Next, we deploy FKT with state-of-the-art SVM classifier to predict novel genes to 11,000 biological processes across six diverse organisms and expand the coverage of accurate function predictions to processes that are often ignored because of a dearth of annotated genes in an organism. Finally, we perform in vivo experimental investigation in Danio rerio and confirm the regulatory role of our top predicted novel gene, wnt5b, in leftward cell migration during heart development. FKT is immediately applicable to many bioinformatics techniques and will help biologists systematically integrate prior knowledge from diverse systems to direct targeted experiments in their organism of study. Due to technical and ethical challenges many human diseases or biological processes are studied in model organisms. Discoveries in these organisms are then transferred back to human or other model organisms. Traditional methods for transferring novel gene function annotations have relied on finding genes with high sequence similarity believed to share evolutionary ancestry. However, sequence similarity does not guarantee a shared functional role in molecular pathways. In this study, we show that functional genomics can complement traditional sequence similarity measures to improve the transfer of gene annotations between organisms. We coupled our knowledge transfer method with current state-of-the-art machine learning algorithms and predicted gene function for 11,000 biological processes across six organisms. We experimentally validated our prediction of wnt5b's involvement in the determination of left-right heart asymmetry in zebrafish. Our results show that functional knowledge transfer can improve the coverage and accuracy of machine learning methods used for gene function prediction in a diverse set of organisms. Such an approach can be applied to additional organisms, and will be especially beneficial in organisms that have high-throughput genomic data with sparse annotations.
Collapse
Affiliation(s)
- Christopher Y. Park
- Department of Computer Science, Princeton University, Princeton, New Jersey, United States of America
| | - Aaron K. Wong
- Department of Computer Science, Princeton University, Princeton, New Jersey, United States of America
| | - Casey S. Greene
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
| | - Jessica Rowland
- Department of Molecular Biology, Princeton University, Princeton, New Jersey, United States of America
| | - Yuanfang Guan
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Lars A. Bongo
- Department of Computer Science, University of Tromsø, Tromsø, Norway
| | - Rebecca D. Burdine
- Department of Molecular Biology, Princeton University, Princeton, New Jersey, United States of America
| | - Olga G. Troyanskaya
- Department of Computer Science, Princeton University, Princeton, New Jersey, United States of America
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
- * E-mail:
| |
Collapse
|
147
|
Abstract
The biggest challenge for text and data mining is to truly impact the biomedical discovery process, enabling scientists to generate novel hypothesis to address the most crucial questions. Among a number of worthy submissions, we have selected six papers that exemplify advances in text and data mining methods that have a demonstrated impact on a wide range of applications. Work presented in this session includes data mining techniques applied to the discovery of 3-way genetic interactions and to the analysis of genetic data in the context of electronic medical records (EMRs), as well as an integrative approach that combines data from genetic (SNP) and transcriptomic (microarray) sources for clinical prediction. Text mining advances include a classification method to determine whether a published article contains pharmacological experiments relevant to drug-drug interactions, a fine-grained text mining approach for detecting the catalytic sites in proteins in the biomedical literature, and a method for automatically extending a taxonomy of health-related terms to integrate consumer-friendly synonyms for medical terminologies.
Collapse
Affiliation(s)
- Graciela Gonzalez
- Department of Biomedical Informatics, Arizona State University, Scottsdale, AZ 85259, USA.
| | | | | | | | | | | | | | | |
Collapse
|
148
|
Abstract
Modern experimental strategies often generate genome-scale measurements of human tissues or cell lines in various physiological states. Investigators often use these datasets individually to help elucidate molecular mechanisms of human diseases. Here we discuss approaches that effectively weight and integrate hundreds of heterogeneous datasets to gene-gene networks that focus on a specific process or disease. Diverse and systematic genome-scale measurements provide such approaches both a great deal of power and a number of challenges. We discuss some such challenges as well as methods to address them. We also raise important considerations for the assessment and evaluation of such approaches. When carefully applied, these integrative data-driven methods can make novel high-quality predictions that can transform our understanding of the molecular-basis of human disease.
Collapse
Affiliation(s)
- Casey S. Greene
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
| | - Olga G. Troyanskaya
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
- * E-mail:
| |
Collapse
|
149
|
Wong AK, Park CY, Greene CS, Bongo LA, Guan Y, Troyanskaya OG. IMP: a multi-species functional genomics portal for integration, visualization and prediction of protein functions and networks. Nucleic Acids Res 2012; 40:W484-90. [PMID: 22684505 PMCID: PMC3394282 DOI: 10.1093/nar/gks458] [Citation(s) in RCA: 75] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
Integrative multi-species prediction (IMP) is an interactive web server that enables molecular biologists to interpret experimental results and to generate hypotheses in the context of a large cross-organism compendium of functional predictions and networks. The system provides a framework for biologists to analyze their candidate gene sets in the context of functional networks, as they expand or focus these sets by mining functional relationships predicted from integrated high-throughput data. IMP integrates prior knowledge and data collections from multiple organisms in its analyses. Through flexible and interactive visualizations, researchers can compare functional contexts and interpret the behavior of their gene sets across organisms. Additionally, IMP identifies homologs with conserved functional roles for knowledge transfer, allowing for accurate function predictions even for biological processes that have very few experimental annotations in a given organism. IMP currently supports seven organisms (Homo sapiens, Mus musculus, Rattus novegicus, Drosophila melanogaster, Danio rerio, Caenorhabditis elegans and Saccharomyces cerevisiae), does not require any registration or installation and is freely available for use at http://imp.princeton.edu.
Collapse
Affiliation(s)
- Aaron K Wong
- Department of Computer Science, Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08540, USA
| | | | | | | | | | | |
Collapse
|
150
|
Abstract
The development of technology capable of inexpensively performing large-scale measurements of biological systems has generated a wealth of data. Integrative analysis of these data holds the promise of uncovering gene function, regulation, and, in the longer run, understanding complex disease. However, their analysis has proved very challenging, as it is difficult to quickly and effectively assess the relevance and accuracy of these data for individual biological questions. Here, we identify biases that present challenges for the assessment of functional genomics data and methods. We then discuss evaluation methods that, taken together, begin to address these issues. We also argue that the funding of systematic data-driven experiments and of high-quality curation efforts will further improve evaluation metrics so that they more-accurately assess functional genomics data and methods. Such metrics will allow researchers in the field of functional genomics to continue to answer important biological questions in a data-driven manner.
Collapse
Affiliation(s)
- Casey S Greene
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, USA.
| | | |
Collapse
|