1
|
Phetsouphanh C, Jacka B, Ballouz S, Jackson KJL, Wilson DB, Manandhar B, Klemm V, Tan HX, Wheatley A, Aggarwal A, Akerman A, Milogiannakis V, Starr M, Cunningham P, Turville SG, Kent SJ, Byrne A, Brew BJ, Darley DR, Dore GJ, Kelleher AD, Matthews GV. Improvement of immune dysregulation in individuals with long COVID at 24-months following SARS-CoV-2 infection. Nat Commun 2024; 15:3315. [PMID: 38632311 PMCID: PMC11024141 DOI: 10.1038/s41467-024-47720-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2023] [Accepted: 04/11/2024] [Indexed: 04/19/2024] Open
Abstract
This study investigates the humoral and cellular immune responses and health-related quality of life measures in individuals with mild to moderate long COVID (LC) compared to age and gender matched recovered COVID-19 controls (MC) over 24 months. LC participants show elevated nucleocapsid IgG levels at 3 months, and higher neutralizing capacity up to 8 months post-infection. Increased spike-specific and nucleocapsid-specific CD4+ T cells, PD-1, and TIM-3 expression on CD4+ and CD8+ T cells were observed at 3 and 8 months, but these differences do not persist at 24 months. Some LC participants had detectable IFN-γ and IFN-β, that was attributed to reinfection and antigen re-exposure. Single-cell RNA sequencing at the 24 month timepoint shows similar immune cell proportions and reconstitution of naïve T and B cell subsets in LC and MC. No significant differences in exhaustion scores or antigen-specific T cell clones are observed. These findings suggest resolution of immune activation in LC and return to comparable immune responses between LC and MC over time. Improvement in self-reported health-related quality of life at 24 months was also evident in the majority of LC (62%). PTX3, CRP levels and platelet count are associated with improvements in health-related quality of life.
Collapse
Affiliation(s)
| | - Brendan Jacka
- The Kirby Institute, University of New South Wales, Sydney, NSW, Australia
| | - Sara Ballouz
- Garvan Institute for Medical research, Sydney, NSW, Australia
- School of Computer Science and Engineering, Faculty of Engineering, University of New South Wales, Sydney, NSW, Australia
| | | | - Daniel B Wilson
- The Kirby Institute, University of New South Wales, Sydney, NSW, Australia
| | - Bikash Manandhar
- The Kirby Institute, University of New South Wales, Sydney, NSW, Australia
| | - Vera Klemm
- The Kirby Institute, University of New South Wales, Sydney, NSW, Australia
| | - Hyon-Xhi Tan
- Department of Microbiology and Immunology, Peter Doherty Institute, University of Melbourne, Victoria, VIC, Australia
| | - Adam Wheatley
- Department of Microbiology and Immunology, Peter Doherty Institute, University of Melbourne, Victoria, VIC, Australia
| | - Anupriya Aggarwal
- The Kirby Institute, University of New South Wales, Sydney, NSW, Australia
| | - Anouschka Akerman
- The Kirby Institute, University of New South Wales, Sydney, NSW, Australia
| | | | - Mitchell Starr
- NSW State Reference Laboratory for HIV, St. Vincent's Centre for Applied Medical Research, Sydney, NSW, Australia
| | - Phillip Cunningham
- NSW State Reference Laboratory for HIV, St. Vincent's Centre for Applied Medical Research, Sydney, NSW, Australia
| | - Stuart G Turville
- The Kirby Institute, University of New South Wales, Sydney, NSW, Australia
| | - Stephen J Kent
- Department of Microbiology and Immunology, Peter Doherty Institute, University of Melbourne, Victoria, VIC, Australia
| | - Anthony Byrne
- Heart Lung Clinic, St. Vincent's Hospital Sydney and Faculty of Medicine and Health (UNSW), Sydney, NSW, Australia
| | - Bruce J Brew
- Peter Duncan Neurosciences Unit- St Vincent's Centre for Applied Medical Research, Sydney, NSW, Australia
| | | | - Gregory J Dore
- The Kirby Institute, University of New South Wales, Sydney, NSW, Australia
- St. Vincent's Hospital, Darlinghurst, NSW, Australia
| | - Anthony D Kelleher
- The Kirby Institute, University of New South Wales, Sydney, NSW, Australia.
- St. Vincent's Hospital, Darlinghurst, NSW, Australia.
| | - Gail V Matthews
- The Kirby Institute, University of New South Wales, Sydney, NSW, Australia.
- St. Vincent's Hospital, Darlinghurst, NSW, Australia.
| |
Collapse
|
2
|
Ballouz S, Kawaguchi RK, Pena MT, Fischer S, Crow M, French L, Knight FM, Adams LB, Gillis J. The transcriptional legacy of developmental stochasticity. Nat Commun 2023; 14:7226. [PMID: 37940702 PMCID: PMC10632366 DOI: 10.1038/s41467-023-43024-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2022] [Accepted: 10/30/2023] [Indexed: 11/10/2023] Open
Abstract
Genetic and environmental variation are key contributors during organism development, but the influence of minor perturbations or noise is difficult to assess. This study focuses on the stochastic variation in allele-specific expression that persists through cell divisions in the nine-banded armadillo (Dasypus novemcinctus). We investigated the blood transcriptome of five wild monozygotic quadruplets over time to explore the influence of developmental stochasticity on gene expression. We identify an enduring signal of autosomal allelic variability that distinguishes individuals within a quadruplet despite their genetic similarity. This stochastic allelic variation, akin to X-inactivation but broader, provides insight into non-genetic influences on phenotype. The presence of stochastically canalized allelic signatures represents a novel axis for characterizing organismal variability, complementing traditional approaches based on genetic and environmental factors. We also developed a model to explain the inconsistent penetrance associated with these stochastically canalized allelic expressions. By elucidating mechanisms underlying the persistence of allele-specific expression, we enhance understanding of development's role in shaping organismal diversity.
Collapse
Affiliation(s)
- Sara Ballouz
- Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724, USA
- School of Computer Science and Engineering, Faculty of Engineering, University of New South Wales Sydney, Sydney, NSW, Australia
| | - Risa Karakida Kawaguchi
- Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724, USA
- Center for iPS Cell Research and Application, Kyoto University, Kyoto, Japan
| | - Maria T Pena
- US Department of Health and Human Services, Health Resources and Services Administration, Healthcare System Bureau, National Hansen's Disease Program, Baton Rouge, LA, 70803, USA
| | - Stephan Fischer
- Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724, USA
- Institut Pasteur, Université Paris Cité, Bioinformatics and Biostatistics Hub, Paris, F-75015, France
| | - Megan Crow
- Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724, USA
- Genentech, Inc., South San Francisco, CA, USA
| | - Leon French
- Physiology Department and Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, ON, Canada
| | | | - Linda B Adams
- US Department of Health and Human Services, Health Resources and Services Administration, Healthcare System Bureau, National Hansen's Disease Program, Baton Rouge, LA, 70803, USA
| | - Jesse Gillis
- Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724, USA.
- Physiology Department and Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, ON, Canada.
| |
Collapse
|
3
|
Stavrou MR, So SS, Finch AM, Ballouz S, Smith NJ. Gene expression analyses of TAS1R taste receptors relevant to the treatment of cardiometabolic disease. Chem Senses 2023; 48:bjad027. [PMID: 37539767 DOI: 10.1093/chemse/bjad027] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2022] [Indexed: 08/05/2023] Open
Abstract
The sweet taste receptor (STR) is a G protein-coupled receptor (GPCR) responsible for mediating cellular responses to sweet stimuli. Early evidence suggests that elements of the STR signaling system are present beyond the tongue in metabolically active tissues, where it may act as an extraoral glucose sensor. This study aimed to delineate expression of the STR in extraoral tissues using publicly available RNA-sequencing repositories. Gene expression data was mined for all genes implicated in the structure and function of the STR, and control genes including highly expressed metabolic genes in relevant tissues, other GPCRs and effector G proteins with physiological roles in metabolism, and other GPCRs with expression exclusively outside the metabolic tissues. Since the physiological role of the STR in extraoral tissues is likely related to glucose sensing, expression was then examined in diseases related to glucose-sensing impairment such as type 2 diabetes. An aggregate co-expression network was then generated to precisely determine co-expression patterns among the STR genes in these tissues. We found that STR gene expression was negligible in human pancreatic and adipose tissues, and low in intestinal tissue. Genes encoding the STR did not show significant co-expression or connectivity with other functional genes in these tissues. In addition, STR expression was higher in mouse pancreatic and adipose tissues, and equivalent to human in intestinal tissue. Our results suggest that STR expression in mice is not representative of expression in humans, and the receptor is unlikely to be a promising extraoral target in human cardiometabolic disease.
Collapse
Affiliation(s)
- Mariah R Stavrou
- Orphan Receptor Laboratory, School of Biomedical Sciences, Faculty of Medicine and Health, UNSW Sydney, Sydney, NSW, Australia
| | - Sean Souchiart So
- Orphan Receptor Laboratory, School of Biomedical Sciences, Faculty of Medicine and Health, UNSW Sydney, Sydney, NSW, Australia
| | - Angela M Finch
- Department of Pharmacology, School of Biomedical Sciences, Faculty of Medicine and Health, UNSW Sydney, Sydney, NSW, Australia
| | - Sara Ballouz
- Garvan-Weizmann Centre for Cellular Genomics, Garvan Institute of Medical Research, Sydney, NSW, Australia
- School of Computer Science and Engineering, Faculty of Engineering, UNSW Sydney, Sydney, NSW, Australia
| | - Nicola J Smith
- Orphan Receptor Laboratory, School of Biomedical Sciences, Faculty of Medicine and Health, UNSW Sydney, Sydney, NSW, Australia
| |
Collapse
|
4
|
Werner JM, Ballouz S, Hover J, Gillis J. Variability of cross-tissue X-chromosome inactivation characterizes timing of human embryonic lineage specification events. Dev Cell 2022; 57:1995-2008.e5. [PMID: 35914524 PMCID: PMC9398941 DOI: 10.1016/j.devcel.2022.07.007] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2021] [Revised: 05/11/2022] [Accepted: 07/07/2022] [Indexed: 12/14/2022]
Abstract
X-chromosome inactivation (XCI) is a random, permanent, and developmentally early epigenetic event that occurs during mammalian embryogenesis. We harness these features to investigate characteristics of early lineage specification events during human development. We initially assess the consistency of X-inactivation and establish a robust set of XCI-escape genes. By analyzing variance in XCI ratios across tissues and individuals, we find that XCI is shared across all tissues, suggesting that XCI is completed in the epiblast (in at least 6-16 cells) prior to specification of the germ layers. Additionally, we exploit tissue-specific variability to characterize the number of cells present during tissue-lineage commitment, ranging from approximately 20 cells in liver and whole blood tissues to 80 cells in brain tissues. By investigating the variability of XCI ratios using adult tissue, we characterize embryonic features of human XCI and lineage specification that are otherwise difficult to ascertain experimentally.
Collapse
Affiliation(s)
- Jonathan M Werner
- The Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, USA
| | - Sara Ballouz
- The Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, USA; Garvan-Weizmann Centre for Cellular Genomics, Garvan Institute of Medical Research, Darlinghurst, Sydney, NSW Australia
| | - John Hover
- The Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, USA
| | - Jesse Gillis
- The Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, USA; Physiology Department and Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, ON, Canada.
| |
Collapse
|
5
|
Muzumdar S, Ballouz S, Lam F, Degrange M, Kreuzburg S, Chong H, Zerbe C, Jongco A, Gillis J. A granular view of X-linked chronic granulomatous disease exploiting single-cell transcriptomics. The Journal of Immunology 2022. [DOI: 10.4049/jimmunol.208.supp.159.04] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
Abstract
Abstract
X-linked chronic granulomatous disease (X-CGD) is a rare monogenetic immunodeficiency primarily affecting phagocytes. Precipitated by mutations in the CYBB gene, patients exhibit a compromised oxidative burst, leading to recurrent infections which can be life-threatening. Curiously, autoimmune manifestations are also common in patients and carriers. Here, exploiting the cell type-specific nature of this disorder, we characterize X-CGD on a transcriptional level using single-cell sequencing. Peripheral blood from 14 X-CGD probands and 10 carriers signed onto IRB approved protocol NCT00404560, as well as from 15 controls was sampled, and PBMCs and isolated monocytes were subjected to single-cell sequencing. Probands exhibited a strong differential expression signal relative to controls. This was composed of not only genes previously described to be up-regulated in X-CGD such as IFI27, and indeed an autoimmunity-associated broader type I interferon response, but also previously undescribed genes involved in monocyte function (ARG1), antimicrobial proteins (CAMP, SLPI), and inflammasome components (AIM2). Surprisingly, expression variability was not greater in carriers relative to probands or controls, indicating a lack of cell autonomous effects from the deletion of CYBB. Interestingly, aggregate expression of differentially expressed genes in the probands was able to classify carriers from sex-matched controls with high accuracy (AUROC = 0.92), indicating the presence of an X-CGD-specific gene signature. This gene signature was also strongly co-expressed across 17 chordate species, pointing towards the disruption of ancestral pathways important in antimicrobial immunity in X-CGD probands and carriers.
This work was partially supported by a Swiss National Science Foundation fellowship to S.M.
Collapse
Affiliation(s)
| | - Sara Ballouz
- 2Garvan Inst. of Med. Res., Australia, Australia
| | - Fung Lam
- 3Feinstein Inst. for Med. Res., Northwell Hlth
| | | | | | | | | | | | | |
Collapse
|
6
|
Kaminow B, Ballouz S, Gillis J, Dobin A. Pan-human consensus genome significantly improves the accuracy of RNA-seq analyses. Genome Res 2022; 32:738-749. [PMID: 35256454 PMCID: PMC8997357 DOI: 10.1101/gr.275613.121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2021] [Accepted: 03/02/2022] [Indexed: 11/25/2022]
Abstract
The Human Reference Genome serves as the foundation for modern genomic analyses. However, in its present form, it does not adequately represent the vast genetic diversity of the human population. In this study, we explored the consensus genome as a potential successor of the current reference genome and assessed its effect on the accuracy of RNA-seq read alignment. In order to find the best haploid genome representation, we constructed consensus genomes at the pan-human, super-population, and population levels, utilizing variant information from the 1000 Genomes Project. Using personal haploid genomes as the ground truth, we compared mapping errors for real RNA-seq reads aligned to the consensus genomes versus the reference genome. For reads overlapping homozygous variants, we found that the mapping error decreased by a factor of ~2-3 when the reference was replaced with the pan-human consensus genome. We also found that using more population-specific consensuses resulted in little to no increase overusing the pan-human consensus, suggesting a limit in the utility of incorporating more specific genomic variation. Replacing reference with consensus genomes impacts functional analyses, such as differential expressions of isoforms, genes, and splice junctions.
Collapse
Affiliation(s)
- Benjamin Kaminow
- Cold Spring Harbor Laboratory; Weill Cornell Graduate School of Medical Sciences
| | - Sara Ballouz
- Garvan-Weizmann Centre for Cellular Genomics, Garvan Institute of Medical Research; School of Medical Sciences, University of New South Wales; Cold Spring Harbor Laboratory
| | | | | |
Collapse
|
7
|
Lee J, Shah M, Ballouz S, Crow M, Gillis J. CoCoCoNet: conserved and comparative co-expression across a diverse set of species. Nucleic Acids Res 2020; 48:W566-W571. [PMID: 32392296 PMCID: PMC7319556 DOI: 10.1093/nar/gkaa348] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2020] [Revised: 04/21/2020] [Accepted: 04/24/2020] [Indexed: 12/19/2022] Open
Abstract
Co-expression analysis has provided insight into gene function in organisms from Arabidopsis to zebrafish. Comparison across species has the potential to enrich these results, for example by prioritizing among candidate human disease genes based on their network properties or by finding alternative model systems where their co-expression is conserved. Here, we present CoCoCoNet as a tool for identifying conserved gene modules and comparing co-expression networks. CoCoCoNet is a resource for both data and methods, providing gold standard networks and sophisticated tools for on-the-fly comparative analyses across 14 species. We show how CoCoCoNet can be used in two use cases. In the first, we demonstrate deep conservation of a nucleolus gene module across very divergent organisms, and in the second, we show how the heterogeneity of autism mechanisms in humans can be broken down by functional groups and translated to model organisms. CoCoCoNet is free to use and available to all at https://milton.cshl.edu/CoCoCoNet, with data and R scripts available at ftp://milton.cshl.edu/data.
Collapse
Affiliation(s)
- John Lee
- Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, 500 Sunnyside Blvd., Woodbury, NY 11797, USA
| | - Manthan Shah
- Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, 500 Sunnyside Blvd., Woodbury, NY 11797, USA
| | - Sara Ballouz
- Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, 500 Sunnyside Blvd., Woodbury, NY 11797, USA
| | - Megan Crow
- Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, 500 Sunnyside Blvd., Woodbury, NY 11797, USA
| | - Jesse Gillis
- Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, 500 Sunnyside Blvd., Woodbury, NY 11797, USA
| |
Collapse
|
8
|
Pang CNI, Ballouz S, Weissberger D, Thibaut LM, Hamey JJ, Gillis J, Wilkins MR, Hart-Smith G. Analytical Guidelines for co-fractionation Mass Spectrometry Obtained through Global Profiling of Gold Standard Saccharomyces cerevisiae Protein Complexes. Mol Cell Proteomics 2020; 19:1876-1895. [PMID: 32817346 PMCID: PMC7664123 DOI: 10.1074/mcp.ra120.002154] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2020] [Revised: 07/14/2020] [Indexed: 11/06/2022] Open
Abstract
Co-fractionation MS (CF-MS) is a technique with potential to characterize endogenous and unmanipulated protein complexes on an unprecedented scale. However this potential has been offset by a lack of guidelines for best-practice CF-MS data collection and analysis. To obtain such guidelines, this study thoroughly evaluates novel and published Saccharomyces cerevisiae CF-MS data sets using very high proteome coverage libraries of yeast gold standard complexes. A new method for identifying gold standard complexes in CF-MS data, Reference Complex Profiling, and the Extending 'Guilt-by-Association' by Degree (EGAD) R package are used for these evaluations, which are verified with concurrent analyses of published human data. By evaluating data collection designs, which involve fractionation of cell lysates, it is found that near-maximum recall of complexes can be achieved with fewer samples than published studies. Distributing sample collection across orthogonal fractionation methods, rather than a single high resolution data set, leads to particularly efficient recall. By evaluating 17 different similarity scoring metrics, which are central to CF-MS data analysis, it is found that two metrics rarely used in past CF-MS studies - Spearman and Kendall correlations - and the recently introduced Co-apex metric frequently maximize recall, whereas a popular metric-Euclidean distance-delivers poor recall. The common practice of integrating external genomic data into CF-MS data analysis is also evaluated, revealing that this practice may improve the precision and recall of known complexes but is generally unsuitable for predicting novel complexes in model organisms. If studying nonmodel organisms using orthologous genomic data, it is found that particular subsets of fractionation profiles (e.g. the lowest abundance quartile) should be excluded to minimize false discovery. These assessments are summarized in a series of universally applicable guidelines for precise, sensitive and efficient CF-MS studies of known complexes, and effective predictions of novel complexes for orthogonal experimental validation.
Collapse
Affiliation(s)
- Chi Nam Ignatius Pang
- Systems Biology Initiative, School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, New South Wales, Australia
| | - Sara Ballouz
- Garvan Institute of Medical Research, Darlinghurst, Sydney, New South Wales, Australia
| | - Daniel Weissberger
- School of Chemistry, University of New South Wales, Sydney, New South Wales, Australia
| | - Loïc M Thibaut
- School of Mathematics and Statistics, University of New South Wales, Sydney, New South Wales, Australia
| | - Joshua J Hamey
- Systems Biology Initiative, School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, New South Wales, Australia
| | - Jesse Gillis
- Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, Woodbury, New York, USA
| | - Marc R Wilkins
- Systems Biology Initiative, School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, New South Wales, Australia
| | - Gene Hart-Smith
- Systems Biology Initiative, School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, New South Wales, Australia; Department of Molecular Sciences, Macquarie University, Sydney, New South Wales, Australia.
| |
Collapse
|
9
|
Ballouz S, Mangala MM, Perry MD, Heitmann S, Gillis JA, Hill AP, Vandenberg JI. Co-expression of calcium and hERG potassium channels reduces the incidence of proarrhythmic events. Cardiovasc Res 2020; 117:2216-2227. [PMID: 33002116 DOI: 10.1093/cvr/cvaa280] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/16/2020] [Revised: 08/25/2020] [Accepted: 09/17/2020] [Indexed: 01/02/2023] Open
Abstract
AIMS Cardiac electrical activity is extraordinarily robust. However, when it goes wrong it can have fatal consequences. Electrical activity in the heart is controlled by the carefully orchestrated activity of more than a dozen different ion conductances. While there is considerable variability in cardiac ion channel expression levels between individuals, studies in rodents have indicated that there are modules of ion channels whose expression co-vary. The aim of this study was to investigate whether meta-analytic co-expression analysis of large-scale gene expression datasets could identify modules of co-expressed cardiac ion channel genes in human hearts that are of functional importance. METHODS AND RESULTS Meta-analysis of 3653 public human RNA-seq datasets identified a strong correlation between expression of CACNA1C (L-type calcium current, ICaL) and KCNH2 (rapid delayed rectifier K+ current, IKr), which was also observed in human adult heart tissue samples. In silico modelling suggested that co-expression of CACNA1C and KCNH2 would limit the variability in action potential duration seen with variations in expression of ion channel genes and reduce susceptibility to early afterdepolarizations, a surrogate marker for proarrhythmia. We also found that levels of KCNH2 and CACNA1C expression are correlated in human-induced pluripotent stem cell-derived cardiac myocytes and the levels of CACNA1C and KCNH2 expression were inversely correlated with the magnitude of changes in repolarization duration following inhibition of IKr. CONCLUSION Meta-analytic approaches of multiple independent human gene expression datasets can be used to identify gene modules that are important for regulating heart function. Specifically, we have verified that there is co-expression of CACNA1C and KCNH2 ion channel genes in human heart tissue, and in silico analyses suggest that CACNA1C-KCNH2 co-expression increases the robustness of cardiac electrical activity.
Collapse
Affiliation(s)
- Sara Ballouz
- Garvan-Weizmann Centre for Cellular Genomics, Garvan Institute of Medical Research, 384 Victoria Street, Darlinghurst NSW 2010, Australia.,University of New South Wales, Sydney, Kensington, NSW 2052, Australia.,Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, One Bungtown Road, NY 11724, USA
| | - Melissa M Mangala
- Victor Chang Cardiac Research Institute, Lowy Packer Building, 405 Liverpool Street, Darlinghurst, New South Wales 2010, Australia
| | - Matthew D Perry
- University of New South Wales, Sydney, Kensington, NSW 2052, Australia.,Victor Chang Cardiac Research Institute, Lowy Packer Building, 405 Liverpool Street, Darlinghurst, New South Wales 2010, Australia
| | - Stewart Heitmann
- Victor Chang Cardiac Research Institute, Lowy Packer Building, 405 Liverpool Street, Darlinghurst, New South Wales 2010, Australia
| | - Jesse A Gillis
- Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, One Bungtown Road, NY 11724, USA
| | - Adam P Hill
- University of New South Wales, Sydney, Kensington, NSW 2052, Australia.,Victor Chang Cardiac Research Institute, Lowy Packer Building, 405 Liverpool Street, Darlinghurst, New South Wales 2010, Australia
| | - Jamie I Vandenberg
- University of New South Wales, Sydney, Kensington, NSW 2052, Australia.,Victor Chang Cardiac Research Institute, Lowy Packer Building, 405 Liverpool Street, Darlinghurst, New South Wales 2010, Australia
| |
Collapse
|
10
|
Ballouz S, Dobin A, Gingeras TR, Gillis J. The fractured landscape of RNA-seq alignment: the default in our STARs. Nucleic Acids Res 2019; 46:5125-5138. [PMID: 29718481 PMCID: PMC6007662 DOI: 10.1093/nar/gky325] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2017] [Accepted: 04/16/2018] [Indexed: 12/28/2022] Open
Abstract
Many tools are available for RNA-seq alignment and expression quantification, with comparative value being hard to establish. Benchmarking assessments often highlight methods’ good performance, but are focused on either model data or fail to explain variation in performance. This leaves us to ask, what is the most meaningful way to assess different alignment choices? And importantly, where is there room for progress? In this work, we explore the answers to these two questions by performing an exhaustive assessment of the STAR aligner. We assess STAR’s performance across a range of alignment parameters using common metrics, and then on biologically focused tasks. We find technical metrics such as fraction mapping or expression profile correlation to be uninformative, capturing properties unlikely to have any role in biological discovery. Surprisingly, we find that changes in alignment parameters within a wide range have little impact on both technical and biological performance. Yet, when performance finally does break, it happens in difficult regions, such as X-Y paralogs and MHC genes. We believe improved reporting by developers will help establish where results are likely to be robust or fragile, providing a better baseline to establish where methodological progress can still occur.
Collapse
Affiliation(s)
- Sara Ballouz
- Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, Woodbury, NY 11797, USA
| | - Alexander Dobin
- Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, Woodbury, NY 11797, USA
| | - Thomas R Gingeras
- Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, Woodbury, NY 11797, USA
| | - Jesse Gillis
- Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, Woodbury, NY 11797, USA
| |
Collapse
|
11
|
Abstract
The use of the human reference genome has shaped methods and data across modern genomics. This has offered many benefits while creating a few constraints. In the following opinion, we outline the history, properties, and pitfalls of the current human reference genome. In a few illustrative analyses, we focus on its use for variant-calling, highlighting its nearness to a 'type specimen'. We suggest that switching to a consensus reference would offer important advantages over the continued use of the current reference with few disadvantages.
Collapse
Affiliation(s)
- Sara Ballouz
- Cold Spring Harbor Laboratory, The Stanley Institute for Cognitive Genomics, Cold Spring Harbor, NY, 11724, USA
| | - Alexander Dobin
- Cold Spring Harbor Laboratory, The Stanley Institute for Cognitive Genomics, Cold Spring Harbor, NY, 11724, USA
| | - Jesse A Gillis
- Cold Spring Harbor Laboratory, The Stanley Institute for Cognitive Genomics, Cold Spring Harbor, NY, 11724, USA.
| |
Collapse
|
12
|
Ballouz S, Pavlidis P, Gillis J. Using predictive specificity to determine when gene set analysis is biologically meaningful. Nucleic Acids Res 2018; 45:e20. [PMID: 28204549 PMCID: PMC5389513 DOI: 10.1093/nar/gkw957] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2016] [Revised: 10/04/2016] [Accepted: 10/10/2016] [Indexed: 11/14/2022] Open
Abstract
Gene set analysis, which translates gene lists into enriched functions, is among the most common bioinformatic methods. Yet few would advocate taking the results at face value. Not only is there no agreement on the algorithms themselves, there is no agreement on how to benchmark them. In this paper, we evaluate the robustness and uniqueness of enrichment results as a means of assessing methods even where correctness is unknown. We show that heavily annotated (‘multifunctional’) genes are likely to appear in genomics study results and drive the generation of biologically non-specific enrichment results as well as highly fragile significances. By providing a means of determining where enrichment analyses report non-specific and non-robust findings, we are able to assess where we can be confident in their use. We find significant progress in recent bias correction methods for enrichment and provide our own software implementation. Our approach can be readily adapted to any pre-existing package.
Collapse
Affiliation(s)
- Sara Ballouz
- Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, Woodbury, NY 11797, USA
| | - Paul Pavlidis
- Department of Psychiatry and Michael Smith Laboratories, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada
| | - Jesse Gillis
- Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, Woodbury, NY 11797, USA
| |
Collapse
|
13
|
Crow M, Paul A, Ballouz S, Huang ZJ, Gillis J. Characterizing the replicability of cell types defined by single cell RNA-sequencing data using MetaNeighbor. Nat Commun 2018; 9:884. [PMID: 29491377 PMCID: PMC5830442 DOI: 10.1038/s41467-018-03282-0] [Citation(s) in RCA: 142] [Impact Index Per Article: 23.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2017] [Accepted: 02/02/2018] [Indexed: 12/19/2022] Open
Abstract
Single-cell RNA-sequencing (scRNA-seq) technology provides a new avenue to discover and characterize cell types; however, the experiment-specific technical biases and analytic variability inherent to current pipelines may undermine its replicability. Meta-analysis is further hampered by the use of ad hoc naming conventions. Here we demonstrate our replication framework, MetaNeighbor, that quantifies the degree to which cell types replicate across datasets, and enables rapid identification of clusters with high similarity. We first measure the replicability of neuronal identity, comparing results across eight technically and biologically diverse datasets to define best practices for more complex assessments. We then apply this to novel interneuron subtypes, finding that 24/45 subtypes have evidence of replication, which enables the identification of robust candidate marker genes. Across tasks we find that large sets of variably expressed genes can identify replicable cell types with high accuracy, suggesting a general route forward for large-scale evaluation of scRNA-seq data.
Collapse
Affiliation(s)
- Megan Crow
- Cold Spring Harbor Laboratory, One Bungtown Road, Cold Spring Harbor, NY, 11724, USA
| | - Anirban Paul
- Cold Spring Harbor Laboratory, One Bungtown Road, Cold Spring Harbor, NY, 11724, USA
| | - Sara Ballouz
- Cold Spring Harbor Laboratory, One Bungtown Road, Cold Spring Harbor, NY, 11724, USA
| | - Z Josh Huang
- Cold Spring Harbor Laboratory, One Bungtown Road, Cold Spring Harbor, NY, 11724, USA
| | - Jesse Gillis
- Cold Spring Harbor Laboratory, One Bungtown Road, Cold Spring Harbor, NY, 11724, USA.
| |
Collapse
|
14
|
Ballouz S, Weber M, Pavlidis P, Gillis J. EGAD: ultra-fast functional analysis of gene networks. Bioinformatics 2018; 33:612-614. [PMID: 27993773 DOI: 10.1093/bioinformatics/btw695] [Citation(s) in RCA: 30] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2016] [Accepted: 11/03/2016] [Indexed: 12/25/2022] Open
Abstract
Summary Evaluating gene networks with respect to known biology is a common task but often a computationally costly one. Many computational experiments are difficult to apply exhaustively in network analysis due to run-times. To permit high-throughput analysis of gene networks, we have implemented a set of very efficient tools to calculate functional properties in networks based on guilt-by-association methods. ( xtending ' uilt-by- ssociation' by egree) allows gene networks to be evaluated with respect to hundreds or thousands of gene sets. The methods predict novel members of gene groups, assess how well a gene network groups known sets of genes, and determines the degree to which generic predictions drive performance. By allowing fast evaluations, whether of random sets or real functional ones, provides the user with an assessment of performance which can easily be used in controlled evaluations across many parameters. Availability and Implementation The software package is freely available at https://github.com/sarbal/EGAD and implemented for use in R and Matlab. The package is also freely available under the LGPL license from the Bioconductor web site ( http://bioconductor.org ). Contact JGillis@cshl.edu. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Sara Ballouz
- Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, Woodbury, NY 11797, USA
| | - Melanie Weber
- Department of Mathematics and Computer Science, University of Leipzig, Leipzig, Germany
| | - Paul Pavlidis
- Department of Psychiatry and Michael Smith Laboratories, University of British Columbia, Vancouver, Canada
| | - Jesse Gillis
- Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, Woodbury, NY 11797, USA
| |
Collapse
|
15
|
Abstract
BACKGROUND Disagreements over genetic signatures associated with disease have been particularly prominent in the field of psychiatric genetics, creating a sharp divide between disease burdens attributed to common and rare variation, with study designs independently targeting each. Meta-analysis within each of these study designs is routine, whether using raw data or summary statistics, but combining results across study designs is atypical. However, tests of functional convergence are used across all study designs, where candidate gene sets are assessed for overlaps with previously known properties. This suggests one possible avenue for combining not study data, but the functional conclusions that they reach. METHOD In this work, we test for functional convergence in autism spectrum disorder (ASD) across different study types, and specifically whether the degree to which a gene is implicated in autism is correlated with the degree to which it drives functional convergence. Because different study designs are distinguishable by their differences in effect size, this also provides a unified means of incorporating the impact of study design into the analysis of convergence. RESULTS We detected remarkably significant positive trends in aggregate (p < 2.2e-16) with 14 individually significant properties (false discovery rate <0.01), many in areas researchers have targeted based on different reasoning, such as the fragile X mental retardation protein (FMRP) interactor enrichment (false discovery rate 0.003). We are also able to detect novel technical effects and we see that network enrichment from protein-protein interaction data is heavily confounded with study design, arising readily in control data. CONCLUSIONS We see a convergent functional signal for a subset of known and novel functions in ASD from all sources of genetic variation. Meta-analytic approaches explicitly accounting for different study designs can be adapted to other diseases to discover novel functional associations and increase statistical power.
Collapse
Affiliation(s)
- Sara Ballouz
- The Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724 USA
| | - Jesse Gillis
- The Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724 USA
| |
Collapse
|
16
|
O’Meara MJ, Ballouz S, Shoichet BK, Gillis J. Ligand Similarity Complements Sequence, Physical Interaction, and Co-Expression for Gene Function Prediction. PLoS One 2016; 11:e0160098. [PMID: 27467773 PMCID: PMC4965129 DOI: 10.1371/journal.pone.0160098] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2016] [Accepted: 07/13/2016] [Indexed: 12/13/2022] Open
Abstract
The expansion of protein-ligand annotation databases has enabled large-scale networking of proteins by ligand similarity. These ligand-based protein networks, which implicitly predict the ability of neighboring proteins to bind related ligands, may complement biologically-oriented gene networks, which are used to predict functional or disease relevance. To quantify the degree to which such ligand-based protein associations might complement functional genomic associations, including sequence similarity, physical protein-protein interactions, co-expression, and disease gene annotations, we calculated a network based on the Similarity Ensemble Approach (SEA: sea.docking.org), where protein neighbors reflect the similarity of their ligands. We also measured the similarity with functional genomic networks over a common set of 1,131 genes, and found that the networks had only small overlaps, which were significant only due to the large scale of the data. Consistent with the view that the networks contain different information, combining them substantially improved Molecular Function prediction within GO (from AUROC~0.63–0.75 for the individual data modalities to AUROC~0.8 in the aggregate). We investigated the boost in guilt-by-association gene function prediction when the networks are combined and describe underlying properties that can be further exploited.
Collapse
Affiliation(s)
- Matthew J. O’Meara
- Department of Pharmaceutical Chemistry, University of California San Francisco, San Francisco, California, 94158–2550, United States of America
| | - Sara Ballouz
- Cold Spring Harbor Laboratory, Stanley Institute for Cognitive Genomics, 500 Sunnyside Boulevard, Woodbury, NY, 11797, United States of America
| | - Brian K. Shoichet
- Department of Pharmaceutical Chemistry, University of California San Francisco, San Francisco, California, 94158–2550, United States of America
- * E-mail: (BKS); (JG)
| | - Jesse Gillis
- Cold Spring Harbor Laboratory, Stanley Institute for Cognitive Genomics, 500 Sunnyside Boulevard, Woodbury, NY, 11797, United States of America
- * E-mail: (BKS); (JG)
| |
Collapse
|
17
|
Crow M, Paul A, Ballouz S, Huang ZJ, Gillis J. Exploiting single-cell expression to characterize co-expression replicability. Genome Biol 2016; 17:101. [PMID: 27165153 PMCID: PMC4862082 DOI: 10.1186/s13059-016-0964-6] [Citation(s) in RCA: 59] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/29/2016] [Accepted: 04/25/2016] [Indexed: 01/25/2023] Open
Abstract
Background Co-expression networks have been a useful tool for functional genomics, providing important clues about the cellular and biochemical mechanisms that are active in normal and disease processes. However, co-expression analysis is often treated as a black box with results being hard to trace to their basis in the data. Here, we use both published and novel single-cell RNA sequencing (RNA-seq) data to understand fundamental drivers of gene-gene connectivity and replicability in co-expression networks. Results We perform the first major analysis of single-cell co-expression, sampling from 31 individual studies. Using neighbor voting in cross-validation, we find that single-cell network connectivity is less likely to overlap with known functions than co-expression derived from bulk data, with functional variation within cell types strongly resembling that also occurring across cell types. To identify features and analysis practices that contribute to this connectivity, we perform our own single-cell RNA-seq experiment of 126 cortical interneurons in an experimental design targeted to co-expression. By assessing network replicability, semantic similarity and overall functional connectivity, we identify technical factors influencing co-expression and suggest how they can be controlled for. Many of the technical effects we identify are expression-level dependent, making expression level itself highly predictive of network topology. We show this occurs generally through re-analysis of the BrainSpan RNA-seq data. Conclusions Technical properties of single-cell RNA-seq data create confounds in co-expression networks which can be identified and explicitly controlled for in any supervised analysis. This is useful both in improving co-expression performance and in characterizing single-cell data in generally applicable terms, permitting cross-laboratory comparison within a common framework. Electronic supplementary material The online version of this article (doi:10.1186/s13059-016-0964-6) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Megan Crow
- Cold Spring Harbor Laboratory, One Bungtown Road, Cold Spring, Harbor, NY, 11724, USA
| | - Anirban Paul
- Cold Spring Harbor Laboratory, One Bungtown Road, Cold Spring, Harbor, NY, 11724, USA
| | - Sara Ballouz
- Cold Spring Harbor Laboratory, One Bungtown Road, Cold Spring, Harbor, NY, 11724, USA
| | - Z Josh Huang
- Cold Spring Harbor Laboratory, One Bungtown Road, Cold Spring, Harbor, NY, 11724, USA
| | - Jesse Gillis
- Cold Spring Harbor Laboratory, One Bungtown Road, Cold Spring, Harbor, NY, 11724, USA.
| |
Collapse
|
18
|
Abstract
In addition to detecting novel transcripts and higher dynamic range, a principal claim for RNA-sequencing has been greater replicability, typically measured in sample-sample correlations of gene expression levels. Through a re-analysis of ENCODE data, we show that replicability of transcript abundances will provide misleading estimates of the replicability of conditional variation in transcript abundances (i.e., most expression experiments). Heuristics which implicitly address this problem have emerged in quality control measures to obtain ‘good’ differential expression results. However, these methods involve strict filters such as discarding low expressing genes or using technical replicates to remove discordant transcripts, and are costly or simply ad hoc. As an alternative, we model gene-level replicability of differential activity using co-expressing genes. We find that sets of housekeeping interactions provide a sensitive means of estimating the replicability of expression changes, where the co-expressing pair can be regarded as pseudo-replicates of one another. We model the effects of noise that perturbs a gene’s expression within its usual distribution of values and show that perturbing expression by only 5% within that range is readily detectable (AUROC~0.73). We have made our method available as a set of easily implemented R scripts. RNA-sequencing has become a popular means to detect the expression levels of genes. However, quality control is still challenging, requiring both extreme measures and rules which are set in stone from extensive previous analysis. Instead of relying on these rules, we show that co-expression can be used to measure biological replicability with extremely high precision. Co-expression is a well-studied phenomenon in which two genes that are known to form a functional unit are also expressed at similar levels, and change in similar ways across conditions. Using this concept, we can detect how well an experiment replicates by measuring how well it has retained the co-expression pattern across defined gene-pairs. We do this by measuring how easy it is to detect a sample to which some noise has been added. We show this method is a useful tool for quality control.
Collapse
Affiliation(s)
- Sara Ballouz
- Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America
| | - Jesse Gillis
- Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America
| |
Collapse
|
19
|
Verleyen W, Ballouz S, Gillis J. Positive and negative forms of replicability in gene network analysis. Bioinformatics 2015; 32:1065-73. [PMID: 26668004 DOI: 10.1093/bioinformatics/btv734] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2015] [Accepted: 12/09/2015] [Indexed: 02/07/2023] Open
Abstract
MOTIVATION Gene networks have become a central tool in the analysis of genomic data but are widely regarded as hard to interpret. This has motivated a great deal of comparative evaluation and research into best practices. We explore the possibility that this may lead to overfitting in the field as a whole. RESULTS We construct a model of 'research communities' sampling from real gene network data and machine learning methods to characterize performance trends. Our analysis reveals an important principle limiting the value of replication, namely that targeting it directly causes 'easy' or uninformative replication to dominate analyses. We find that when sampling across network data and algorithms with similar variability, the relationship between replicability and accuracy is positive (Spearman's correlation, rs ∼0.33) but where no such constraint is imposed, the relationship becomes negative for a given gene function (rs ∼ -0.13). We predict factors driving replicability in some prior analyses of gene networks and show that they are unconnected with the correctness of the original result, instead reflecting replicable biases. Without these biases, the original results also vanish replicably. We show these effects can occur quite far upstream in network data and that there is a strong tendency within protein-protein interaction data for highly replicable interactions to be associated with poor quality control. AVAILABILITY AND IMPLEMENTATION Algorithms, network data and a guide to the code available at: https://github.com/wimverleyen/AggregateGeneFunctionPrediction CONTACT jgillis@cshl.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- W Verleyen
- Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, 500 Sunnyside Boulevard Woodbury, NY 11797, USA
| | - S Ballouz
- Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, 500 Sunnyside Boulevard Woodbury, NY 11797, USA
| | - J Gillis
- Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, 500 Sunnyside Boulevard Woodbury, NY 11797, USA
| |
Collapse
|
20
|
Ballouz S, Verleyen W, Gillis J. Guidance for RNA-seq co-expression network construction and analysis: safety in numbers. ACTA ACUST UNITED AC 2015; 31:2123-30. [PMID: 25717192 DOI: 10.1093/bioinformatics/btv118] [Citation(s) in RCA: 134] [Impact Index Per Article: 14.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2014] [Accepted: 02/19/2015] [Indexed: 12/11/2022]
Abstract
MOTIVATION RNA-seq co-expression analysis is in its infancy and reasonable practices remain poorly defined. We assessed a variety of RNA-seq expression data to determine factors affecting functional connectivity and topology in co-expression networks. RESULTS We examine RNA-seq co-expression data generated from 1970 RNA-seq samples using a Guilt-By-Association framework, in which genes are assessed for the tendency of co-expression to reflect shared function. Minimal experimental criteria to obtain performance on par with microarrays were >20 samples with read depth >10 M per sample. While the aggregate network constructed shows good performance (area under the receiver operator characteristic curve ∼0.71), the dependency on number of experiments used is nearly identical to that present in microarrays, suggesting thousands of samples are required to obtain 'gold-standard' co-expression. We find a major topological difference between RNA-seq and microarray co-expression in the form of low overlaps between hub-like genes from each network due to changes in the correlation of expression noise within each technology. CONTACT jgillis@cshl.edu or sballouz@cshl.edu SUPPLEMENTARY INFORMATION Networks are available at: http://gillislab.labsites.cshl.edu/supplements/rna-seq-networks/ and supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- S Ballouz
- Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, 500 Sunnyside Boulevard Woodbury, NY 11797, USA
| | - W Verleyen
- Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, 500 Sunnyside Boulevard Woodbury, NY 11797, USA
| | - J Gillis
- Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, 500 Sunnyside Boulevard Woodbury, NY 11797, USA
| |
Collapse
|
21
|
|
22
|
Grover MP, Ballouz S, Mohanasundaram KA, George RA, Sherman CDH, Crowley TM, Wouters MA. Identification of novel therapeutics for complex diseases from genome-wide association data. BMC Med Genomics 2014; 7 Suppl 1:S8. [PMID: 25077696 PMCID: PMC4101352 DOI: 10.1186/1755-8794-7-s1-s8] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022] Open
Abstract
Background Human genome sequencing has enabled the association of phenotypes with genetic loci, but our ability to effectively translate this data to the clinic has not kept pace. Over the past 60 years, pharmaceutical companies have successfully demonstrated the safety and efficacy of over 1,200 novel therapeutic drugs via costly clinical studies. While this process must continue, better use can be made of the existing valuable data. In silico tools such as candidate gene prediction systems allow rapid identification of disease genes by identifying the most probable candidate genes linked to genetic markers of the disease or phenotype under investigation. Integration of drug-target data with candidate gene prediction systems can identify novel phenotypes which may benefit from current therapeutics. Such a drug repositioning tool can save valuable time and money spent on preclinical studies and phase I clinical trials. Methods We previously used Gentrepid (http://www.gentrepid.org) as a platform to predict 1,497 candidate genes for the seven complex diseases considered in the Wellcome Trust Case-Control Consortium genome-wide association study; namely Type 2 Diabetes, Bipolar Disorder, Crohn's Disease, Hypertension, Type 1 Diabetes, Coronary Artery Disease and Rheumatoid Arthritis. Here, we adopted a simple approach to integrate drug data from three publicly available drug databases: the Therapeutic Target Database, the Pharmacogenomics Knowledgebase and DrugBank; with candidate gene predictions from Gentrepid at the systems level. Results Using the publicly available drug databases as sources of drug-target association data, we identified a total of 428 candidate genes as novel therapeutic targets for the seven phenotypes of interest, and 2,130 drugs feasible for repositioning against the predicted novel targets. Conclusions By integrating genetic, bioinformatic and drug data, we have demonstrated that currently available drugs may be repositioned as novel therapeutics for the seven diseases studied here, quickly taking advantage of prior work in pharmaceutics to translate ground-breaking results in genetics to clinical treatments.
Collapse
|
23
|
Gillis J, Ballouz S, Pavlidis P. Bias tradeoffs in the creation and analysis of protein-protein interaction networks. J Proteomics 2014; 100:44-54. [PMID: 24480284 DOI: 10.1016/j.jprot.2014.01.020] [Citation(s) in RCA: 53] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2013] [Revised: 01/13/2014] [Accepted: 01/17/2014] [Indexed: 02/04/2023]
Abstract
UNLABELLED Networks constructed from aggregated protein-protein interaction data are commonplace in biology. But the studies these data are derived from were conducted with their own hypotheses and foci. Focusing on data from budding yeast present in BioGRID, we determine that many of the downstream signals present in network data are significantly impacted by biases in the original data. We determine the degree to which selection bias in favor of biologically interesting bait proteins goes down with study size, while we also find that promiscuity in prey contributes more substantially in larger studies. We analyze interaction studies over time with respect to data in the Gene Ontology and find that reproducibly observed interactions are less likely to favor multifunctional proteins. We find that strong alignment between co-expression and protein-protein interaction data occurs only for extreme co-expression values, and use this data to suggest candidates for targets likely to reveal novel biology in follow-up studies. BIOLOGICAL SIGNIFICANCE Protein-protein interaction data finds particularly heavy use in the interpretation of disease-causal variants. In principle, network data allows researchers to find novel commonalities among candidate genes. In this study, we detail several of the most salient biases contributing to aggregated protein-protein interaction databases. We find strong evidence for the role of selection and laboratory biases. Many of these effects contribute to the commonalities researchers find for disease genes. In order for characterization of disease genes and their interactions to not simply be an artifact of researcher preference, it is imperative to identify data biases explicitly. Based on this, we also suggest ways to move forward in producing candidates less influenced by prior knowledge. This article is part of a Special Issue entitled: Can Proteomics Fill the Gap Between Genomics and Phenotypes?
Collapse
Affiliation(s)
- Jesse Gillis
- Cold Spring Harbor Laboratory, Stanley Institute for Cognitive Genomics, 500 Sunnyside Boulevard, Woodbury, NY 11797, United States.
| | - Sara Ballouz
- Cold Spring Harbor Laboratory, Stanley Institute for Cognitive Genomics, 500 Sunnyside Boulevard, Woodbury, NY 11797, United States.
| | - Paul Pavlidis
- Department of Psychiatry and Centre for High-Throughput Biology, University of British Columbia, 2185 East Mall., Vancouver, BC V6T 1Z4, Canada.
| |
Collapse
|
24
|
Ballouz S, Liu JY, Oti M, Gaeta B, Fatkin D, Bahlo M, Wouters MA. Candidate disease gene prediction using Gentrepid: application to a genome-wide association study on coronary artery disease. Mol Genet Genomic Med 2013; 2:44-57. [PMID: 24498628 PMCID: PMC3907915 DOI: 10.1002/mgg3.40] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2012] [Accepted: 08/19/2013] [Indexed: 12/12/2022] Open
Abstract
Current single-locus-based analyses and candidate disease gene prediction methodologies used in genome-wide association studies (GWAS) do not capitalize on the wealth of the underlying genetic data, nor functional data available from molecular biology. Here, we analyzed GWAS data from the Wellcome Trust Case Control Consortium (WTCCC) on coronary artery disease (CAD). Gentrepid uses a multiple-locus-based approach, drawing on protein pathway- or domain-based data to make predictions. Known disease genes may be used as additional information (seeded method) or predictions can be based entirely on GWAS single nucleotide polymorphisms (SNPs) (ab initio method). We looked in detail at specific predictions made by Gentrepid for CAD and compared these with known genetic data and the scientific literature. Gentrepid was able to extract known disease genes from the candidate search space and predict plausible novel disease genes from both known and novel WTCCC-implicated loci. The disease gene candidates are consistent with known biological information. The results demonstrate that this computational approach is feasible and a valuable discovery tool for geneticists.
Collapse
Affiliation(s)
- Sara Ballouz
- Structural and Computational Biology Division, Victor Chang Cardiac Research Institute Darlinghurst, NSW, 2010, Australia ; School of Computer Science and Engineering, University of New South Wales Kensington, NSW, 2052, Australia
| | - Jason Y Liu
- Structural and Computational Biology Division, Victor Chang Cardiac Research Institute Darlinghurst, NSW, 2010, Australia
| | - Martin Oti
- Centre for Molecular and Biomolecular Informatics, Radboud University Nijmegen Medical Centre Nijmegen, The Netherlands
| | - Bruno Gaeta
- School of Computer Science and Engineering, University of New South Wales Kensington, NSW, 2052, Australia
| | - Diane Fatkin
- School of Medical Sciences, University of New South Wales Kensington, NSW, 2052, Australia ; Molecular Cardiology and Biophysics Division, Victor Chang Cardiac Research Institute Darlinghurst, NSW, 2010, Australia
| | - Melanie Bahlo
- Bioinformatics Division, The Walter and Eliza Hall Institute of Medical Research Parkville, VIC, 3052, Australia
| | - Merridee A Wouters
- School of Medicine, Deakin University Geelong, VIC, 3217, Australia ; School of Life and Environmental Sciences, Deakin University Geelong, VIC, 3217, Australia
| |
Collapse
|
25
|
Ballouz S, Liu JY, George RA, Bains N, Liu A, Oti M, Gaeta B, Fatkin D, Wouters MA. Gentrepid V2.0: a web server for candidate disease gene prediction. BMC Bioinformatics 2013; 14:249. [PMID: 23947436 PMCID: PMC3844418 DOI: 10.1186/1471-2105-14-249] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2012] [Accepted: 08/13/2013] [Indexed: 01/06/2023] Open
Abstract
Background Candidate disease gene prediction is a rapidly developing area of bioinformatics research with the potential to deliver great benefits to human health. As experimental studies detecting associations between genetic intervals and disease proliferate, better bioinformatic techniques that can expand and exploit the data are required. Description Gentrepid is a web resource which predicts and prioritizes candidate disease genes for both Mendelian and complex diseases. The system can take input from linkage analysis of single genetic intervals or multiple marker loci from genome-wide association studies. The underlying database of the Gentrepid tool sources data from numerous gene and protein resources, taking advantage of the wealth of biological information available. Using known disease gene information from OMIM, the system predicts and prioritizes disease gene candidates that participate in the same protein pathways or share similar protein domains. Alternatively, using an ab initio approach, the system can detect enrichment of these protein annotations without prior knowledge of the phenotype. Conclusions The system aims to integrate the wealth of protein information currently available with known and novel phenotype/genotype information to acquire knowledge of biological mechanisms underpinning disease. We have updated the system to facilitate analysis of GWAS data and the study of complex diseases. Application of the system to GWAS data on hypertension using the ICBP data is provided as an example. An interesting prediction is a ZIP transporter additional to the one found by the ICBP analysis. The webserver URL is https://www.gentrepid.org/.
Collapse
Affiliation(s)
- Sara Ballouz
- School of Medicine, Deakin University, Geelong, VIC 3217, Australia.
| | | | | | | | | | | | | | | | | |
Collapse
|
26
|
Ballouz S, Liu JY, Oti M, Gaeta B, Fatkin D, Bahlo M, Wouters MA. Analysis of genome-wide association study data using the protein knowledge base. BMC Genet 2011; 12:98. [PMID: 22077927 PMCID: PMC3261104 DOI: 10.1186/1471-2156-12-98] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2011] [Accepted: 11/13/2011] [Indexed: 12/25/2022] Open
Abstract
BACKGROUND Genome-wide association studies (GWAS) aim to identify causal variants and genes for complex disease by independently testing a large number of SNP markers for disease association. Although genes have been implicated in these studies, few utilise the multiple-hit model of complex disease to identify causal candidates. A major benefit of multi-locus comparison is that it compensates for some shortcomings of current statistical analyses that test the frequency of each SNP in isolation for the phenotype population versus control. RESULTS Here we developed and benchmarked several protocols for GWAS data analysis using different in-silico gene prediction and prioritisation methodologies. We adopted a high sensitivity approach to the data, using less conservative statistical SNP associations. Multiple gene search spaces, either of fixed-widths or proximity-based, were generated around each SNP marker. We used the candidate disease gene prediction system Gentrepid to identify candidates based on shared biomolecular pathways or domain-based protein homology. Predictions were made either with phenotype-specific known disease genes as input; or without a priori knowledge, by exhaustive comparison of genes in distinct loci. Because Gentrepid uses biomolecular data to find interactions and common features between genes in distinct loci of the search spaces, it takes advantage of the multi-locus aspect of the data. CONCLUSIONS Results suggest testing multiple SNP-to-gene search spaces compensates for differences in phenotypes, populations and SNP platforms. Surprisingly, domain-based homology information was more informative when benchmarked against gene candidates reported by GWA studies compared to previously determined disease genes, possibly suggesting a larger contribution of gene homologs to complex diseases than Mendelian diseases.
Collapse
Affiliation(s)
- Sara Ballouz
- Structural and Computational Biology Division, Victor Chang Cardiac Research Institute, Darlinghurst, NSW, 2010, Australia
- School of Computer Science and Engineering, University of New South Wales, Kensington, NSW, 2052, Australia
| | - Jason Y Liu
- Structural and Computational Biology Division, Victor Chang Cardiac Research Institute, Darlinghurst, NSW, 2010, Australia
| | - Martin Oti
- Centre for Molecular and Biomolecular Informatics, Radboud University Nijmegen Medical Centre, Nijmegen, The Netherlands
| | - Bruno Gaeta
- School of Computer Science and Engineering, University of New South Wales, Kensington, NSW, 2052, Australia
| | - Diane Fatkin
- School of Medical Sciences, University of New South Wales, Kensington, NSW, 2052, Australia
- Molecular Cardiology and Biophysics Division, Victor Chang Cardiac Research Institute, Darlinghurst, NSW, 2010, Australia
| | - Melanie Bahlo
- Bioinformatics Division, The Walter and Eliza Hall Institute of Medical Research, Parkville, VIC, 3052, Australia
| | - Merridee A Wouters
- School of Life and Environmental Sciences, Deakin University, Geelong, VIC, 3217, Australia
| |
Collapse
|
27
|
Abstract
Despite increasing sequencing capacity, genetic disease investigation still frequently results in the identification of loci containing multiple candidate disease genes that need to be tested for involvement in the disease. This process can be expedited by prioritizing the candidates prior to testing. Over the last decade, a large number of computational methods and tools have been developed to assist the clinical geneticist in prioritizing candidate disease genes. In this chapter, we give an overview of computational tools that can be used for this purpose, all of which are freely available over the web.
Collapse
Affiliation(s)
- Martin Oti
- Structural and Computational Biology Division, Victor Chang Cardiac Research Institute, 2010, Darlinghurst, NSW, Australia.
| | | | | |
Collapse
|
28
|
Ballouz S, Francis AR, Lan R, Tanaka MM. Conditions for the evolution of gene clusters in bacterial genomes. PLoS Comput Biol 2010; 6:e1000672. [PMID: 20168992 PMCID: PMC2820515 DOI: 10.1371/journal.pcbi.1000672] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2009] [Accepted: 01/07/2010] [Indexed: 11/18/2022] Open
Abstract
Genes encoding proteins in a common pathway are often found near each other along bacterial chromosomes. Several explanations have been proposed to account for the evolution of these structures. For instance, natural selection may directly favour gene clusters through a variety of mechanisms, such as increased efficiency of coregulation. An alternative and controversial hypothesis is the selfish operon model, which asserts that clustered arrangements of genes are more easily transferred to other species, thus improving the prospects for survival of the cluster. According to another hypothesis (the persistence model), genes that are in close proximity are less likely to be disrupted by deletions. Here we develop computational models to study the conditions under which gene clusters can evolve and persist. First, we examine the selfish operon model by re-implementing the simulation and running it under a wide range of conditions. Second, we introduce and study a Moran process in which there is natural selection for gene clustering and rearrangement occurs by genome inversion events. Finally, we develop and study a model that includes selection and inversion, which tracks the occurrence and fixation of rearrangements. Surprisingly, gene clusters fail to evolve under a wide range of conditions. Factors that promote the evolution of gene clusters include a low number of genes in the pathway, a high population size, and in the case of the selfish operon model, a high horizontal transfer rate. The computational analysis here has shown that the evolution of gene clusters can occur under both direct and indirect selection as long as certain conditions hold. Under these conditions the selfish operon model is still viable as an explanation for the evolution of gene clusters. Genes involved in a common pathway or function are frequently found near each other on bacterial chromosomes. A number of hypotheses have been previously presented to explain this observation. A particularly influential theory is the selfish operon model, which posits that horizontal transfer could promote gene clustering by favouring transfer of arrangements of genes that are close together. Subsequent theoretical development and analysis of genomic data have contributed to the debate about the plausibility of this model. Here, by re-examining the evolutionary dynamics of gene clusters, we provide and discuss conditions under which gene clusters can evolve. We find that first, some form of bias for clustering is required for clusters to evolve. This bias can be in the form of bias in horizontal transfer towards genes that are close together, or direct natural selection for gene proximity. Our computational work does not present a theoretical obstacle to the selfish operon model as a possible explanation for the evolution of gene clusters.
Collapse
Affiliation(s)
- Sara Ballouz
- School of Biotechnology and Biomolecular Sciences, University of New South Wales, Kensington, New South Wales, Australia
| | | | | | | |
Collapse
|
29
|
Robertson J, Ballouz S, Jaiyesimi I, Jury R, Margolis J. A Phase I Study of Dose Escalating Conformal Radiation Therapy with Concurrent Full-dose Gemcitabine and Erlotinib for Unresected Pancreas Cancer. Int J Radiat Oncol Biol Phys 2009. [DOI: 10.1016/j.ijrobp.2009.07.620] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
30
|
Teber ET, Liu JY, Ballouz S, Fatkin D, Wouters MA. Comparison of automated candidate gene prediction systems using genes implicated in type 2 diabetes by genome-wide association studies. BMC Bioinformatics 2009; 10 Suppl 1:S69. [PMID: 19208173 PMCID: PMC2648789 DOI: 10.1186/1471-2105-10-s1-s69] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open
Abstract
Background Automated candidate gene prediction systems allow geneticists to hone in on disease genes more rapidly by identifying the most probable candidate genes linked to the disease phenotypes under investigation. Here we assessed the ability of eight different candidate gene prediction systems to predict disease genes in intervals previously associated with type 2 diabetes by benchmarking their performance against genes implicated by recent genome-wide association studies. Results Using a search space of 9556 genes, all but one of the systems pruned the genome in favour of genes associated with moderate to highly significant SNPs. Of the 11 genes associated with highly significant SNPs identified by the genome-wide association studies, eight were flagged as likely candidates by at least one of the prediction systems. A list of candidates produced by a previous consensus approach did not match any of the genes implicated by 706 moderate to highly significant SNPs flagged by the genome-wide association studies. We prioritized genes associated with medium significance SNPs. Conclusion The study appraises the relative success of several candidate gene prediction systems against independent genetic data. Even when confronted with challengingly large intervals, the candidate gene prediction systems can successfully select likely disease genes. Furthermore, they can be used to filter statistically less-well-supported genetic data to select more likely candidates. We suggest consensus approaches fail because they penalize novel predictions made from independent underlying databases. To realize their full potential further work needs to be done on prioritization and annotation of genes.
Collapse
Affiliation(s)
- Erdahl T Teber
- Victor Chang Cardiac Research Institute, 384 Victoria St, Darlinghurst, 2010, NSW, Australia.
| | | | | | | | | |
Collapse
|
31
|
Robertson J, Hardy M, Ballouz S, Jaiyesimi I, Margolis J, Jury R, Wallace M, Maino H. Conformal Radiation Therapy with Concurrent Full-dose Gemcitabine and Erlotinib for Unresected Pancreas Cancer: A Phase I Trial. Int J Radiat Oncol Biol Phys 2008. [DOI: 10.1016/j.ijrobp.2008.06.1725] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|