1
|
Oshternian SR, Loipfinger S, Bhattacharya A, Fehrmann RSN. Exploring combinations of dimensionality reduction, transfer learning, and regularization methods for predicting binary phenotypes with transcriptomic data. BMC Bioinformatics 2024; 25:167. [PMID: 38671342 PMCID: PMC11046904 DOI: 10.1186/s12859-024-05795-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2023] [Accepted: 04/22/2024] [Indexed: 04/28/2024] Open
Abstract
BACKGROUND Numerous transcriptomic-based models have been developed to predict or understand the fundamental mechanisms driving biological phenotypes. However, few models have successfully transitioned into clinical practice due to challenges associated with generalizability and interpretability. To address these issues, researchers have turned to dimensionality reduction methods and have begun implementing transfer learning approaches. METHODS In this study, we aimed to determine the optimal combination of dimensionality reduction and regularization methods for predictive modeling. We applied seven dimensionality reduction methods to various datasets, including two supervised methods (linear optimal low-rank projection and low-rank canonical correlation analysis), two unsupervised methods [principal component analysis and consensus independent component analysis (c-ICA)], and three methods [autoencoder (AE), adversarial variational autoencoder, and c-ICA] within a transfer learning framework, trained on > 140,000 transcriptomic profiles. To assess the performance of the different combinations, we used a cross-validation setup encapsulated within a permutation testing framework, analyzing 30 different transcriptomic datasets with binary phenotypes. Furthermore, we included datasets with small sample sizes and phenotypes of varying degrees of predictability, and we employed independent datasets for validation. RESULTS Our findings revealed that regularized models without dimensionality reduction achieved the highest predictive performance, challenging the necessity of dimensionality reduction when the primary goal is to achieve optimal predictive performance. However, models using AE and c-ICA with transfer learning for dimensionality reduction showed comparable performance, with enhanced interpretability and robustness of predictors, compared to models using non-dimensionality-reduced data. CONCLUSION These findings offer valuable insights into the optimal combination of strategies for enhancing the predictive performance, interpretability, and generalizability of transcriptomic-based models.
Collapse
Affiliation(s)
- S R Oshternian
- Department of Medical Oncology, University Medical Center Groningen, University of Groningen, P.O. Box 30.001, 9700 RB, Groningen, The Netherlands
| | - S Loipfinger
- Department of Medical Oncology, University Medical Center Groningen, University of Groningen, P.O. Box 30.001, 9700 RB, Groningen, The Netherlands
| | - A Bhattacharya
- Department of Medical Oncology, University Medical Center Groningen, University of Groningen, P.O. Box 30.001, 9700 RB, Groningen, The Netherlands
| | - R S N Fehrmann
- Department of Medical Oncology, University Medical Center Groningen, University of Groningen, P.O. Box 30.001, 9700 RB, Groningen, The Netherlands.
| |
Collapse
|
2
|
Wang J, Wan YW, Al-Ouran R, Huang M, Liu Z. CoRegNet: unraveling gene co-regulation networks from public RNA-Seq repositories using a beta-binomial statistical model. Brief Bioinform 2023; 25:bbad380. [PMID: 38113079 PMCID: PMC10729864 DOI: 10.1093/bib/bbad380] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2023] [Revised: 09/13/2023] [Indexed: 12/21/2023] Open
Abstract
Millions of RNA sequencing samples have been deposited into public databases, providing a rich resource for biological research. These datasets encompass tens of thousands of experiments and offer comprehensive insights into human cellular regulation. However, a major challenge is how to integrate these experiments that acquired at different conditions. We propose a new statistical tool based on beta-binomial distributions that can construct robust gene co-regulation network (CoRegNet) across tens of thousands of experiments. Our analysis of over 12 000 experiments involving human tissues and cells shows that CoRegNet significantly outperforms existing gene co-expression-based methods. Although the majority of the genes are linearly co-regulated, we did discover an interesting set of genes that are non-linearly co-regulated; half of the time they change in the same direction and the other half they change in the opposite direction. Additionally, we identified a set of gene pairs that follows the Simpson's paradox. By utilizing public domain data, CoRegNet offers a powerful approach for identifying functionally related gene pairs, thereby revealing new biological insights.
Collapse
Affiliation(s)
- Jiasheng Wang
- Jan and Dan Duncan Neurological Research Institute at Texas Children’s Hospital, Houston, TX 77030, USA
- Graduate Program in Quantitative and Computational Biosciences, Baylor College of Medicine, Houston, TX, 77030, USA
- Department of Pediatrics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Ying-Wooi Wan
- Jan and Dan Duncan Neurological Research Institute at Texas Children’s Hospital, Houston, TX 77030, USA
- Department of Molecular and Human Genetics, Baylor College of Medicine, Howard Hughes Medical Institute, Houston, TX 77030, USA
| | | | - Meichen Huang
- Jan and Dan Duncan Neurological Research Institute at Texas Children’s Hospital, Houston, TX 77030, USA
- Department of Neurology, Baylor College of Medicine, Houston, TX 77030, USA
| | - Zhandong Liu
- Jan and Dan Duncan Neurological Research Institute at Texas Children’s Hospital, Houston, TX 77030, USA
- Graduate Program in Quantitative and Computational Biosciences, Baylor College of Medicine, Houston, TX, 77030, USA
- Department of Pediatrics, Baylor College of Medicine, Houston, TX 77030, USA
| |
Collapse
|
3
|
Chow YL, Singh S, Carpenter AE, Way GP. Predicting drug polypharmacology from cell morphology readouts using variational autoencoder latent space arithmetic. PLoS Comput Biol 2022; 18:e1009888. [PMID: 35213530 PMCID: PMC8906577 DOI: 10.1371/journal.pcbi.1009888] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2021] [Revised: 03/09/2022] [Accepted: 02/01/2022] [Indexed: 01/13/2023] Open
Abstract
A variational autoencoder (VAE) is a machine learning algorithm, useful for generating a compressed and interpretable latent space. These representations have been generated from various biomedical data types and can be used to produce realistic-looking simulated data. However, standard vanilla VAEs suffer from entangled and uninformative latent spaces, which can be mitigated using other types of VAEs such as β-VAE and MMD-VAE. In this project, we evaluated the ability of VAEs to learn cell morphology characteristics derived from cell images. We trained and evaluated these three VAE variants-Vanilla VAE, β-VAE, and MMD-VAE-on cell morphology readouts and explored the generative capacity of each model to predict compound polypharmacology (the interactions of a drug with more than one target) using an approach called latent space arithmetic (LSA). To test the generalizability of the strategy, we also trained these VAEs using gene expression data of the same compound perturbations and found that gene expression provides complementary information. We found that the β-VAE and MMD-VAE disentangle morphology signals and reveal a more interpretable latent space. We reliably simulated morphology and gene expression readouts from certain compounds thereby predicting cell states perturbed with compounds of known polypharmacology. Inferring cell state for specific drug mechanisms could aid researchers in developing and identifying targeted therapeutics and categorizing off-target effects in the future.
Collapse
Affiliation(s)
- Yuen Ler Chow
- Imaging Platform, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
- Brookline High School, Brookline, Massachusetts, United States of America
| | - Shantanu Singh
- Imaging Platform, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
| | - Anne E. Carpenter
- Imaging Platform, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
| | - Gregory P. Way
- Imaging Platform, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
- Center for Health AI and Department of Biochemistry and Molecular Genetics, University of Colorado School of Medicine, Aurora, Colorado, United States of America
| |
Collapse
|
4
|
Ashenova A, Daniyarov A, Molkenov A, Sharip A, Zinovyev A, Kairov U. Meta-Analysis of Esophageal Cancer Transcriptomes Using Independent Component Analysis. Front Genet 2021; 12:683632. [PMID: 34795689 PMCID: PMC8594933 DOI: 10.3389/fgene.2021.683632] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2021] [Accepted: 10/05/2021] [Indexed: 11/17/2022] Open
Abstract
Independent Component Analysis is a matrix factorization method for data dimension reduction. ICA has been widely applied for the analysis of transcriptomic data for blind separation of biological, environmental, and technical factors affecting gene expression. The study aimed to analyze the publicly available esophageal cancer data using the ICA for identification and comprehensive analysis of reproducible signaling pathways and molecular signatures involved in this cancer type. In this study, four independent esophageal cancer transcriptomic datasets from GEO databases were used. A bioinformatics tool « BiODICA-Independent Component Analysis of Big Omics Data» was applied to compute independent components (ICs). Gene Set Enrichment Analysis (GSEA) and ToppGene uncovered the most significantly enriched pathways. Construction and visualization of gene networks and graphs were performed using the Cytoscape, and HPRD database. The correlation graph between decompositions into 30 ICs was built with absolute correlation values exceeding 0.3. Clusters of components-pseudocliques were observed in the structure of the correlation graph. The top 1,000 most contributing genes of each ICs in the pseudocliques were mapped to the PPI network to construct associated signaling pathways. Some cliques were composed of densely interconnected nodes and included components common to most cancer types (such as cell cycle and extracellular matrix signals), while others were specific to EC. The results of this investigation may reveal potential biomarkers of esophageal carcinogenesis, functional subsystems dysregulated in the tumor cells, and be helpful in predicting the early development of a tumor.
Collapse
Affiliation(s)
- Ainur Ashenova
- Laboratory of Bioinformatics and Systems Biology, National Laboratory Astana, Center for Life Sciences, Nazarbayev University, Nur-Sultan, Kazakhstan
- Department of Biology, School of Sciences and Humanities, Nazarbayev University, Nur-Sultan, Kazakhstan
| | - Asset Daniyarov
- Laboratory of Bioinformatics and Systems Biology, National Laboratory Astana, Center for Life Sciences, Nazarbayev University, Nur-Sultan, Kazakhstan
| | - Askhat Molkenov
- Laboratory of Bioinformatics and Systems Biology, National Laboratory Astana, Center for Life Sciences, Nazarbayev University, Nur-Sultan, Kazakhstan
| | - Aigul Sharip
- Laboratory of Bioinformatics and Systems Biology, National Laboratory Astana, Center for Life Sciences, Nazarbayev University, Nur-Sultan, Kazakhstan
| | - Andrei Zinovyev
- Institut Curie, PSL Research University, INSERM U900, Paris, France
- Laboratory of Advanced Methods for High-dimensional Data Analysis, Lobachevsky University, Nizhny Novgorod, Russia
| | - Ulykbek Kairov
- Laboratory of Bioinformatics and Systems Biology, National Laboratory Astana, Center for Life Sciences, Nazarbayev University, Nur-Sultan, Kazakhstan
| |
Collapse
|
5
|
Altman MC, Rinchai D, Baldwin N, Toufiq M, Whalen E, Garand M, Syed Ahamed Kabeer B, Alfaki M, Presnell SR, Khaenam P, Ayllón-Benítez A, Mougin F, Thébault P, Chiche L, Jourde-Chiche N, Phillips JT, Klintmalm G, O'Garra A, Berry M, Bloom C, Wilkinson RJ, Graham CM, Lipman M, Lertmemongkolchai G, Bedognetti D, Thiebaut R, Kheradmand F, Mejias A, Ramilo O, Palucka K, Pascual V, Banchereau J, Chaussabel D. Development of a fixed module repertoire for the analysis and interpretation of blood transcriptome data. Nat Commun 2021; 12:4385. [PMID: 34282143 PMCID: PMC8289976 DOI: 10.1038/s41467-021-24584-w] [Citation(s) in RCA: 26] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2020] [Accepted: 06/21/2021] [Indexed: 01/21/2023] Open
Abstract
As the capacity for generating large-scale molecular profiling data continues to grow, the ability to extract meaningful biological knowledge from it remains a limitation. Here, we describe the development of a new fixed repertoire of transcriptional modules, BloodGen3, that is designed to serve as a stable reusable framework for the analysis and interpretation of blood transcriptome data. The construction of this repertoire is based on co-clustering patterns observed across sixteen immunological and physiological states encompassing 985 blood transcriptome profiles. Interpretation is supported by customized resources, including module-level analysis workflows, fingerprint grid plot visualizations, interactive web applications and an extensive annotation framework comprising functional profiling reports and reference transcriptional profiles. Taken together, this well-characterized and well-supported transcriptional module repertoire can be employed for the interpretation and benchmarking of blood transcriptome profiles within and across patient cohorts. Blood transcriptome fingerprints for the 16 reference cohorts can be accessed interactively via: https://drinchai.shinyapps.io/BloodGen3Module/ .
Collapse
Affiliation(s)
- Matthew C Altman
- Systems Immunology, Benaroya Research Institute, Seattle, WA, USA.
- Division of Allergy and Infectious Diseases, University of Washington, Seattle, WA, USA.
| | | | - Nicole Baldwin
- Baylor Institute for Immunology Research, Baylor Research Institute, Dallas, TX, USA
| | | | - Elizabeth Whalen
- Systems Immunology, Benaroya Research Institute, Seattle, WA, USA
| | | | | | | | - Scott R Presnell
- Systems Immunology, Benaroya Research Institute, Seattle, WA, USA
| | - Prasong Khaenam
- Systems Immunology, Benaroya Research Institute, Seattle, WA, USA
| | - Aaron Ayllón-Benítez
- Inserm U1219 Bordeaux Population Health Research Center, Bordeaux University, Bordeaux, France
| | - Fleur Mougin
- Inserm U1219 Bordeaux Population Health Research Center, Bordeaux University, Bordeaux, France
| | | | - Laurent Chiche
- Department of Internal Medicine, Hopital Européen, Marseille, France
| | | | - J Theodore Phillips
- Baylor Institute for Immunology Research, Baylor Research Institute, Dallas, TX, USA
| | - Goran Klintmalm
- Baylor Institute for Immunology Research, Baylor Research Institute, Dallas, TX, USA
| | - Anne O'Garra
- Laboratory of Immunoregulation and Infection, The Francis Crick Institute, London, UK
- National Heart and Lung Institute, Imperial College London, London, UK
| | | | - Chloe Bloom
- National Heart and Lung Institute, Imperial College London, London, UK
| | - Robert J Wilkinson
- The Francis Crick Institute, London, UK
- Department of Infectious Disease, Imperial College, London, UK
- Wellcome Center for Infectious Diseases Research in Africa and Department of Medicine, Institute of Infectious Diseases and Molecular Medicine, University of Cape Town Observatory, 7925, Cape Town, Republic of South Africa
| | - Christine M Graham
- Laboratory of Immunoregulation and Infection, The Francis Crick Institute, London, UK
| | - Marc Lipman
- UCL Respiratory, Division of Medicine, University College London, London, UK
| | - Ganjana Lertmemongkolchai
- Centre for Research and Development of Medical Diagnostic Laboratories, Faculty of Associated Medical Sciences, Khon Kaen University, Khon Kaen, Thailand
| | | | - Rodolphe Thiebaut
- Inserm U1219 Bordeaux Population Health Research Center, Bordeaux University, Bordeaux, France
| | - Farrah Kheradmand
- Baylor College of Medicine & Center for Translational Research on Inflammatory Diseases, Michael E. DeBakey VAMC, Houston, TX, USA
| | - Asuncion Mejias
- Abigail Wexner Research Institute at Nationwide Children's Hospital and the Ohio State University School of Medicine, Columbus, OH, USA
| | - Octavio Ramilo
- Abigail Wexner Research Institute at Nationwide Children's Hospital and the Ohio State University School of Medicine, Columbus, OH, USA
| | - Karolina Palucka
- Baylor Institute for Immunology Research, Baylor Research Institute, Dallas, TX, USA
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
| | - Virginia Pascual
- Baylor Institute for Immunology Research, Baylor Research Institute, Dallas, TX, USA
- Weill Cornell Medicine, New York, NY, USA
| | - Jacques Banchereau
- Baylor Institute for Immunology Research, Baylor Research Institute, Dallas, TX, USA
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
| | - Damien Chaussabel
- Systems Immunology, Benaroya Research Institute, Seattle, WA, USA.
- Research Branch, Sidra Medicine, Doha, Qatar.
| |
Collapse
|
6
|
Rinchai D, Roelands J, Toufiq M, Hendrickx W, Altman MC, Bedognetti D, Chaussabel D. BloodGen3Module: Blood transcriptional module repertoire analysis and visualization using R. Bioinformatics 2021; 37:2382-2389. [PMID: 33624743 PMCID: PMC8388021 DOI: 10.1093/bioinformatics/btab121] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2020] [Revised: 01/14/2021] [Accepted: 02/23/2021] [Indexed: 11/28/2022] Open
Abstract
Motivation We previously described the construction and characterization of fixed reusable blood transcriptional module repertoires. More recently we released a third iteration (‘BloodGen3’ module repertoire) that comprises 382 functionally annotated modules and encompasses 14 168 transcripts. Custom bioinformatic tools are needed to support downstream analysis, visualization and interpretation relying on such fixed module repertoires. Results We have developed and describe here an R package, BloodGen3Module. The functions of our package permit group comparison analyses to be performed at the module-level, and to display the results as annotated fingerprint grid plots. A parallel workflow for computing module repertoire changes for individual samples rather than groups of samples is also available; these results are displayed as fingerprint heatmaps. An illustrative case is used to demonstrate the steps involved in generating blood transcriptome repertoire fingerprints of septic patients. Taken together, this resource could facilitate the analysis and interpretation of changes in blood transcript abundance observed across a wide range of pathological and physiological states. Availability and implementation The BloodGen3Module package and documentation are freely available from Github: https://github.com/Drinchai/BloodGen3Module. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | | | | | | | - Matthew C Altman
- Division of Allergy and Infectious Diseases, University of Washington, Seattle, Washington, USA.,Systems Immunology, Benaroya Research Institute, Seattle, Washington, USA
| | | | | |
Collapse
|
7
|
Lee AJ, Park Y, Doing G, Hogan DA, Greene CS. Correcting for experiment-specific variability in expression compendia can remove underlying signals. Gigascience 2020; 9:giaa117. [PMID: 33140829 PMCID: PMC7607552 DOI: 10.1093/gigascience/giaa117] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2020] [Revised: 08/28/2020] [Accepted: 09/29/2020] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION In the past two decades, scientists in different laboratories have assayed gene expression from millions of samples. These experiments can be combined into compendia and analyzed collectively to extract novel biological patterns. Technical variability, or "batch effects," may result from combining samples collected and processed at different times and in different settings. Such variability may distort our ability to extract true underlying biological patterns. As more integrative analysis methods arise and data collections get bigger, we must determine how technical variability affects our ability to detect desired patterns when many experiments are combined. OBJECTIVE We sought to determine the extent to which an underlying signal was masked by technical variability by simulating compendia comprising data aggregated across multiple experiments. METHOD We developed a generative multi-layer neural network to simulate compendia of gene expression experiments from large-scale microbial and human datasets. We compared simulated compendia before and after introducing varying numbers of sources of undesired variability. RESULTS The signal from a baseline compendium was obscured when the number of added sources of variability was small. Applying statistical correction methods rescued the underlying signal in these cases. However, as the number of sources of variability increased, it became easier to detect the original signal even without correction. In fact, statistical correction reduced our power to detect the underlying signal. CONCLUSION When combining a modest number of experiments, it is best to correct for experiment-specific noise. However, when many experiments are combined, statistical correction reduces our ability to extract underlying patterns.
Collapse
Affiliation(s)
- Alexandra J Lee
- Genomics and Computational Biology Graduate Program, University of Pennsylvania, 3400 Civic Center Blvd, Philadelphia, PA, 19104, USA
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, 3400 Civic Center Blvd, Philadelphia, PA, 19104, USA
| | - YoSon Park
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, 3400 Civic Center Blvd, Philadelphia, PA, 19104, USA
| | - Georgia Doing
- Department of Microbiology and Immunology, Geisel School of Medicine, Dartmouth, 1 Rope Ferry Rd, Hanover, NH, 03755, USA
| | - Deborah A Hogan
- Department of Microbiology and Immunology, Geisel School of Medicine, Dartmouth, 1 Rope Ferry Rd, Hanover, NH, 03755, USA
| | - Casey S Greene
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, 3400 Civic Center Blvd, Philadelphia, PA, 19104, USA
- Childhood Cancer Data Lab, Alex's Lemonade Stand Foundation, 1429 Walnut St, Floor 10, Philadelphia, PA, 19102 USA
| |
Collapse
|
8
|
Application of Transcriptional Gene Modules to Analysis of Caenorhabditis elegans' Gene Expression Data. G3-GENES GENOMES GENETICS 2020; 10:3623-3638. [PMID: 32759329 PMCID: PMC7534440 DOI: 10.1534/g3.120.401270] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Abstract
Identification of co-expressed sets of genes (gene modules) is used widely for grouping functionally related genes during transcriptomic data analysis. An organism-wide atlas of high-quality gene modules would provide a powerful tool for unbiased detection of biological signals from gene expression data. Here, using a method based on independent component analysis we call DEXICA, we have defined and optimized 209 modules that broadly represent transcriptional wiring of the key experimental organism C. elegans. These modules represent responses to changes in the environment (e.g., starvation, exposure to xenobiotics), genes regulated by transcriptions factors (e.g., ATFS-1, DAF-16), genes specific to tissues (e.g., neurons, muscle), genes that change during development, and other complex transcriptional responses to genetic, environmental and temporal perturbations. Interrogation of these modules reveals processes that are activated in long-lived mutants in cases where traditional analyses of differentially expressed genes fail to do so. Additionally, we show that modules can inform the strength of the association between a gene and an annotation (e.g., GO term). Analysis of “module-weighted annotations” improves on several aspects of traditional annotation-enrichment tests and can aid in functional interpretation of poorly annotated genes. We provide an online interactive resource with tutorials at http://genemodules.org/, in which users can find detailed information on each module, check genes for module-weighted annotations, and use both of these to analyze their own gene expression data (generated using any platform) or gene sets of interest.
Collapse
|
9
|
Byrd JB, Greene AC, Prasad DV, Jiang X, Greene CS. Responsible, practical genomic data sharing that accelerates research. Nat Rev Genet 2020; 21:615-629. [PMID: 32694666 PMCID: PMC7974070 DOI: 10.1038/s41576-020-0257-5] [Citation(s) in RCA: 58] [Impact Index Per Article: 14.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 06/08/2020] [Indexed: 12/13/2022]
Abstract
Data sharing anchors reproducible science, but expectations and best practices are often nebulous. Communities of funders, researchers and publishers continue to grapple with what should be required or encouraged. To illuminate the rationales for sharing data, the technical challenges and the social and cultural challenges, we consider the stakeholders in the scientific enterprise. In biomedical research, participants are key among those stakeholders. Ethical sharing requires considering both the value of research efforts and the privacy costs for participants. We discuss current best practices for various types of genomic data, as well as opportunities to promote ethical data sharing that accelerates science by aligning incentives.
Collapse
Affiliation(s)
- James Brian Byrd
- Department of Internal Medicine, Medical School, University of Michigan, Ann Arbor, MI, USA
| | - Anna C Greene
- Alex's Lemonade Stand Foundation, Bala Cynwyd, PA, USA
| | | | - Xiaoqian Jiang
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Casey S Greene
- Childhood Cancer Data Lab, Alex's Lemonade Stand Foundation, Philadelphia, PA, USA.
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA.
| |
Collapse
|
10
|
Rinchai D, Syed Ahamed Kabeer B, Toufiq M, Tatari-Calderone Z, Deola S, Brummaier T, Garand M, Branco R, Baldwin N, Alfaki M, Altman MC, Ballestrero A, Bassetti M, Zoppoli G, De Maria A, Tang B, Bedognetti D, Chaussabel D. A modular framework for the development of targeted Covid-19 blood transcript profiling panels. J Transl Med 2020; 18:291. [PMID: 32736569 PMCID: PMC7393249 DOI: 10.1186/s12967-020-02456-z] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2020] [Accepted: 07/21/2020] [Indexed: 02/07/2023] Open
Abstract
BACKGROUND Covid-19 morbidity and mortality are associated with a dysregulated immune response. Tools are needed to enhance existing immune profiling capabilities in affected patients. Here we aimed to develop an approach to support the design of targeted blood transcriptome panels for profiling the immune response to SARS-CoV-2 infection. METHODS We designed a pool of candidates based on a pre-existing and well-characterized repertoire of blood transcriptional modules. Available Covid-19 blood transcriptome data was also used to guide this process. Further selection steps relied on expert curation. Additionally, we developed several custom web applications to support the evaluation of candidates. RESULTS As a proof of principle, we designed three targeted blood transcript panels, each with a different translational connotation: immunological relevance, therapeutic development relevance and SARS biology relevance. CONCLUSION Altogether the work presented here may contribute to the future expansion of immune profiling capabilities via targeted profiling of blood transcript abundance in Covid-19 patients.
Collapse
Affiliation(s)
| | | | | | | | | | - Tobias Brummaier
- Shoklo Malaria Research Unit, Mahidol-Oxford Tropical Medicine Research Unit, Faculty of Tropical Medicine, Mahidol University, Mae Sot, Thailand
- Centre for Tropical Medicine and Global Health, Nuffield Department of Medicine, University of Oxford, Oxford, UK
- Swiss Tropical and Public Health Institute, Basel, Switzerland
- University of Basel, Basel, Switzerland
| | | | | | - Nicole Baldwin
- Baylor Institute for Immunology Research and Baylor Research Institute, Dallas, TX, USA
| | | | - Matthew C Altman
- Division of Allergy and Infectious Diseases, University of Washington, Seattle, WA, USA
- Systems Immunology, Benaroya Research Institute, Seattle, WA, USA
| | - Alberto Ballestrero
- Department of Internal Medicine, Università degli Studi di Genova, Genoa, Italy
- IRCCS Ospedale Policlinico San Martino, Genoa, Italy
| | - Matteo Bassetti
- Division of Infectious and Tropical Diseases, IRCCS Ospedale Policlinico San Martino, Genoa, Italy
- Department of Health Sciences, University of Genoa, Genoa, Italy
| | - Gabriele Zoppoli
- Department of Internal Medicine, Università degli Studi di Genova, Genoa, Italy
- IRCCS Ospedale Policlinico San Martino, Genoa, Italy
| | - Andrea De Maria
- Division of Infectious and Tropical Diseases, IRCCS Ospedale Policlinico San Martino, Genoa, Italy
- Department of Health Sciences, University of Genoa, Genoa, Italy
| | - Benjamin Tang
- Nepean Clinical School, University of Sydney, Sydney, NSW, Australia
| | - Davide Bedognetti
- Sidra Medicine, Doha, Qatar
- Department of Internal Medicine, Università degli Studi di Genova, Genoa, Italy
| | | |
Collapse
|
11
|
Abstract
Over the last several years, next-generation sequencing and its recent push toward single-cell resolution have transformed the landscape of immunology research by revealing novel complexities about all components of the immune system. With the vast amounts of diverse data currently being generated, and with the methods of analyzing and combining diverse data improving as well, integrative systems approaches are becoming more powerful. Previous integrative approaches have combined multiple data types and revealed ways that the immune system, both as a whole and as individual parts, is affected by genetics, the microbiome, and other factors. In this review, we explore the data types that are available for studying immunology with an integrative systems approach, as well as the current strategies and challenges for conducting such analyses.
Collapse
Affiliation(s)
- Silvia Pineda
- Bakar Computational Health Sciences Institute, University of California, San Francisco, California 94158, USA
- Genetic and Molecular Epidemiology Group, Spanish National Cancer Research Centre, 28029 Madrid, Spain
| | - Daniel G. Bunis
- Bakar Computational Health Sciences Institute, University of California, San Francisco, California 94158, USA
| | - Idit Kosti
- Bakar Computational Health Sciences Institute, University of California, San Francisco, California 94158, USA
- Department of Pediatrics, University of California, San Francisco, California 94143, USA
| | - Marina Sirota
- Bakar Computational Health Sciences Institute, University of California, San Francisco, California 94158, USA
- Department of Pediatrics, University of California, San Francisco, California 94143, USA
| |
Collapse
|
12
|
Way GP, Zietz M, Rubinetti V, Himmelstein DS, Greene CS. Compressing gene expression data using multiple latent space dimensionalities learns complementary biological representations. Genome Biol 2020; 21:109. [PMID: 32393369 PMCID: PMC7212571 DOI: 10.1186/s13059-020-02021-3] [Citation(s) in RCA: 30] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2019] [Accepted: 04/16/2020] [Indexed: 12/27/2022] Open
Abstract
BACKGROUND Unsupervised compression algorithms applied to gene expression data extract latent or hidden signals representing technical and biological sources of variation. However, these algorithms require a user to select a biologically appropriate latent space dimensionality. In practice, most researchers fit a single algorithm and latent dimensionality. We sought to determine the extent by which selecting only one fit limits the biological features captured in the latent representations and, consequently, limits what can be discovered with subsequent analyses. RESULTS We compress gene expression data from three large datasets consisting of adult normal tissue, adult cancer tissue, and pediatric cancer tissue. We train many different models across a large range of latent space dimensionalities and observe various performance differences. We identify more curated pathway gene sets significantly associated with individual dimensions in denoising autoencoder and variational autoencoder models trained using an intermediate number of latent dimensionalities. Combining compressed features across algorithms and dimensionalities captures the most pathway-associated representations. When trained with different latent dimensionalities, models learn strongly associated and generalizable biological representations including sex, neuroblastoma MYCN amplification, and cell types. Stronger signals, such as tumor type, are best captured in models trained at lower dimensionalities, while more subtle signals such as pathway activity are best identified in models trained with more latent dimensionalities. CONCLUSIONS There is no single best latent dimensionality or compression algorithm for analyzing gene expression data. Instead, using features derived from different compression models across multiple latent space dimensionalities enhances biological representations.
Collapse
Affiliation(s)
- Gregory P Way
- Genomics and Computational Biology Graduate Group, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, 19104, USA
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, 10-131 SCTR 34th and Civic Center Blvd, Philadelphia, PA, 19104, USA
- Imaging Platform, Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA
| | - Michael Zietz
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, 10-131 SCTR 34th and Civic Center Blvd, Philadelphia, PA, 19104, USA
| | - Vincent Rubinetti
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, 10-131 SCTR 34th and Civic Center Blvd, Philadelphia, PA, 19104, USA
| | - Daniel S Himmelstein
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, 10-131 SCTR 34th and Civic Center Blvd, Philadelphia, PA, 19104, USA
| | - Casey S Greene
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, 10-131 SCTR 34th and Civic Center Blvd, Philadelphia, PA, 19104, USA.
- Childhood Cancer Data Lab, Alex's Lemonade Stand Foundation, Philadelphia, PA, 19102, USA.
| |
Collapse
|
13
|
Sompairac N, Nazarov PV, Czerwinska U, Cantini L, Biton A, Molkenov A, Zhumadilov Z, Barillot E, Radvanyi F, Gorban A, Kairov U, Zinovyev A. Independent Component Analysis for Unraveling the Complexity of Cancer Omics Datasets. Int J Mol Sci 2019; 20:E4414. [PMID: 31500324 PMCID: PMC6771121 DOI: 10.3390/ijms20184414] [Citation(s) in RCA: 44] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2019] [Revised: 09/02/2019] [Accepted: 09/04/2019] [Indexed: 12/13/2022] Open
Abstract
Independent component analysis (ICA) is a matrix factorization approach where the signals captured by each individual matrix factors are optimized to become as mutually independent as possible. Initially suggested for solving source blind separation problems in various fields, ICA was shown to be successful in analyzing functional magnetic resonance imaging (fMRI) and other types of biomedical data. In the last twenty years, ICA became a part of the standard machine learning toolbox, together with other matrix factorization methods such as principal component analysis (PCA) and non-negative matrix factorization (NMF). Here, we review a number of recent works where ICA was shown to be a useful tool for unraveling the complexity of cancer biology from the analysis of different types of omics data, mainly collected for tumoral samples. Such works highlight the use of ICA in dimensionality reduction, deconvolution, data pre-processing, meta-analysis, and others applied to different data types (transcriptome, methylome, proteome, single-cell data). We particularly focus on the technical aspects of ICA application in omics studies such as using different protocols, determining the optimal number of components, assessing and improving reproducibility of the ICA results, and comparison with other popular matrix factorization techniques. We discuss the emerging ICA applications to the integrative analysis of multi-level omics datasets and introduce a conceptual view on ICA as a tool for defining functional subsystems of a complex biological system and their interactions under various conditions. Our review is accompanied by a Jupyter notebook which illustrates the discussed concepts and provides a practical tool for applying ICA to the analysis of cancer omics datasets.
Collapse
Affiliation(s)
- Nicolas Sompairac
- Institut Curie, PSL Research University, 75005 Paris, France.
- INSERM U900, 75248 Paris, France.
- CBIO-Centre for Computational Biology, Mines ParisTech, PSL Research University, 75006 Paris, France.
- Centre de Recherches Interdisciplinaires, Université Paris Descartes, 75004 Paris, France.
| | - Petr V Nazarov
- Multiomics Data Science Research Group, Quantitative Biology Unit, Luxembourg Institute of Health (LIH), L-1445 Strassen, Luxembourg.
| | - Urszula Czerwinska
- Institut Curie, PSL Research University, 75005 Paris, France.
- INSERM U900, 75248 Paris, France.
- CBIO-Centre for Computational Biology, Mines ParisTech, PSL Research University, 75006 Paris, France.
| | - Laura Cantini
- Computational Systems Biology Team, Institut de Biologie de l'Ecole Normale Supérieure, CNRS UMR8197, INSERM U1024, Ecole Normale Supérieure, PSL Research University, 75005 Paris, France.
| | - Anne Biton
- Centre de Bioinformatique, Biostatistique et Biologie Intégrative (C3BI, USR 3756 Institut Pasteur et CNRS), 75015 Paris, France.
| | - Askhat Molkenov
- Laboratory of Bioinformatics and Systems Biology, Center for Life Sciences, National Laboratory Astana, Nazarbayev University, 010000 Nur-Sultan, Kazakhstan.
| | - Zhaxybay Zhumadilov
- Laboratory of Bioinformatics and Systems Biology, Center for Life Sciences, National Laboratory Astana, Nazarbayev University, 010000 Nur-Sultan, Kazakhstan.
- University Medical Center, Nazarbayev University, 010000 Nur-Sultan, Kazakhstan.
| | - Emmanuel Barillot
- Institut Curie, PSL Research University, 75005 Paris, France.
- INSERM U900, 75248 Paris, France.
- CBIO-Centre for Computational Biology, Mines ParisTech, PSL Research University, 75006 Paris, France.
| | - Francois Radvanyi
- Institut Curie, PSL Research University, 75005 Paris, France.
- CNRS, UMR 144, 75248 Paris, France.
| | - Alexander Gorban
- Center for Mathematical Modeling, University of Leicester, Leicester LE1 7RH, UK.
- Lobachevsky University, 603022 Nizhny Novgorod, Russia.
| | - Ulykbek Kairov
- Laboratory of Bioinformatics and Systems Biology, Center for Life Sciences, National Laboratory Astana, Nazarbayev University, 010000 Nur-Sultan, Kazakhstan.
| | - Andrei Zinovyev
- Institut Curie, PSL Research University, 75005 Paris, France.
- INSERM U900, 75248 Paris, France.
- CBIO-Centre for Computational Biology, Mines ParisTech, PSL Research University, 75006 Paris, France.
| |
Collapse
|
14
|
Way GP, Greene CS. Discovering Pathway and Cell Type Signatures in Transcriptomic Compendia with Machine Learning. Annu Rev Biomed Data Sci 2019. [DOI: 10.1146/annurev-biodatasci-072018-021348] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Pathway and cell type signatures are patterns present in transcriptome data that are associated with biological processes or phenotypic consequences. These signatures result from specific cell type and pathway expression but can require large transcriptomic compendia to detect. Machine learning techniques can be powerful tools for signature discovery through their ability to provide accurate and interpretable results. In this review, we discuss various machine learning applications to extract pathway and cell type signatures from transcriptomic compendia. We focus on the biological motivations and interpretation for both supervised and unsupervised learning approaches in this setting. We consider recent advances, including deep learning, and their applications to expanding bulk and single-cell RNA data. As data and computational resources increase, there will be more opportunities for machine learning to aid in revealing biological signatures.
Collapse
Affiliation(s)
- Gregory P. Way
- Genomics and Computational Biology Graduate Group, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA
| | - Casey S. Greene
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA
| |
Collapse
|
15
|
Pal A, Chiu HY, Taneja R. Genetics, epigenetics and redox homeostasis in rhabdomyosarcoma: Emerging targets and therapeutics. Redox Biol 2019; 25:101124. [PMID: 30709791 PMCID: PMC6859585 DOI: 10.1016/j.redox.2019.101124] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2018] [Revised: 01/20/2019] [Accepted: 01/24/2019] [Indexed: 12/16/2022] Open
Abstract
Rhabdomyosarcoma (RMS) is the most common soft tissue sarcoma accounting for 5-8% of malignant tumours in children and adolescents. Children with high risk disease have poor prognosis. Anti-RMS therapies include surgery, radiation and combination chemotherapy. While these strategies improved survival rates, they have plateaued since 1990s as drugs that target differentiation and self-renewal of tumours cells have not been identified. Moreover, prevailing treatments are aggressive with drug resistance and metastasis causing failure of several treatment regimes. Significant advances have been made recently in understanding the genetic and epigenetic landscape in RMS. These studies have identified novel diagnostic and prognostic markers and opened new avenues for treatment. An important target identified in high throughput drug screening studies is reactive oxygen species (ROS). Indeed, many drugs in clinical trials for RMS impact tumour progression through ROS. In light of such emerging evidence, we discuss recent findings highlighting key pathways, epigenetic alterations and their impacts on ROS that form the basis of developing novel molecularly targeted therapies in RMS. Such targeted therapies in combination with conventional therapy could reduce adverse side effects in young survivors and lead to a decline in long-term morbidity.
Collapse
Affiliation(s)
- Ananya Pal
- Department of Physiology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore 117593, Singapore
| | - Hsin Yao Chiu
- Department of Physiology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore 117593, Singapore
| | - Reshma Taneja
- Department of Physiology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore 117593, Singapore.
| |
Collapse
|