1
|
Gorla A, Witonsky J, Elhawary JR, Chen ZJ, Mefford J, Perez-Garcia J, Huntsman S, Hu D, Eng C, Woodruff PG, Sankararaman S, Ziv E, Flint J, Zaitlen N, Burchard E, Rahmani E. Epigenetic patient stratification via contrastive machine learning refines hallmark biomarkers in minoritized children with asthma. RESEARCH SQUARE 2024:rs.3.rs-5066762. [PMID: 39315258 PMCID: PMC11419268 DOI: 10.21203/rs.3.rs-5066762/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/25/2024]
Abstract
Identifying and refining clinically significant patient stratification is a critical step toward realizing the promise of precision medicine in asthma. Several peripheral blood hallmarks, including total peripheral blood eosinophil count (BEC) and immunoglobulin E (IgE) levels, are routinely used in asthma clinical practice for endotype classification and predicting response to state-of-the-art targeted biologic drugs. However, these biomarkers appear ineffective in predicting treatment outcomes in some patients, and they differ in distribution between racially and ethnically diverse populations, potentially compromising medical care and hindering health equity due to biases in drug eligibility. Here, we propose constructing an unbiased patient stratification score based on DNA methylation (DNAm) and utilizing it to refine the efficacy of hallmark biomarkers for predicting drug response. We developed Phenotype Aware Component Analysis (PACA), a novel contrastive machine-learning method for learning combinations of DNAm sites reflecting biomedically meaningful patient stratifications. Leveraging whole-blood DNAm from Latino (discovery; n=1,016) and African American (replication; n=756) pediatric asthma case-control cohorts, we applied PACA to refine the prediction of bronchodilator response (BDR) to the short-acting β2-agonist albuterol, the most used drug to treat acute bronchospasm worldwide. While BEC and IgE correlate with BDR in the general patient population, our PACA-derived DNAm score renders these biomarkers predictive of drug response only in patients with high DNAm scores. BEC correlates with BDR in patients with upper-quartile DNAm scores (OR 1.12; 95% CI [1.04, 1.22]; P=7.9 e-4) but not in patients with lower-quartile scores (OR 1.05; 95% CI [0.95, 1.17]; P=0.21); and IgE correlates with BDR in above-median (OR for response 1.42; 95% CI [1.24, 1.63]; P=3.9e-7) but not in below-median patients (OR 1.05; 95% CI [0.92, 1.2]; P=0.57). These results hold within the commonly recognized type 2 (T2)-high asthma endotype but not in T2-low patients, suggesting that our DNAm score primarily represents an unknown variation of T2 asthma. Among T2-high patients with high DNAm scores, elevated BEC or IgE also corresponds to baseline clinical presentation that is known to benefit more from biologic treatment, including higher exacerbation scores, higher allergen sensitization, lower BMI, more recent oral corticosteroids prescription, and lower lung function. Our findings suggest that BEC and IgE, the traditional asthma biomarkers of T2-high asthma, are poor biomarkers for millions worldwide. Revisiting existing drug eligibility criteria relying on these biomarkers in asthma medical care may enhance precision and equity in treatment.
Collapse
Affiliation(s)
- Aditya Gorla
- Bioinformatics Interdepartmental Program, University of California Los Angeles, Los Angeles, CA, USA
| | - Jonathan Witonsky
- Division of Allergy, Immunology, and Bone Marrow Transplant, Department of Pediatrics, University of California San Francisco, San Francisco, CA, USA
| | - Jennifer R Elhawary
- Department of Medicine, University of California, San Francisco, San Francisco, CA, USA
| | - Zeyuan Johnson Chen
- Department of Computer Science, University of California Los Angeles, Los Angeles, CA, USA
| | - Joel Mefford
- Department of Neurology, University of California Los Angeles, Los Angeles, CA, USA
| | - Javier Perez-Garcia
- Genomics and Health Group, Department of Biochemistry, Microbiology, Cell Biology, and Genetics, University of La Laguna, La Laguna, Spain
| | - Scott Huntsman
- Department of Medicine, University of California, San Francisco, San Francisco, CA, USA
| | - Donglei Hu
- Department of Medicine, University of California, San Francisco, San Francisco, CA, USA
| | - Celeste Eng
- Department of Medicine, University of California, San Francisco, San Francisco, CA, USA
| | - Prescott G Woodruff
- Department of Medicine, University of California, San Francisco, San Francisco, CA, USA
| | - Sriram Sankararaman
- Department of Computer Science, University of California Los Angeles, Los Angeles, CA, USA
- Department of Computational Medicine, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, CA, USA
- Department of Human Genetics, University of California Los Angeles, Los Angeles, CA, USA
| | - Elad Ziv
- Department of Medicine, University of California, San Francisco, San Francisco, CA, USA
| | - Jonathan Flint
- Department of Psychiatry and Behavioral Sciences, Brain Research Institute, University of California Los Angeles, Los Angeles, CA, USA
| | - Noah Zaitlen
- Department of Computational Medicine, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, CA, USA
- Department of Human Genetics, University of California Los Angeles, Los Angeles, CA, USA
- Department of Neurology, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, CA, USA
| | - Esteban Burchard
- Department of Medicine, University of California, San Francisco, San Francisco, CA, USA
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, USA
| | - Elior Rahmani
- Department of Computational Medicine, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, CA, USA
| |
Collapse
|
2
|
Ferro dos Santos MR, Giuili E, De Koker A, Everaert C, De Preter K. Computational deconvolution of DNA methylation data from mixed DNA samples. Brief Bioinform 2024; 25:bbae234. [PMID: 38762790 PMCID: PMC11102637 DOI: 10.1093/bib/bbae234] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2024] [Revised: 03/30/2024] [Accepted: 04/30/2024] [Indexed: 05/20/2024] Open
Abstract
In this review, we provide a comprehensive overview of the different computational tools that have been published for the deconvolution of bulk DNA methylation (DNAm) data. Here, deconvolution refers to the estimation of cell-type proportions that constitute a mixed sample. The paper reviews and compares 25 deconvolution methods (supervised, unsupervised or hybrid) developed between 2012 and 2023 and compares the strengths and limitations of each approach. Moreover, in this study, we describe the impact of the platform used for the generation of methylation data (including microarrays and sequencing), the applied data pre-processing steps and the used reference dataset on the deconvolution performance. Next to reference-based methods, we also examine methods that require only partial reference datasets or require no reference set at all. In this review, we provide guidelines for the use of specific methods dependent on the DNA methylation data type and data availability.
Collapse
Affiliation(s)
- Maísa R Ferro dos Santos
- VIB-UGent Center for Medical Biotechnology (CMB), Technologiepark-Zwijnaarde 75, 9052 Zwijnaarde, Belgium
- Cancer Research Institute Ghent (CRIG), 9000 Ghent, Belgium
| | - Edoardo Giuili
- VIB-UGent Center for Medical Biotechnology (CMB), Technologiepark-Zwijnaarde 75, 9052 Zwijnaarde, Belgium
- Cancer Research Institute Ghent (CRIG), 9000 Ghent, Belgium
| | - Andries De Koker
- VIB-UGent Center for Medical Biotechnology (CMB), Technologiepark-Zwijnaarde 75, 9052 Zwijnaarde, Belgium
- Cancer Research Institute Ghent (CRIG), 9000 Ghent, Belgium
| | - Celine Everaert
- VIB-UGent Center for Medical Biotechnology (CMB), Technologiepark-Zwijnaarde 75, 9052 Zwijnaarde, Belgium
- Cancer Research Institute Ghent (CRIG), 9000 Ghent, Belgium
| | - Katleen De Preter
- VIB-UGent Center for Medical Biotechnology (CMB), Technologiepark-Zwijnaarde 75, 9052 Zwijnaarde, Belgium
- Cancer Research Institute Ghent (CRIG), 9000 Ghent, Belgium
| |
Collapse
|
3
|
Gorla A, Sankararaman S, Burchard E, Flint J, Zaitlen N, Rahmani E. Phenotypic subtyping via contrastive learning. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.01.05.522921. [PMID: 36711575 PMCID: PMC9881932 DOI: 10.1101/2023.01.05.522921] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
Abstract
Defining and accounting for subphenotypic structure has the potential to increase statistical power and provide a deeper understanding of the heterogeneity in the molecular basis of complex disease. Existing phenotype subtyping methods primarily rely on clinically observed heterogeneity or metadata clustering. However, they generally tend to capture the dominant sources of variation in the data, which often originate from variation that is not descriptive of the mechanistic heterogeneity of the phenotype of interest; in fact, such dominant sources of variation, such as population structure or technical variation, are, in general, expected to be independent of subphenotypic structure. We instead aim to find a subspace with signal that is unique to a group of samples for which we believe that subphenotypic variation exists (e.g., cases of a disease). To that end, we introduce Phenotype Aware Components Analysis (PACA), a contrastive learning approach leveraging canonical correlation analysis to robustly capture weak sources of subphenotypic variation. In the context of disease, PACA learns a gradient of variation unique to cases in a given dataset, while leveraging control samples for accounting for variation and imbalances of biological and technical confounders between cases and controls. We evaluated PACA using an extensive simulation study, as well as on various subtyping tasks using genotypes, transcriptomics, and DNA methylation data. Our results provide multiple strong evidence that PACA allows us to robustly capture weak unknown variation of interest while being calibrated and well-powered, far superseding the performance of alternative methods. This renders PACA as a state-of-the-art tool for defining de novo subtypes that are more likely to reflect molecular heterogeneity, especially in challenging cases where the phenotypic heterogeneity may be masked by a myriad of strong unrelated effects in the data.
Collapse
Affiliation(s)
- Aditya Gorla
- Bioinformatics Interdepartmental Program, University of California, Los Angeles, Los Angeles, CA, USA
| | - Sriram Sankararaman
- Department of Computer Science, University of California, Los Angeles, Los Angeles, CA, USA
- Department of Computational Medicine, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, USA
- Department of Human Genetics, University of California, Los Angeles, Los Angeles, CA, USA
| | - Esteban Burchard
- Department of Medicine, University of California, San Francisco, San Francisco, CA, USA
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, CA, USA
| | - Jonathan Flint
- Department of Psychiatry and Behavioral Sciences, Brain Research Institute, University of California, Los Angeles, Los Angeles, CA, USA
| | - Noah Zaitlen
- Department of Computational Medicine, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, USA
- Department of Human Genetics, University of California, Los Angeles, Los Angeles, CA, USA
- Department of Neurology, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, USA
| | - Elior Rahmani
- Department of Computational Medicine, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, USA
| |
Collapse
|
4
|
Scherer M, Schmidt F, Lazareva O, Walter J, Baumbach J, Schulz MH, List M. Machine learning for deciphering cell heterogeneity and gene regulation. NATURE COMPUTATIONAL SCIENCE 2021; 1:183-191. [PMID: 38183187 DOI: 10.1038/s43588-021-00038-7] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/15/2020] [Accepted: 02/08/2021] [Indexed: 12/14/2022]
Abstract
Epigenetics studies inheritable and reversible modifications of DNA that allow cells to control gene expression throughout their development and in response to environmental conditions. In computational epigenomics, machine learning is applied to study various epigenetic mechanisms genome wide. Its aim is to expand our understanding of cell differentiation, that is their specialization, in health and disease. Thus far, most efforts focus on understanding the functional encoding of the genome and on unraveling cell-type heterogeneity. Here, we provide an overview of state-of-the-art computational methods and their underlying statistical concepts, which range from matrix factorization and regularized linear regression to deep learning methods. We further show how the rise of single-cell technology leads to new computational challenges and creates opportunities to further our understanding of epigenetic regulation.
Collapse
Affiliation(s)
- Michael Scherer
- Department of Genetics/Epigenetics, Saarland University, Saarbrücken, Germany
- Computational Biology Group, Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, Germany
- Graduate School of Computer Science, Saarland Informatics Campus, Saarbrücken, Germany
| | | | - Olga Lazareva
- Chair of Experimental Bioinformatics, TUM School of Life Sciences Weihenstephan, Technical University of Munich, Freising, Germany
| | - Jörn Walter
- Computational Biology Group, Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, Germany
| | - Jan Baumbach
- Chair of Experimental Bioinformatics, TUM School of Life Sciences Weihenstephan, Technical University of Munich, Freising, Germany
- Computational BioMedicine Lab, Institute of Mathematics and Computer Science, University of Southern Denmark, Odense, Denmark
- Chair of Computational Systems Biology, University of Hamburg, Hamburg, Germany
| | - Marcel H Schulz
- Institute of Cardiovascular Regeneration, University Hospital and Goethe University Frankfurt, Frankfurt, Germany
| | - Markus List
- Chair of Experimental Bioinformatics, TUM School of Life Sciences Weihenstephan, Technical University of Munich, Freising, Germany.
| |
Collapse
|
5
|
Scherer M, Nazarov PV, Toth R, Sahay S, Kaoma T, Maurer V, Vedeneev N, Plass C, Lengauer T, Walter J, Lutsik P. Reference-free deconvolution, visualization and interpretation of complex DNA methylation data using DecompPipeline, MeDeCom and FactorViz. Nat Protoc 2020; 15:3240-3263. [PMID: 32978601 DOI: 10.1038/s41596-020-0369-6] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2019] [Accepted: 05/29/2020] [Indexed: 12/13/2022]
Abstract
DNA methylation profiling offers unique insights into human development and diseases. Often the analysis of complex tissues and cell mixtures is the only feasible option to study methylation changes across large patient cohorts. Since DNA methylomes are highly cell type specific, deconvolution methods can be used to recover cell type-specific information in the form of latent methylation components (LMCs) from such 'bulk' samples. Reference-free deconvolution methods retrieve these components without the need for DNA methylation profiles of purified cell types. Currently no integrated and guided procedure is available for data preparation and subsequent interpretation of deconvolution results. Here, we describe a three-stage protocol for reference-free deconvolution of DNA methylation data comprising: (i) data preprocessing, confounder adjustment using independent component analysis (ICA) and feature selection using DecompPipeline, (ii) deconvolution with multiple parameters using MeDeCom, RefFreeCellMix or EDec and (iii) guided biological inference and validation of deconvolution results with the R/Shiny graphical user interface FactorViz. Our protocol simplifies the analysis and guides the initial interpretation of DNA methylation data derived from complex samples. The harmonized approach is particularly useful to dissect and evaluate cell heterogeneity in complex systems such as tumors. We apply the protocol to lung cancer methylomes from The Cancer Genome Atlas (TCGA) and show that our approach identifies the proportions of stromal cells and tumor-infiltrating immune cells, as well as associations of the detected components with clinical parameters. The protocol takes slightly >3 d to complete and requires basic R skills.
Collapse
Affiliation(s)
- Michael Scherer
- Department of Genetics/Epigenetics, Saarland University, Saarbrücken, Germany.,Computational Biology, Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, Germany
| | - Petr V Nazarov
- Quantitative Biology Unit, Luxembourg Institute of Health, Strassen, Luxembourg
| | - Reka Toth
- Division of Cancer Epigenomics, German Cancer Research Center (DKFZ), Heidelberg, Germany.,Division of Thoracic Oncology, German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Shashwat Sahay
- Department of Genetics/Epigenetics, Saarland University, Saarbrücken, Germany.,Center for Digital Health, Berlin Institute of Health and Charité-Universitätsmedizin Berlin, Berlin, Germany
| | - Tony Kaoma
- Quantitative Biology Unit, Luxembourg Institute of Health, Strassen, Luxembourg
| | - Valentin Maurer
- Division of Cancer Epigenomics, German Cancer Research Center (DKFZ), Heidelberg, Germany
| | | | - Christoph Plass
- Division of Cancer Epigenomics, German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Thomas Lengauer
- Computational Biology, Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, Germany
| | - Jörn Walter
- Department of Genetics/Epigenetics, Saarland University, Saarbrücken, Germany
| | - Pavlo Lutsik
- Division of Cancer Epigenomics, German Cancer Research Center (DKFZ), Heidelberg, Germany.
| |
Collapse
|
6
|
BATMAN: Fast and Accurate Integration of Single-Cell RNA-Seq Datasets via Minimum-Weight Matching. iScience 2020; 23:101185. [PMID: 32504875 PMCID: PMC7276436 DOI: 10.1016/j.isci.2020.101185] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2020] [Revised: 04/17/2020] [Accepted: 05/15/2020] [Indexed: 11/23/2022] Open
Abstract
Single-cell RNA-sequencing (scRNA-seq) is a set of technologies used to profile gene expression at the level of individual cells. Although the throughput of scRNA-seq experiments is steadily growing in terms of the number of cells, large datasets are not yet commonly generated owing to prohibitively high costs. Integrating multiple datasets into one can improve power in scRNA-seq experiments, and efficient integration is very important for downstream analyses such as identifying cell-type-specific eQTLs. State-of-the-art scRNA-seq integration methods are based on the mutual nearest neighbor paradigm and fail to both correct for batch effects and maintain the local structure of the datasets. In this paper, we propose a novel scRNA-seq dataset integration method called BATMAN (BATch integration via minimum-weight MAtchiNg). Across multiple simulations and real datasets, we show that our method significantly outperforms state-of-the-art tools with respect to existing metrics for batch effects by up to 80% while retaining cell-to-cell relationships. Current methods for scRNA-seq dataset integration are based on MNN paradigm MNN paradigm has drawbacks, e.g., it fails in case of non-orthogonal batch effects BATMAN proposes a new paradigm based on minimum-weight bipartite matching BATMAN outperforms the existing scRNA-seq integration methods in the gene space
Collapse
|