1
|
Słowiński P, Li M, Restrepo P, Alomran N, Spurr LF, Miller C, Tsaneva-Atanasova K, Horvath A. GeTallele: A Method for Analysis of DNA and RNA Allele Frequency Distributions. Front Bioeng Biotechnol 2020; 8:1021. [PMID: 33042959 PMCID: PMC7525018 DOI: 10.3389/fbioe.2020.01021] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2020] [Accepted: 08/04/2020] [Indexed: 12/12/2022] Open
Abstract
Variant allele frequencies (VAF) are an important measure of genetic variation that can be estimated at single-nucleotide variant (SNV) sites. RNA and DNA VAFs are used as indicators of a wide-range of biological traits, including tumor purity and ploidy changes, allele-specific expression and gene-dosage transcriptional response. Here we present a novel methodology to assess gene and chromosomal allele asymmetries and to aid in identifying genomic alterations in RNA and DNA datasets. Our approach is based on analysis of the VAF distributions in chromosomal segments (continuous multi-SNV genomic regions). In each segment we estimate variant probability, a parameter of a random process that can generate synthetic VAF samples that closely resemble the observed data. We show that variant probability is a biologically interpretable quantitative descriptor of the VAF distribution in chromosomal segments which is consistent with other approaches. To this end, we apply the proposed methodology on data from 72 samples obtained from patients with breast invasive carcinoma (BRCA) from The Cancer Genome Atlas (TCGA). We compare DNA and RNA VAF distributions from matched RNA and whole exome sequencing (WES) datasets and find that both genomic signals give very similar segmentation and estimated variant probability profiles. We also find a correlation between variant probability with copy number alterations (CNA). Finally, to demonstrate a practical application of variant probabilities, we use them to estimate tumor purity. Tumor purity estimates based on variant probabilities demonstrate good concordance with other approaches (Pearson's correlation between 0.44 and 0.76). Our evaluation suggests that variant probabilities can serve as a dependable descriptor of VAF distribution, further enabling the statistical comparison of matched DNA and RNA datasets. Finally, they provide conceptual and mechanistic insights into relations between structure of VAF distributions and genetic events. The methodology is implemented in a Matlab toolbox that provides a suite of functions for analysis, statistical assessment and visualization of Genome and Transcriptome allele frequencies distributions. GeTallele is available at: https://github.com/SlowinskiPiotr/GeTallele.
Collapse
Affiliation(s)
- Piotr Słowiński
- Department of Mathematics, College of Engineering, Mathematics and Physical Sciences, Living Systems Institute, Translational Research Exchange @ Exeter and The Engineering and Physical Sciences Research Council Centre for Predictive Modelling in Healthcare, University of Exeter, Exeter, United Kingdom
| | - Muzi Li
- McCormick Genomics and Proteomics Center, School of Medicine and Health Sciences, The George Washington University, Washington, DC, United States
| | - Paula Restrepo
- McCormick Genomics and Proteomics Center, School of Medicine and Health Sciences, The George Washington University, Washington, DC, United States.,Department of Genetics and Genomics Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, United States
| | - Nawaf Alomran
- McCormick Genomics and Proteomics Center, School of Medicine and Health Sciences, The George Washington University, Washington, DC, United States
| | - Liam F Spurr
- McCormick Genomics and Proteomics Center, School of Medicine and Health Sciences, The George Washington University, Washington, DC, United States.,Cancer Program, Broad Institute of MIT and Harvard, Cambridge, MA, United States.,Medical Oncology, Dana-Farber Cancer Institute, Boston, MA, United States.,Biological Sciences Division, Pritzker School of Medicine, The University of Chicago, Chicago, IL, United States
| | - Christian Miller
- McCormick Genomics and Proteomics Center, School of Medicine and Health Sciences, The George Washington University, Washington, DC, United States
| | - Krasimira Tsaneva-Atanasova
- Department of Mathematics, College of Engineering, Mathematics and Physical Sciences, Living Systems Institute, Translational Research Exchange @ Exeter and The Engineering and Physical Sciences Research Council Centre for Predictive Modelling in Healthcare, University of Exeter, Exeter, United Kingdom.,Department of Bioinformatics and Mathematical Modelling, Institute of Biophysics and Biomedical Engineering, Bulgarian Academy of Sciences, Sofia, Bulgaria
| | - Anelia Horvath
- McCormick Genomics and Proteomics Center, School of Medicine and Health Sciences, The George Washington University, Washington, DC, United States.,Department of Pharmacology and Physiology, School of Medicine and Health Sciences, The George Washington University, Washington, DC, United States.,Department of Biochemistry and Molecular Medicine, School of Medicine and Health Sciences, The George Washington University, Washington, DC, United States
| |
Collapse
|
2
|
Yang S, Wachtel MS, Wu J. DFseq: Distribution-Free Method to Detect Differential Gene Expression for RNA-Sequencing Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:558-565. [PMID: 30176602 DOI: 10.1109/tcbb.2018.2866994] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Many current RNA-sequencing data analysis methods compare expressions one gene at a time, taking little consideration of the correlations among genes. In this study, we propose a method to convert such an one-dimensional comparison approach into a two-dimensional evaluation of the ratio of standard deviations (SD) of two constructed random variables. This method allows the identification of differentially expressed genes while controlling a preset significance level conditional on the read count mean-variance relationship. Meanwhile, correlations among genes are naturally accommodated due to the clustering of genes with similar distribution in the proposed σ-σ plot. The proposed distribution-free method is designated as DFseq, because it does not depend on a parametric distribution to fit read count. As a result, compared with parametric methods, DFseq can effectively handle genes with a bimodal-like distribution and/or genes with excessive 0 read counts, as well as genes with outlying observations. Besides, DFseq is an ideal platform for comparing performance of different differential gene expression detection methods.
Collapse
|