201
|
van der Velde KJ, de Boer EN, van Diemen CC, Sikkema-Raddatz B, Abbott KM, Knopperts A, Franke L, Sijmons RH, de Koning TJ, Wijmenga C, Sinke RJ, Swertz MA. GAVIN: Gene-Aware Variant INterpretation for medical sequencing. Genome Biol 2017; 18:6. [PMID: 28093075 PMCID: PMC5240400 DOI: 10.1186/s13059-016-1141-7] [Citation(s) in RCA: 43] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2016] [Accepted: 12/19/2016] [Indexed: 01/08/2023] Open
Abstract
We present Gene-Aware Variant INterpretation (GAVIN), a new method that accurately classifies variants for clinical diagnostic purposes. Classifications are based on gene-specific calibrations of allele frequencies from the ExAC database, likely variant impact using SnpEff, and estimated deleteriousness based on CADD scores for >3000 genes. In a benchmark on 18 clinical gene sets, we achieve a sensitivity of 91.4% and a specificity of 76.9%. This accuracy is unmatched by 12 other tools. We provide GAVIN as an online MOLGENIS service to annotate VCF files and as an open source executable for use in bioinformatic pipelines. It can be found at http://molgenis.org/gavin.
Collapse
Affiliation(s)
- K Joeri van der Velde
- University of Groningen, University Medical Center Groningen, Genomics Coordination Center, Groningen, The Netherlands.,Department of Genetics, University of Groningen, University Medical Center Groningen, Groningen, The Netherlands
| | - Eddy N de Boer
- Department of Genetics, University of Groningen, University Medical Center Groningen, Groningen, The Netherlands
| | - Cleo C van Diemen
- Department of Genetics, University of Groningen, University Medical Center Groningen, Groningen, The Netherlands
| | - Birgit Sikkema-Raddatz
- Department of Genetics, University of Groningen, University Medical Center Groningen, Groningen, The Netherlands
| | - Kristin M Abbott
- Department of Genetics, University of Groningen, University Medical Center Groningen, Groningen, The Netherlands
| | - Alain Knopperts
- Department of Genetics, University of Groningen, University Medical Center Groningen, Groningen, The Netherlands
| | - Lude Franke
- Department of Genetics, University of Groningen, University Medical Center Groningen, Groningen, The Netherlands
| | - Rolf H Sijmons
- Department of Genetics, University of Groningen, University Medical Center Groningen, Groningen, The Netherlands
| | - Tom J de Koning
- Department of Genetics, University of Groningen, University Medical Center Groningen, Groningen, The Netherlands
| | - Cisca Wijmenga
- Department of Genetics, University of Groningen, University Medical Center Groningen, Groningen, The Netherlands
| | - Richard J Sinke
- Department of Genetics, University of Groningen, University Medical Center Groningen, Groningen, The Netherlands
| | - Morris A Swertz
- University of Groningen, University Medical Center Groningen, Genomics Coordination Center, Groningen, The Netherlands. .,Department of Genetics, University of Groningen, University Medical Center Groningen, Groningen, The Netherlands.
| |
Collapse
|
202
|
Piraino SW, Furney SJ. Identification of coding and non-coding mutational hotspots in cancer genomes. BMC Genomics 2017; 18:17. [PMID: 28056774 PMCID: PMC5217664 DOI: 10.1186/s12864-016-3420-9] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2016] [Accepted: 12/14/2016] [Indexed: 12/21/2022] Open
Abstract
Background The identification of mutations that play a causal role in tumour development, so called “driver” mutations, is of critical importance for understanding how cancers form and how they might be treated. Several large cancer sequencing projects have identified genes that are recurrently mutated in cancer patients, suggesting a role in tumourigenesis. While the landscape of coding drivers has been extensively studied and many of the most prominent driver genes are well characterised, comparatively less is known about the role of mutations in the non-coding regions of the genome in cancer development. The continuing fall in genome sequencing costs has resulted in a concomitant increase in the number of cancer whole genome sequences being produced, facilitating systematic interrogation of both the coding and non-coding regions of cancer genomes. Results To examine the mutational landscapes of tumour genomes we have developed a novel method to identify mutational hotspots in tumour genomes using both mutational data and information on evolutionary conservation. We have applied our methodology to over 1300 whole cancer genomes and show that it identifies prominent coding and non-coding regions that are known or highly suspected to play a role in cancer. Importantly, we applied our method to the entire genome, rather than relying on predefined annotations (e.g. promoter regions) and we highlight recurrently mutated regions that may have resulted from increased exposure to mutational processes rather than selection, some of which have been identified previously as targets of selection. Finally, we implicate several pan-cancer and cancer-specific candidate non-coding regions, which could be involved in tumourigenesis. Conclusions We have developed a framework to identify mutational hotspots in cancer genomes, which is applicable to the entire genome. This framework identifies known and novel coding and non-coding mutional hotspots and can be used to differentiate candidate driver regions from likely passenger regions susceptible to somatic mutation. Electronic supplementary material The online version of this article (doi:10.1186/s12864-016-3420-9) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Scott W Piraino
- School of Biomolecular and Biomedical Science, Conway Institute of Biomolecular and Biomedical Research, University College Dublin, Dublin, Ireland
| | - Simon J Furney
- School of Biomolecular and Biomedical Science, Conway Institute of Biomolecular and Biomedical Research, University College Dublin, Belfield, Dublin 4, Ireland.
| |
Collapse
|
203
|
Handel AE, Gallone G, Zameel Cader M, Ponting CP. Most brain disease-associated and eQTL haplotypes are not located within transcription factor DNase-seq footprints in brain. Hum Mol Genet 2017; 26:79-89. [PMID: 27798116 PMCID: PMC5351933 DOI: 10.1093/hmg/ddw369] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2016] [Revised: 09/19/2016] [Accepted: 10/24/2016] [Indexed: 11/20/2022] Open
Abstract
Dense genotyping approaches have revealed much about the genetic architecture both of gene expression and disease susceptibility. However, assigning causality to genetic variants associated with a transcriptomic or phenotypic trait presents a far greater challenge. The development of epigenomic resources by ENCODE, the Epigenomic Roadmap and others has led to strategies that seek to infer the likely functional variants underlying these genome-wide association signals. It is known, for example, that such variants tend to be located within areas of open chromatin, as detected by techniques such as DNase-seq and FAIRE-seq. We aimed to assess what proportion of variants associated with phenotypic or transcriptomic traits in the human brain are located within transcription factor binding sites. The bioinformatic tools, Wellington and HINT, were used to infer transcription factor footprints from existing DNase-seq data derived from central nervous system tissues with high spatial resolution. This dataset was then employed to assess the likely contribution of altered transcription factor binding to both expression quantitative trait loci (eQTL) and genome-wide association study (GWAS) signals. Surprisingly, we show that most haplotypes associated with GWAS or eQTL phenotypes are located outside of DNase-seq footprints. This could imply that DNase-seq footprinting is too insensitive an approach to identify a large proportion of true transcription factor binding sites. Importantly, this suggests that prioritising variants for genome engineering studies to establish causality will continue to be frustrated by an inability of footprinting to identify the causative variant within a haplotype.
Collapse
Affiliation(s)
- Adam E. Handel
- MRC Functional Genomics Unit, Department of Physiology, Anatomy and Genetics
- Weatherall Institute of Molecular Medicine, University of Oxford, Oxford, Oxfordshire, UK
| | - Giuseppe Gallone
- MRC Functional Genomics Unit, Department of Physiology, Anatomy and Genetics
| | - M. Zameel Cader
- Weatherall Institute of Molecular Medicine, University of Oxford, Oxford, Oxfordshire, UK
| | - Chris P. Ponting
- MRC Functional Genomics Unit, Department of Physiology, Anatomy and Genetics
| |
Collapse
|
204
|
Chadaeva IV, Ponomarenko MP, Rasskazov DA, Sharypova EB, Kashina EV, Matveeva MY, Arshinova TV, Ponomarenko PM, Arkova OV, Bondar NP, Savinkova LK, Kolchanov NA. Candidate SNP markers of aggressiveness-related complications and comorbidities of genetic diseases are predicted by a significant change in the affinity of TATA-binding protein for human gene promoters. BMC Genomics 2016; 17:995. [PMID: 28105927 PMCID: PMC5249025 DOI: 10.1186/s12864-016-3353-3] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
BACKGROUND Aggressiveness in humans is a hereditary behavioral trait that mobilizes all systems of the body-first of all, the nervous and endocrine systems, and then the respiratory, vascular, muscular, and others-e.g., for the defense of oneself, children, family, shelter, territory, and other possessions as well as personal interests. The level of aggressiveness of a person determines many other characteristics of quality of life and lifespan, acting as a stress factor. Aggressive behavior depends on many parameters such as age, gender, diseases and treatment, diet, and environmental conditions. Among them, genetic factors are believed to be the main parameters that are well-studied at the factual level, but in actuality, genome-wide studies of aggressive behavior appeared relatively recently. One of the biggest projects of the modern science-1000 Genomes-involves identification of single nucleotide polymorphisms (SNPs), i.e., differences of individual genomes from the reference genome. SNPs can be associated with hereditary diseases, their complications, comorbidities, and responses to stress or a drug. Clinical comparisons between cohorts of patients and healthy volunteers (as a control) allow for identifying SNPs whose allele frequencies significantly separate them from one another as markers of the above conditions. Computer-based preliminary analysis of millions of SNPs detected by the 1000 Genomes project can accelerate clinical search for SNP markers due to preliminary whole-genome search for the most meaningful candidate SNP markers and discarding of neutral and poorly substantiated SNPs. RESULTS Here, we combine two computer-based search methods for SNPs (that alter gene expression) {i} Web service SNP_TATA_Comparator (DNA sequence analysis) and {ii} PubMed-based manual search for articles on aggressiveness using heuristic keywords. Near the known binding sites for TATA-binding protein (TBP) in human gene promoters, we found aggressiveness-related candidate SNP markers, including rs1143627 (associated with higher aggressiveness in patients undergoing cytokine immunotherapy), rs544850971 (higher aggressiveness in old women taking lipid-lowering medication), and rs10895068 (childhood aggressiveness-related obesity in adolescence with cardiovascular complications in adulthood). CONCLUSIONS After validation of these candidate markers by clinical protocols, these SNPs may become useful for physicians (may help to improve treatment of patients) and for the general population (a lifestyle choice preventing aggressiveness-related complications).
Collapse
Affiliation(s)
- Irina V. Chadaeva
- Institute of Cytology and Genetics, Siberian Branch of Russian Academy of Sciences, 10 Lavrentyev Avenue, Novosibirsk, 630090 Russia
- Novosibirsk State University, 2 Pirogova Street, Novosibirsk, 630090 Russia
| | - Mikhail P. Ponomarenko
- Institute of Cytology and Genetics, Siberian Branch of Russian Academy of Sciences, 10 Lavrentyev Avenue, Novosibirsk, 630090 Russia
- Novosibirsk State University, 2 Pirogova Street, Novosibirsk, 630090 Russia
| | - Dmitry A. Rasskazov
- Institute of Cytology and Genetics, Siberian Branch of Russian Academy of Sciences, 10 Lavrentyev Avenue, Novosibirsk, 630090 Russia
| | - Ekaterina B. Sharypova
- Institute of Cytology and Genetics, Siberian Branch of Russian Academy of Sciences, 10 Lavrentyev Avenue, Novosibirsk, 630090 Russia
| | - Elena V. Kashina
- Institute of Cytology and Genetics, Siberian Branch of Russian Academy of Sciences, 10 Lavrentyev Avenue, Novosibirsk, 630090 Russia
| | - Marina Yu Matveeva
- Institute of Cytology and Genetics, Siberian Branch of Russian Academy of Sciences, 10 Lavrentyev Avenue, Novosibirsk, 630090 Russia
| | - Tatjana V. Arshinova
- Institute of Cytology and Genetics, Siberian Branch of Russian Academy of Sciences, 10 Lavrentyev Avenue, Novosibirsk, 630090 Russia
| | - Petr M. Ponomarenko
- Children’s Hospital Los Angeles, 4640 Hollywood Boulevard, University of Southern California, Los Angeles, CA 90027 USA
| | - Olga V. Arkova
- Institute of Cytology and Genetics, Siberian Branch of Russian Academy of Sciences, 10 Lavrentyev Avenue, Novosibirsk, 630090 Russia
- Vector-Best Inc, Koltsovo, Novosibirsk Region 630559 Russia
| | - Natalia P. Bondar
- Institute of Cytology and Genetics, Siberian Branch of Russian Academy of Sciences, 10 Lavrentyev Avenue, Novosibirsk, 630090 Russia
| | - Ludmila K. Savinkova
- Institute of Cytology and Genetics, Siberian Branch of Russian Academy of Sciences, 10 Lavrentyev Avenue, Novosibirsk, 630090 Russia
| | - Nikolay A. Kolchanov
- Institute of Cytology and Genetics, Siberian Branch of Russian Academy of Sciences, 10 Lavrentyev Avenue, Novosibirsk, 630090 Russia
- Novosibirsk State University, 2 Pirogova Street, Novosibirsk, 630090 Russia
| |
Collapse
|
205
|
Dong C, Guo Y, Yang H, He Z, Liu X, Wang K. iCAGES: integrated CAncer GEnome Score for comprehensively prioritizing driver genes in personal cancer genomes. Genome Med 2016; 8:135. [PMID: 28007024 PMCID: PMC5180414 DOI: 10.1186/s13073-016-0390-0] [Citation(s) in RCA: 40] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2016] [Accepted: 12/05/2016] [Indexed: 12/31/2022] Open
Abstract
Cancer results from the acquisition of somatic driver mutations. Several computational tools can predict driver genes from population-scale genomic data, but tools for analyzing personal cancer genomes are underdeveloped. Here we developed iCAGES, a novel statistical framework that infers driver variants by integrating contributions from coding, non-coding, and structural variants, identifies driver genes by combining genomic information and prior biological knowledge, then generates prioritized drug treatment. Analysis on The Cancer Genome Atlas (TCGA) data showed that iCAGES predicts whether patients respond to drug treatment (P = 0.006 by Fisher's exact test) and long-term survival (P = 0.003 from Cox regression). iCAGES is available at http://icages.wglab.org .
Collapse
Affiliation(s)
- Chengliang Dong
- Zilkha Neurogenetic Institute, University of Southern California, Los Angeles, CA, 90089, USA
- Biostatistics Graduate Program, Department of Preventive Medicine, University of Southern California, Los Angeles, CA, 90089, USA
| | - Yunfei Guo
- Zilkha Neurogenetic Institute, University of Southern California, Los Angeles, CA, 90089, USA
- Biostatistics Graduate Program, Department of Preventive Medicine, University of Southern California, Los Angeles, CA, 90089, USA
| | - Hui Yang
- Zilkha Neurogenetic Institute, University of Southern California, Los Angeles, CA, 90089, USA
- Neuroscience Graduate Program, University of Southern California, Los Angeles, CA, 90089, USA
| | - Zeyu He
- Department of Computer Science, New York University, New York, NY, 10012, USA
| | - Xiaoming Liu
- Human Genetics Center, The University of Texas Health Science Center at Houston, Houston, TX, 77030, USA
- Division of Epidemiology, Human Genetics and Environmental Sciences, The University of Texas Health Science Center at Houston, Houston, TX, 77030, USA
| | - Kai Wang
- Zilkha Neurogenetic Institute, University of Southern California, Los Angeles, CA, 90089, USA.
- Institute for Genomic Medicine, Columbia University, 630 W. 168th St, Room 11-451, New York, NY, 10032, USA.
| |
Collapse
|
206
|
Lu Y, Quan C, Chen H, Bo X, Zhang C. 3DSNP: a database for linking human noncoding SNPs to their three-dimensional interacting genes. Nucleic Acids Res 2016; 45:D643-D649. [PMID: 27789693 PMCID: PMC5210526 DOI: 10.1093/nar/gkw1022] [Citation(s) in RCA: 77] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2016] [Revised: 10/16/2016] [Accepted: 10/18/2016] [Indexed: 12/02/2022] Open
Abstract
The vast noncoding portion of the human genome harbors a rich array of functional elements and disease-causing regulatory variants. Recent high-throughput chromosome conformation capture studies have outlined the principles of these elements interacting and regulating the expression of distal target genes through three-dimensional (3D) chromatin looping. Here we present 3DSNP, an integrated database for annotating human noncoding variants by exploring their roles in the distal interactions between genes and regulatory elements. 3DSNP integrates 3D chromatin interactions, local chromatin signatures in different cell types and linkage disequilibrium (LD) information from the 1000 Genomes Project. 3DSNP provides informative visualization tools to display the integrated local and 3D chromatin signatures and the genetic associations among variants. Data from different functional categories are integrated in a scoring system that quantitatively measures the functionality of SNPs to help select important variants from a large pool. 3DSNP is a valuable resource for the annotation of human noncoding genome sequence and investigating the impact of noncoding variants on clinical phenotypes. The 3DSNP database is available at http://biotech.bmi.ac.cn/3dsnp/.
Collapse
Affiliation(s)
- Yiming Lu
- Beijing Institute of Radiation Medicine, State Key Laboratory of Proteomics, Beijing 100850, China
| | - Cheng Quan
- Beijing Institute of Radiation Medicine, State Key Laboratory of Proteomics, Beijing 100850, China
| | - Hebing Chen
- Beijing Institute of Radiation Medicine, State Key Laboratory of Proteomics, Beijing 100850, China
| | - Xiaochen Bo
- Beijing Institute of Radiation Medicine, State Key Laboratory of Proteomics, Beijing 100850, China
| | - Chenggang Zhang
- Beijing Institute of Radiation Medicine, State Key Laboratory of Proteomics, Beijing 100850, China
| |
Collapse
|
207
|
Li H, He Z, Gu Y, Fang L, Lv X. Prioritization of non-coding disease-causing variants and long non-coding RNAs in liver cancer. Oncol Lett 2016; 12:3987-3994. [PMID: 27895760 DOI: 10.3892/ol.2016.5135] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2015] [Accepted: 06/16/2016] [Indexed: 01/10/2023] Open
Abstract
There are multiple bioinformatics tools available for the detection of coding driver mutations in cancers. However, the prioritization of pathogenic non-coding variants remains a challenging and demanding task. The present study was performed to discriminate non-coding disease-causing mutations and prioritize potential cancer-implicated long non-coding RNAs (lncRNAs) in liver cancer using a logistic regression model. A logistic regression model was constructed by combining 19,153 disease-associated ClinVar and human gene mutation database pathogenic variants as the response variable and non-coding features as the predictor variable. Genome-wide association study (GWAS) disease or trait-associated variants and recurrent somatic mutations were used to validate the model. Non-coding gene features with the highest fractions of load were characterized and potential cancer-associated lncRNA candidates were prioritized by combining the fraction of high-scoring regions and average score predicted by the logistic regression model. H3K9me3 and conserved regions were the most negatively and positively informative for the model, respectively. The area under the receiver operating characteristic curve of the model was 0.92. The average score of GWAS disease-associated variants was significantly increased compared with neutral single nucleotide polymorphisms (5.8642 vs. 5.4707; P<0.001), the average score of recurrent somatic mutations of liver cancer was significantly increased compared with non-recurrent somatic mutations (5.4101 vs. 5.2768; P=0.0125). The present study found regions in lncRNAs and introns/untranslated regions of protein coding genes where mutations are most likely to be damaging. In total, 847 lncRNAs were filtered out from the background. Characterization of this subset of lncRNAs showed that these lncRNAs are more conservative, less mutated and more highly expressed compared with other control lncRNAs. In addition, 23 of these lncRNAs were differentially expressed between 12 pairs of liver cancer and adjacent normal specimens. The logistic regression model is a useful tool to prioritize non-coding pathogenic variants and lncRNAs, and paves the way for the detection of non-coding driver lncRNAs in liver cancer.
Collapse
Affiliation(s)
- Hua Li
- Department of Anesthesiology, Shanghai Pulmonary Hospital, Tongji University School of Medicine, Shanghai 200433, P.R. China
| | - Zekun He
- Department of Clinical Medicine, Fuzhou Medical College of Nanchang University, Fuzhou, Jiangxi 344000, P.R. China
| | - Yang Gu
- Department of Anesthesiology, Shanghai Pulmonary Hospital, Tongji University School of Medicine, Shanghai 200433, P.R. China
| | - Lin Fang
- Department of Thyroid and Breast Surgery, Shanghai Tenth People's Hospital, Tongji University, School of Medicine, Shanghai 200072, P.R. China
| | - Xin Lv
- Department of Anesthesiology, Shanghai Pulmonary Hospital, Tongji University School of Medicine, Shanghai 200433, P.R. China
| |
Collapse
|
208
|
Candidate SNP Markers of Chronopathologies Are Predicted by a Significant Change in the Affinity of TATA-Binding Protein for Human Gene Promoters. BIOMED RESEARCH INTERNATIONAL 2016; 2016:8642703. [PMID: 27635400 PMCID: PMC5011241 DOI: 10.1155/2016/8642703] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/04/2016] [Revised: 06/25/2016] [Accepted: 06/28/2016] [Indexed: 01/14/2023]
Abstract
Variations in human genome (e.g., single nucleotide polymorphisms, SNPs) may be associated with hereditary diseases, their complications, comorbidities, and drug responses. Using Web service SNP_TATA_Comparator presented in our previous paper, here we analyzed immediate surroundings of known SNP markers of diseases and identified several candidate SNP markers that can significantly change the affinity of TATA-binding protein for human gene promoters, with circadian consequences. For example, rs572527200 may be related to asthma, where symptoms are circadian (worse at night), and rs367732974 may be associated with heart attacks that are characterized by a circadian preference (early morning). By the same method, we analyzed the 90 bp proximal promoter region of each protein-coding transcript of each human gene of the circadian clock core. This analysis yielded 53 candidate SNP markers, such as rs181985043 (susceptibility to acute Q fever in male patients), rs192518038 (higher risk of a heart attack in patients with diabetes), and rs374778785 (emphysema and lung cancer in smokers). If they are properly validated according to clinical standards, these candidate SNP markers may turn out to be useful for physicians (to select optimal treatment for each patient) and for the general population (to choose a lifestyle preventing possible circadian complications of diseases).
Collapse
|
209
|
Poulos RC, Sloane MA, Hesson LB, Wong JWH. The search for cis-regulatory driver mutations in cancer genomes. Oncotarget 2016; 6:32509-25. [PMID: 26356674 PMCID: PMC4741709 DOI: 10.18632/oncotarget.5085] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2015] [Accepted: 08/06/2015] [Indexed: 12/16/2022] Open
Abstract
With the advent of high-throughput and relatively inexpensive whole-genome sequencing technology, the focus of cancer research has begun to shift toward analyses of somatic mutations in non-coding cis-regulatory elements of the cancer genome. Cis-regulatory elements play an important role in gene regulation, with mutations in these elements potentially resulting in changes to the expression of linked genes. The recent discoveries of recurrent TERT promoter mutations in melanoma, and recurrent mutations that create a super-enhancer regulating TAL1 expression in T-cell acute lymphoblastic leukaemia (T-ALL), have sparked significant interest in the search for other somatic cis-regulatory mutations driving cancer development. In this review, we look more closely at the TERT promoter and TAL1 enhancer alterations and use these examples to ask whether other cis-regulatory mutations may play a role in cancer susceptibility. In doing so, we make observations from the data emerging from recent research in this field, and describe the experimental and analytical approaches which could be adopted in the hope of better uncovering the true functional significance of somatic cis-regulatory mutations in cancer.
Collapse
Affiliation(s)
- Rebecca C Poulos
- Prince of Wales Clinical School and Lowy Cancer Research Centre, UNSW Australia, Sydney, Australia
| | - Mathew A Sloane
- Prince of Wales Clinical School and Lowy Cancer Research Centre, UNSW Australia, Sydney, Australia
| | - Luke B Hesson
- Prince of Wales Clinical School and Lowy Cancer Research Centre, UNSW Australia, Sydney, Australia
| | - Jason W H Wong
- Prince of Wales Clinical School and Lowy Cancer Research Centre, UNSW Australia, Sydney, Australia
| |
Collapse
|
210
|
Li MJ, Pan Z, Liu Z, Wu J, Wang P, Zhu Y, Xu F, Xia Z, Sham PC, Kocher JPA, Li M, Liu JS, Wang J. Predicting regulatory variants with composite statistic. Bioinformatics 2016; 32:2729-36. [PMID: 27273672 DOI: 10.1093/bioinformatics/btw288] [Citation(s) in RCA: 35] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2015] [Accepted: 04/29/2016] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Prediction and prioritization of human non-coding regulatory variants is critical for understanding the regulatory mechanisms of disease pathogenesis and promoting personalized medicine. Existing tools utilize functional genomics data and evolutionary information to evaluate the pathogenicity or regulatory functions of non-coding variants. However, different algorithms lead to inconsistent and even conflicting predictions. Combining multiple methods may increase accuracy in regulatory variant prediction. RESULTS Here, we compiled an integrative resource for predictions from eight different tools on functional annotation of non-coding variants. We further developed a composite strategy to integrate multiple predictions and computed the composite likelihood of a given variant being regulatory variant. Benchmarked by multiple independent causal variants datasets, we demonstrated that our composite model significantly improves the prediction performance. AVAILABILITY AND IMPLEMENTATION We implemented our model and scoring procedure as a tool, named PRVCS, which is freely available to academic and non-profit usage at http://jjwanglab.org/PRVCS CONTACT: wang.junwen@mayo.edu, jliu@stat.harvard.edu, or limx54@gmail.com SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Mulin Jun Li
- Department of Statistics, Harvard University, Cambridge, Boston, 02138-2901 MA, USA, Centre for Genomic Sciences
| | - Zhicheng Pan
- Centre for Genomic Sciences, Department of Psychiatry
| | - Zipeng Liu
- Centre for Genomic Sciences, Department of Anaesthesiology
| | - Jiexing Wu
- Department of Statistics, Harvard University, Cambridge, Boston, 02138-2901 MA, USA
| | | | - Yun Zhu
- Centre for Genomic Sciences, School of Biomedical Sciences
| | | | | | | | - Jean-Pierre A Kocher
- Department of Health Sciences Research, Center for Individualized Medicine, Mayo Clinic, Scottsdale, AZ 85259, USA and
| | - Miaoxin Li
- Centre for Genomic Sciences, Department of Psychiatry, Centre for Reproduction, Development and Growth, LKS Faculty of Medicine, the University of Hong Kong, Hong Kong SAR, China
| | - Jun S Liu
- Department of Statistics, Harvard University, Cambridge, Boston, 02138-2901 MA, USA, Center for Statistical Science, Tsinghua University, Beijing 100084, China and
| | - Junwen Wang
- Centre for Genomic Sciences, Department of Health Sciences Research, Center for Individualized Medicine, Mayo Clinic, Scottsdale, AZ 85259, USA and Department of Biomedical Informatics, Arizona State University, Scottsdale, AZ 85259, USA
| |
Collapse
|
211
|
Umer HM, Cavalli M, Dabrowski MJ, Diamanti K, Kruczyk M, Pan G, Komorowski J, Wadelius C. A Significant Regulatory Mutation Burden at a High-Affinity Position of the CTCF Motif in Gastrointestinal Cancers. Hum Mutat 2016; 37:904-13. [PMID: 27174533 DOI: 10.1002/humu.23014] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2016] [Accepted: 05/03/2016] [Indexed: 12/22/2022]
Abstract
Somatic mutations drive cancer and there are established ways to study those in coding sequences. It has been shown that some regulatory mutations are over-represented in cancer. We develop a new strategy to find putative regulatory mutations based on experimentally established motifs for transcription factors (TFs). In total, we find 1,552 candidate regulatory mutations predicted to significantly reduce binding affinity of many TFs in hepatocellular carcinoma and affecting binding of CTCF also in esophagus, gastric, and pancreatic cancers. Near mutated motifs, there is a significant enrichment of (1) genes mutated in cancer, (2) tumor-suppressor genes, (3) genes in KEGG cancer pathways, and (4) sets of genes previously associated to cancer. Experimental and functional validations support the findings. The strategy can be applied to identify regulatory mutations in any cell type with established TF motifs and will aid identifications of genes contributing to cancer.
Collapse
Affiliation(s)
- Husen M Umer
- Science for Life Laboratory, Department of Cell and Molecular Biology, Uppsala University, Uppsala, SE-751-24, Sweden
| | - Marco Cavalli
- Science for Life Laboratory, Department of Immunology, Genetics and Pathology, Uppsala University, Uppsala, SE-751 08, Sweden
| | - Michal J Dabrowski
- Institute of Computer Science, Polish Academy of Sciences, Warsaw, 01-248, Poland
| | - Klev Diamanti
- Science for Life Laboratory, Department of Cell and Molecular Biology, Uppsala University, Uppsala, SE-751-24, Sweden
| | - Marcin Kruczyk
- Science for Life Laboratory, Department of Cell and Molecular Biology, Uppsala University, Uppsala, SE-751-24, Sweden
| | - Gang Pan
- Science for Life Laboratory, Department of Immunology, Genetics and Pathology, Uppsala University, Uppsala, SE-751 08, Sweden
| | - Jan Komorowski
- Science for Life Laboratory, Department of Cell and Molecular Biology, Uppsala University, Uppsala, SE-751-24, Sweden.,Institute of Computer Science, Polish Academy of Sciences, Warsaw, 01-248, Poland
| | - Claes Wadelius
- Science for Life Laboratory, Department of Immunology, Genetics and Pathology, Uppsala University, Uppsala, SE-751 08, Sweden
| |
Collapse
|
212
|
Bendl J, Musil M, Štourač J, Zendulka J, Damborský J, Brezovský J. PredictSNP2: A Unified Platform for Accurately Evaluating SNP Effects by Exploiting the Different Characteristics of Variants in Distinct Genomic Regions. PLoS Comput Biol 2016; 12:e1004962. [PMID: 27224906 PMCID: PMC4880439 DOI: 10.1371/journal.pcbi.1004962] [Citation(s) in RCA: 133] [Impact Index Per Article: 16.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2015] [Accepted: 05/05/2016] [Indexed: 12/20/2022] Open
Abstract
An important message taken from human genome sequencing projects is that the human population exhibits approximately 99.9% genetic similarity. Variations in the remaining parts of the genome determine our identity, trace our history and reveal our heritage. The precise delineation of phenotypically causal variants plays a key role in providing accurate personalized diagnosis, prognosis, and treatment of inherited diseases. Several computational methods for achieving such delineation have been reported recently. However, their ability to pinpoint potentially deleterious variants is limited by the fact that their mechanisms of prediction do not account for the existence of different categories of variants. Consequently, their output is biased towards the variant categories that are most strongly represented in the variant databases. Moreover, most such methods provide numeric scores but not binary predictions of the deleteriousness of variants or confidence scores that would be more easily understood by users. We have constructed three datasets covering different types of disease-related variants, which were divided across five categories: (i) regulatory, (ii) splicing, (iii) missense, (iv) synonymous, and (v) nonsense variants. These datasets were used to develop category-optimal decision thresholds and to evaluate six tools for variant prioritization: CADD, DANN, FATHMM, FitCons, FunSeq2 and GWAVA. This evaluation revealed some important advantages of the category-based approach. The results obtained with the five best-performing tools were then combined into a consensus score. Additional comparative analyses showed that in the case of missense variations, protein-based predictors perform better than DNA sequence-based predictors. A user-friendly web interface was developed that provides easy access to the five tools’ predictions, and their consensus scores, in a user-understandable format tailored to the specific features of different categories of variations. To enable comprehensive evaluation of variants, the predictions are complemented with annotations from eight databases. The web server is freely available to the community at http://loschmidt.chemi.muni.cz/predictsnp2.
Collapse
Affiliation(s)
- Jaroslav Bendl
- Loschmidt Laboratories, Department of Experimental Biology and Research Centre for Toxic Compounds in the Environment RECETOX, Masaryk University, Brno, Czech Republic
- Department of Information Systems, Faculty of Information Technology, Brno University of Technology, Brno, Czech Republic
- International Clinical Research Center, St. Anne’s University Hospital Brno, Brno, Czech Republic
| | - Miloš Musil
- Loschmidt Laboratories, Department of Experimental Biology and Research Centre for Toxic Compounds in the Environment RECETOX, Masaryk University, Brno, Czech Republic
- Department of Information Systems, Faculty of Information Technology, Brno University of Technology, Brno, Czech Republic
| | - Jan Štourač
- Loschmidt Laboratories, Department of Experimental Biology and Research Centre for Toxic Compounds in the Environment RECETOX, Masaryk University, Brno, Czech Republic
- International Clinical Research Center, St. Anne’s University Hospital Brno, Brno, Czech Republic
| | - Jaroslav Zendulka
- Department of Information Systems, Faculty of Information Technology, Brno University of Technology, Brno, Czech Republic
| | - Jiří Damborský
- Loschmidt Laboratories, Department of Experimental Biology and Research Centre for Toxic Compounds in the Environment RECETOX, Masaryk University, Brno, Czech Republic
- International Clinical Research Center, St. Anne’s University Hospital Brno, Brno, Czech Republic
- * E-mail: (JD); (JBr)
| | - Jan Brezovský
- Loschmidt Laboratories, Department of Experimental Biology and Research Centre for Toxic Compounds in the Environment RECETOX, Masaryk University, Brno, Czech Republic
- International Clinical Research Center, St. Anne’s University Hospital Brno, Brno, Czech Republic
- * E-mail: (JD); (JBr)
| |
Collapse
|
213
|
Li H, Lv X. Functional annotation of noncoding variants and prioritization of cancer-associated lncRNAs in lung cancer. Oncol Lett 2016; 12:222-230. [PMID: 27347129 DOI: 10.3892/ol.2016.4604] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2015] [Accepted: 04/01/2016] [Indexed: 11/05/2022] Open
Abstract
Multiple computational tools have been widely applied to the detection of coding driver mutations in cancer; however, the prioritization of pathogenic non-coding variants remains a difficult and demanding task. The present study was performed to distinguish non-coding disease-causing mutations from neutral ones, and to prioritize potential cancer-associated long non-coding RNAs (lncRNAs) with a logistic regression model in lung cancer. A logistic regression model was constructed, combining 19,153 disease-associated ClinVar and Human Gene Mutation Database pathogenic variants as the response variable and non-coding features as the predictor variable. Validation of the model was conducted with genome-wide association study (GWAS) disease- or trait-associated single nucleotide polymorphisms (SNPs) and recurrent somatic mutations. High scoring regions were characterized with respect to their distribution in various features and gene classes; potential cancer-associated lncRNA candidates were prioritized, combining the fraction of high-scoring regions and average score predicted by the logistic regression model. H3K79me2 was the most negative factor that contributed to the model, while conserved regions were most positively informative to the model. The area under the receiver operating characteristic curve of the model was 0.89. The model assigned a significantly higher score to GWAS SNPs and recurrent somatic mutations compared with neutral SNPs (mean, 5.9012 vs. 5.5238; P<0.001, Mann-Whitney U test) and non-recurrent mutations (mean, 5.4677 vs. 5.2277, P<0.001, Mann-Whitney U test), respectively. It was observed that regions, including splicing sites and untranslated regions, and gene classes, including cancer genes and cancer-associated lncRNAs, had an increased enrichment of high-scoring regions. In total, 2,679 cancer-associated lncRNAs were determined and characterized. A total of 104 of these lncRNAs were differentially expressed between lung cancer and normal specimens. The logistic regression model is a useful and efficient scoring system to prioritize non-coding pathogenic variants and lncRNAs, and may provide the basis for detecting non-coding driver lncRNAs in lung cancer.
Collapse
Affiliation(s)
- Hua Li
- Department of Anesthesiology, Shanghai Pulmonary Hospital, School of Medicine, Tongji University, Shanghai 200072, P.R. China
| | - Xin Lv
- Department of Anesthesiology, Shanghai Pulmonary Hospital, School of Medicine, Tongji University, Shanghai 200072, P.R. China
| |
Collapse
|
214
|
|
215
|
Kelley DR, Snoek J, Rinn JL. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res 2016; 26:990-9. [PMID: 27197224 PMCID: PMC4937568 DOI: 10.1101/gr.200535.115] [Citation(s) in RCA: 519] [Impact Index Per Article: 64.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2015] [Accepted: 04/26/2016] [Indexed: 12/22/2022]
Abstract
The complex language of eukaryotic gene expression remains incompletely understood. Despite the importance suggested by many noncoding variants statistically associated with human disease, nearly all such variants have unknown mechanisms. Here, we address this challenge using an approach based on a recent machine learning advance-deep convolutional neural networks (CNNs). We introduce the open source package Basset to apply CNNs to learn the functional activity of DNA sequences from genomics data. We trained Basset on a compendium of accessible genomic sites mapped in 164 cell types by DNase-seq, and demonstrate greater predictive accuracy than previous methods. Basset predictions for the change in accessibility between variant alleles were far greater for Genome-wide association study (GWAS) SNPs that are likely to be causal relative to nearby SNPs in linkage disequilibrium with them. With Basset, a researcher can perform a single sequencing assay in their cell type of interest and simultaneously learn that cell's chromatin accessibility code and annotate every mutation in the genome with its influence on present accessibility and latent potential for accessibility. Thus, Basset offers a powerful computational approach to annotate and interpret the noncoding genome.
Collapse
Affiliation(s)
- David R Kelley
- Department of Stem Cell and Regenerative Biology, Harvard University, Cambridge, Massachusetts 02138, USA
| | - Jasper Snoek
- School of Engineering and Applied Science, Harvard University, Cambridge, Massachusetts 02138, USA
| | - John L Rinn
- Department of Stem Cell and Regenerative Biology, Harvard University, Cambridge, Massachusetts 02138, USA
| |
Collapse
|
216
|
A uniform survey of allele-specific binding and expression over 1000-Genomes-Project individuals. Nat Commun 2016; 7:11101. [PMID: 27089393 PMCID: PMC4837449 DOI: 10.1038/ncomms11101] [Citation(s) in RCA: 51] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2015] [Accepted: 02/19/2016] [Indexed: 02/07/2023] Open
Abstract
Large-scale sequencing in the 1000 Genomes Project has revealed multitudes of single nucleotide variants (SNVs). Here, we provide insights into the functional effect of these variants using allele-specific behaviour. This can be assessed for an individual by mapping ChIP-seq and RNA-seq reads to a personal genome, and then measuring 'allelic imbalances' between the numbers of reads mapped to the paternal and maternal chromosomes. We annotate variants associated with allele-specific binding and expression in 382 individuals by uniformly processing 1,263 functional genomics data sets, developing approaches to reduce the heterogeneity between data sets due to overdispersion and mapping bias. Since many allelic variants are rare, aggregation across multiple individuals is necessary to identify broadly applicable 'allelic elements'. We also found SNVs for which we can anticipate allelic imbalance from the disruption of a binding motif. Our results serve as an allele-specific annotation for the 1000 Genomes variant catalogue and are distributed as an online resource (alleledb.gersteinlab.org).
Collapse
|
217
|
Ponomarenko MP, Arkova O, Rasskazov D, Ponomarenko P, Savinkova L, Kolchanov N. Candidate SNP Markers of Gender-Biased Autoimmune Complications of Monogenic Diseases Are Predicted by a Significant Change in the Affinity of TATA-Binding Protein for Human Gene Promoters. Front Immunol 2016; 7:130. [PMID: 27092142 PMCID: PMC4819121 DOI: 10.3389/fimmu.2016.00130] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2015] [Accepted: 03/21/2016] [Indexed: 12/17/2022] Open
Abstract
Some variations of human genome [for example, single nucleotide polymorphisms (SNPs)] are markers of hereditary diseases and drug responses. Analysis of them can help to improve treatment. Computer-based analysis of millions of SNPs in the 1000 Genomes project makes a search for SNP markers more targeted. Here, we combined two computer-based approaches: DNA sequence analysis and keyword search in databases. In the binding sites for TATA-binding protein (TBP) in human gene promoters, we found candidate SNP markers of gender-biased autoimmune diseases, including rs1143627 [cachexia in rheumatoid arthritis (double prevalence among women)]; rs11557611 [demyelinating diseases (thrice more prevalent among young white women than among non-white individuals)]; rs17231520 and rs569033466 [both: atherosclerosis comorbid with related diseases (double prevalence among women)]; rs563763767 [Hughes syndrome-related thrombosis (lethal during pregnancy)]; rs2814778 [autoimmune diseases (excluding multiple sclerosis and rheumatoid arthritis) underlying hypergammaglobulinemia in women]; rs72661131 and rs562962093 (both: preterm delivery in pregnant diabetic women); and rs35518301, rs34166473, rs34500389, rs33981098, rs33980857, rs397509430, rs34598529, rs33931746, rs281864525, and rs63750953 (all: autoimmune diseases underlying hypergammaglobulinemia in women). Validation of these predicted candidate SNP markers using the clinical standards may advance personalized medicine.
Collapse
Affiliation(s)
- Mikhail P. Ponomarenko
- Institute of Cytology and Genetics, Siberian Branch of Russian Academy of Sciences, Novosibirsk, Russia
- Novosibirsk State University, Novosibirsk, Russia
| | - Olga Arkova
- Institute of Cytology and Genetics, Siberian Branch of Russian Academy of Sciences, Novosibirsk, Russia
| | - Dmitry Rasskazov
- Institute of Cytology and Genetics, Siberian Branch of Russian Academy of Sciences, Novosibirsk, Russia
| | | | - Ludmila Savinkova
- Institute of Cytology and Genetics, Siberian Branch of Russian Academy of Sciences, Novosibirsk, Russia
| | - Nikolay Kolchanov
- Institute of Cytology and Genetics, Siberian Branch of Russian Academy of Sciences, Novosibirsk, Russia
- Novosibirsk State University, Novosibirsk, Russia
| |
Collapse
|
218
|
Turnaev II, Rasskazov DA, Arkova OV, Ponomarenko MP, Ponomarenko PM, Savinkova LK, Kolchanov NA. Hypothetical SNP markers that significantly affect the affinity of the TATA-binding protein to VEGFA, ERBB2, IGF1R, FLT1, KDR, and MET oncogene promoters as chemotherapy targets. Mol Biol 2016. [DOI: 10.1134/s0026893316010209] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
219
|
Abstract
The interpretation of noncoding alterations in cancer genomes presents an unresolved problem in cancer studies. While the impact of somatic variations in protein-coding regions is widely accepted, noncoding aberrations are mostly considered as passenger events. However, with the advance of genome-wide profiling strategies, alterations outside the coding context entered the focus, and multiple examples highlight the role of gene deregulation as cancer-driving events. This review describes the implication of noncoding alterations in oncogenesis and provides a theoretical framework for the identification of causal somatic variants using quantitative trait loci (QTL) analysis. Assuming that functional noncoding alterations affect quantifiable regulatory processes, somatic QTL studies constitute a valuable strategy to pinpoint cancer gene deregulation. Eventually, the comprehensive identification and interpretation of coding and noncoding alterations will guide our future understanding of cancer biology.
Collapse
|
220
|
Du C, Wu X, Li J. Mutation pattern is an influential factor on functional mutation rates in cancer. Cancer Cell Int 2016; 16:2. [PMID: 26865835 PMCID: PMC4748466 DOI: 10.1186/s12935-016-0278-5] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2015] [Accepted: 02/03/2016] [Indexed: 11/11/2022] Open
Abstract
BACKGROUND Mutation rates are consistently varied in cancer genome and play an important role in tumorigenesis, however, little has been known about their function potential and impact on the distribution of functional mutations. In this study, we investigated genomic features which affect mutation pattern and the function importance of mutation pattern in cancer. METHODS Somatic mutations of clear-cell renal cell carcinoma, liver cancer, lung cancer and melanoma and single nucleotide polymorphisms (SNPs) were intersected with 54 distinct genomic features. Somatic mutation and SNP densities were then computed for each feature type. We constructed 2856 1-Mb windows, in which each row (1-Mb window) contains somatic mutation, SNP densities and 54 feature vectors. Correlation analyses were conducted between somatic mutation, SNP densities and each feature vector. We also built two random forest models, namely somatic mutation model (CSM) and SNP model to predict somatic mutation and SNP densities on a 1-Kb scale. The relation of CSM and SNP scores was further analyzed with the distributions of deleterious coding variants predicted by SIFT and Mutation Assessor, non-coding functional variants evaluated with FunSeq 2 and GWAVA and disease-causing variants from HGMD and ClinVar databases. RESULTS We observed a wide range of genomic features which affect local mutation rates, such as replication time, transcription levels, histone marks and regulatory elements. Repressive histone marks, replication time and promoter contributed most to the CSM models, while, recombination rate and chromatin organizations were most important for the SNP model. We showed low mutated regions preferentially have higher densities of deleterious coding mutations, higher average scores of non-coding variants, higher fraction of functional regions and higher enrichment of disease-causing variants as compared to high mutated regions. CONCLUSIONS Somatic mutation densities vary largely across cancer genome, mutation frequency is a major indication of function and influence on the distribution of functional mutations in cancer.
Collapse
Affiliation(s)
- Chuance Du
- Department of Urology, Ganzhou Hospital Affiliated to Nanchang University, Ganzhou, Jiangxi province China
| | - Xiaoyuan Wu
- Department of Rehabilitation, Ganzhou Hospital Affiliated to Nanchang University, Nan Chang, Jiangxi province China
| | - Jia Li
- Department of Thyroid and Breast, Shanghai Tenth People’s Hospital, Tongji University, Shanghai, 200072 China
| |
Collapse
|
221
|
Khurana E, Fu Y, Chakravarty D, Demichelis F, Rubin MA, Gerstein M. Role of non-coding sequence variants in cancer. Nat Rev Genet 2016; 17:93-108. [PMID: 26781813 DOI: 10.1038/nrg.2015.17] [Citation(s) in RCA: 319] [Impact Index Per Article: 39.9] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
Patients with cancer carry somatic sequence variants in their tumour in addition to the germline variants in their inherited genome. Although variants in protein-coding regions have received the most attention, numerous studies have noted the importance of non-coding variants in cancer. Moreover, the overwhelming majority of variants, both somatic and germline, occur in non-coding portions of the genome. We review the current understanding of non-coding variants in cancer, including the great diversity of the mutation types--from single nucleotide variants to large genomic rearrangements--and the wide range of mechanisms by which they affect gene expression to promote tumorigenesis, such as disrupting transcription factor-binding sites or functions of non-coding RNAs. We highlight specific case studies of somatic and germline variants, and discuss how non-coding variants can be interpreted on a large-scale through computational and experimental methods.
Collapse
Affiliation(s)
- Ekta Khurana
- Meyer Cancer Center, Weill Cornell Medical College, New York, New York 10065, USA.,Institute for Precision Medicine, Weill Cornell Medical College, New York, New York 10065, USA.,Institute for Computational Biomedicine, Weill Cornell Medical College, New York, New York 10021, USA.,Department of Physiology and Biophysics, Weill Cornell Medical College, New York, New York 10065, USA
| | - Yao Fu
- Bina Technologies, Roche Sequencing, Redwood City, California 94065, USA
| | - Dimple Chakravarty
- Institute for Precision Medicine, Weill Cornell Medical College, New York, New York 10065, USA.,Department of Pathology and Laboratory Medicine, Weill Cornell Medical College, New York, New York 10065, USA
| | - Francesca Demichelis
- Institute for Precision Medicine, Weill Cornell Medical College, New York, New York 10065, USA.,Institute for Computational Biomedicine, Weill Cornell Medical College, New York, New York 10021, USA.,Centre for Integrative Biology, University of Trento, 38123 Trento, Italy
| | - Mark A Rubin
- Meyer Cancer Center, Weill Cornell Medical College, New York, New York 10065, USA.,Institute for Precision Medicine, Weill Cornell Medical College, New York, New York 10065, USA.,Department of Pathology and Laboratory Medicine, Weill Cornell Medical College, New York, New York 10065, USA
| | - Mark Gerstein
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut 06520, USA.,Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA.,Department of Computer Science, Yale University, New Haven, Connecticut 06520, USA
| |
Collapse
|
222
|
Arkova OV, Ponomarenko MP, Rasskazov DA, Drachkova IA, Arshinova TV, Ponomarenko PM, Savinkova LK, Kolchanov NA. Obesity-related known and candidate SNP markers can significantly change affinity of TATA-binding protein for human gene promoters. BMC Genomics 2015; 16 Suppl 13:S5. [PMID: 26694100 PMCID: PMC4686794 DOI: 10.1186/1471-2164-16-s13-s5] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023] Open
Abstract
BACKGROUND Obesity affects quality of life and life expectancy and is associated with cardiovascular disorders, cancer, diabetes, reproductive disorders in women, prostate diseases in men, and congenital anomalies in children. The use of single nucleotide polymorphism (SNP) markers of diseases and drug responses (i.e., significant differences of personal genomes of patients from the reference human genome) can help physicians to improve treatment. Clinical research can validate SNP markers via genotyping of patients and demonstration that SNP alleles are significantly more frequent in patients than in healthy people. The search for biomedical SNP markers of interest can be accelerated by computer-based analysis of hundreds of millions of SNPs in the 1000 Genomes project because of selection of the most meaningful candidate SNP markers and elimination of neutral SNPs. RESULTS We cross-validated the output of two computer-based methods: DNA sequence analysis using Web service SNP_TATA_Comparator and keyword search for articles on comorbidities of obesity. Near the sites binding to TATA-binding protein (TBP) in human gene promoters, we found 22 obesity-related candidate SNP markers, including rs10895068 (male breast cancer in obesity); rs35036378 (reduced risk of obesity after ovariectomy); rs201739205 (reduced risk of obesity-related cancers due to weight loss by diet/exercise in obese postmenopausal women); rs183433761 (obesity resistance during a high-fat diet); rs367732974 and rs549591993 (both: cardiovascular complications in obese patients with type 2 diabetes mellitus); rs200487063 and rs34104384 (both: obesity-caused hypertension); rs35518301, rs72661131, and rs562962093 (all: obesity); and rs397509430, rs33980857, rs34598529, rs33931746, rs33981098, rs34500389, rs63750953, rs281864525, rs35518301, and rs34166473 (all: chronic inflammation in comorbidities of obesity). Using an electrophoretic mobility shift assay under nonequilibrium conditions, we empirically validated the statistical significance (α < 0.00025) of the differences in TBP affinity values between the minor and ancestral alleles of 4 out of the 22 SNPs: rs200487063, rs201381696, rs34104384, and rs183433761. We also measured half-life (t1/2), Gibbs free energy change (ΔG), and the association and dissociation rate constants, ka and kd, of the TBP-DNA complex for these SNPs. CONCLUSIONS Validation of the 22 candidate SNP markers by proper clinical protocols appears to have a strong rationale and may advance postgenomic predictive preventive personalized medicine.
Collapse
Affiliation(s)
- Olga V Arkova
- Institute of Cytology and Genetics, Siberian Branch of Russian Academy of Sciences, 10 Lavrentyeva Avenue, Novosibirsk 630090, Russia
| | - Mikhail P Ponomarenko
- Institute of Cytology and Genetics, Siberian Branch of Russian Academy of Sciences, 10 Lavrentyeva Avenue, Novosibirsk 630090, Russia
- Novosibirsk State University, 2 Pirogova Street, Novosibirsk 630090, Russia
- Laboratory of Evolutionary Bioinformatics and Theoretical Genetics, Institute of Cytology and Genetics, Siberian Branch of Russian Academy of Sciences, 10 Lavrentyev Avenue, Novosibirsk 630090, Russia
| | - Dmitry A Rasskazov
- Institute of Cytology and Genetics, Siberian Branch of Russian Academy of Sciences, 10 Lavrentyeva Avenue, Novosibirsk 630090, Russia
| | - Irina A Drachkova
- Institute of Cytology and Genetics, Siberian Branch of Russian Academy of Sciences, 10 Lavrentyeva Avenue, Novosibirsk 630090, Russia
| | - Tatjana V Arshinova
- Institute of Cytology and Genetics, Siberian Branch of Russian Academy of Sciences, 10 Lavrentyeva Avenue, Novosibirsk 630090, Russia
| | - Petr M Ponomarenko
- Children's Hospital Los Angeles, 4640 Hollywood Boulevard, University of Southern California, Los Angeles, CA 90027, USA
| | - Ludmila K Savinkova
- Institute of Cytology and Genetics, Siberian Branch of Russian Academy of Sciences, 10 Lavrentyeva Avenue, Novosibirsk 630090, Russia
| | - Nikolay A Kolchanov
- Institute of Cytology and Genetics, Siberian Branch of Russian Academy of Sciences, 10 Lavrentyeva Avenue, Novosibirsk 630090, Russia
- Novosibirsk State University, 2 Pirogova Street, Novosibirsk 630090, Russia
| |
Collapse
|
223
|
Piraino SW, Furney SJ. Beyond the exome: the role of non-coding somatic mutations in cancer. Ann Oncol 2015; 27:240-8. [PMID: 26598542 DOI: 10.1093/annonc/mdv561] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2015] [Accepted: 11/04/2015] [Indexed: 02/06/2023] Open
Abstract
The comprehensive identification of mutations contributing to the development of cancer is a priority of large cancer sequencing projects. To date, most studies have scrutinized mutations in coding regions of the genome, but several recent discoveries, including the identification of recurrent somatic mutations in the TERT promoter in multiple cancer types, support the idea that mutations in non-coding regions are also important in tumour development. Furthermore, analysis of whole-genome sequencing data from tumours has elucidated novel mutational patterns and processes etched into cancer genomes. Here, we present an overview of insights gleaned from the analysis of mutations from sequenced cancer genomes. We then review the mechanisms by which non-coding mutations can play a role in cancer. Finally, we discuss recent efforts aimed at identifying non-coding driver mutations, as well as the unique challenges that the analysis of non-coding mutations present in contrast to the identification of driver mutations in coding regions.
Collapse
Affiliation(s)
- S W Piraino
- School of Medicine, Conway Institute of Biomolecular and Biomedical Research, University College Dublin, Dublin, Ireland
| | - S J Furney
- School of Medicine, Conway Institute of Biomolecular and Biomedical Research, University College Dublin, Dublin, Ireland
| |
Collapse
|
224
|
A Dual Model for Prioritizing Cancer Mutations in the Non-coding Genome Based on Germline and Somatic Events. PLoS Comput Biol 2015; 11:e1004583. [PMID: 26588488 PMCID: PMC4654583 DOI: 10.1371/journal.pcbi.1004583] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2015] [Accepted: 10/04/2015] [Indexed: 11/19/2022] Open
Abstract
We address here the issue of prioritizing non-coding mutations in the tumoral genome. To this aim, we created two independent computational models. The first (germline) model estimates purifying selection based on population SNP data. The second (somatic) model estimates tumor mutation density based on whole genome tumor sequencing. We show that each model reflects a different set of constraints acting either on the normal or tumor genome, and we identify the specific genome features that most contribute to these constraints. Importantly, we show that the somatic mutation model carries independent functional information that can be used to narrow down the non-coding regions that may be relevant to cancer progression. On this basis, we identify positions in non-coding RNAs and the non-coding parts of mRNAs that are both under purifying selection in the germline and protected from mutation in tumors, thus introducing a new strategy for future detection of cancer driver elements in the expressed non-coding genome. Cancer cells undergo a mutation/selection process that resembles that of any living cell. Most mutations in cancer cell DNA occur in the so-called "non-coding" regions that represent 98.5% of the genome length. Pinning down which of these mutations contribute to the fitness of cancer cells would be important for identifying new "cancer drivers", which may in turn lead to future treatments. Unfortunately, predicting the impact of a non-coding DNA alteration remains extremely difficult. In this study, we analyze millions of non-coding cancer mutations and show cancer-specific mutational patterns can be used to predict non-coding regions that are preserved from mutations and may thus be important for cancer cell survival. Combining this information with population data, we propose a new scoring system that should help prioritize important non-coding mutations in future studies.
Collapse
|
225
|
Svetlichnyy D, Imrichova H, Fiers M, Kalender Atak Z, Aerts S. Identification of High-Impact cis-Regulatory Mutations Using Transcription Factor Specific Random Forest Models. PLoS Comput Biol 2015; 11:e1004590. [PMID: 26562774 PMCID: PMC4642938 DOI: 10.1371/journal.pcbi.1004590] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2015] [Accepted: 10/10/2015] [Indexed: 02/02/2023] Open
Abstract
Cancer genomes contain vast amounts of somatic mutations, many of which are passenger mutations not involved in oncogenesis. Whereas driver mutations in protein-coding genes can be distinguished from passenger mutations based on their recurrence, non-coding mutations are usually not recurrent at the same position. Therefore, it is still unclear how to identify cis-regulatory driver mutations, particularly when chromatin data from the same patient is not available, thus relying only on sequence and expression information. Here we use machine-learning methods to predict functional regulatory regions using sequence information alone, and compare the predicted activity of the mutated region with the reference sequence. This way we define the Predicted Regulatory Impact of a Mutation in an Enhancer (PRIME). We find that the recently identified driver mutation in the TAL1 enhancer has a high PRIME score, representing a “gain-of-target” for MYB, whereas the highly recurrent TERT promoter mutation has a surprisingly low PRIME score. We trained Random Forest models for 45 cancer-related transcription factors, and used these to score variations in the HeLa genome and somatic mutations across more than five hundred cancer genomes. Each model predicts only a small fraction of non-coding mutations with a potential impact on the function of the encompassing regulatory region. Nevertheless, as these few candidate driver mutations are often linked to gains in chromatin activity and gene expression, they may contribute to the oncogenic program by altering the expression levels of specific oncogenes and tumor suppressor genes. Precise regulation of gene expression is controlled by cis-regulatory modules (CRM) containing binding sites for transcription factors (TF). The genome-wide location of all TF binding sites can often be obtained by ChIP-seq (chromatin immunoprecipitation followed by deep sequencing), yet in most cases only a minority of the binding peaks actually represent functional CRMs that control the transcription initiation of a bona fide TF target gene. Here, we investigated for 45 cancer-related TFs how machine-learning approaches can be used to predict functional TF target CRMs. After careful evaluation of their performance, we used these TF-target classifiers to predict which cis-regulatory mutations may have a significant impact on gene regulation by evaluating whether the mutation causes a significant gain or loss in the probability that the CRM is a functional TF target. We found that Random Forest classifiers can achieve more than 100-fold higher specificity for mutation prediction compared to the simple approaches based on scanning with position weight matrices. By scanning somatic mutations in breast cancer genomes and in the HeLa genome, we finally show that our TF-target classifiers can identify high impact non-coding mutations that are associated with concordant TF binding, gene expression changes and chromatin activity. In conclusion, TF-specific Random Forest classifiers can be used to prioritize cis-regulatory mutations in cancer genomes with high accuracy.
Collapse
Affiliation(s)
- Dmitry Svetlichnyy
- Laboratory of Computational Biology, KU Leuven Center for Human Genetics, Leuven, Belgium
| | - Hana Imrichova
- Laboratory of Computational Biology, KU Leuven Center for Human Genetics, Leuven, Belgium
| | - Mark Fiers
- VIB Center for the Biology of Disease, Leuven, Belgium
| | - Zeynep Kalender Atak
- Laboratory of Computational Biology, KU Leuven Center for Human Genetics, Leuven, Belgium
| | - Stein Aerts
- Laboratory of Computational Biology, KU Leuven Center for Human Genetics, Leuven, Belgium
- * E-mail:
| |
Collapse
|
226
|
Lu Q, Yao X, Hu Y, Zhao H. GenoWAP: GWAS signal prioritization through integrated analysis of genomic functional annotation. Bioinformatics 2015; 32:542-8. [PMID: 26504140 DOI: 10.1093/bioinformatics/btv610] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2015] [Accepted: 10/16/2015] [Indexed: 12/29/2022] Open
Abstract
MOTIVATION Genome-wide association study (GWAS) has been a great success in the past decade. However, significant challenges still remain in both identifying new risk loci and interpreting results. Bonferroni-corrected significance level is known to be conservative, leading to insufficient statistical power when the effect size is moderate at risk locus. Complex structure of linkage disequilibrium also makes it challenging to separate causal variants from nonfunctional ones in large haplotype blocks. Under such circumstances, a computational approach that may increase signal replication rate and identify potential functional sites among correlated markers is urgently needed. RESULTS We describe GenoWAP, a GWAS signal prioritization method that integrates genomic functional annotation and GWAS test statistics. The effectiveness of GenoWAP is demonstrated through its applications to Crohn's disease and schizophrenia using the largest studies available, where highly ranked loci show substantially stronger signals in the whole dataset after prioritization based on a subset of samples. At the single nucleotide polymorphism (SNP) level, top ranked SNPs after prioritization have both higher replication rates and consistently stronger enrichment of eQTLs. Within each risk locus, GenoWAP may be able to distinguish functional sites from groups of correlated SNPs. AVAILABILITY AND IMPLEMENTATION GenoWAP is freely available on the web at http://genocanyon.med.yale.edu/GenoWAP.
Collapse
Affiliation(s)
- Qiongshi Lu
- Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA
| | | | - Yiming Hu
- Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA
| | - Hongyu Zhao
- Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA, Program of Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA and VA Cooperative Studies Program Coordinating Center, West Haven, CT, USA
| |
Collapse
|
227
|
Ponomarenko M, Rasskazov D, Arkova O, Ponomarenko P, Suslov V, Savinkova L, Kolchanov N. How to Use SNP_TATA_Comparator to Find a Significant Change in Gene Expression Caused by the Regulatory SNP of This Gene's Promoter via a Change in Affinity of the TATA-Binding Protein for This Promoter. BIOMED RESEARCH INTERNATIONAL 2015; 2015:359835. [PMID: 26516624 PMCID: PMC4609514 DOI: 10.1155/2015/359835] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/03/2015] [Accepted: 08/24/2015] [Indexed: 01/11/2023]
Abstract
The use of biomedical SNP markers of diseases can improve effectiveness of treatment. Genotyping of patients with subsequent searching for SNPs more frequent than in norm is the only commonly accepted method for identification of SNP markers within the framework of translational research. The bioinformatics applications aimed at millions of unannotated SNPs of the "1000 Genomes" can make this search for SNP markers more focused and less expensive. We used our Web service involving Fisher's Z-score for candidate SNP markers to find a significant change in a gene's expression. Here we analyzed the change caused by SNPs in the gene's promoter via a change in affinity of the TATA-binding protein for this promoter. We provide examples and discuss how to use this bioinformatics application in the course of practical analysis of unannotated SNPs from the "1000 Genomes" project. Using known biomedical SNP markers, we identified 17 novel candidate SNP markers nearby: rs549858786 (rheumatoid arthritis); rs72661131 (cardiovascular events in rheumatoid arthritis); rs562962093 (stroke); rs563558831 (cyclophosphamide bioactivation); rs55878706 (malaria resistance, leukopenia), rs572527200 (asthma, systemic sclerosis, and psoriasis), rs371045754 (hemophilia B), rs587745372 (cardiovascular events); rs372329931, rs200209906, rs367732974, and rs549591993 (all four: cancer); rs17231520 and rs569033466 (both: atherosclerosis); rs63750953, rs281864525, and rs34166473 (all three: malaria resistance, thalassemia).
Collapse
Affiliation(s)
- Mikhail Ponomarenko
- Institute of Cytology and Genetics, Siberian Branch of Russian Academy of Sciences, Novosibirsk 630090, Russia
- Department of Natural Sciences, Novosibirsk State University, Novosibirsk 630090, Russia
| | - Dmitry Rasskazov
- Institute of Cytology and Genetics, Siberian Branch of Russian Academy of Sciences, Novosibirsk 630090, Russia
| | - Olga Arkova
- Institute of Cytology and Genetics, Siberian Branch of Russian Academy of Sciences, Novosibirsk 630090, Russia
| | - Petr Ponomarenko
- Children's Hospital Los Angeles, University of Southern California, Los Angeles, CA 90027, USA
| | - Valentin Suslov
- Institute of Cytology and Genetics, Siberian Branch of Russian Academy of Sciences, Novosibirsk 630090, Russia
| | - Ludmila Savinkova
- Institute of Cytology and Genetics, Siberian Branch of Russian Academy of Sciences, Novosibirsk 630090, Russia
| | - Nikolay Kolchanov
- Institute of Cytology and Genetics, Siberian Branch of Russian Academy of Sciences, Novosibirsk 630090, Russia
- Department of Natural Sciences, Novosibirsk State University, Novosibirsk 630090, Russia
| |
Collapse
|
228
|
Li J, Drubay D, Michiels S, Gautheret D. Mining the coding and non-coding genome for cancer drivers. Cancer Lett 2015; 369:307-15. [PMID: 26433158 DOI: 10.1016/j.canlet.2015.09.015] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2015] [Revised: 09/24/2015] [Accepted: 09/24/2015] [Indexed: 12/20/2022]
Abstract
Progress in next-generation sequencing provides unprecedented opportunities to fully characterize the spectrum of somatic mutations of cancer genomes. Given the large number of somatic mutations identified by such technologies, the prioritization of cancer-driving events is a consistent bottleneck. Most bioinformatics tools concentrate on driver mutations in the coding fraction of the genome, those causing changes in protein products. As more non-coding pathogenic variants are identified and characterized, the development of computational approaches to effectively prioritize cancer-driving variants within the non-coding fraction of human genome is becoming critical. After a short summary of methods for coding variant prioritization, we here review the highly diverse non-coding elements that may act as cancer drivers and describe recent methods that attempt to evaluate the deleteriousness of sequence variation in these elements. With such tools, the prioritization and identification of cancer-implicated regulatory elements and non-coding RNAs is becoming a reality.
Collapse
Affiliation(s)
- Jia Li
- Institute for Integrative Biology of the Cell (I2BC), CNRS, CEA, Université Paris-Sud, Université Paris-Saclay, 91198 Gif sur Yvette, France
| | - Damien Drubay
- Service de Biostatistique et d'Epidemiologie, Gustave Roussy, Villejuif, France; INSERM U1018, CESP, Université Paris-Sud, Université Paris-Saclay, Villejuif, France
| | - Stefan Michiels
- Service de Biostatistique et d'Epidemiologie, Gustave Roussy, Villejuif, France; INSERM U1018, CESP, Université Paris-Sud, Université Paris-Saclay, Villejuif, France
| | - Daniel Gautheret
- Institute for Integrative Biology of the Cell (I2BC), CNRS, CEA, Université Paris-Sud, Université Paris-Saclay, 91198 Gif sur Yvette, France.
| |
Collapse
|
229
|
Zhou J, Troyanskaya OG. Predicting effects of noncoding variants with deep learning-based sequence model. Nat Methods 2015; 12:931-4. [PMID: 26301843 PMCID: PMC4768299 DOI: 10.1038/nmeth.3547] [Citation(s) in RCA: 1116] [Impact Index Per Article: 124.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2015] [Accepted: 06/11/2015] [Indexed: 12/18/2022]
Abstract
Identifying functional effects of noncoding variants is a major challenge in human genetics. To predict the noncoding-variant effects de novo from sequence, we developed a deep learning-based algorithmic framework, DeepSEA (http://deepsea.princeton.edu/), that directly learns a regulatory sequence code from large-scale chromatin-profiling data, enabling prediction of chromatin effects of sequence alterations with single-nucleotide sensitivity. We further used this capability to improve prioritization of functional variants including expression quantitative trait loci (eQTLs) and disease-associated variants.
Collapse
Affiliation(s)
- Jian Zhou
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, USA
- Graduate Program in Quantitative and Computational Biology, Princeton University, Princeton, New Jersey, USA
| | - Olga G Troyanskaya
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, USA
- Department of Computer Science, Princeton University, Princeton, New Jersey, USA
- Simons Center for Data Analysis, Simons Foundation, New York, New York, USA
| |
Collapse
|
230
|
Lochovsky L, Zhang J, Fu Y, Khurana E, Gerstein M. LARVA: an integrative framework for large-scale analysis of recurrent variants in noncoding annotations. Nucleic Acids Res 2015; 43:8123-34. [PMID: 26304545 PMCID: PMC4787796 DOI: 10.1093/nar/gkv803] [Citation(s) in RCA: 52] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2015] [Accepted: 07/28/2015] [Indexed: 01/22/2023] Open
Abstract
In cancer research, background models for mutation rates have been extensively calibrated in coding regions, leading to the identification of many driver genes, recurrently mutated more than expected. Noncoding regions are also associated with disease; however, background models for them have not been investigated in as much detail. This is partially due to limited noncoding functional annotation. Also, great mutation heterogeneity and potential correlations between neighboring sites give rise to substantial overdispersion in mutation count, resulting in problematic background rate estimation. Here, we address these issues with a new computational framework called LARVA. It integrates variants with a comprehensive set of noncoding functional elements, modeling the mutation counts of the elements with a β-binomial distribution to handle overdispersion. LARVA, moreover, uses regional genomic features such as replication timing to better estimate local mutation rates and mutational hotspots. We demonstrate LARVA's effectiveness on 760 whole-genome tumor sequences, showing that it identifies well-known noncoding drivers, such as mutations in the TERT promoter. Furthermore, LARVA highlights several novel highly mutated regulatory sites that could potentially be noncoding drivers. We make LARVA available as a software tool and release our highly mutated annotations as an online resource (larva.gersteinlab.org).
Collapse
Affiliation(s)
- Lucas Lochovsky
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06520, USA
| | - Jing Zhang
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06520, USA
| | - Yao Fu
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06520, USA
| | - Ekta Khurana
- Institute for Computational Biomedicine, Weill Cornell Medical College, New York, NY 10065, USA Department of Physiology and Biophysics, Weill Cornell Medical College, New York, New York 10065
| | - Mark Gerstein
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06520, USA Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520, USA Department of Computer Science, Yale University, New Haven, CT 06520, USA
| |
Collapse
|
231
|
Poulos RC, Thoms JAI, Shah A, Beck D, Pimanda JE, Wong JWH. Systematic Screening of Promoter Regions Pinpoints Functional Cis-Regulatory Mutations in a Cutaneous Melanoma Genome. Mol Cancer Res 2015; 13:1218-26. [PMID: 26082173 DOI: 10.1158/1541-7786.mcr-15-0146] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2015] [Accepted: 06/04/2015] [Indexed: 11/16/2022]
Abstract
UNLABELLED With the recent discovery of recurrent mutations in the TERT promoter in melanoma, identification of other somatic causal promoter mutations is of considerable interest. Yet, the impact of sequence variation on the regulatory potential of gene promoters has not been systematically evaluated. This study assesses the impact of promoter mutations on promoter activity in the whole-genome sequenced malignant melanoma cell line COLO-829. Combining somatic mutation calls from COLO-829 with genome-wide chromatin accessibility and histone modification data revealed mutations within promoter elements. Interestingly, a high number of potential promoter mutations (n = 23) were found, a result mirrored in subsequent analysis of TCGA whole-melanoma genomes. The impact of wild-type and mutant promoter sequences were evaluated by subcloning into luciferase reporter vectors and testing their transcriptional activity in COLO-829 cells. Of the 23 promoter regions tested, four mutations significantly altered reporter activity relative to wild-type sequences. These data were then subjected to multiple computational algorithms that score the cis-regulatory altering potential of mutations. These analyses identified one mutation, located within the promoter region of NDUFB9, which encodes the mitochondrial NADH dehydrogenase (ubiquinone) 1 beta subcomplex 9, to be recurrent in 4.4% (19 of 432) of TCGA whole-melanoma exomes. The mutation is predicted to disrupt a highly conserved SP1/KLF transcription factor binding motif and its frequent co-occurrence with mutations in the coding sequence of NF1 supports a pathologic role for this mutation in melanoma. Taken together, these data show the relatively high prevalence of promoter mutations in the COLO-829 melanoma genome, and indicate that a proportion of these significantly alter the regulatory potential of gene promoters. IMPLICATIONS Genomic-based screening within gene promoter regions suggests that functional cis-regulatory mutations may be common in melanoma genomes, highlighting the need to examine their role in tumorigenesis.
Collapse
Affiliation(s)
- Rebecca C Poulos
- Prince of Wales Clinical School, University of New South Wales Australia, Sydney, Australia. Lowy Cancer Research Centre, University of New South Wales Australia, Sydney, Australia
| | - Julie A I Thoms
- Prince of Wales Clinical School, University of New South Wales Australia, Sydney, Australia. Lowy Cancer Research Centre, University of New South Wales Australia, Sydney, Australia
| | - Anushi Shah
- Prince of Wales Clinical School, University of New South Wales Australia, Sydney, Australia. Lowy Cancer Research Centre, University of New South Wales Australia, Sydney, Australia
| | - Dominik Beck
- Prince of Wales Clinical School, University of New South Wales Australia, Sydney, Australia. Lowy Cancer Research Centre, University of New South Wales Australia, Sydney, Australia
| | - John E Pimanda
- Prince of Wales Clinical School, University of New South Wales Australia, Sydney, Australia. Lowy Cancer Research Centre, University of New South Wales Australia, Sydney, Australia. Department of Haematology, Prince of Wales Hospital, Sydney, Australia
| | - Jason W H Wong
- Prince of Wales Clinical School, University of New South Wales Australia, Sydney, Australia. Lowy Cancer Research Centre, University of New South Wales Australia, Sydney, Australia.
| |
Collapse
|
232
|
|
233
|
Wang J, Batmanov K. BayesPI-BAR: a new biophysical model for characterization of regulatory sequence variations. Nucleic Acids Res 2015. [PMID: 26202972 PMCID: PMC4666384 DOI: 10.1093/nar/gkv733] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open
Abstract
Sequence variations in regulatory DNA regions are known to cause functionally important consequences for gene expression. DNA sequence variations may have an essential role in determining phenotypes and may be linked to disease; however, their identification through analysis of massive genome-wide sequencing data is a great challenge. In this work, a new computational pipeline, a Bayesian method for protein–DNA interaction with binding affinity ranking (BayesPI-BAR), is proposed for quantifying the effect of sequence variations on protein binding. BayesPI-BAR uses biophysical modeling of protein–DNA interactions to predict single nucleotide polymorphisms (SNPs) that cause significant changes in the binding affinity of a regulatory region for transcription factors (TFs). The method includes two new parameters (TF chemical potentials or protein concentrations and direct TF binding targets) that are neglected by previous methods. The new method is verified on 67 known human regulatory SNPs, of which 47 (70%) have predicted true TFs ranked in the top 10. Importantly, the performance of BayesPI-BAR, which uses principal component analysis to integrate multiple predictions from various TF chemical potentials, is found to be better than that of existing programs, such as sTRAP and is-rSNP, when evaluated on the same SNPs. BayesPI-BAR is a publicly available tool and is able to carry out parallelized computation, which helps to investigate a large number of TFs or SNPs and to detect disease-associated regulatory sequence variations in the sea of genome-wide noncoding regions.
Collapse
Affiliation(s)
- Junbai Wang
- Pathology Department, Oslo University Hospital-Norwegian Radium Hospital, Montebello 0310, Oslo, Norway
| | - Kirill Batmanov
- Pathology Department, Oslo University Hospital-Norwegian Radium Hospital, Montebello 0310, Oslo, Norway
| |
Collapse
|
234
|
Wang Q, Lu Q, Zhao H. A review of study designs and statistical methods for genomic epidemiology studies using next generation sequencing. Front Genet 2015; 6:149. [PMID: 25941534 PMCID: PMC4403555 DOI: 10.3389/fgene.2015.00149] [Citation(s) in RCA: 34] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2015] [Accepted: 03/30/2015] [Indexed: 12/22/2022] Open
Abstract
Results from numerous linkage and association studies have greatly deepened scientists’ understanding of the genetic basis of many human diseases, yet some important questions remain unanswered. For example, although a large number of disease-associated loci have been identified from genome-wide association studies in the past 10 years, it is challenging to interpret these results as most disease-associated markers have no clear functional roles in disease etiology, and all the identified genomic factors only explain a small portion of disease heritability. With the help of next-generation sequencing (NGS), diverse types of genomic and epigenetic variations can be detected with high accuracy. More importantly, instead of using linkage disequilibrium to detect association signals based on a set of pre-set probes, NGS allows researchers to directly study all the variants in each individual, therefore promises opportunities for identifying functional variants and a more comprehensive dissection of disease heritability. Although the current scale of NGS studies is still limited due to the high cost, the success of several recent studies suggests the great potential for applying NGS in genomic epidemiology, especially as the cost of sequencing continues to drop. In this review, we discuss several pioneer applications of NGS, summarize scientific discoveries for rare and complex diseases, and compare various study designs including targeted sequencing and whole-genome sequencing using population-based and family-based cohorts. Finally, we highlight recent advancements in statistical methods proposed for sequencing analysis, including group-based association tests, meta-analysis techniques, and annotation tools for variant prioritization.
Collapse
Affiliation(s)
- Qian Wang
- Program of Computational Biology and Bioinformatics, Yale University New Haven, CT, USA
| | - Qiongshi Lu
- Department of Biostatistics, Yale School of Public Health New Haven, CT, USA
| | - Hongyu Zhao
- Program of Computational Biology and Bioinformatics, Yale University New Haven, CT, USA ; Department of Biostatistics, Yale School of Public Health New Haven, CT, USA ; Veterans Affairs Cooperative Studies Program Coordinating Center West Haven, CT, USA
| |
Collapse
|
235
|
Vuong H, Che A, Ravichandran S, Luke BT, Collins JR, Mudunuri US. AVIA v2.0: annotation, visualization and impact analysis of genomic variants and genes. Bioinformatics 2015; 31:2748-50. [PMID: 25861966 DOI: 10.1093/bioinformatics/btv200] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2014] [Accepted: 04/05/2015] [Indexed: 12/17/2022] Open
Abstract
UNLABELLED As sequencing becomes cheaper and more widely available, there is a greater need to quickly and effectively analyze large-scale genomic data. While the functionality of AVIA v1.0, whose implementation was based on ANNOVAR, was comparable with other annotation web servers, AVIA v2.0 represents an enhanced web-based server that extends genomic annotations to cell-specific transcripts and protein-level functional annotations. With AVIA's improved interface, users can better visualize their data, perform comprehensive searches and categorize both coding and non-coding variants. AVAILABILITY AND IMPLEMENTATION AVIA is freely available through the web at http://avia.abcc.ncifcrf.gov. CONTACT Hue.Vuong@fnlcr.nih.gov SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Hue Vuong
- Advanced Biomedical Computing Center, Frederick National Laboratory for Cancer Research, Frederick, MD 21702, USA
| | - Anney Che
- Advanced Biomedical Computing Center, Frederick National Laboratory for Cancer Research, Frederick, MD 21702, USA
| | - Sarangan Ravichandran
- Advanced Biomedical Computing Center, Frederick National Laboratory for Cancer Research, Frederick, MD 21702, USA
| | - Brian T Luke
- Advanced Biomedical Computing Center, Frederick National Laboratory for Cancer Research, Frederick, MD 21702, USA
| | - Jack R Collins
- Advanced Biomedical Computing Center, Frederick National Laboratory for Cancer Research, Frederick, MD 21702, USA
| | - Uma S Mudunuri
- Advanced Biomedical Computing Center, Frederick National Laboratory for Cancer Research, Frederick, MD 21702, USA
| |
Collapse
|
236
|
Mathelier A, Shi W, Wasserman WW. Identification of altered cis-regulatory elements in human disease. Trends Genet 2015; 31:67-76. [DOI: 10.1016/j.tig.2014.12.003] [Citation(s) in RCA: 82] [Impact Index Per Article: 9.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2014] [Revised: 12/19/2014] [Accepted: 12/19/2014] [Indexed: 02/01/2023]
|