1
|
Huang B, Fan C, Chen K, Rao J, Ou P, Tian C, Yang Y, Cooper DN, Zhao H. VCAT: an integrated variant function annotation tools. Hum Genet 2024:10.1007/s00439-024-02699-6. [PMID: 39192052 DOI: 10.1007/s00439-024-02699-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2024] [Accepted: 08/14/2024] [Indexed: 08/29/2024]
Abstract
The development of sequencing technology has promoted discovery of variants in the human genome. Identifying functions of these variants is important for us to link genotype to phenotype, and to diagnose diseases. However, it usually requires researchers to visit multiple databases. Here, we presented a one-stop webserver for variant function annotation tools (VCAT, https://biomed.nscc-gz.cn/zhaolab/VCAT/ ) that is the first one connecting variant to functions via the epigenome, protein, drug and RNA. VCAT is also the first one to make all annotations visualized in interactive charts or molecular structures. VCAT allows users to upload data in VCF format, and download results via a URL. Moreover, VCAT has annotated a huge number (1,262,041,068) of variants collected from dbSNP, 1000 Genomes projects, gnomAD, ICGC, TCGA, and HPRC Pangenome project. For these variants, users are able to searcher their functions, related diseases and drugs from VCAT. In summary, VCAT provides a one-stop webserver to explore the potential functions of human genomic variants including their relationship with diseases and drugs.
Collapse
Affiliation(s)
- Bi Huang
- Department of Medical Research Center, Sun Yat-Sen Memorial Hospital, Sun Yat-Sen University, 107 Yan Jiang West Road, Guangzhou, 500001, People's Republic of China
- Guangdong Provincial Key Laboratory of Malignant Tumor Epigenetics and Gene Regulation, Guangzhou, People's Republic of China
| | - Cong Fan
- Department of Medical Research Center, Sun Yat-Sen Memorial Hospital, Sun Yat-Sen University, 107 Yan Jiang West Road, Guangzhou, 500001, People's Republic of China
- Guangdong Provincial Key Laboratory of Malignant Tumor Epigenetics and Gene Regulation, Guangzhou, People's Republic of China
| | - Ken Chen
- School of Data and Computer Science, Sun Yat-Sen University, Guangzhou, People's Republic of China
| | - Jiahua Rao
- School of Data and Computer Science, Sun Yat-Sen University, Guangzhou, People's Republic of China
| | - Peihua Ou
- School of Data and Computer Science, Sun Yat-Sen University, Guangzhou, People's Republic of China
| | - Chong Tian
- School of Data and Computer Science, Sun Yat-Sen University, Guangzhou, People's Republic of China
| | - Yuedong Yang
- School of Data and Computer Science, Sun Yat-Sen University, Guangzhou, People's Republic of China
| | - David N Cooper
- School of Medicine, Institute of Medical Genetics, Cardiff University, Heath Park, Cardiff, CF14 4XN, UK
| | - Huiying Zhao
- Department of Medical Research Center, Sun Yat-Sen Memorial Hospital, Sun Yat-Sen University, 107 Yan Jiang West Road, Guangzhou, 500001, People's Republic of China.
- Guangdong Provincial Key Laboratory of Malignant Tumor Epigenetics and Gene Regulation, Guangzhou, People's Republic of China.
| |
Collapse
|
2
|
Morova T, Ding Y, Huang CCF, Sar F, Schwarz T, Giambartolomei C, Baca S, Grishin D, Hach F, Gusev A, Freedman M, Pasaniuc B, Lack N. Optimized high-throughput screening of non-coding variants identified from genome-wide association studies. Nucleic Acids Res 2022; 51:e18. [PMID: 36546757 PMCID: PMC9943666 DOI: 10.1093/nar/gkac1198] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2022] [Revised: 11/19/2022] [Accepted: 12/06/2022] [Indexed: 12/24/2022] Open
Abstract
The vast majority of disease-associated single nucleotide polymorphisms (SNP) identified from genome-wide association studies (GWAS) are localized in non-coding regions. A significant fraction of these variants impact transcription factors binding to enhancer elements and alter gene expression. To functionally interrogate the activity of such variants we developed snpSTARRseq, a high-throughput experimental method that can interrogate the functional impact of hundreds to thousands of non-coding variants on enhancer activity. snpSTARRseq dramatically improves signal-to-noise by utilizing a novel sequencing and bioinformatic approach that increases both insert size and the number of variants tested per loci. Using this strategy, we interrogated known prostate cancer (PCa) risk-associated loci and demonstrated that 35% of them harbor SNPs that significantly altered enhancer activity. Combining these results with chromosomal looping data we could identify interacting genes and provide a mechanism of action for 20 PCa GWAS risk regions. When benchmarked to orthogonal methods, snpSTARRseq showed a strong correlation with in vivo experimental allelic-imbalance studies whereas there was no correlation with predictive in silico approaches. Overall, snpSTARRseq provides an integrated experimental and computational framework to functionally test non-coding genetic variants.
Collapse
Affiliation(s)
- Tunc Morova
- Vancouver Prostate Centre, Vancouver, BC V6H 3Z6, Canada
| | - Yi Ding
- Bioinformatics Interdepartmental Program, University of California, Los Angeles, Los Angeles, CA 90095, USA
| | | | - Funda Sar
- Vancouver Prostate Centre, Vancouver, BC V6H 3Z6, Canada
| | - Tommer Schwarz
- Bioinformatics Interdepartmental Program, University of California, Los Angeles, Los Angeles, CA 90095, USA
| | - Claudia Giambartolomei
- Central RNA Lab, Istituto Italiano di Tecnologia, Genova 16163, Italy,Department of Pathology and Laboratory Medicine, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA 90095, USA
| | - Sylvan C Baca
- Department of Medical Oncology, The Center for Functional Cancer Epigenetics, Dana Farber Cancer Institute, Boston, MA 02215, USA
| | - Dennis Grishin
- Department of Medical Oncology, The Center for Functional Cancer Epigenetics, Dana Farber Cancer Institute, Boston, MA 02215, USA
| | - Faraz Hach
- Vancouver Prostate Centre, Vancouver, BC V6H 3Z6, Canada,Department of Urologic Science, University of British Columbia, Vancouver, BC V5Z 1M9, Canada
| | - Alexander Gusev
- Department of Medical Oncology, The Center for Functional Cancer Epigenetics, Dana Farber Cancer Institute, Boston, MA 02215, USA,Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA
| | - Matthew L Freedman
- Department of Medical Oncology, The Center for Functional Cancer Epigenetics, Dana Farber Cancer Institute, Boston, MA 02215, USA,The Center for Cancer Genome Discovery, Dana Farber Cancer Institute, Boston, MA 02215, USA
| | - Bogdan Pasaniuc
- Bioinformatics Interdepartmental Program, University of California, Los Angeles, Los Angeles, CA 90095, USA,Department of Pathology and Laboratory Medicine, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA 90095, USA,Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA 90095, USA,Department of Computational Medicine, University of California, Los Angeles, Los Angeles, CA 90095, USA
| | - Nathan A Lack
- To whom correspondence should be addressed. Tel: +1 604 875 4411;
| |
Collapse
|
3
|
Lange M, Begolli R, Giakountis A. Non-Coding Variants in Cancer: Mechanistic Insights and Clinical Potential for Personalized Medicine. Noncoding RNA 2021; 7:47. [PMID: 34449663 PMCID: PMC8395730 DOI: 10.3390/ncrna7030047] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2021] [Revised: 07/26/2021] [Accepted: 08/01/2021] [Indexed: 12/11/2022] Open
Abstract
The cancer genome is characterized by extensive variability, in the form of Single Nucleotide Polymorphisms (SNPs) or structural variations such as Copy Number Alterations (CNAs) across wider genomic areas. At the molecular level, most SNPs and/or CNAs reside in non-coding sequences, ultimately affecting the regulation of oncogenes and/or tumor-suppressors in a cancer-specific manner. Notably, inherited non-coding variants can predispose for cancer decades prior to disease onset. Furthermore, accumulation of additional non-coding driver mutations during progression of the disease, gives rise to genomic instability, acting as the driving force of neoplastic development and malignant evolution. Therefore, detection and characterization of such mutations can improve risk assessment for healthy carriers and expand the diagnostic and therapeutic toolbox for the patient. This review focuses on functional variants that reside in transcribed or not transcribed non-coding regions of the cancer genome and presents a collection of appropriate state-of-the-art methodologies to study them.
Collapse
Affiliation(s)
- Marios Lange
- Department of Biochemistry and Biotechnology, University of Thessaly, Biopolis, 41500 Larissa, Greece; (M.L.); (R.B.)
| | - Rodiola Begolli
- Department of Biochemistry and Biotechnology, University of Thessaly, Biopolis, 41500 Larissa, Greece; (M.L.); (R.B.)
| | - Antonis Giakountis
- Department of Biochemistry and Biotechnology, University of Thessaly, Biopolis, 41500 Larissa, Greece; (M.L.); (R.B.)
- Institute for Fundamental Biomedical Research, B.S.R.C “Alexander Fleming”, 34 Fleming Str., 16672 Vari, Greece
| |
Collapse
|
4
|
Wang Y, Xue H, Pourcel C, Du Y, Gautheret D. 2-kupl: mapping-free variant detection from DNA-seq data of matched samples. BMC Bioinformatics 2021; 22:304. [PMID: 34090332 PMCID: PMC8180056 DOI: 10.1186/s12859-021-04185-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2021] [Accepted: 05/11/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The detection of genome variants, including point mutations, indels and structural variants, is a fundamental and challenging computational problem. We address here the problem of variant detection between two deep-sequencing (DNA-seq) samples, such as two human samples from an individual patient, or two samples from distinct bacterial strains. The preferred strategy in such a case is to align each sample to a common reference genome, collect all variants and compare these variants between samples. Such mapping-based protocols have several limitations. DNA sequences with large indels, aggregated mutations and structural variants are hard to map to the reference. Furthermore, DNA sequences cannot be mapped reliably to genomic low complexity regions and repeats. RESULTS We introduce 2-kupl, a k-mer based, mapping-free protocol to detect variants between two DNA-seq samples. On simulated and actual data, 2-kupl achieves higher accuracy than other mapping-free protocols. Applying 2-kupl to prostate cancer whole exome sequencing data, we identify a number of candidate variants in hard-to-map regions and propose potential novel recurrent variants in this disease. CONCLUSIONS We developed a mapping-free protocol for variant calling between matched DNA-seq samples. Our protocol is suitable for variant detection in unmappable genome regions or in the absence of a reference genome.
Collapse
Affiliation(s)
- Yunfeng Wang
- Institute of Integrative Cell Biology (I2BC), Université Paris-Saclay, CNRS, CEA, 1 avenue de la Terrasse, 91190 Gif-sur-Yvette, France
- Annoroad Gene Technology Co., Ltd, Beijing, 100176 China
| | - Haoliang Xue
- Institute of Integrative Cell Biology (I2BC), Université Paris-Saclay, CNRS, CEA, 1 avenue de la Terrasse, 91190 Gif-sur-Yvette, France
| | - Christine Pourcel
- Institute of Integrative Cell Biology (I2BC), Université Paris-Saclay, CNRS, CEA, 1 avenue de la Terrasse, 91190 Gif-sur-Yvette, France
| | - Yang Du
- Annoroad Gene Technology Co., Ltd, Beijing, 100176 China
| | - Daniel Gautheret
- Institute of Integrative Cell Biology (I2BC), Université Paris-Saclay, CNRS, CEA, 1 avenue de la Terrasse, 91190 Gif-sur-Yvette, France
- IHU PRISM, Gustave Roussy, 114 rue Edouard Vaillant, 94800 Villejuif, France
| |
Collapse
|
5
|
Biggs H, Parthasarathy P, Gavryushkina A, Gardner PP. ncVarDB: a manually curated database for pathogenic non-coding variants and benign controls. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2020; 2020:6013764. [PMID: 33258967 PMCID: PMC7706182 DOI: 10.1093/database/baaa105] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/28/2020] [Revised: 10/13/2020] [Accepted: 11/12/2020] [Indexed: 11/22/2022]
Abstract
Variants within the non-coding genome are frequently associated with phenotypes in genome-wide association studies. These non-coding regions may be involved in the regulation of gene expression, encode functional non-coding RNAs, or influence splicing and other cellular functions. We have curated a list of characterized non-coding human genome variants based on the published evidence that indicates phenotypic consequences of the variation. In order to minimize annotation errors, two curators have independently verified the supporting evidence for pathogenicity of each non-coding variant in the published literature. The database consists of 721 non-coding variants linked to the published literature describing the evidence of functional consequences. We have also sampled 7228 covariate-matched benign controls, that have a population frequency of over 5%, from the single nucleotide polymorphism database (dbSNP151) database. These were sampled controlling for potential confounding factors such as linkage with pathogenic variants, annotation type (untranslated region, intron, intergenic, etc.) and variant type (substitution or indel). The dataset presented here represents a curated repository, with a potential use for the training or evaluation of algorithms used in the prediction of non-coding variant functionality. Database URL: https://github.com/Gardner-BinfLab/ncVarDB.
Collapse
Affiliation(s)
- Harry Biggs
- Department of Biochemistry, University of Otago, PO Box 56, Dunedin 9054, New Zealand
| | - Padmini Parthasarathy
- Department of Biochemistry, University of Otago, PO Box 56, Dunedin 9054, New Zealand
| | - Alexandra Gavryushkina
- Department of Biochemistry, University of Otago, PO Box 56, Dunedin 9054, New Zealand.,Bio-Protection Research Centre, University of Otago, PO Box 56, Dunedin 9054, New Zealand
| | - Paul P Gardner
- Department of Biochemistry, University of Otago, PO Box 56, Dunedin 9054, New Zealand.,Bio-Protection Research Centre, University of Otago, PO Box 56, Dunedin 9054, New Zealand
| |
Collapse
|
6
|
Ergoren MC, Cobanogulları H, Temel SG, Mocan G. Functional coding/non-coding variants in EGFR, ROS1 and ALK genes and their role in liquid biopsy as a personalized therapy. Crit Rev Oncol Hematol 2020; 156:103113. [PMID: 33038629 DOI: 10.1016/j.critrevonc.2020.103113] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2020] [Revised: 09/17/2020] [Accepted: 09/18/2020] [Indexed: 02/06/2023] Open
Abstract
Personalized medicine holds promise to tailor the treatment options for patients' unique genetic make-up, behavioral and environmental background. Liquid biopsy is non-invasive technique and precise diagnosis and treatment approach. Significantly, NGS technologies have revolutionized the genomic medicine by novel identifying SNPs, indel mutations in both coding and non-coding regions and also a promising technology to accelerate the early detection and finding new biomarkers for diagnosis and treatment. The number of the bioinformatics tools have been rapidly increasing with the aim of learning more about the detected mutations either they have a pathogenic role or not. EGFR, ROS1 and ALK genes are members of the RTK family. Until now, mutations within these genes have been associated with many cancers and involved in resistance formation to TKIs. This review article summarized the findings about the mostly investigated variations in EGFR, ROS1 and ALK genes and their potential role in liquid biopsy approach.
Collapse
Affiliation(s)
- Mahmut Cerkez Ergoren
- Department of Medical Biology, Faculty of Medicine, Near East University, Nicosia, 99138, Cyprus; DESAM Institute, Near East University, 99138, Nicosia, Cyprus.
| | - Havva Cobanogulları
- Department of Medical Biology, Faculty of Medicine, Near East University, Nicosia, 99138, Cyprus; DESAM Institute, Near East University, 99138, Nicosia, Cyprus
| | - Sehime Gulsun Temel
- Department of Medical Genetics, Faculty of Medicine, Bursa Uludag University, Bursa, Turkey; Department of Histology & Embryology, Faculty of Medicine, Bursa Uludag University, Bursa, Turkey; Department of Translational Medicine, Institute of Health Sciences, Bursa Uludag University, Bursa, Turkey
| | - Gamze Mocan
- Department of Medical Biology, Faculty of Medicine, Near East University, Nicosia, 99138, Cyprus; Department of Medical Pathology, Faculty of Medicine, Near East University, Nicosia, 99138, Cyprus
| |
Collapse
|
7
|
Drubay D, Gautheret D, Michiels S. A benchmark study of scoring methods for non-coding mutations. Bioinformatics 2019; 34:1635-1641. [PMID: 29340599 DOI: 10.1093/bioinformatics/bty008] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2017] [Accepted: 01/09/2018] [Indexed: 01/06/2023] Open
Abstract
Motivation Detailed knowledge of coding sequences has led to different candidate models for pathogenic variant prioritization. Several deleteriousness scores have been proposed for the non-coding part of the genome, but no large-scale comparison has been realized to date to assess their performance. Results We compared the leading scoring tools (CADD, FATHMM-MKL, Funseq2 and GWAVA) and some recent competitors (DANN, SNP and SOM scores) for their ability to discriminate assumed pathogenic variants from assumed benign variants (using the ClinVar, COSMIC and 1000 genomes project databases). Using the ClinVar benchmark, CADD was the best tool for detecting the pathogenic variants that are mainly located in protein coding gene regions. Using the COSMIC benchmark, FATHMM-MKL, GWAVA and SOMliver outperformed the other tools for pathogenic variants that are typically located in lincRNAs, pseudogenes and other parts of the non-coding genome. However, all tools had low precision, which could potentially be improved by future non-coding genome feature discoveries. These results may have been influenced by the presence of potential benign variants in the COSMIC database. The development of a gold standard as consistent as ClinVar for these regions will be necessary to confirm our tool ranking. Availability and implementation The Snakemake, C++ and R codes are freely available from https://github.com/Oncostat/BenchmarkNCVTools and supported on Linux. Contact damien.drubay@gustaveroussy.fr or stefan.michiels@gustaveroussy.fr. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Damien Drubay
- INSERM U1018, CESP, Fac. de Médecine-Univ. Paris-Sud-UVSQ, INSERM, Université Paris-Saclay, 94807 Villejuif cedex, France.,Gustave Roussy, Service de Biostatistique et d'Epidémiologie, Villejuif F-94805, France
| | - Daniel Gautheret
- Institute for Integrative Biology of the Cell, Université Paris-Sud, CNRS, CEA, 91198 Gif-sur-Yvette, France
| | - Stefan Michiels
- INSERM U1018, CESP, Fac. de Médecine-Univ. Paris-Sud-UVSQ, INSERM, Université Paris-Saclay, 94807 Villejuif cedex, France.,Gustave Roussy, Service de Biostatistique et d'Epidémiologie, Villejuif F-94805, France
| |
Collapse
|
8
|
Agajanian S, Oluyemi O, Verkhivker GM. Integration of Random Forest Classifiers and Deep Convolutional Neural Networks for Classification and Biomolecular Modeling of Cancer Driver Mutations. Front Mol Biosci 2019; 6:44. [PMID: 31245384 PMCID: PMC6579812 DOI: 10.3389/fmolb.2019.00044] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2019] [Accepted: 05/23/2019] [Indexed: 12/21/2022] Open
Abstract
Development of machine learning solutions for prediction of functional and clinical significance of cancer driver genes and mutations are paramount in modern biomedical research and have gained a significant momentum in a recent decade. In this work, we integrate different machine learning approaches, including tree based methods, random forest and gradient boosted tree (GBT) classifiers along with deep convolutional neural networks (CNN) for prediction of cancer driver mutations in the genomic datasets. The feasibility of CNN in using raw nucleotide sequences for classification of cancer driver mutations was initially explored by employing label encoding, one hot encoding, and embedding to preprocess the DNA information. These classifiers were benchmarked against their tree-based alternatives in order to evaluate the performance on a relative scale. We then integrated DNA-based scores generated by CNN with various categories of conservational, evolutionary and functional features into a generalized random forest classifier. The results of this study have demonstrated that CNN can learn high level features from genomic information that are complementary to the ensemble-based predictors often employed for classification of cancer mutations. By combining deep learning-generated score with only two main ensemble-based functional features, we can achieve a superior performance of various machine learning classifiers. Our findings have also suggested that synergy of nucleotide-based deep learning scores and integrated metrics derived from protein sequence conservation scores can allow for robust classification of cancer driver mutations with a limited number of highly informative features. Machine learning predictions are leveraged in molecular simulations, protein stability, and network-based analysis of cancer mutations in the protein kinase genes to obtain insights about molecular signatures of driver mutations and enhance the interpretability of cancer-specific classification models.
Collapse
Affiliation(s)
- Steve Agajanian
- Graduate Program in Computational and Data Sciences, Schmid College of Science and Technology, Chapman University, Orange, CA, United States
| | - Odeyemi Oluyemi
- Graduate Program in Computational and Data Sciences, Schmid College of Science and Technology, Chapman University, Orange, CA, United States
| | - Gennady M Verkhivker
- Graduate Program in Computational and Data Sciences, Schmid College of Science and Technology, Chapman University, Orange, CA, United States.,Department of Biomedical and Pharmaceutical Sciences, Chapman University School of Pharmacy, Irvine, CA, United States
| |
Collapse
|
9
|
Lowdon RF, Wang T. Epigenomic annotation of noncoding mutations identifies mutated pathways in primary liver cancer. PLoS One 2017; 12:e0174032. [PMID: 28333948 PMCID: PMC5363827 DOI: 10.1371/journal.pone.0174032] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2016] [Accepted: 03/02/2017] [Indexed: 11/19/2022] Open
Abstract
Evidence that noncoding mutation can result in cancer driver events is mounting. However, it is more difficult to assign molecular biological consequences to noncoding mutations than to coding mutations, and a typical cancer genome contains many more noncoding mutations than protein-coding mutations. Accordingly, parsing functional noncoding mutation signal from noise remains an important challenge. Here we use an empirical approach to identify putatively functional noncoding somatic single nucleotide variants (SNVs) from liver cancer genomes. Annotation of candidate variants by publicly available epigenome datasets finds that 40.5% of SNVs fall in regulatory elements. When assigned to specific regulatory elements, we find that the distribution of regulatory element mutation mirrors that of nonsynonymous coding mutation, where few regulatory elements are recurrently mutated in a patient population but many are singly mutated. We find potential gain-of-binding site events among candidate SNVs, suggesting a mechanism of action for these variants. When aggregating noncoding somatic mutation in promoters, we find that genes in the ERBB signaling and MAPK signaling pathways are significantly enriched for promoter mutations. Altogether, our results suggest that functional somatic SNVs in cancer are sporadic, but occasionally occur in regulatory elements and may affect phenotype by creating binding sites for transcriptional regulators. Accordingly, we propose that noncoding mutation should be formally accounted for when determining gene- and pathway-mutation burden in cancer.
Collapse
Affiliation(s)
- Rebecca F. Lowdon
- Center for Genome Sciences and Systems Biology, Department of Genetics, Washington University in St. Louis, Saint Louis, Missouri, United States of America
| | - Ting Wang
- Center for Genome Sciences and Systems Biology, Department of Genetics, Washington University in St. Louis, Saint Louis, Missouri, United States of America
| |
Collapse
|
10
|
Li H, He Z, Gu Y, Fang L, Lv X. Prioritization of non-coding disease-causing variants and long non-coding RNAs in liver cancer. Oncol Lett 2016; 12:3987-3994. [PMID: 27895760 DOI: 10.3892/ol.2016.5135] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2015] [Accepted: 06/16/2016] [Indexed: 01/10/2023] Open
Abstract
There are multiple bioinformatics tools available for the detection of coding driver mutations in cancers. However, the prioritization of pathogenic non-coding variants remains a challenging and demanding task. The present study was performed to discriminate non-coding disease-causing mutations and prioritize potential cancer-implicated long non-coding RNAs (lncRNAs) in liver cancer using a logistic regression model. A logistic regression model was constructed by combining 19,153 disease-associated ClinVar and human gene mutation database pathogenic variants as the response variable and non-coding features as the predictor variable. Genome-wide association study (GWAS) disease or trait-associated variants and recurrent somatic mutations were used to validate the model. Non-coding gene features with the highest fractions of load were characterized and potential cancer-associated lncRNA candidates were prioritized by combining the fraction of high-scoring regions and average score predicted by the logistic regression model. H3K9me3 and conserved regions were the most negatively and positively informative for the model, respectively. The area under the receiver operating characteristic curve of the model was 0.92. The average score of GWAS disease-associated variants was significantly increased compared with neutral single nucleotide polymorphisms (5.8642 vs. 5.4707; P<0.001), the average score of recurrent somatic mutations of liver cancer was significantly increased compared with non-recurrent somatic mutations (5.4101 vs. 5.2768; P=0.0125). The present study found regions in lncRNAs and introns/untranslated regions of protein coding genes where mutations are most likely to be damaging. In total, 847 lncRNAs were filtered out from the background. Characterization of this subset of lncRNAs showed that these lncRNAs are more conservative, less mutated and more highly expressed compared with other control lncRNAs. In addition, 23 of these lncRNAs were differentially expressed between 12 pairs of liver cancer and adjacent normal specimens. The logistic regression model is a useful tool to prioritize non-coding pathogenic variants and lncRNAs, and paves the way for the detection of non-coding driver lncRNAs in liver cancer.
Collapse
Affiliation(s)
- Hua Li
- Department of Anesthesiology, Shanghai Pulmonary Hospital, Tongji University School of Medicine, Shanghai 200433, P.R. China
| | - Zekun He
- Department of Clinical Medicine, Fuzhou Medical College of Nanchang University, Fuzhou, Jiangxi 344000, P.R. China
| | - Yang Gu
- Department of Anesthesiology, Shanghai Pulmonary Hospital, Tongji University School of Medicine, Shanghai 200433, P.R. China
| | - Lin Fang
- Department of Thyroid and Breast Surgery, Shanghai Tenth People's Hospital, Tongji University, School of Medicine, Shanghai 200072, P.R. China
| | - Xin Lv
- Department of Anesthesiology, Shanghai Pulmonary Hospital, Tongji University School of Medicine, Shanghai 200433, P.R. China
| |
Collapse
|
11
|
Li H, Lv X. Functional annotation of noncoding variants and prioritization of cancer-associated lncRNAs in lung cancer. Oncol Lett 2016; 12:222-230. [PMID: 27347129 DOI: 10.3892/ol.2016.4604] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2015] [Accepted: 04/01/2016] [Indexed: 11/05/2022] Open
Abstract
Multiple computational tools have been widely applied to the detection of coding driver mutations in cancer; however, the prioritization of pathogenic non-coding variants remains a difficult and demanding task. The present study was performed to distinguish non-coding disease-causing mutations from neutral ones, and to prioritize potential cancer-associated long non-coding RNAs (lncRNAs) with a logistic regression model in lung cancer. A logistic regression model was constructed, combining 19,153 disease-associated ClinVar and Human Gene Mutation Database pathogenic variants as the response variable and non-coding features as the predictor variable. Validation of the model was conducted with genome-wide association study (GWAS) disease- or trait-associated single nucleotide polymorphisms (SNPs) and recurrent somatic mutations. High scoring regions were characterized with respect to their distribution in various features and gene classes; potential cancer-associated lncRNA candidates were prioritized, combining the fraction of high-scoring regions and average score predicted by the logistic regression model. H3K79me2 was the most negative factor that contributed to the model, while conserved regions were most positively informative to the model. The area under the receiver operating characteristic curve of the model was 0.89. The model assigned a significantly higher score to GWAS SNPs and recurrent somatic mutations compared with neutral SNPs (mean, 5.9012 vs. 5.5238; P<0.001, Mann-Whitney U test) and non-recurrent mutations (mean, 5.4677 vs. 5.2277, P<0.001, Mann-Whitney U test), respectively. It was observed that regions, including splicing sites and untranslated regions, and gene classes, including cancer genes and cancer-associated lncRNAs, had an increased enrichment of high-scoring regions. In total, 2,679 cancer-associated lncRNAs were determined and characterized. A total of 104 of these lncRNAs were differentially expressed between lung cancer and normal specimens. The logistic regression model is a useful and efficient scoring system to prioritize non-coding pathogenic variants and lncRNAs, and may provide the basis for detecting non-coding driver lncRNAs in lung cancer.
Collapse
Affiliation(s)
- Hua Li
- Department of Anesthesiology, Shanghai Pulmonary Hospital, School of Medicine, Tongji University, Shanghai 200072, P.R. China
| | - Xin Lv
- Department of Anesthesiology, Shanghai Pulmonary Hospital, School of Medicine, Tongji University, Shanghai 200072, P.R. China
| |
Collapse
|
12
|
Wang Q, Zhang J, Liu Y, Zhang W, Zhou J, Duan R, Pu P, Kang C, Han L. A novel cell cycle-associated lncRNA, HOXA11-AS, is transcribed from the 5-prime end of the HOXA transcript and is a biomarker of progression in glioma. Cancer Lett 2016; 373:251-9. [PMID: 26828136 DOI: 10.1016/j.canlet.2016.01.039] [Citation(s) in RCA: 137] [Impact Index Per Article: 17.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2015] [Revised: 01/08/2016] [Accepted: 01/24/2016] [Indexed: 01/17/2023]
Abstract
The comprehensive lncRNA expression signature in glioma has not yet been fully elucidated. We performed a high-throughput microarray to detect the ncRNA expression profiles of 220 human glioma tissues. Here, we found that a novel lncRNA, HOXA11-AS, was the antisense transcript of the HOX11 gene. It was shown that HOXA11-AS was closely associated with glioma grade and poor prognosis. Multivariate Cox regression analysis revealed that HOXA11-AS was an independent prognostic factor in glioblastoma multiforme patients, and its expression was correlated with the glioma molecular subtypes of the Cancer Genome Atlas. Gene set enrichment analysis indicated that the gene sets most correlated with HOXA11-AS expression were involved in cell cycle progression. Over-expression of the HOXA11-AS transcript promoted cell proliferation in vitro, while knockdown of HOXA11-AS expression repressed cell proliferation via regulation of cell cycle progression. The growth-promoting and growth-inhibiting effects of HOXA11-AS were also demonstrated in a xenograft mouse model. Our data confirms, for the first time, that HOXA11-AS is an important long non-coding RNA that primarily serves as a prognostic factor for glioma patient survival. HOXA11-AS could serve as a biomarker for identifying glioma molecular subtypes and as therapeutic target for glioma patients.
Collapse
Affiliation(s)
- Qixue Wang
- Department of Neurosurgery, Tianjin Medical University General Hospital, Tianjin 300052, China; Laboratory of Neuro-Oncology, Tianjin Neurological Institute, Tianjin 300052, China; Key Laboratory of Post-trauma Neuro-Repair and Regeneration in Central Nervous System, Ministry of Education, Tianjin 300052, China; Tianjin Key Laboratory of Injuries, Variations and Regeneration of Nervous System, Tianjin 300052, China; Chinese Glioma Cooperative Group (CGCG), 6 Tiantanxi Li, Beijing 100050, China
| | - Junxia Zhang
- Chinese Glioma Cooperative Group (CGCG), 6 Tiantanxi Li, Beijing 100050, China; Department of Neurosurgery, The First Affiliated Hospital of Nanjing Medical University, Nanjing 210029, China
| | - Yanwei Liu
- Chinese Glioma Cooperative Group (CGCG), 6 Tiantanxi Li, Beijing 100050, China; Glioma Center, Department of Neurosurgery, Beijing Tiantan Hospital, Capital Medical University, Beijing 100050, China
| | - Wei Zhang
- Chinese Glioma Cooperative Group (CGCG), 6 Tiantanxi Li, Beijing 100050, China; Glioma Center, Department of Neurosurgery, Beijing Tiantan Hospital, Capital Medical University, Beijing 100050, China
| | - Junhu Zhou
- Department of Neurosurgery, Tianjin Medical University General Hospital, Tianjin 300052, China; Laboratory of Neuro-Oncology, Tianjin Neurological Institute, Tianjin 300052, China; Key Laboratory of Post-trauma Neuro-Repair and Regeneration in Central Nervous System, Ministry of Education, Tianjin 300052, China; Tianjin Key Laboratory of Injuries, Variations and Regeneration of Nervous System, Tianjin 300052, China; Chinese Glioma Cooperative Group (CGCG), 6 Tiantanxi Li, Beijing 100050, China
| | - Ran Duan
- Chinese Glioma Cooperative Group (CGCG), 6 Tiantanxi Li, Beijing 100050, China; Glioma Center, Department of Neurosurgery, Beijing Tiantan Hospital, Capital Medical University, Beijing 100050, China
| | - Peiyu Pu
- Department of Neurosurgery, Tianjin Medical University General Hospital, Tianjin 300052, China; Laboratory of Neuro-Oncology, Tianjin Neurological Institute, Tianjin 300052, China; Key Laboratory of Post-trauma Neuro-Repair and Regeneration in Central Nervous System, Ministry of Education, Tianjin 300052, China; Tianjin Key Laboratory of Injuries, Variations and Regeneration of Nervous System, Tianjin 300052, China; Chinese Glioma Cooperative Group (CGCG), 6 Tiantanxi Li, Beijing 100050, China
| | - Chunsheng Kang
- Department of Neurosurgery, Tianjin Medical University General Hospital, Tianjin 300052, China; Laboratory of Neuro-Oncology, Tianjin Neurological Institute, Tianjin 300052, China; Key Laboratory of Post-trauma Neuro-Repair and Regeneration in Central Nervous System, Ministry of Education, Tianjin 300052, China; Tianjin Key Laboratory of Injuries, Variations and Regeneration of Nervous System, Tianjin 300052, China; Chinese Glioma Cooperative Group (CGCG), 6 Tiantanxi Li, Beijing 100050, China
| | - Lei Han
- Department of Neurosurgery, Tianjin Medical University General Hospital, Tianjin 300052, China; Laboratory of Neuro-Oncology, Tianjin Neurological Institute, Tianjin 300052, China; Key Laboratory of Post-trauma Neuro-Repair and Regeneration in Central Nervous System, Ministry of Education, Tianjin 300052, China; Tianjin Key Laboratory of Injuries, Variations and Regeneration of Nervous System, Tianjin 300052, China; Chinese Glioma Cooperative Group (CGCG), 6 Tiantanxi Li, Beijing 100050, China.
| |
Collapse
|