1
|
Lynn N, Tuller T. Detecting and understanding meaningful cancerous mutations based on computational models of mRNA splicing. NPJ Syst Biol Appl 2024; 10:25. [PMID: 38453965 PMCID: PMC10920900 DOI: 10.1038/s41540-024-00351-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2023] [Accepted: 02/22/2024] [Indexed: 03/09/2024] Open
Abstract
Cancer research has long relied on non-silent mutations. Yet, it has become overwhelmingly clear that silent mutations can affect gene expression and cancer cell fitness. One fundamental mechanism that apparently silent mutations can severely disrupt is alternative splicing. Here we introduce Oncosplice, a tool that scores mutations based on models of proteomes generated using aberrant splicing predictions. Oncosplice leverages a highly accurate neural network that predicts splice sites within arbitrary mRNA sequences, a greedy transcript constructor that considers alternate arrangements of splicing blueprints, and an algorithm that grades the functional divergence between proteins based on evolutionary conservation. By applying this tool to 12M somatic mutations we identify 8K deleterious variants that are significantly depleted within the healthy population; we demonstrate the tool's ability to identify clinically validated pathogenic variants with a positive predictive value of 94%; we show strong enrichment of predicted deleterious mutations across pan-cancer drivers. We also achieve improved patient survival estimation using a proposed set of novel cancer-involved genes. Ultimately, this pipeline enables accelerated insight-gathering of sequence-specific consequences for a class of understudied mutations and provides an efficient way of filtering through massive variant datasets - functionalities with immediate experimental and clinical applications.
Collapse
Affiliation(s)
- Nicolas Lynn
- Department of Biomedical Engineering, the Engineering Faculty, Tel Aviv University, Tel-Aviv, 69978, Israel
| | - Tamir Tuller
- Department of Biomedical Engineering, the Engineering Faculty, Tel Aviv University, Tel-Aviv, 69978, Israel.
| |
Collapse
|
2
|
Tobacco Mosaic Virus Infection of Chrysanthemums in Thailand: Development of Colorimetric Reverse-Transcription Loop-Mediated Isothermal Amplification (RT–LAMP) Technique for Sensitive and Rapid Detection. PLANTS 2022; 11:plants11141788. [PMID: 35890422 PMCID: PMC9325109 DOI: 10.3390/plants11141788] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/10/2022] [Revised: 07/04/2022] [Accepted: 07/04/2022] [Indexed: 11/23/2022]
Abstract
We detected tobacco mosaic virus (TMV), a member of the genus Tobamovirus and one of the most significant plant-infecting viruses, for the first time in a chrysanthemum in Thailand using reverse-transcription polymerase chain reaction (RT–PCR). The TMV-infected chrysanthemum leaves exhibited mosaic symptoms. We conducted a sequence analysis of the coat protein (CP) gene and found that the TMV detected in the chrysanthemum had 98% identity with other TMV isolates in GenBank. We carried out bioassays and showed that TMV induced mosaic and stunting symptoms in inoculated chrysanthemums. We observed the rigid rod structure of TMV under a transmission electron microscope (TEM). To enhance the speed and sensitivity of detection, we developed a colorimetric RT loop-mediated isothermal amplification (LAMP) technique. We achieved LAMP detection after 30 min incubation in isothermal conditions at 65 °C, and distinguished the positive results according to the color change from pink to yellow. The sensitivity of the LAMP technique was 1000-fold greater than that of RT–PCR, and we found no cross-reactivity with other viruses or viroids. This is the first reported case of a TMV-infected chrysanthemum in Thailand, and our colorimetric RT–LAMP TMV detection method is the first of its kind.
Collapse
|
3
|
Gong Y, Srinivasan SS, Zhang R, Kessenbrock K, Zhang J. scEpiLock: A Weakly Supervised Learning Framework for cis-Regulatory Element Localization and Variant Impact Quantification for Single-Cell Epigenetic Data. Biomolecules 2022; 12:874. [PMID: 35883430 PMCID: PMC9312957 DOI: 10.3390/biom12070874] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2022] [Revised: 06/16/2022] [Accepted: 06/16/2022] [Indexed: 02/04/2023] Open
Abstract
Recent advances in single-cell transposase-accessible chromatin using a sequencing assay (scATAC-seq) allow cellular heterogeneity dissection and regulatory landscape reconstruction with an unprecedented resolution. However, compared to bulk-sequencing, its ultra-high missingness remarkably reduces usable reads in each cell type, resulting in broader, fuzzier peak boundary definitions and limiting our ability to pinpoint functional regions and interpret variant impacts precisely. We propose a weakly supervised learning method, scEpiLock, to directly identify core functional regions from coarse peak labels and quantify variant impacts in a cell-type-specific manner. First, scEpiLock uses a multi-label classifier to predict chromatin accessibility via a deep convolutional neural network. Then, its weakly supervised object detection module further refines the peak boundary definition using gradient-weighted class activation mapping (Grad-CAM). Finally, scEpiLock provides cell-type-specific variant impacts within a given peak region. We applied scEpiLock to various scATAC-seq datasets and found that it achieves an area under receiver operating characteristic curve (AUC) of ~0.9 and an area under precision recall (AUPR) above 0.7. Besides, scEpiLock's object detection condenses coarse peaks to only ⅓ of their original size while still reporting higher conservation scores. In addition, we applied scEpiLock on brain scATAC-seq data and reported several genome-wide association studies (GWAS) variants disrupting regulatory elements around known risk genes for Alzheimer's disease, demonstrating its potential to provide cell-type-specific biological insights in disease studies.
Collapse
Affiliation(s)
- Yanwen Gong
- Center for Complex Biological Systems, University of California, Irvine, CA 92697, USA;
- Department of Biological Chemistry, School of Medicine, University of California, Irvine, CA 92697, USA
| | | | - Ruiyi Zhang
- Department of Computer Science, University of California, Irvine, CA 92697, USA; (S.S.S.); (R.Z.)
| | - Kai Kessenbrock
- Center for Complex Biological Systems, University of California, Irvine, CA 92697, USA;
- Department of Biological Chemistry, School of Medicine, University of California, Irvine, CA 92697, USA
| | - Jing Zhang
- Department of Computer Science, University of California, Irvine, CA 92697, USA; (S.S.S.); (R.Z.)
| |
Collapse
|
4
|
Morales-Laverde L, Echeverz M, Trobos M, Solano C, Lasa I. Experimental Polymorphism Survey in Intergenic Regions of the icaADBCR Locus in Staphylococcus aureus Isolates from Periprosthetic Joint Infections. Microorganisms 2022; 10:600. [PMID: 35336176 PMCID: PMC8955882 DOI: 10.3390/microorganisms10030600] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2022] [Revised: 03/03/2022] [Accepted: 03/04/2022] [Indexed: 12/18/2022] Open
Abstract
Staphylococcus aureus is a leading cause of prosthetic joint infections (PJI) characterized by bacterial biofilm formation and recalcitrance to immune-mediated clearance and antibiotics. The molecular events behind PJI infection are yet to be unraveled. In this sense, identification of polymorphisms in bacterial genomes may help to establish associations between sequence variants and the ability of S. aureus to cause PJI. Here, we report an experimental nucleotide-level survey specifically aimed at the intergenic regions (IGRs) of the icaADBCR locus, which is responsible for the synthesis of the biofilm exopolysaccharide PIA/PNAG, in a collection of strains sampled from PJI and wounds. IGRs of the icaADBCR locus were highly conserved and no PJI-specific SNPs were found. Moreover, polymorphisms in these IGRs did not significantly affect transcription of the icaADBC operon under in vitro laboratory conditions. In contrast, an SNP within the icaR coding region, resulting in a V176E change in the transcriptional repressor IcaR, led to a significant increase in icaADBC operon transcription and PIA/PNAG production and a reduction in S. aureus virulence in a Galleria mellonella infection model. In conclusion, SNPs in icaADBCR IGRs of S. aureus isolates from PJI are not associated with icaADBC expression, PIA/PNAG production and adaptation to PJI.
Collapse
Affiliation(s)
- Liliana Morales-Laverde
- Laboratory of Microbial Pathogenesis, Navarrabiomed, Hospital Universitario de Navarra (HUN), Universidad Pública de Navarra (UPNA), IdiSNA, 31008 Pamplona, Spain; (L.M.-L.); (M.E.); (C.S.)
| | - Maite Echeverz
- Laboratory of Microbial Pathogenesis, Navarrabiomed, Hospital Universitario de Navarra (HUN), Universidad Pública de Navarra (UPNA), IdiSNA, 31008 Pamplona, Spain; (L.M.-L.); (M.E.); (C.S.)
| | - Margarita Trobos
- Department of Biomaterials, Institute of Clinical Sciences, Sahlgrenska Academy at University of Gothenburg, 40530 Gothenburg, Sweden;
| | - Cristina Solano
- Laboratory of Microbial Pathogenesis, Navarrabiomed, Hospital Universitario de Navarra (HUN), Universidad Pública de Navarra (UPNA), IdiSNA, 31008 Pamplona, Spain; (L.M.-L.); (M.E.); (C.S.)
| | - Iñigo Lasa
- Laboratory of Microbial Pathogenesis, Navarrabiomed, Hospital Universitario de Navarra (HUN), Universidad Pública de Navarra (UPNA), IdiSNA, 31008 Pamplona, Spain; (L.M.-L.); (M.E.); (C.S.)
| |
Collapse
|
5
|
Genetic load: genomic estimates and applications in non-model animals. Nat Rev Genet 2022; 23:492-503. [PMID: 35136196 DOI: 10.1038/s41576-022-00448-x] [Citation(s) in RCA: 60] [Impact Index Per Article: 30.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/10/2022] [Indexed: 12/11/2022]
Abstract
Genetic variation, which is generated by mutation, recombination and gene flow, can reduce the mean fitness of a population, both now and in the future. This 'genetic load' has been estimated in a wide range of animal taxa using various approaches. Advances in genome sequencing and computational techniques now enable us to estimate the genetic load in populations and individuals without direct fitness estimates. Here, we review the classic and contemporary literature of genetic load. We describe approaches to quantify the genetic load in whole-genome sequence data based on evolutionary conservation and annotations. We show that splitting the load into its two components - the realized load (or expressed load) and the masked load (or inbreeding load) - can improve our understanding of the population genetics of deleterious mutations.
Collapse
|
6
|
Sivaprakasam B, Sadagopan P. Development of shiny dashboard application for “genome-wide association study on analysis of SNPs injected in Homo sapiens genome (snips-HsG)”. GENE REPORTS 2021. [DOI: 10.1016/j.genrep.2021.101033] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
7
|
Chen D, Cremona MA, Qi Z, Mitra RD, Chiaromonte F, Makova KD. Human L1 Transposition Dynamics Unraveled with Functional Data Analysis. Mol Biol Evol 2021; 37:3576-3600. [PMID: 32722770 DOI: 10.1093/molbev/msaa194] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
Long INterspersed Elements-1 (L1s) constitute >17% of the human genome and still actively transpose in it. Characterizing L1 transposition across the genome is critical for understanding genome evolution and somatic mutations. However, to date, L1 insertion and fixation patterns have not been studied comprehensively. To fill this gap, we investigated three genome-wide data sets of L1s that integrated at different evolutionary times: 17,037 de novo L1s (from an L1 insertion cell-line experiment conducted in-house), and 1,212 polymorphic and 1,205 human-specific L1s (from public databases). We characterized 49 genomic features-proxying chromatin accessibility, transcriptional activity, replication, recombination, etc.-in the ±50 kb flanks of these elements. These features were contrasted between the three L1 data sets and L1-free regions using state-of-the-art Functional Data Analysis statistical methods, which treat high-resolution data as mathematical functions. Our results indicate that de novo, polymorphic, and human-specific L1s are surrounded by different genomic features acting at specific locations and scales. This led to an integrative model of L1 transposition, according to which L1s preferentially integrate into open-chromatin regions enriched in non-B DNA motifs, whereas they are fixed in regions largely free of purifying selection-depleted of genes and noncoding most conserved elements. Intriguingly, our results suggest that L1 insertions modify local genomic landscape by extending CpG methylation and increasing mononucleotide microsatellite density. Altogether, our findings substantially facilitate understanding of L1 integration and fixation preferences, pave the way for uncovering their role in aging and cancer, and inform their use as mutagenesis tools in genetic studies.
Collapse
Affiliation(s)
- Di Chen
- Intercollege Graduate Degree Program in Genetics, The Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA
| | - Marzia A Cremona
- Department of Statistics, The Pennsylvania State University, University Park, PA.,Department of Operations and Decision Systems, Université Laval, Québec, Canada
| | - Zongtai Qi
- Department of Genetics and Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, MO
| | - Robi D Mitra
- Department of Genetics and Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, MO
| | - Francesca Chiaromonte
- Department of Statistics, The Pennsylvania State University, University Park, PA.,EMbeDS, Sant'Anna School of Advanced Studies, Pisa, Italy.,The Huck Institutes of the Life Sciences, Center for Medical Genomics, The Pennsylvania State University, University Park, PA
| | - Kateryna D Makova
- The Huck Institutes of the Life Sciences, Center for Medical Genomics, The Pennsylvania State University, University Park, PA.,Department of Biology, The Pennsylvania State University, University Park, PA
| |
Collapse
|
8
|
Mustofa I, Susilowati S, Wurlina W, Hernawati T, Oktanella Y. Green tea extract increases the quality and reduced DNA mutation of post-thawed Kacang buck sperm. Heliyon 2021; 7:e06372. [PMID: 33732926 PMCID: PMC7944040 DOI: 10.1016/j.heliyon.2021.e06372] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2021] [Revised: 01/29/2021] [Accepted: 02/23/2021] [Indexed: 01/11/2023] Open
Abstract
The study aimed to determine the addition of green tea extract (GTE) in extender on the quality and DNA mutation of post-thawed Kacang buck sperm. The sperm DNA mutation was observed on nicotinamide adenine dinucleotide hydride (NADH) dehydrogenase 1 (ND1) of mitochondrial Deoxyribonucleic Acid (mtDNA). A pool of 12 Kacang buck ejaculates was diluted in skim milk-egg yolk extender contained 0, 0.05, 0.10, and 0.15 mg of GTE/100 mL for T0, T1, T2, and T3 group, respectively. Each of the aliquot groups was packaged in 0.25 mL French mini straw contained 60 million alive sperm and froze according to the protocol. The ND1 mtDNA amplification of samples was carried out Polymerase Chain Reaction machine, followed by DNA sequencing using the Sanger method. Meanwhile, the phylogenetic tree was constructed using the neighbor-joining (NJ) method with MEGA 7.0 software. The results showed that the T2 group maintained the highest quality for Kacang buck post-thawed semen. There was the highest percentages of sperms viability, motility, intact plasma membrane (IPM), the lowest of malondialdehyde (MDA) concentration, sperm DNA fragmentation (SDF), the total and types of ND1 mtDNA mutation frequency. The phylogenetic tree analysis revealed that the clade of the T2 group was most closely related to the sequence reference. However, there was no correlation between the semen quality parameters (sperm viability, motility, IPM, MDA concentration, and SDF) with ND1 mtDNA mutation of post-thawed Kacang buck semen. It could be concluded that GTE was useful as an antioxidant for Kacang buck semen extender for frozen sperm.
Collapse
Affiliation(s)
- Imam Mustofa
- Department of Veterinary Reproduction, Faculty of Veterinary Medicine, Universitas Airlangga, Kampus C Unair, Mulyorejo, Surabaya, 60115, East Java, Indonesia
| | - Suherni Susilowati
- Department of Veterinary Reproduction, Faculty of Veterinary Medicine, Universitas Airlangga, Kampus C Unair, Mulyorejo, Surabaya, 60115, East Java, Indonesia
| | - Wurlina Wurlina
- Department of Veterinary Reproduction, Faculty of Veterinary Medicine, Universitas Airlangga, Kampus C Unair, Mulyorejo, Surabaya, 60115, East Java, Indonesia
| | - Tatik Hernawati
- Department of Veterinary Reproduction, Faculty of Veterinary Medicine, Universitas Airlangga, Kampus C Unair, Mulyorejo, Surabaya, 60115, East Java, Indonesia
| | - Yudit Oktanella
- Department of Veterinary Reproduction, Faculty of Veterinary Medicine, Brawijaya University, Jl. Veteran, Ketawanggede, Lowokwaru, Malang, 65145, Indonesia
| |
Collapse
|
9
|
Huber CD, Kim BY, Lohmueller KE. Population genetic models of GERP scores suggest pervasive turnover of constrained sites across mammalian evolution. PLoS Genet 2020; 16:e1008827. [PMID: 32469868 PMCID: PMC7286533 DOI: 10.1371/journal.pgen.1008827] [Citation(s) in RCA: 47] [Impact Index Per Article: 11.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2020] [Revised: 06/10/2020] [Accepted: 05/05/2020] [Indexed: 01/20/2023] Open
Abstract
Comparative genomic approaches have been used to identify sites where mutations are under purifying selection and of functional consequence by searching for sequences that are conserved across distantly related species. However, the performance of these approaches has not been rigorously evaluated under population genetic models. Further, short-lived functional elements may not leave a footprint of sequence conservation across many species. We use simulations to study how one measure of conservation, the Genomic Evolutionary Rate Profiling (GERP) score, relates to the strength of selection (Nes). We show that the GERP score is related to the strength of purifying selection. However, changes in selection coefficients or functional elements over time (i.e. functional turnover) can strongly affect the GERP distribution, leading to unexpected relationships between GERP and Nes. Further, we show that for functional elements that have a high turnover rate, adding more species to the analysis does not necessarily increase statistical power. Finally, we use the distribution of GERP scores across the human genome to compare models with and without turnover of sites where mutations are under purifying selection. We show that mutations in 4.51% of the noncoding human genome are under purifying selection and that most of this sequence has likely experienced changes in selection coefficients throughout mammalian evolution. Our work reveals limitations to using comparative genomic approaches to identify deleterious mutations. Commonly used GERP score thresholds miss over half of the noncoding sites in the human genome where mutations are under purifying selection. One of the most significant and challenging tasks in modern genomics is to assess the functional consequences of a particular nucleotide change in a genome. A common approach to address this challenge prioritizes sequences that share similar nucleotides across distantly related species, with the rationale that mutations at such positions were deleterious and removed from the population by purifying natural selection. Our manuscript shows that one popular measure of sequence conservation, the GERP score, performs well at identifying selected mutations if mutations at a site were under selection across all of mammalian evolution. Changes in selection at a given site dramatically reduces the power of GERP to detect selected mutations in humans. We also combine population genetic models with the distribution of GERP scores at noncoding sites across the human genome to show that the degree of selection at individual sites has changed throughout mammalian evolution. Importantly, we demonstrate that at least 80 Mb of noncoding sequence under purifying selection in humans will not have extreme GERP scores and will likely be missed by modern comparative genomic approaches. Our work argues that new approaches, potentially based on genetic variation within species, will be required to identify deleterious mutations.
Collapse
Affiliation(s)
- Christian D. Huber
- School of Biological Sciences, University of Adelaide, Adelaide, South Australia, Australia
| | - Bernard Y. Kim
- Department of Biology, Stanford University, Stanford, California, United States of America
| | - Kirk E. Lohmueller
- Department of Ecology and Evolutionary Biology, University of California, Los Angeles, California, United States of America
- Interdepartmental Program in Bioinformatics, University of California, Los Angeles, California, United States of America
- Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, California, United States of America
- * E-mail:
| |
Collapse
|
10
|
Lou S, Cotter KA, Li T, Liang J, Mohsen H, Liu J, Zhang J, Cohen S, Xu J, Yu H, Rubin MA, Gerstein M. GRAM: A GeneRAlized Model to predict the molecular effect of a non-coding variant in a cell-type specific manner. PLoS Genet 2019; 15:e1007860. [PMID: 31469829 PMCID: PMC6742416 DOI: 10.1371/journal.pgen.1007860] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2018] [Revised: 09/12/2019] [Accepted: 07/22/2019] [Indexed: 12/19/2022] Open
Abstract
There has been much effort to prioritize genomic variants with respect to their impact on "function". However, function is often not precisely defined: sometimes it is the disease association of a variant; on other occasions, it reflects a molecular effect on transcription or epigenetics. Here, we coupled multiple genomic predictors to build GRAM, a GeneRAlized Model, to predict a well-defined experimental target: the expression-modulating effect of a non-coding variant on its associated gene, in a transferable, cell-specific manner. Firstly, we performed feature engineering: using LASSO, a regularized linear model, we found transcription factor (TF) binding most predictive, especially for TFs that are hubs in the regulatory network; in contrast, evolutionary conservation, a popular feature in many other variant-impact predictors, has almost no contribution. Moreover, TF binding inferred from in vitro SELEX is as effective as that from in vivo ChIP-Seq. Second, we implemented GRAM integrating only SELEX features and expression profiles; thus, the program combines a universal regulatory score with an easily obtainable modifier reflecting the particular cell type. We benchmarked GRAM on large-scale MPRA datasets, achieving AUROC scores of 0.72 in GM12878 and 0.66 in a multi-cell line dataset. We then evaluated the performance of GRAM on targeted regions using luciferase assays in the MCF7 and K562 cell lines. We noted that changing the insertion position of the construct relative to the reporter gene gave very different results, highlighting the importance of carefully defining the exact prediction target of the model. Finally, we illustrated the utility of GRAM in fine-mapping causal variants and developed a practical software pipeline to carry this out. In particular, we demonstrated in specific examples how the pipeline could pinpoint variants that directly modulate gene expression within a larger linkage-disequilibrium block associated with a phenotype of interest (e.g., for an eQTL).
Collapse
Affiliation(s)
- Shaoke Lou
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut, United States of America.,Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut, United States of America
| | - Kellie A Cotter
- Department for BioMedical Research, University of Bern, CH, Bern, Switzerland
| | - Tianxiao Li
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut, United States of America.,Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut, United States of America
| | - Jin Liang
- Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, New York, United States of America
| | - Hussein Mohsen
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut, United States of America.,Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut, United States of America.,Program in the History of Science and Medicine, Yale University, New Haven, Connecticut, United States of America
| | - Jason Liu
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut, United States of America.,Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut, United States of America
| | - Jing Zhang
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut, United States of America.,Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut, United States of America
| | - Sandra Cohen
- Department of Pathology and Laboratory Medicine, Weill Cornell Medicine, Cornell University, New York, New York, United States of America
| | - Jinrui Xu
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut, United States of America.,Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut, United States of America
| | - Haiyuan Yu
- Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, New York, United States of America.,Department of Computational Biology, Cornell University, Ithaca, New York, United States of America
| | - Mark A Rubin
- Department for BioMedical Research, University of Bern, CH, Bern, Switzerland.,Weill Cornell Medicine, New York, United States of America
| | - Mark Gerstein
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut, United States of America.,Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut, United States of America
| |
Collapse
|
11
|
Bonjoch L, Mur P, Arnau-Collell C, Vargas-Parra G, Shamloo B, Franch-Expósito S, Pineda M, Capellà G, Erman B, Castellví-Bel S. Approaches to functionally validate candidate genetic variants involved in colorectal cancer predisposition. Mol Aspects Med 2019; 69:27-40. [PMID: 30935834 DOI: 10.1016/j.mam.2019.03.004] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2019] [Revised: 03/26/2019] [Accepted: 03/26/2019] [Indexed: 02/07/2023]
Abstract
Most next generation sequencing (NGS) studies identified candidate genetic variants predisposing to colorectal cancer (CRC) but do not tackle its functional interpretation to unequivocally recognize a new hereditary CRC gene. Besides, germline variants in already established hereditary CRC-predisposing genes or somatic variants share the same need when trying to categorize those with relevant significance. Functional genomics approaches have an important role in identifying the causal links between genetic architecture and phenotypes, in order to decipher cellular function in health and disease. Therefore, functional interpretation of identified genetic variants by NGS platforms is now essential. Available approaches nowadays include bioinformatics, cell and molecular biology and animal models. Recent advances, such as the CRISPR-Cas9, ZFN and TALEN systems, have been already used as a powerful tool with this objective. However, the use of cell lines is of limited value due to the CRC heterogeneity and its close interaction with microenvironment. Access to tridimensional cultures or organoids and xenograft models that mimic the in vivo tissue architecture could revolutionize functional analysis. This review will focus on the application of state-of-the-art functional studies to better tackle new genes involved in germline predisposition to this neoplasm.
Collapse
Affiliation(s)
- Laia Bonjoch
- Gastroenterology Department, Hospital Clínic, Institut d'Investigacions Biomèdiques August Pi i Sunyer (IDIBAPS), Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBEREHD), University of Barcelona, Barcelona, Spain
| | - Pilar Mur
- Hereditary Cancer Program, Catalan Institute of Oncology, Institut d'Investigació Biomèdica de Bellvitge (IDIBELL), ONCOBELL Program, L'Hospitalet de Llobregat, Barcelona, Spain; Centro de Investigación Biomédica en Red de Cáncer (CIBERONC), Spain
| | - Coral Arnau-Collell
- Gastroenterology Department, Hospital Clínic, Institut d'Investigacions Biomèdiques August Pi i Sunyer (IDIBAPS), Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBEREHD), University of Barcelona, Barcelona, Spain
| | - Gardenia Vargas-Parra
- Hereditary Cancer Program, Catalan Institute of Oncology, Institut d'Investigació Biomèdica de Bellvitge (IDIBELL), ONCOBELL Program, L'Hospitalet de Llobregat, Barcelona, Spain; Centro de Investigación Biomédica en Red de Cáncer (CIBERONC), Spain
| | - Bahar Shamloo
- Molecular Biology, Genetics, and Bioengineering Department, Legacy Research Institute, Portland, OR, USA
| | - Sebastià Franch-Expósito
- Gastroenterology Department, Hospital Clínic, Institut d'Investigacions Biomèdiques August Pi i Sunyer (IDIBAPS), Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBEREHD), University of Barcelona, Barcelona, Spain
| | - Marta Pineda
- Hereditary Cancer Program, Catalan Institute of Oncology, Institut d'Investigació Biomèdica de Bellvitge (IDIBELL), ONCOBELL Program, L'Hospitalet de Llobregat, Barcelona, Spain; Centro de Investigación Biomédica en Red de Cáncer (CIBERONC), Spain
| | - Gabriel Capellà
- Hereditary Cancer Program, Catalan Institute of Oncology, Institut d'Investigació Biomèdica de Bellvitge (IDIBELL), ONCOBELL Program, L'Hospitalet de Llobregat, Barcelona, Spain; Centro de Investigación Biomédica en Red de Cáncer (CIBERONC), Spain
| | - Batu Erman
- Molecular Biology, Genetics and Bioengineering Program, Faculty of Engineering and Natural Sciences, Sabanci University, Istanbul, Turkey
| | - Sergi Castellví-Bel
- Gastroenterology Department, Hospital Clínic, Institut d'Investigacions Biomèdiques August Pi i Sunyer (IDIBAPS), Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBEREHD), University of Barcelona, Barcelona, Spain.
| |
Collapse
|
12
|
Zhou Y, Fujikura K, Mkrtchian S, Lauschke VM. Computational Methods for the Pharmacogenetic Interpretation of Next Generation Sequencing Data. Front Pharmacol 2018; 9:1437. [PMID: 30564131 PMCID: PMC6288784 DOI: 10.3389/fphar.2018.01437] [Citation(s) in RCA: 48] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2018] [Accepted: 11/20/2018] [Indexed: 12/21/2022] Open
Abstract
Up to half of all patients do not respond to pharmacological treatment as intended. A substantial fraction of these inter-individual differences is due to heritable factors and a growing number of associations between genetic variations and drug response phenotypes have been identified. Importantly, the rapid progress in Next Generation Sequencing technologies in recent years unveiled the true complexity of the genetic landscape in pharmacogenes with tens of thousands of rare genetic variants. As each individual was found to harbor numerous such rare variants they are anticipated to be important contributors to the genetically encoded inter-individual variability in drug effects. The fundamental challenge however is their functional interpretation due to the sheer scale of the problem that renders systematic experimental characterization of these variants currently unfeasible. Here, we review concepts and important progress in the development of computational prediction methods that allow to evaluate the effect of amino acid sequence alterations in drug metabolizing enzymes and transporters. In addition, we discuss recent advances in the interpretation of functional effects of non-coding variants, such as variations in splice sites, regulatory regions and miRNA binding sites. We anticipate that these methodologies will provide a useful toolkit to facilitate the integration of the vast extent of rare genetic variability into drug response predictions in a precision medicine framework.
Collapse
Affiliation(s)
- Yitian Zhou
- Section of Pharmacogenetics, Department of Physiology and Pharmacology, Karolinska Institutet, Stockholm, Sweden
| | - Kohei Fujikura
- Department of Diagnostic Pathology, Kobe University Graduate School of Medicine, Kobe, Japan
| | - Souren Mkrtchian
- Section of Pharmacogenetics, Department of Physiology and Pharmacology, Karolinska Institutet, Stockholm, Sweden
| | - Volker M. Lauschke
- Section of Pharmacogenetics, Department of Physiology and Pharmacology, Karolinska Institutet, Stockholm, Sweden
| |
Collapse
|
13
|
Hoffmann RD, Palmgren M. Purifying selection acts on coding and non-coding sequences of paralogous genes in Arabidopsis thaliana. BMC Genomics 2016; 17:456. [PMID: 27296049 PMCID: PMC4906602 DOI: 10.1186/s12864-016-2803-2] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2015] [Accepted: 05/27/2016] [Indexed: 01/13/2023] Open
Abstract
Background Whole-genome duplications in the ancestors of many diverse species provided the genetic material for evolutionary novelty. Several models explain the retention of paralogous genes. However, how these models are reflected in the evolution of coding and non-coding sequences of paralogous genes is unknown. Results Here, we analyzed the coding and non-coding sequences of paralogous genes in Arabidopsis thaliana and compared these sequences with those of orthologous genes in Arabidopsis lyrata. Paralogs with lower expression than their duplicate had more nonsynonymous substitutions, were more likely to fractionate, and exhibited less similar expression patterns with their orthologs in the other species. Also, lower-expressed genes had greater tissue specificity. Orthologous conserved non-coding sequences in the promoters, introns, and 3′ untranslated regions were less abundant at lower-expressed genes compared to their higher-expressed paralogs. A gene ontology (GO) term enrichment analysis showed that paralogs with similar expression levels were enriched in GO terms related to ribosomes, whereas paralogs with different expression levels were enriched in terms associated with stress responses. Conclusions Loss of conserved non-coding sequences in one gene of a paralogous gene pair correlates with reduced expression levels that are more tissue specific. Together with increased mutation rates in the coding sequences, this suggests that similar forces of purifying selection act on coding and non-coding sequences. We propose that coding and non-coding sequences evolve concurrently following gene duplication. Electronic supplementary material The online version of this article (doi:10.1186/s12864-016-2803-2) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Robert D Hoffmann
- Center for Membrane Pumps in Cells and Disease - PUMPKIN, Danish National Research Foundation, Department of Plant and Environmental Sciences, University of Copenhagen, 1871, Frederiksberg C, Denmark.
| | - Michael Palmgren
- Center for Membrane Pumps in Cells and Disease - PUMPKIN, Danish National Research Foundation, Department of Plant and Environmental Sciences, University of Copenhagen, 1871, Frederiksberg C, Denmark
| |
Collapse
|
14
|
Bendl J, Musil M, Štourač J, Zendulka J, Damborský J, Brezovský J. PredictSNP2: A Unified Platform for Accurately Evaluating SNP Effects by Exploiting the Different Characteristics of Variants in Distinct Genomic Regions. PLoS Comput Biol 2016; 12:e1004962. [PMID: 27224906 PMCID: PMC4880439 DOI: 10.1371/journal.pcbi.1004962] [Citation(s) in RCA: 133] [Impact Index Per Article: 16.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2015] [Accepted: 05/05/2016] [Indexed: 12/20/2022] Open
Abstract
An important message taken from human genome sequencing projects is that the human population exhibits approximately 99.9% genetic similarity. Variations in the remaining parts of the genome determine our identity, trace our history and reveal our heritage. The precise delineation of phenotypically causal variants plays a key role in providing accurate personalized diagnosis, prognosis, and treatment of inherited diseases. Several computational methods for achieving such delineation have been reported recently. However, their ability to pinpoint potentially deleterious variants is limited by the fact that their mechanisms of prediction do not account for the existence of different categories of variants. Consequently, their output is biased towards the variant categories that are most strongly represented in the variant databases. Moreover, most such methods provide numeric scores but not binary predictions of the deleteriousness of variants or confidence scores that would be more easily understood by users. We have constructed three datasets covering different types of disease-related variants, which were divided across five categories: (i) regulatory, (ii) splicing, (iii) missense, (iv) synonymous, and (v) nonsense variants. These datasets were used to develop category-optimal decision thresholds and to evaluate six tools for variant prioritization: CADD, DANN, FATHMM, FitCons, FunSeq2 and GWAVA. This evaluation revealed some important advantages of the category-based approach. The results obtained with the five best-performing tools were then combined into a consensus score. Additional comparative analyses showed that in the case of missense variations, protein-based predictors perform better than DNA sequence-based predictors. A user-friendly web interface was developed that provides easy access to the five tools’ predictions, and their consensus scores, in a user-understandable format tailored to the specific features of different categories of variations. To enable comprehensive evaluation of variants, the predictions are complemented with annotations from eight databases. The web server is freely available to the community at http://loschmidt.chemi.muni.cz/predictsnp2.
Collapse
Affiliation(s)
- Jaroslav Bendl
- Loschmidt Laboratories, Department of Experimental Biology and Research Centre for Toxic Compounds in the Environment RECETOX, Masaryk University, Brno, Czech Republic
- Department of Information Systems, Faculty of Information Technology, Brno University of Technology, Brno, Czech Republic
- International Clinical Research Center, St. Anne’s University Hospital Brno, Brno, Czech Republic
| | - Miloš Musil
- Loschmidt Laboratories, Department of Experimental Biology and Research Centre for Toxic Compounds in the Environment RECETOX, Masaryk University, Brno, Czech Republic
- Department of Information Systems, Faculty of Information Technology, Brno University of Technology, Brno, Czech Republic
| | - Jan Štourač
- Loschmidt Laboratories, Department of Experimental Biology and Research Centre for Toxic Compounds in the Environment RECETOX, Masaryk University, Brno, Czech Republic
- International Clinical Research Center, St. Anne’s University Hospital Brno, Brno, Czech Republic
| | - Jaroslav Zendulka
- Department of Information Systems, Faculty of Information Technology, Brno University of Technology, Brno, Czech Republic
| | - Jiří Damborský
- Loschmidt Laboratories, Department of Experimental Biology and Research Centre for Toxic Compounds in the Environment RECETOX, Masaryk University, Brno, Czech Republic
- International Clinical Research Center, St. Anne’s University Hospital Brno, Brno, Czech Republic
- * E-mail: (JD); (JBr)
| | - Jan Brezovský
- Loschmidt Laboratories, Department of Experimental Biology and Research Centre for Toxic Compounds in the Environment RECETOX, Masaryk University, Brno, Czech Republic
- International Clinical Research Center, St. Anne’s University Hospital Brno, Brno, Czech Republic
- * E-mail: (JD); (JBr)
| |
Collapse
|
15
|
Liu M, Watson LT, Zhang L. HMMvar-func: a new method for predicting the functional outcome of genetic variants. BMC Bioinformatics 2015; 16:351. [PMID: 26518340 PMCID: PMC4628267 DOI: 10.1186/s12859-015-0781-z] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2015] [Accepted: 10/16/2015] [Indexed: 11/14/2022] Open
Abstract
Background Numerous tools have been developed to predict the fitness effects (i.e., neutral, deleterious, or beneficial) of genetic variants on corresponding proteins. However, prediction in terms of whether a variant causes the variant bearing protein to lose the original function or gain new function is also needed for better understanding of how the variant contributes to disease/cancer. To address this problem, the present work introduces and computationally defines four types of functional outcome of a variant: gain, loss, switch, and conservation of function. The deployment of multiple hidden Markov models is proposed to computationally classify mutations by the four functional impact types. Results The functional outcome is predicted for over a hundred thyroid stimulating hormone receptor (TSHR) mutations, as well as cancer related mutations in oncogenes or tumor suppressor genes. The results show that the proposed computational method is effective in fine grained prediction of the functional outcome of a mutation, and can be used to help elucidate the molecular mechanism of disease/cancer causing mutations. The program is freely available at http://bioinformatics.cs.vt.edu/zhanglab/HMMvar/download.php. Conclusion This work is the first to computationally define and predict functional impact of mutations, loss, switch, gain, or conservation of function. These fine grained predictions can be especially useful for identifying mutations that cause or are linked to cancer. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0781-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Mingming Liu
- Department of Computer Science, Virginia Polytechnic Institute & State University, Blacksburg, USA.
| | - Layne T Watson
- Department of Computer Science, Virginia Polytechnic Institute & State University, Blacksburg, USA. .,Department of Mathematics, Virginia Polytechnic Institute & State University, Blacksburg, USA. .,Department of Aerospace and Ocean Engineering, Virginia Polytechnic Institute & State University, Blacksburg, USA.
| | - Liqing Zhang
- Department of Computer Science, Virginia Polytechnic Institute & State University, Blacksburg, USA.
| |
Collapse
|
16
|
Francioli LC, Polak PP, Koren A, Menelaou A, Chun S, Renkens I, van Duijn CM, Swertz M, Wijmenga C, van Ommen G, Slagboom PE, Boomsma DI, Ye K, Guryev V, Arndt PF, Kloosterman WP, de Bakker PIW, Sunyaev SR. Genome-wide patterns and properties of de novo mutations in humans. Nat Genet 2015; 47:822-826. [PMID: 25985141 PMCID: PMC4485564 DOI: 10.1038/ng.3292] [Citation(s) in RCA: 252] [Impact Index Per Article: 28.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2014] [Accepted: 04/07/2015] [Indexed: 12/12/2022]
Abstract
Mutations create variation in the population, fuel evolution, and cause genetic diseases. Current knowledge about de novo mutations is incomplete and mostly indirect 1–10. Here, we analyze 11,020 de novo mutations from whole-genomes of 250 families. We show that de novo mutations in offspring of older fathers are not only more numerous 11–13 but also occur more frequently in early-replicating, genic regions. Functional regions exhibit higher mutation rates due to CpG dinucleotides and reveal signatures of transcription-coupled repair, while mutation clusters with a unique signature point to a novel mutational mechanism. Mutation and recombination rates independently associate with nucleotide diversity, and regional variation in human-chimpanzee divergence is only partly explained by mutation rate heterogeneity. Finally, we provide a genome-wide mutation rate map for medical and population genetics applications. Our results reveal novel insights and refine long-standing hypotheses about human mutagenesis.
Collapse
Affiliation(s)
- Laurent C Francioli
- Department of Medical Genetics, Center for Molecular Medicine, University Medical Center Utrecht, Utrecht, The Netherlands
| | - Paz P Polak
- Division of Genetics, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - Amnon Koren
- Department of Genetics, Harvard Medical School, Boston, MA, USA
| | - Androniki Menelaou
- Department of Medical Genetics, Center for Molecular Medicine, University Medical Center Utrecht, Utrecht, The Netherlands
| | - Sung Chun
- Division of Genetics, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - Ivo Renkens
- Department of Medical Genetics, Center for Molecular Medicine, University Medical Center Utrecht, Utrecht, The Netherlands
| | | | | | - Morris Swertz
- University of Groningen, University Medical Center Groningen, Department of Genetics, Groningen, The Netherlands.,University of Groningen, University Medical Center Groningen, Genomics Coordination Center, Groningen, The Netherlands
| | - Cisca Wijmenga
- University of Groningen, University Medical Center Groningen, Department of Genetics, Groningen, The Netherlands.,University of Groningen, University Medical Center Groningen, Genomics Coordination Center, Groningen, The Netherlands
| | - Gertjan van Ommen
- Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands
| | - P Eline Slagboom
- Section of Molecular Epidemiology, Department of Medical Statistics and Bioinformatics, Leiden University Medical Center, Leiden, The Netherlands
| | - Dorret I Boomsma
- Department of Biological Psychology, VU University Amsterdam, Amsterdam, The Netherlands
| | - Kai Ye
- Section of Molecular Epidemiology, Department of Medical Statistics and Bioinformatics, Leiden University Medical Center, Leiden, The Netherlands.,The Genome Institute, Washington University, St. Louis, MO, USA
| | - Victor Guryev
- European Research Institute for the Biology of Ageing, University of Groningen, University Medical Center Groningen, Groningen, The Netherlands
| | - Peter F Arndt
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - Wigard P Kloosterman
- Department of Medical Genetics, Center for Molecular Medicine, University Medical Center Utrecht, Utrecht, The Netherlands
| | - Paul I W de Bakker
- Department of Medical Genetics, Center for Molecular Medicine, University Medical Center Utrecht, Utrecht, The Netherlands.,Department of Epidemiology, Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht, The Netherlands
| | - Shamil R Sunyaev
- Division of Genetics, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| |
Collapse
|
17
|
Gulko B, Hubisz MJ, Gronau I, Siepel A. A method for calculating probabilities of fitness consequences for point mutations across the human genome. Nat Genet 2015; 47:276-83. [PMID: 25599402 PMCID: PMC4342276 DOI: 10.1038/ng.3196] [Citation(s) in RCA: 181] [Impact Index Per Article: 20.1] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2014] [Accepted: 12/19/2014] [Indexed: 12/17/2022]
Abstract
We describe a novel computational method for estimating the probability that a point mutation at each position in a genome will influence fitness. These fitness consequence (fitCons) scores serve as evolution-based measures of potential genomic function. Our approach is to cluster genomic positions into groups exhibiting distinct “fingerprints” based on high-throughput functional genomic data, then to estimate a probability of fitness consequences for each group from associated patterns of genetic polymorphism and divergence. We have generated fitCons scores for three human cell types based on public data from ENCODE. Compared with conventional conservation scores, fitCons scores show considerably improved prediction power for cis-regulatory elements. In addition, fitCons scores indicate that 4.2–7.5% of nucleotides in the human genome have influenced fitness since the human-chimpanzee divergence, and they suggest that recent evolutionary turnover has had limited impact on the functional content of the genome.
Collapse
Affiliation(s)
- Brad Gulko
- Graduate Field of Computer Science, Cornell University, Ithaca, New York, USA
| | - Melissa J Hubisz
- Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, New York, USA
| | - Ilan Gronau
- Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, New York, USA
| | - Adam Siepel
- 1] Graduate Field of Computer Science, Cornell University, Ithaca, New York, USA. [2] Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, New York, USA
| |
Collapse
|
18
|
Taher L, Narlikar L, Ovcharenko I. Identification and computational analysis of gene regulatory elements. Cold Spring Harb Protoc 2015; 2015:pdb.top083642. [PMID: 25561628 PMCID: PMC5885252 DOI: 10.1101/pdb.top083642] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
Over the last two decades, advances in experimental and computational technologies have greatly facilitated genomic research. Next-generation sequencing technologies have made de novo sequencing of large genomes affordable, and powerful computational approaches have enabled accurate annotations of genomic DNA sequences. Charting functional regions in genomes must account for not only the coding sequences, but also noncoding RNAs, repetitive elements, chromatin states, epigenetic modifications, and gene regulatory elements. A mix of comparative genomics, high-throughput biological experiments, and machine learning approaches has played a major role in this truly global effort. Here we describe some of these approaches and provide an account of our current understanding of the complex landscape of the human genome. We also present overviews of different publicly available, large-scale experimental data sets and computational tools, which we hope will prove beneficial for researchers working with large and complex genomes.
Collapse
Affiliation(s)
- Leila Taher
- Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894
- Institute for Biostatistics and Informatics in Medicine and Ageing Research, University of Rostock, 18051 Rostock, Germany
| | - Leelavati Narlikar
- Chemical Engineering and Process Development Division, National Chemical Laboratory, CSIR, Pune 411008, India
| | - Ivan Ovcharenko
- Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894
| |
Collapse
|
19
|
Modolo L, Picard F, Lerat E. A new genome-wide method to track horizontally transferred sequences: application to Drosophila. Genome Biol Evol 2015; 6:416-32. [PMID: 24497602 PMCID: PMC3942030 DOI: 10.1093/gbe/evu026] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022] Open
Abstract
Because of methodological breakthroughs and the availability of an increasing amount of whole-genome sequence data, horizontal transfers (HTs) in eukaryotes have received much attention recently. Contrary to similar analyses in prokaryotes, most studies in eukaryotes usually investigate particular sequences corresponding to transposable elements (TEs), neglecting the other components of the genome. We present a new methodological framework for the genome-wide detection of all putative horizontally transferred sequences between two species that requires no prior knowledge of the transferred sequences. This method provides a broader picture of HTs in eukaryotes by fully exploiting complete-genome sequence data. In contrast to previous genome-wide approaches, we used a well-defined statistical framework to control for the number of false positives in the results, and we propose two new validation procedures to control for confounding factors. The first validation procedure relies on a comparative analysis with other species of the phylogeny to validate HTs for the nonrepeated sequences detected, whereas the second one built upon the study of the dynamics of the detected TEs. We applied our method to two closely related Drosophila species, Drosophila melanogaster and D. simulans, in which we discovered 10 new HTs in addition to all the HTs previously detected in different studies, which underscores our method’s high sensitivity and specificity. Our results favor the hypothesis of multiple independent HTs of TEs while unraveling a small portion of the network of HTs in the Drosophila phylogeny.
Collapse
Affiliation(s)
- Laurent Modolo
- Université de Lyon, France, Université Lyon 1, CNRS, UMR 5558, Laboratoire de Biométrie et Biologie Evolutive, VIlleurbanne, France
| | | | | |
Collapse
|
20
|
Abstract
Identifying sequence variants that play a mechanistic role in human disease and other phenotypes is a fundamental goal in human genetics and will be important in translating the results of variation studies. Experimental validation to confirm that a variant causes the biochemical changes responsible for a given disease or phenotype is considered the gold standard, but this cannot currently be applied to the 3 million or so variants expected in an individual genome. This has prompted the development of a wide variety of computational approaches that use several different sources of information to identify functional variation. Here, we review and assess the limitations of computational techniques for categorizing variants according to functional classes, prioritizing variants for experimental follow-up and generating hypotheses about the possible molecular mechanisms to inform downstream experiments. We discuss the main current bioinformatics approaches to identifying functional variation, including widely used algorithms for coding variation such as SIFT and PolyPhen and also novel techniques for interpreting variation across the genome.
Collapse
Affiliation(s)
- Graham RS Ritchie
- />European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridge CB10 1SD UK
- />Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA UK
| | - Paul Flicek
- />European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridge CB10 1SD UK
- />Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA UK
| |
Collapse
|
21
|
Dib L, Silvestro D, Salamin N. Evolutionary footprint of coevolving positions in genes. Bioinformatics 2014; 30:1241-9. [DOI: 10.1093/bioinformatics/btu012] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023] Open
|
22
|
Liu M, Watson LT, Zhang L. Quantitative prediction of the effect of genetic variation using hidden Markov models. BMC Bioinformatics 2014; 15:5. [PMID: 24405700 PMCID: PMC3893606 DOI: 10.1186/1471-2105-15-5] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2013] [Accepted: 01/02/2014] [Indexed: 11/10/2022] Open
Abstract
Background With the development of sequencing technologies, more and more sequence variants are available for investigation. Different classes of variants in the human genome have been identified, including single nucleotide substitutions, insertion and deletion, and large structural variations such as duplications and deletions. Insertion and deletion (indel) variants comprise a major proportion of human genetic variation. However, little is known about their effects on humans. The absence of understanding is largely due to the lack of both biological data and computational resources. Results This paper presents a new indel functional prediction method HMMvar based on HMM profiles, which capture the conservation information in sequences. The results demonstrate that a scoring strategy based on HMM profiles can achieve good performance in identifying deleterious or neutral variants for different data sets, and can predict the protein functional effects of both single and multiple mutations. Conclusions This paper proposed a quantitative prediction method, HMMvar, to predict the effect of genetic variation using hidden Markov models. The HMM based pipeline program implementing the method HMMvar is freely available at
https://bioinformatics.cs.vt.edu/zhanglab/hmm.
Collapse
Affiliation(s)
| | | | - Liqing Zhang
- Department of Computer Science, Virginia Polytechnic Institute and State University, Blacksburg, VA, USA.
| |
Collapse
|
23
|
Peterson TA, Doughty E, Kann MG. Towards precision medicine: advances in computational approaches for the analysis of human variants. J Mol Biol 2013; 425:4047-63. [PMID: 23962656 PMCID: PMC3807015 DOI: 10.1016/j.jmb.2013.08.008] [Citation(s) in RCA: 93] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2013] [Revised: 08/07/2013] [Accepted: 08/08/2013] [Indexed: 12/26/2022]
Abstract
Variations and similarities in our individual genomes are part of our history, our heritage, and our identity. Some human genomic variants are associated with common traits such as hair and eye color, while others are associated with susceptibility to disease or response to drug treatment. Identifying the human variations producing clinically relevant phenotypic changes is critical for providing accurate and personalized diagnosis, prognosis, and treatment for diseases. Furthermore, a better understanding of the molecular underpinning of disease can lead to development of new drug targets for precision medicine. Several resources have been designed for collecting and storing human genomic variations in highly structured, easily accessible databases. Unfortunately, a vast amount of information about these genetic variants and their functional and phenotypic associations is currently buried in the literature, only accessible by manual curation or sophisticated text text-mining technology to extract the relevant information. In addition, the low cost of sequencing technologies coupled with increasing computational power has enabled the development of numerous computational methodologies to predict the pathogenicity of human variants. This review provides a detailed comparison of current human variant resources, including HGMD, OMIM, ClinVar, and UniProt/Swiss-Prot, followed by an overview of the computational methods and techniques used to leverage the available data to predict novel deleterious variants. We expect these resources and tools to become the foundation for understanding the molecular details of genomic variants leading to disease, which in turn will enable the promise of precision medicine.
Collapse
Affiliation(s)
- Thomas A Peterson
- Department of Biological Sciences, University of Maryland, Baltimore County, 1000 Hilltop Circle, Baltimore, MD 21250, USA
| | - Emily Doughty
- Biomedical Informatics Program, Stanford University, Stanford, CA 94305, USA
| | - Maricel G Kann
- Department of Biological Sciences, University of Maryland, Baltimore County, 1000 Hilltop Circle, Baltimore, MD 21250, USA
| |
Collapse
|
24
|
Eren AM, Maignien L, Sul WJ, Murphy LG, Grim SL, Morrison HG, Sogin ML. Oligotyping: Differentiating between closely related microbial taxa using 16S rRNA gene data. Methods Ecol Evol 2013; 4. [PMID: 24358444 PMCID: PMC3864673 DOI: 10.1111/2041-210x.12114] [Citation(s) in RCA: 423] [Impact Index Per Article: 38.5] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
Abstract
Bacteria comprise the most diverse domain of life on Earth, where they occupy nearly every possible ecological niche and play key roles in biological and chemical processes. Studying the composition and ecology of bacterial ecosystems and understanding their function are of prime importance. High-throughput sequencing technologies enable nearly comprehensive descriptions of bacterial diversity through 16S ribosomal RNA gene amplicons. Analyses of these communities generally rely upon taxonomic assignments through reference data bases or clustering approaches using de facto sequence similarity thresholds to identify operational taxonomic units. However, these methods often fail to resolve ecologically meaningful differences between closely related organisms in complex microbial data sets. In this paper, we describe oligotyping, a novel supervised computational method that allows researchers to investigate the diversity of closely related but distinct bacterial organisms in final operational taxonomic units identified in environmental data sets through 16S ribosomal RNA gene data by the canonical approaches. Our analysis of two data sets from two different environments demonstrates the capacity of oligotyping at discriminating distinct microbial populations of ecological importance. Oligotyping can resolve the distribution of closely related organisms across environments and unveil previously overlooked ecological patterns for microbial communities. The URL http://oligotyping.org offers an open-source software pipeline for oligotyping.
Collapse
Affiliation(s)
- A Murat Eren
- Josephine Bay Paul Center for Comparative Molecular Biology and Evolution, Marine Biological Laboratory, Woods Hole, MA 02543 USA
| | - Loïs Maignien
- Josephine Bay Paul Center for Comparative Molecular Biology and Evolution, Marine Biological Laboratory, Woods Hole, MA 02543 USA
| | - Woo Jun Sul
- Josephine Bay Paul Center for Comparative Molecular Biology and Evolution, Marine Biological Laboratory, Woods Hole, MA 02543 USA
| | - Leslie G Murphy
- Josephine Bay Paul Center for Comparative Molecular Biology and Evolution, Marine Biological Laboratory, Woods Hole, MA 02543 USA
| | - Sharon L Grim
- Josephine Bay Paul Center for Comparative Molecular Biology and Evolution, Marine Biological Laboratory, Woods Hole, MA 02543 USA
| | - Hilary G Morrison
- Josephine Bay Paul Center for Comparative Molecular Biology and Evolution, Marine Biological Laboratory, Woods Hole, MA 02543 USA
| | - Mitchell L Sogin
- Josephine Bay Paul Center for Comparative Molecular Biology and Evolution, Marine Biological Laboratory, Woods Hole, MA 02543 USA
| |
Collapse
|
25
|
Dorn C, Grunert M, Sperling SR. Application of high-throughput sequencing for studying genomic variations in congenital heart disease. Brief Funct Genomics 2013; 13:51-65. [PMID: 24095982 DOI: 10.1093/bfgp/elt040] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
Congenital heart diseases (CHD) represent the most common birth defect in human. The majority of cases are caused by a combination of complex genetic alterations and environmental influences. In the past, many disease-causing mutations have been identified; however, there is still a large proportion of cardiac malformations with unknown precise origin. High-throughput sequencing technologies established during the last years offer novel opportunities to further study the genetic background underlying the disease. In this review, we provide a roadmap for designing and analyzing high-throughput sequencing studies focused on CHD, but also with general applicability to other complex diseases. The three main next-generation sequencing (NGS) platforms including their particular advantages and disadvantages are presented. To identify potentially disease-related genomic variations and genes, different filtering steps and gene prioritization strategies are discussed. In addition, available control datasets based on NGS are summarized. Finally, we provide an overview of current studies already using NGS technologies and showing that these techniques will help to further unravel the complex genetics underlying CHD.
Collapse
Affiliation(s)
- Cornelia Dorn
- Department of Cardiovascular Genetics, Experimental and Clinical Research Center (ECRC), Charité-University Medicine Berlin and Max Delbrück Center (MDC) for Molecular Medicine, Lindenberger Weg 80, 13125 Berlin, Germany. Department of Biochemistry, Free University Berlin, Berlin, Germany. Tel.: +49-(0)30-450540123; Fax: +49-(0)30-84131699;
| | | | | |
Collapse
|
26
|
Effect of genetic regions on the correlation between single point mutation variability and morbidity. Comput Biol Med 2013; 43:594-9. [DOI: 10.1016/j.compbiomed.2013.01.017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2011] [Revised: 07/27/2012] [Accepted: 01/19/2013] [Indexed: 11/19/2022]
|
27
|
Kenigsberg E, Tanay A. Drosophila functional elements are embedded in structurally constrained sequences. PLoS Genet 2013; 9:e1003512. [PMID: 23750124 PMCID: PMC3671938 DOI: 10.1371/journal.pgen.1003512] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2012] [Accepted: 03/04/2013] [Indexed: 12/22/2022] Open
Abstract
Modern functional genomics uncovered numerous functional elements in metazoan genomes. Nevertheless, only a small fraction of the typical non-exonic genome contains elements that code for function directly. On the other hand, a much larger fraction of the genome is associated with significant evolutionary constraints, suggesting that much of the non-exonic genome is weakly functional. Here we show that in flies, local (30–70 bp) conserved sequence elements that are associated with multiple regulatory functions serve as focal points to a pattern of punctuated regional increase in G/C nucleotide frequencies. We show that this pattern, which covers a region tenfold larger than the conserved elements themselves, is an evolutionary consequence of a shift in the balance between gain and loss of G/C nucleotides and that it is correlated with nucleosome occupancy across multiple classes of epigenetic state. Evidence for compensatory evolution and analysis of SNP allele frequencies show that the evolutionary regime underlying this balance shift is likely to be non-neutral. These data suggest that current gaps in our understanding of genome function and evolutionary dynamics are explicable by a model of sparse sequence elements directly encoding for function, embedded into structural sequences that help to define the local and global epigenomic context of such functional elements. A key challenge in functional genomics is to predict evolutionary dynamics from functional annotation of the genome and vice versa. Modern epigenomic studies helped assign function to numerous new sequence elements, but left most of the genome essentially uncharacterized. Evolutionary genomics, on the other hand, consistently suggests that a much larger fraction of the un-annotated genome evolves under selective pressure. We hypothesize that this function-selection gap can be attributed to sequences that facilitate the physical organization of functional elements, such as transcription factor binding sites, within chromosomes. We exemplify this by studying in detail the sequences embedding small conserved elements (CEs) in Drosophila. We show that, while CEs have typically high AT content, high GC content levels around them are maintained by a non-neutral evolutionary balance between gain and loss of GC nucleotides. This non-uniform pattern is highly correlated with nucleosome organization around CEs, potentially imposing an evolutionary constraint on as much as one quarter of the genome. We suggest this can at least partly explain the above function-selection gap. Weak evolutionary constraints on “structural” sequences (at scales ranging from one nucleosome to recently described multi-megabase topological domains) may affect genome evolution just like structural motifs shape protein evolution.
Collapse
Affiliation(s)
- Ephraim Kenigsberg
- Department of Computer Science and Applied Mathematics and Department of Biological Regulation, Weizmann Institute, Rehovot, Israel
| | - Amos Tanay
- Department of Computer Science and Applied Mathematics and Department of Biological Regulation, Weizmann Institute, Rehovot, Israel
- * E-mail:
| |
Collapse
|
28
|
Mutational signatures of de-differentiation in functional non-coding regions of melanoma genomes. PLoS Genet 2012; 8:e1002871. [PMID: 22912592 PMCID: PMC3415438 DOI: 10.1371/journal.pgen.1002871] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2011] [Accepted: 06/11/2012] [Indexed: 11/23/2022] Open
Abstract
Much emphasis has been placed on the identification, functional characterization, and therapeutic potential of somatic variants in tumor genomes. However, the majority of somatic variants lie outside coding regions and their role in cancer progression remains to be determined. In order to establish a system to test the functional importance of non-coding somatic variants in cancer, we created a low-passage cell culture of a metastatic melanoma tumor sample. As a foundation for interpreting functional assays, we performed whole-genome sequencing and analysis of this cell culture, the metastatic tumor from which it was derived, and the patient-matched normal genomes. When comparing somatic mutations identified in the cell culture and tissue genomes, we observe concordance at the majority of single nucleotide variants, whereas copy number changes are more variable. To understand the functional impact of non-coding somatic variation, we leveraged functional data generated by the ENCODE Project Consortium. We analyzed regulatory regions derived from multiple different cell types and found that melanocyte-specific regions are among the most depleted for somatic mutation accumulation. Significant depletion in other cell types suggests the metastatic melanoma cells de-differentiated to a more basal regulatory state. Experimental identification of genome-wide regulatory sites in two different melanoma samples supports this observation. Together, these results show that mutation accumulation in metastatic melanoma is nonrandom across the genome and that a de-differentiated regulatory architecture is common among different samples. Our findings enable identification of the underlying genetic components of melanoma and define the differences between a tissue-derived tumor sample and the cell culture created from it. Such information helps establish a broader mechanistic understanding of the linkage between non-coding genomic variations and the cellular evolution of cancer. Here we investigate the relationship between somatic variants and non-coding regulatory regions. To do this, we develop a new algorithm for identifying single nucleotide somatic variants in whole-genome sequencing data and apply it to a metastatic melanoma sample and a cell culture derived from this sample. Our results show that the two genomes are similar at the level of single nucleotide changes and more variable at larger copy number changes. We further observe that patterns of somatic mutation accumulation in non-coding regulatory regions suggests that the metastatic melanoma cells de-differentiated into a more basal regulatory state. That is, by simply looking at mutation accumulation across cell-type-specific non-coding functional regions, one can clearly see patterns that are indicative of cell state de-differentiation. Results from genome-wide functional regulatory region experimental mapping support this observation.
Collapse
|
29
|
Capriotti E, Nehrt NL, Kann MG, Bromberg Y. Bioinformatics for personal genome interpretation. Brief Bioinform 2012; 13:495-512. [PMID: 22247263 DOI: 10.1093/bib/bbr070] [Citation(s) in RCA: 55] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/02/2023] Open
Abstract
An international consortium released the first draft sequence of the human genome 10 years ago. Although the analysis of this data has suggested the genetic underpinnings of many diseases, we have not yet been able to fully quantify the relationship between genotype and phenotype. Thus, a major current effort of the scientific community focuses on evaluating individual predispositions to specific phenotypic traits given their genetic backgrounds. Many resources aim to identify and annotate the specific genes responsible for the observed phenotypes. Some of these use intra-species genetic variability as a means for better understanding this relationship. In addition, several online resources are now dedicated to collecting single nucleotide variants and other types of variants, and annotating their functional effects and associations with phenotypic traits. This information has enabled researchers to develop bioinformatics tools to analyze the rapidly increasing amount of newly extracted variation data and to predict the effect of uncharacterized variants. In this work, we review the most important developments in the field--the databases and bioinformatics tools that will be of utmost importance in our concerted effort to interpret the human variome.
Collapse
Affiliation(s)
- Emidio Capriotti
- Department of Mathematics and Computer Science, University of Balearic Islands, ctra. de Valldemossa Km 7.5, Palma de Mallorca, 07122 Spain.
| | | | | | | |
Collapse
|
30
|
Young JM, Luche RM, Trask BJ. Rigorous and thorough bioinformatic analyses of olfactory receptor promoters confirm enrichment of O/E and homeodomain binding sites but reveal no new common motifs. BMC Genomics 2011; 12:561. [PMID: 22085861 PMCID: PMC3247239 DOI: 10.1186/1471-2164-12-561] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2011] [Accepted: 11/15/2011] [Indexed: 12/02/2022] Open
Abstract
Background Mammalian olfactory receptors (ORs) are subject to a remarkable but poorly understood regime of transcriptional regulation, whereby individual olfactory neurons each express only one allele of a single member of the large OR gene family. Results We performed a rigorous search for enriched sequence motifs in the largest dataset of OR promoter regions analyzed to date. We combined measures of cross-species conservation with databases of known transcription factor binding sites and ab initio motif-finding algorithms. We found strong enrichment of binding sites for the O/E family of transcription factors and for homeodomain factors, both already known to be involved in the transcriptional control of ORs, but did not identify any novel enriched sequences. We also found that TATA-boxes are present in at least a subset of OR promoters. Conclusions Our rigorous approach provides a template for the analysis of the regulation of large gene families and demonstrates some of the difficulties and pitfalls of such analyses. Although currently available bioinformatics methods cannot detect all transcriptional regulatory elements, our thorough analysis of OR promoters shows that in the case of this gene family, experimental approaches have probably already identified all the binding factors common to large fractions of OR promoters.
Collapse
Affiliation(s)
- Janet M Young
- Division of Human Biology, Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA.
| | | | | |
Collapse
|
31
|
Sadri J, Diallo AB, Blanchette M. Predicting site-specific human selective pressure using evolutionary signatures. Bioinformatics 2011; 27:i266-74. [PMID: 21685080 PMCID: PMC3117352 DOI: 10.1093/bioinformatics/btr241] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
Motivation: The identification of non-coding functional regions of the human genome remains one of the main challenges of genomics. By observing how a given region evolved over time, one can detect signs of negative or positive selection hinting that the region may be functional. With the quickly increasing number of vertebrate genomes to compare with our own, this type of approach is set to become extremely powerful, provided the right analytical tools are available. Results: A large number of approaches have been proposed to measure signs of past selective pressure, usually in the form of reduced mutation rate. Here, we propose a radically different approach to the detection of non-coding functional region: instead of measuring past evolutionary rates, we build a machine learning classifier to predict current substitution rates in human based on the inferred evolutionary events that affected the region during vertebrate evolution. We show that different types of evolutionary events, occurring along different branches of the phylogenetic tree, bring very different amounts of information. We propose a number of simple machine learning classifiers and show that a Support-Vector Machine (SVM) predictor clearly outperforms existing tools at predicting human non-coding functional sites. Comparison to external evidences of selection and regulatory function confirms that these SVM predictions are more accurate than those of other approaches. Availability: The predictor and predictions made are available at http://www.mcb.mcgill.ca/~blanchem/sadri. Contact:blanchem@mcb.mcgill.ca
Collapse
Affiliation(s)
- Javad Sadri
- School of Computer Science, McGill University, 3630 University, Montreal, QC, Canada H3A 2B2
| | | | | |
Collapse
|
32
|
Ponting CP, Nellåker C, Meader S. Rapid turnover of functional sequence in human and other genomes. Annu Rev Genomics Hum Genet 2011; 12:275-99. [PMID: 21721940 DOI: 10.1146/annurev-genom-090810-183115] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
The amount of a genome's sequence that is functional has been surprisingly difficult to estimate accurately. This has severely hindered analyses asking whether the amount of functional genomic sequence correlates with organismal complexity. Most studies estimate these amounts by considering nucleotide substitution rates within aligned sequences. These approaches show reduced power to identify sequence that is aligned, functional, and constrained only within narrowly defined phyla. The neutral indel model exploits insertions or deletions (indels) rather than substitutions in predicting functional sequence. Surprisingly, this method indicates that half of all functional sequence is specific to individual eutherian lineages. This review considers the rates at which coding or noncoding and functional or nonfunctional sequence changes among mammalian genomes. In contrast to the slow rate at which protein-coding sequence changes, functional noncoding sequence appears to change or be turned over at rapid rates in mammals.
Collapse
Affiliation(s)
- Chris P Ponting
- Medical Research Council Functional Genomics Unit, Department of Physiology, Anatomy, and Genetics, University of Oxford, Oxford OX1 3QX, United Kingdom.
| | | | | |
Collapse
|
33
|
Stitziel NO, Kiezun A, Sunyaev S. Computational and statistical approaches to analyzing variants identified by exome sequencing. Genome Biol 2011; 12:227. [PMID: 21920052 PMCID: PMC3308043 DOI: 10.1186/gb-2011-12-9-227] [Citation(s) in RCA: 99] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023] Open
Abstract
New sequencing technology has enabled the identification of thousands of single nucleotide polymorphisms in the exome, and many computational and statistical approaches to identify disease-association signals have emerged.
Collapse
Affiliation(s)
- Nathan O Stitziel
- Division of Cardiovascular Medicine, Brigham and Women’s Hospital, Harvard Medical School, 75 Francis Street, Boston, MA 02115, USA
| | | | | |
Collapse
|
34
|
Abstract
Many evolutionary studies over the past decade have estimated α(sel), the proportion of all nucleotides in the human genome that are subject to purifying selection because of their biological function. Most of these studies have estimated the nucleotide substitution rates from genome sequence alignments across many diverse mammals. Some α(sel) estimates will be affected by the heterogeneity of substitution rates in neutral sequence across the genome. Most will also be inaccurate if change in the functional sequence repertoire occurs rapidly relative to the separation of lineages that are being compared. Evidence gathered from both evolutionary and experimental analyses now indicate that rates of "turnover" of functional, predominantly noncoding, sequence are, indeed, high. They are sufficiently high that an estimated 50% of mouse constrained noncoding sequence is predicted not to be shared with rat, a closely related rodent. The rapidity of turnover results in, at least, a twofold underestimate of α(sel) by analyses that measure constraint across the eutherian phylogeny. Approaches that take account of turnover estimate that the steady-state value of α(sel) lies between 10% and 15%. Experimental studies corroborate the predicted rates of loss and gain of noncoding functional sites. These studies show the limitations inherent in the use of deep sequence conservation for identifying functional sequence. Experimental investigations focusing on lineage-specific, noncoding, and functional sequence are now essential if we are to appreciate the complete functional repertoire of the human genome.
Collapse
Affiliation(s)
- Chris P Ponting
- MRC Functional Genomics Unit, Department of Physiology, Anatomy and Genetics, University of Oxford, Oxford OX1 3QX, United Kingdom.
| | | |
Collapse
|
35
|
Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data. Nat Rev Genet 2011; 12:628-40. [PMID: 21850043 DOI: 10.1038/nrg3046] [Citation(s) in RCA: 397] [Impact Index Per Article: 30.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
Genome and exome sequencing yield extensive catalogues of human genetic variation. However, pinpointing the few phenotypically causal variants among the many variants present in human genomes remains a major challenge, particularly for rare and complex traits wherein genetic information alone is often insufficient. Here, we review approaches to estimate the deleteriousness of single nucleotide variants (SNVs), which can be used to prioritize disease-causal variants. We describe recent advances in comparative and functional genomics that enable systematic annotation of both coding and non-coding variants. Application and optimization of these methods will be essential to find the genetic answers that sequencing promises to hide in plain sight.
Collapse
|
36
|
Kiryu H. Sufficient statistics and expectation maximization algorithms in phylogenetic tree models. ACTA ACUST UNITED AC 2011; 27:2346-53. [PMID: 21757463 DOI: 10.1093/bioinformatics/btr420] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION Measuring evolutionary conservation is a routine step in the identification of functional elements in genome sequences. Although a number of studies have proposed methods that use the continuous time Markov models (CTMMs) to find evolutionarily constrained elements, their probabilistic structures have been less frequently investigated. RESULTS In this article, we investigate a sufficient statistic for CTMMs. The statistic is composed of the fractional duration of nucleotide characters over evolutionary time, F(d), and the number of substitutions occurring in phylogenetic trees, N(s). We first derive basic properties of the sufficient statistic. Then, we derive an expectation maximization (EM) algorithm for estimating the parameters of a phylogenetic model, which iteratively computes the expectation values of the sufficient statistic. We show that the EM algorithm exhibits much faster convergence than other optimization methods that use numerical gradient descent algorithms. Finally, we investigate the genome-wide distribution of fractional duration time F(d) which, unlike the number of substitutions N(s), has rarely been investigated. We show that F(d) has evolutionary information that is distinct from that in N(s), which may be useful for detecting novel types of evolutionary constraints existing in the human genome. AVAILABILITY The C++ source code of the 'Fdur' software is available at http://www.ncrna.org/software/fdur/ CONTACT kiryu-h@k.u-tokyo.ac.jp SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Hisanori Kiryu
- Department of Computational Biology, Faculty of Frontier Science, The University of Tokyo, Kashiwa, Chiba 277-8561, Japan.
| |
Collapse
|
37
|
Pertea M, Pertea GM, Salzberg SL. Detection of lineage-specific evolutionary changes among primate species. BMC Bioinformatics 2011; 12:274. [PMID: 21726447 PMCID: PMC3143108 DOI: 10.1186/1471-2105-12-274] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2011] [Accepted: 07/04/2011] [Indexed: 11/17/2022] Open
Abstract
BACKGROUND Comparison of the human genome with other primates offers the opportunity to detect evolutionary events that created the diverse phenotypes among the primate species. Because the primate genomes are highly similar to one another, methods developed for analysis of more divergent species do not always detect signs of evolutionary selection. RESULTS We have developed a new method, called DivE, specifically designed to find regions that have evolved either more or less rapidly than expected, for any clade within a set of very closely related species. Unlike some previous methods, DivE does not rely on rates of synonymous and nonsynonymous substitution, which enables it to detect evolutionary events in noncoding regions. We demonstrate using simulated data that DivE compares favorably to alternative methods, and we then apply DivE to the ENCODE regions in 14 primate species. We identify thousands of regions in these primates, ranging from 50 to >10000 bp in length, that appear to have experienced either constrained or accelerated rates of evolution. In particular, we detected 4942 regions that have potentially undergone positive selection in one or more primate species. Most of these regions occur outside of protein-coding genes, although we identified 20 proteins that have experienced positive selection. CONCLUSIONS DivE provides an easy-to-use method to predict both positive and negative selection in noncoding DNA, that is particularly well-suited to detecting lineage-specific selection in large genomes.
Collapse
Affiliation(s)
- Mihaela Pertea
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, Maryland, USA
| | - Geo M Pertea
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, Maryland, USA
| | - Steven L Salzberg
- McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, Maryland, USA
| |
Collapse
|
38
|
A user's guide to the encyclopedia of DNA elements (ENCODE). PLoS Biol 2011; 9:e1001046. [PMID: 21526222 PMCID: PMC3079585 DOI: 10.1371/journal.pbio.1001046] [Citation(s) in RCA: 1082] [Impact Index Per Article: 83.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2010] [Accepted: 03/10/2011] [Indexed: 12/18/2022] Open
Abstract
The mission of the Encyclopedia of DNA Elements (ENCODE) Project is to enable the scientific and medical communities to interpret the human genome sequence and apply it to understand human biology and improve health. The ENCODE Consortium is integrating multiple technologies and approaches in a collective effort to discover and define the functional elements encoded in the human genome, including genes, transcripts, and transcriptional regulatory regions, together with their attendant chromatin states and DNA methylation patterns. In the process, standards to ensure high-quality data have been implemented, and novel algorithms have been developed to facilitate analysis. Data and derived results are made available through a freely accessible database. Here we provide an overview of the project and the resources it is generating and illustrate the application of ENCODE data to interpret the human genome.
Collapse
|
39
|
A genome-wide comparison of the functional properties of rare and common genetic variants in humans. Am J Hum Genet 2011; 88:458-68. [PMID: 21457907 DOI: 10.1016/j.ajhg.2011.03.008] [Citation(s) in RCA: 80] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2011] [Revised: 03/01/2011] [Accepted: 03/14/2011] [Indexed: 01/31/2023] Open
Abstract
One of the longest running debates in evolutionary biology concerns the kind of genetic variation that is primarily responsible for phenotypic variation in species. Here, we address this question for humans specifically from the perspective of population allele frequency of variants across the complete genome, including both coding and noncoding regions. We establish simple criteria to assess the likelihood that variants are functional based on their genomic locations and then use whole-genome sequence data from 29 subjects of European origin to assess the relationship between the functional properties of variants and their population allele frequencies. We find that for all criteria used to assess the likelihood that a variant is functional, the rarer variants are significantly more likely to be functional than the more common variants. Strikingly, these patterns disappear when we focus on only those variants in which the major alleles are derived. These analyses indicate that the majority of the genetic variation in terms of phenotypic consequence may result from a mutation-selection balance, as opposed to balancing selection, and have direct relevance to the study of human disease.
Collapse
|
40
|
Hubisz MJ, Pollard KS, Siepel A. PHAST and RPHAST: phylogenetic analysis with space/time models. Brief Bioinform 2010; 12:41-51. [PMID: 21278375 DOI: 10.1093/bib/bbq072] [Citation(s) in RCA: 321] [Impact Index Per Article: 22.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
The PHylogenetic Analysis with Space/Time models (PHAST) software package consists of a collection of command-line programs and supporting libraries for comparative genomics. PHAST is best known as the engine behind the Conservation tracks in the University of California, Santa Cruz (UCSC) Genome Browser. However, it also includes several other tools for phylogenetic modeling and functional element identification, as well as utilities for manipulating alignments, trees and genomic annotations. PHAST has been in development since 2002 and has now been downloaded more than 1000 times, but so far it has been released only as provisional ('beta') software. Here, we describe the first official release (v1.0) of PHAST, with improved stability, portability and documentation and several new features. We outline the components of the package and detail recent improvements. In addition, we introduce a new interface to the PHAST libraries from the R statistical computing environment, called RPHAST, and illustrate its use in a series of vignettes. We demonstrate that RPHAST can be particularly useful in applications involving both large-scale phylogenomics and complex statistical analyses. The R interface also makes the PHAST libraries acccessible to non-C programmers, and is useful for rapid prototyping. PHAST v1.0 and RPHAST v1.0 are available for download at http://compgen.bscb.cornell.edu/phast, under the terms of an unrestrictive BSD-style license. RPHAST can also be obtained from the Comprehensive R Archive Network (CRAN; http://cran.r-project.org).
Collapse
Affiliation(s)
- Melissa J Hubisz
- Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, NY 14853, USA.
| | | | | |
Collapse
|
41
|
Zhang L, Pei YF, Li J, Papasian CJ, Deng HW. Improved detection of rare genetic variants for diseases. PLoS One 2010; 5:e13857. [PMID: 21079782 PMCID: PMC2975623 DOI: 10.1371/journal.pone.0013857] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2010] [Accepted: 09/30/2010] [Indexed: 11/18/2022] Open
Abstract
Technology advances have promoted gene-based sequencing studies with the aim of identifying rare mutations responsible for complex diseases. A complication in these types of association studies is that the vast majority of non-synonymous mutations are believed to be neutral to phenotypes. It is thus critical to distinguish potential causative variants from neutral variation before performing association tests. In this study, we used existing predicting algorithms to predict functional amino acid substitutions, and incorporated that information into association tests. Using simulations, we comprehensively studied the effects of several influential factors, including the sensitivity and specificity of functional variant predictions, number of variants, and proportion of causative variants, on the performance of association tests. Our results showed that incorporating information regarding functional variants obtained from existing prediction algorithms improves statistical power under certain conditions, particularly when the proportion of causative variants is moderate. The application of the proposed tests to a real sequencing study confirms our conclusions. Our work may help investigators who are planning to pursue gene-based sequencing studies.
Collapse
Affiliation(s)
- Lei Zhang
- Center of System Biomedical Sciences, University of Shanghai for Science and Technology, Shanghai, People's Republic of China
- Key Laboratory of Biomedical Information Engineering, School of Life Science and Technology, Ministry of Education and Institute of Molecular Genetics, Xi'an Jiaotong University, Xi'an, Shaanxi, People's Republic of China
| | - Yu-Fang Pei
- Center of System Biomedical Sciences, University of Shanghai for Science and Technology, Shanghai, People's Republic of China
- Key Laboratory of Biomedical Information Engineering, School of Life Science and Technology, Ministry of Education and Institute of Molecular Genetics, Xi'an Jiaotong University, Xi'an, Shaanxi, People's Republic of China
| | - Jian Li
- School of Medicine, University of Missouri-Kansas City, Kansas City, Missouri, United States of America
| | - Christopher J. Papasian
- School of Medicine, University of Missouri-Kansas City, Kansas City, Missouri, United States of America
| | - Hong-Wen Deng
- Center of System Biomedical Sciences, University of Shanghai for Science and Technology, Shanghai, People's Republic of China
- School of Medicine, University of Missouri-Kansas City, Kansas City, Missouri, United States of America
- College of Life Sciences and Engineering, Beijing Jiao Tong University, Beijing, People's Republic of China
- * E-mail:
| |
Collapse
|
42
|
Meader S, Ponting CP, Lunter G. Massive turnover of functional sequence in human and other mammalian genomes. Genome Res 2010; 20:1335-43. [PMID: 20693480 DOI: 10.1101/gr.108795.110] [Citation(s) in RCA: 80] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
Despite the availability of dozens of animal genome sequences, two key questions remain unanswered: First, what fraction of any species' genome confers biological function, and second, are apparent differences in organismal complexity reflected in an objective measure of genomic complexity? Here, we address both questions by applying, across the mammalian phylogeny, an evolutionary model that estimates the amount of functional DNA that is shared between two species' genomes. Our main findings are, first, that as the divergence between mammalian species increases, the predicted amount of pairwise shared functional sequence drops off dramatically. We show by simulations that this is not an artifact of the method, but rather indicates that functional (and mostly noncoding) sequence is turning over at a very high rate. We estimate that between 200 and 300 Mb (∼6.5%-10%) of the human genome is under functional constraint, which includes five to eight times as many constrained noncoding bases than bases that code for protein. In contrast, in D. melanogaster we estimate only 56-66 Mb to be constrained, implying a ratio of noncoding to coding constrained bases of about 2. This suggests that, rather than genome size or protein-coding gene complement, it is the number of functional bases that might best mirror our naïve preconceptions of organismal complexity.
Collapse
Affiliation(s)
- Stephen Meader
- MRC Functional Genomics Unit, Department of Physiology, Anatomy and Genetics, University of Oxford, Oxford OX1 3QX, United Kingdom
| | | | | |
Collapse
|
43
|
Goode DL, Cooper GM, Schmutz J, Dickson M, Gonzales E, Tsai M, Karra K, Davydov E, Batzoglou S, Myers RM, Sidow A. Evolutionary constraint facilitates interpretation of genetic variation in resequenced human genomes. Genome Res 2010; 20:301-10. [PMID: 20067941 PMCID: PMC2840986 DOI: 10.1101/gr.102210.109] [Citation(s) in RCA: 73] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2009] [Accepted: 01/08/2010] [Indexed: 01/22/2023]
Abstract
Here, we demonstrate how comparative sequence analysis facilitates genome-wide base-pair-level interpretation of individual genetic variation and address two questions of importance for human personal genomics: first, whether an individual's functional variation comes mostly from noncoding or coding polymorphisms; and, second, whether population-specific or globally-present polymorphisms contribute more to functional variation in any given individual. Neither has been definitively answered by analyses of existing variation data because of a focus on coding polymorphisms, ascertainment biases in favor of common variation, and a lack of base-pair-level resolution for identifying functional variants. We resequenced 575 amplicons within 432 individuals at genomic sites enriched for evolutionary constraint and also analyzed variation within three published human genomes. We find that single-site measures of evolutionary constraint derived from mammalian multiple sequence alignments are strongly predictive of reductions in modern-day genetic diversity across a range of annotation categories and across the allele frequency spectrum from rare (<1%) to high frequency (>10% minor allele frequency). Furthermore, we show that putatively functional variation in an individual genome is dominated by polymorphisms that do not change protein sequence and that originate from our shared ancestral population and commonly segregate in human populations. These observations show that common, noncoding alleles contribute substantially to human phenotypes and that constraint-based analyses will be of value to identify phenotypically relevant variants in individual genomes.
Collapse
Affiliation(s)
- David L Goode
- Department of Genetics, Stanford University School of Medicine, Stanford, California 94305, USA
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
44
|
Jaeger SA, Chan ET, Berger MF, Stottmann R, Hughes TR, Bulyk ML. Conservation and regulatory associations of a wide affinity range of mouse transcription factor binding sites. Genomics 2010; 95:185-95. [PMID: 20079828 DOI: 10.1016/j.ygeno.2010.01.002] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2010] [Accepted: 01/08/2010] [Indexed: 12/29/2022]
Abstract
Sequence-specific binding by transcription factors (TFs) interprets regulatory information encoded in the genome. Using recently published universal protein binding microarray (PBM) data on the in vitro DNA binding preferences of these proteins for all possible 8-base-pair sequences, we examined the evolutionary conservation and enrichment within putative regulatory regions of the binding sequences of a diverse library of 104 nonredundant mouse TFs spanning 22 different DNA-binding domain structural classes. We found that not only high affinity binding sites, but also numerous moderate and low affinity binding sites, are under negative selection in the mouse genome. These 8-mers occur preferentially in putative regulatory regions of the mouse genome, including CpG islands and non-exonic ultraconserved elements (UCEs). Of TFs whose PBM "bound" 8-mers are enriched within sets of tissue-specific UCEs, many are expressed in the same tissue(s) as the UCE-driven gene expression. Phylogenetically conserved motif occurrences of various TFs were also enriched in the noncoding sequence surrounding numerous gene sets corresponding to Gene Ontology categories and tissue-specific gene expression clusters, suggesting involvement in transcriptional regulation of those genes. Altogether, our results indicate that many of the sequences bound by these proteins in vitro, including lower affinity DNA sequences, are likely to be functionally important in vivo. This study not only provides an initial analysis of the potential regulatory associations of 104 mouse TFs, but also presents an approach for the functional analysis of TFs from any other metazoan genome as their DNA binding preferences are determined by PBMs or other technologies.
Collapse
Affiliation(s)
- Savina A Jaeger
- Division of Genetics, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA 02115, USA
| | | | | | | | | | | |
Collapse
|
45
|
Oldmeadow C, Mengersen K, Mattick JS, Keith JM. Multiple evolutionary rate classes in animal genome evolution. Mol Biol Evol 2009; 27:942-53. [PMID: 19955480 DOI: 10.1093/molbev/msp299] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022] Open
Abstract
The proportion of functional sequence in the human genome is currently a subject of debate. The most widely accepted figure is that approximately 5% is under purifying selection. In Drosophila, estimates are an order of magnitude higher, though this corresponds to a similar quantity of sequence. These estimates depend on the difference between the distribution of genomewide evolutionary rates and that observed in a subset of sequences presumed to be neutrally evolving. Motivated by the widening gap between these estimates and experimental evidence of genome function, especially in mammals, we developed a sensitive technique for evaluating such distributions and found that they are much more complex than previously apparent. We found strong evidence for at least nine well-resolved evolutionary rate classes in an alignment of four Drosophila species and at least seven classes in an alignment of four mammals, including human. We also identified at least three rate classes in human ancestral repeats. By positing that the largest of these ancestral repeat classes is neutrally evolving, we estimate that the proportion of nonneutrally evolving sequence is 30% of human ancestral repeats and 45% of the aligned portion of the genome. However, we also question whether any of the classes represent neutrally evolving sequences and argue that a plausible alternative is that they reflect variable structure-function constraints operating throughout the genomes of complex organisms.
Collapse
Affiliation(s)
- Christopher Oldmeadow
- School of Mathematical Sciences, Queensland University of Technology, Brisbane, QLD, Australia
| | | | | | | |
Collapse
|
46
|
Meireles-Filho ACA, Stark A. Comparative genomics of gene regulation-conservation and divergence of cis-regulatory information. Curr Opin Genet Dev 2009; 19:565-70. [PMID: 19913403 DOI: 10.1016/j.gde.2009.10.006] [Citation(s) in RCA: 68] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2009] [Revised: 10/06/2009] [Accepted: 10/06/2009] [Indexed: 01/13/2023]
Abstract
We recently witnessed a tremendous increase in genomics studies on gene regulation and in entirely sequenced genomes from closely related species. This has triggered analyses that suggest a wide range of evolutionary dynamics of gene regulation, from rapid turnover of transcription-factor binding sites to conservation of enhancer function across large evolutionary distances. Many examples show that enhancers can evolve beyond recognizable sequence similarity while retaining function. However, bioinformatics approaches are increasingly able to detect conserved regulatory elements through characteristic evolutionary sequence signatures. Cis-regulatory changes are also a major source of morphological evolution, which might be facilitated by many biochemically functional elements that are selectively neutral and by the buffering function of redundant enhancers and 'shadow' enhancers.
Collapse
|
47
|
Pollard KS, Hubisz MJ, Rosenbloom KR, Siepel A. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res 2009; 20:110-21. [PMID: 19858363 DOI: 10.1101/gr.097857.109] [Citation(s) in RCA: 1515] [Impact Index Per Article: 101.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Methods for detecting nucleotide substitution rates that are faster or slower than expected under neutral drift are widely used to identify candidate functional elements in genomic sequences. However, most existing methods consider either reductions (conservation) or increases (acceleration) in rate but not both, or assume that selection acts uniformly across the branches of a phylogeny. Here we examine the more general problem of detecting departures from the neutral rate of substitution in either direction, possibly in a clade-specific manner. We consider four statistical, phylogenetic tests for addressing this problem: a likelihood ratio test, a score test, a test based on exact distributions of numbers of substitutions, and the genomic evolutionary rate profiling (GERP) test. All four tests have been implemented in a freely available program called phyloP. Based on extensive simulation experiments, these tests are remarkably similar in statistical power. With 36 mammalian species, they all appear to be capable of fairly good sensitivity with low false-positive rates in detecting strong selection at individual nucleotides, moderate selection in 3-bp elements, and weaker or clade-specific selection in longer elements. By applying phyloP to mammalian multiple alignments from the ENCODE project, we shed light on patterns of conservation/acceleration in known and predicted functional elements, approximate fractions of sites subject to constraint, and differences in clade-specific selection in the primate and glires clades. We also describe new "Conservation" tracks in the UCSC Genome Browser that display both phyloP and phastCons scores for genome-wide alignments of 44 vertebrate species.
Collapse
Affiliation(s)
- Katherine S Pollard
- Gladstone Institutes, University of California, San Francisco, San Francisco, California 94158, USA.
| | | | | | | |
Collapse
|
48
|
Garber M, Guttman M, Clamp M, Zody MC, Friedman N, Xie X. Identifying novel constrained elements by exploiting biased substitution patterns. ACTA ACUST UNITED AC 2009; 25:i54-62. [PMID: 19478016 PMCID: PMC2687944 DOI: 10.1093/bioinformatics/btp190] [Citation(s) in RCA: 248] [Impact Index Per Article: 16.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023]
Abstract
Motivation: Comparing the genomes from closely related species provides a powerful tool to identify functional elements in a reference genome. Many methods have been developed to identify conserved sequences across species; however, existing methods only model conservation as a decrease in the rate of mutation and have ignored selection acting on the pattern of mutations. Results: We present a new approach that takes advantage of deeply sequenced clades to identify evolutionary selection by uncovering not only signatures of rate-based conservation but also substitution patterns characteristic of sequence undergoing natural selection. We describe a new statistical method for modeling biased nucleotide substitutions, a learning algorithm for inferring site-specific substitution biases directly from sequence alignments and a hidden Markov model for detecting constrained elements characterized by biased substitutions. We show that the new approach can identify significantly more degenerate constrained sequences than rate-based methods. Applying it to the ENCODE regions, we identify as much as 10.2% of these regions are under selection. Availability: The algorithms are implemented in a Java software package, called SiPhy, freely available at http://www.broadinstitute.org/science/software/. Contact:xhx@ics.uci.edu Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Manuel Garber
- Department of Biology, Broad Institute of MIT and Harvard, 7 Cambridge Center, MIT, Cambridge, MA 02142, USA
| | | | | | | | | | | |
Collapse
|
49
|
Abstract
Each human carries a large number of deleterious mutations. Together, these mutations make a significant contribution to human disease. Identification of deleterious mutations within individual genome sequences could substantially impact an individual's health through personalized prevention and treatment of disease. Yet, distinguishing deleterious mutations from the massive number of nonfunctional variants that occur within a single genome is a considerable challenge. Using a comparative genomics data set of 32 vertebrate species we show that a likelihood ratio test (LRT) can accurately identify a subset of deleterious mutations that disrupt highly conserved amino acids within protein-coding sequences, which are likely to be unconditionally deleterious. The LRT is also able to identify known human disease alleles and performs as well as two commonly used heuristic methods, SIFT and PolyPhen. Application of the LRT to three human genomes reveals 796-837 deleterious mutations per individual, approximately 40% of which are estimated to be at <5% allele frequency. However, the overlap between predictions made by the LRT, SIFT, and PolyPhen, is low; 76% of predictions are unique to one of the three methods, and only 5% of predictions are shared across all three methods. Our results indicate that only a small subset of deleterious mutations can be reliably identified, but that this subset provides the raw material for personalized medicine.
Collapse
|
50
|
|