1
|
Rogozin IB, Pavlov YI, Goncearenco A, De S, Lada AG, Poliakov E, Panchenko AR, Cooper DN. Mutational signatures and mutable motifs in cancer genomes. Brief Bioinform 2019; 19:1085-1101. [PMID: 28498882 DOI: 10.1093/bib/bbx049] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2017] [Indexed: 12/22/2022] Open
Abstract
Cancer is a genetic disorder, meaning that a plethora of different mutations, whether somatic or germ line, underlie the etiology of the 'Emperor of Maladies'. Point mutations, chromosomal rearrangements and copy number changes, whether they have occurred spontaneously in predisposed individuals or have been induced by intrinsic or extrinsic (environmental) mutagens, lead to the activation of oncogenes and inactivation of tumor suppressor genes, thereby promoting malignancy. This scenario has now been recognized and experimentally confirmed in a wide range of different contexts. Over the past decade, a surge in available sequencing technologies has allowed the sequencing of whole genomes from liquid malignancies and solid tumors belonging to different types and stages of cancer, giving birth to the new field of cancer genomics. One of the most striking discoveries has been that cancer genomes are highly enriched with mutations of specific kinds. It has been suggested that these mutations can be classified into 'families' based on their mutational signatures. A mutational signature may be regarded as a type of base substitution (e.g. C:G to T:A) within a particular context of neighboring nucleotide sequence (the bases upstream and/or downstream of the mutation). These mutational signatures, supplemented by mutable motifs (a wider mutational context), promise to help us to understand the nature of the mutational processes that operate during tumor evolution because they represent the footprints of interactions between DNA, mutagens and the enzymes of the repair/replication/modification pathways.
Collapse
Affiliation(s)
- Igor B Rogozin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, USA
| | - Youri I Pavlov
- Eppley Institute for Cancer Research, University of Nebraska Medical Center, USA
| | | | | | - Artem G Lada
- Department Microbiology and Molecular Genetics, University of California, Davis, USA
| | - Eugenia Poliakov
- Laboratory of Retinal Cell and Molecular Biology, National Eye Institute, National Institutes of Health, USA
| | - Anna R Panchenko
- National Center for Biotechnology Information, National Institutes of Health, USA
| | | |
Collapse
|
2
|
Mabrouk MS, Naeem SM, Eldosoky MA. DIFFERENT GENOMIC SIGNAL PROCESSING METHODS FOR EUKARYOTIC GENE PREDICTION: A SYSTEMATIC REVIEW. BIOMEDICAL ENGINEERING-APPLICATIONS BASIS COMMUNICATIONS 2017. [DOI: 10.4015/s1016237217300012] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Bioinformatics field has now solidly settled itself as a control in molecular biology and incorporates an extensive variety of branches of knowledge from structural biology, genomics to gene expression studies. Bioinformatics is the application of computer technology to the management of biological information. Genomic signal processing (GSP) techniques have been connected most all around in bioinformatics and will keep on assuming an essential part in the investigation of biomedical issues. GSP refers to using the digital signal processing (DSP) methods for genomic data (e.g. DNA sequences) analysis. Recently, applications of GSP in bioinformatics have obtained great consideration such as identification of DNA protein coding regions, identification of reading frames, cancer detection and others. Cancer is one of the most dangerous diseases that the world faces and has raised the death rate in recent years, it is known medically as malignant neoplasm, so detection of it at the early stage can yield a promising approach to determine and take actions to treat with this risk. GSP is a method which can be used to detect the cancerous cells that are often caused due to genetic abnormality. This systematic review discusses some of the GSP applications in bioinformatics generally. The GSP techniques, used for cancer detection especially, are presented to collect the recent results and what has been reached at this point to be a new subject of research.
Collapse
Affiliation(s)
- Mai S. Mabrouk
- Biomedical Engineering Department, Faculty of Engineering, Misr University for Science and Technology (MUST University), Cairo, Egypt
| | - Safaa M. Naeem
- Biomedical Engineering Department, Faculty of Engineering, Helwan University, Cairo, Egypt
| | - Mohamed A. Eldosoky
- Biomedical Engineering Department, Faculty of Engineering, Helwan University, Cairo, Egypt
| |
Collapse
|
3
|
Nounou MN, Nounou HN, Meskin N, Datta A, Dougherty ER. Multiscale denoising of biological data: a comparative analysis. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2012; 9:1539-1544. [PMID: 22566476 DOI: 10.1109/tcbb.2012.67] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
Measured microarray genomic and metabolic data are a rich source of information about the biological systems they represent. For example, time-series biological data can be used to construct dynamic genetic regulatory network models, which can be used to design intervention strategies to cure or manage major diseases. Also, copy number data can be used to determine the locations and extent of aberrations in chromosome sequences. Unfortunately, measured biological data are usually contaminated with errors that mask the important features in the data. Therefore, these noisy measurements need to be filtered to enhance their usefulness in practice. Wavelet-based multiscale filtering has been shown to be a powerful denoising tool. In this work, different batch as well as online multiscale filtering techniques are used to denoise biological data contaminated with white or colored noise. The performances of these techniques are demonstrated and compared to those of some conventional low-pass filters using two case studies. The first case study uses simulated dynamic metabolic data, while the second case study uses real copy number data. Simulation results show that significant improvement can be achieved using multiscale filtering over conventional filtering techniques.
Collapse
Affiliation(s)
- M N Nounou
- Chemical Engineering Program, Texas A&M University at Qatar, Doha, Qatar.
| | | | | | | | | |
Collapse
|
4
|
|
5
|
Chua GH, Krishnan A, Li KB, Tomita M. MULTIRESOLUTION ANALYSIS UNCOVERS HIDDEN CONSERVATION OF PROPERTIES IN STRUCTURALLY AND FUNCTIONALLY SIMILAR PROTEINS. J Bioinform Comput Biol 2011; 4:1245-67. [PMID: 17245813 DOI: 10.1142/s0219720006002442] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2006] [Revised: 09/13/2006] [Accepted: 09/13/2006] [Indexed: 11/18/2022]
Abstract
Physicochemcial properties of amino acids are important factors in determining protein structure and function. Most approaches make use of averaged properties over entire domains or even proteins to analyze their structure or function. This level of coarseness tends to hide the richness of the variability in the different properties across functional domains. This paper studies the conservation of physicochemical properties in a functionally similar family of proteins using a novel wavelet-based technique known as multiresolution analysis. Such an analysis can help uncover characteristics that can otherwise remain hidden. We have studied the protein kinase family of sequences and our findings are as follows: (a) a number of different properties are conserved over the functional catalytic domain irrespective of the sequence identities; (b) conservation of properties can be observed at different frequency levels and they agree well with the known structural/functional properties of the subdomains for the protein kinase family; (c) structural differences between the different kinase family members are reflected in the waveforms; and (d) functionally important mutations show distortions in the waveforms of conserved properties. The potential usefulness of the above findings in identifying functionally similar sequences in the twilight and midnight zones is demonstrated through a simple prediction model for the protein kinase family which achieved a recall of 93.7% and a precision of 96.75% in cross-validation tests.
Collapse
Affiliation(s)
- Gek-Huey Chua
- Bioinformatics Institute, 30, Biopolis Street, #07-01, Matrix, Singapore
| | | | | | | |
Collapse
|
6
|
Spencer CCA, Deloukas P, Hunt S, Mullikin J, Myers S, Silverman B, Donnelly P, Bentley D, McVean G. The influence of recombination on human genetic diversity. PLoS Genet 2006; 2:e148. [PMID: 17044736 PMCID: PMC1575889 DOI: 10.1371/journal.pgen.0020148] [Citation(s) in RCA: 203] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2006] [Accepted: 07/31/2006] [Indexed: 11/25/2022] Open
Abstract
In humans, the rate of recombination, as measured on the megabase scale, is positively associated with the level of genetic variation, as measured at the genic scale. Despite considerable debate, it is not clear whether these factors are causally linked or, if they are, whether this is driven by the repeated action of adaptive evolution or molecular processes such as double-strand break formation and mismatch repair. We introduce three innovations to the analysis of recombination and diversity: fine-scale genetic maps estimated from genotype experiments that identify recombination hotspots at the kilobase scale, analysis of an entire human chromosome, and the use of wavelet techniques to identify correlations acting at different scales. We show that recombination influences genetic diversity only at the level of recombination hotspots. Hotspots are also associated with local increases in GC content and the relative frequency of GC-increasing mutations but have no effect on substitution rates. Broad-scale association between recombination and diversity is explained through covariance of both factors with base composition. To our knowledge, these results are the first evidence of a direct and local influence of recombination hotspots on genetic variation and the fate of individual mutations. However, that hotspots have no influence on substitution rates suggests that they are too ephemeral on an evolutionary time scale to have a strong influence on broader scale patterns of base composition and long-term molecular evolution. Patterns of genetic variation in the human genome provide a history of the evolutionary forces that have shaped our species. The role of one factor, recombination, in shaping variation is much debated. The observation is that regions of the genome with high recombination also have high levels of genetic variation, but this pattern can be interpreted as evidence for either repeated, widespread adaptive evolution or correlation through neutral factors such as base composition. To resolve this issue, the authors constructed a genetic map of human Chromosome 20 that has a resolution more than three orders in magnitude greater than previous maps. By comparing the location of recombination hotspots with patterns of genetic variation, evolution, and base composition, the authors show that recombination has only a very local influence on diversity, which suggests that molecular mechanisms, such as mismatch-associated repair or double-strand break formation, not adaptive evolution, drives the association.
Collapse
Affiliation(s)
| | - Panos Deloukas
- Wellcome Trust Sanger Institute, Hinxton, United Kingdom
| | - Sarah Hunt
- Wellcome Trust Sanger Institute, Hinxton, United Kingdom
| | - Jim Mullikin
- National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Simon Myers
- Department of Statistics, University of Oxford, Oxford, United Kingdom
- Broad Institute, Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America
| | - Bernard Silverman
- Department of Statistics, University of Oxford, Oxford, United Kingdom
| | - Peter Donnelly
- Department of Statistics, University of Oxford, Oxford, United Kingdom
| | | | - Gil McVean
- Department of Statistics, University of Oxford, Oxford, United Kingdom
- * To whom correspondence should be addressed. E-mail:
| |
Collapse
|
7
|
Abstract
Advocates of maximum likelihood (ML) approaches to phylogenetics commonly cite as one of their primary advantages the use of objective statistical criteria for model selection. Currently, a particular implementation of the likelihood ratio test (LRT) is the most commonly used model-selection criterion in phylogenetics. This approach requires the choice of a starting point and a parameter addition (or removal) sequence that can affect all ML inferences (i.e., topology, model, and all evolutionary parameters). Here, several alternative starting points and parameter sequences are tested in empirical data sets to assess their influence on model selection and optimal topology. In the studied data sets, varying model-selection protocols leads to selection of different models that, in some cases, lead to different ML trees. Given the sensitivity of the LRT, some possible solutions to model selection (within the hypothesis testing approach) are outlined, and alternative model-selection criteria are discussed. Some of the suggested alternatives seem to lack these problems, although their behavior and adequacy for phylogenetics needs to be further explored.
Collapse
Affiliation(s)
- Diego Pol
- Division of Paleontology, American Museum of Natural History, New York, NY 10024, USA.
| |
Collapse
|
8
|
Posada D, Buckley TR. Model selection and model averaging in phylogenetics: advantages of akaike information criterion and bayesian approaches over likelihood ratio tests. Syst Biol 2005; 53:793-808. [PMID: 15545256 DOI: 10.1080/10635150490522304] [Citation(s) in RCA: 2289] [Impact Index Per Article: 120.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022] Open
Abstract
Model selection is a topic of special relevance in molecular phylogenetics that affects many, if not all, stages of phylogenetic inference. Here we discuss some fundamental concepts and techniques of model selection in the context of phylogenetics. We start by reviewing different aspects of the selection of substitution models in phylogenetics from a theoretical, philosophical and practical point of view, and summarize this comparison in table format. We argue that the most commonly implemented model selection approach, the hierarchical likelihood ratio test, is not the optimal strategy for model selection in phylogenetics, and that approaches like the Akaike Information Criterion (AIC) and Bayesian methods offer important advantages. In particular, the latter two methods are able to simultaneously compare multiple nested or nonnested models, assess model selection uncertainty, and allow for the estimation of phylogenies and model parameters using all available models (model-averaged inference or multimodel inference). We also describe how the relative importance of the different parameters included in substitution models can be depicted. To illustrate some of these points, we have applied AIC-based model averaging to 37 mitochondrial DNA sequences from the subgenus Ohomopterus(genus Carabus) ground beetles described by Sota and Vogler (2001).
Collapse
Affiliation(s)
- David Posada
- Departamento de Bioquímica, Genética e Inmunología, Facultad de Biología, Universidad de Vigo, Vigo 36200, Spain.
| | | |
Collapse
|
9
|
Pond SLK, Frost SDW. A simple hierarchical approach to modeling distributions of substitution rates. Mol Biol Evol 2004; 22:223-34. [PMID: 15483327 DOI: 10.1093/molbev/msi009] [Citation(s) in RCA: 53] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Genetic sequence data typically exhibit variability in substitution rates across sites. In practice, there is often too little variation to fit a different rate for each site in the alignment, but the distribution of rates across sites may not be well modeled using simple parametric families. Mixtures of different distributions can capture more complex patterns of rate variation, but are often parameter-rich and difficult to fit. We present a simple hierarchical model in which a baseline rate distribution, such as a gamma distribution, is discretized into several categories, the quantiles of which are estimated using a discretized beta distribution. Although this approach involves adding only two extra parameters to a standard distribution, a wide range of rate distributions can be captured. Using simulated data, we demonstrate that a "beta-" model can reproduce the moments of the rate distribution more accurately than the distribution used to simulate the data, even when the baseline rate distribution is misspecified. Using hepatitis C virus and mammalian mitochondrial sequences, we show that a beta- model can fit as well or better than a model with multiple discrete rate categories, and compares favorably with a model which fits a separate rate category to each site. We also demonstrate this discretization scheme in the context of codon models specifically aimed at identifying individual sites undergoing adaptive or purifying evolution.
Collapse
|
10
|
Benner SA. Interpretive proteomics--finding biological meaning in genome and proteome databases. ADVANCES IN ENZYME REGULATION 2004; 43:271-359. [PMID: 12791396 DOI: 10.1016/s0065-2571(02)00024-9] [Citation(s) in RCA: 18] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Affiliation(s)
- Steven A Benner
- Department of Chemistry, University of Florida, Gainesville FL 32611, USA.
| |
Collapse
|
11
|
Abstract
Chemokine receptors represent a prime target for the development of novel therapeutic strategies in a variety of disease processes, including inflammation, allergy and neoplasia. Here we use maximum likelihood methods and bootstrap methods to investigate both the phylogenetic relationships in a large set of human chemokine receptor sequences and the relationships between chemokine receptors and their nearest neighbors. We found that CCR and CXCR families are not homogeneous. We also provide evidences that angiotensin receptors are the closest neighbors. Other close neighbors include opioid, somatostatin and melanin-concentrating hormone receptors. The phylogenetic analysis suggests ancient paralogous relationships and establishes a link between immune, metabolic and neural systems modulation. We complement our findings with a structural analysis based on wavelet methods of the major branches of chemokine receptors phylogeny. We hypothesize that receptors very close in the tree can form heterodimers. Our analyses reveal different characteristics of amino acid hydrophobicity and volume propensity in the different subfamilies. We also found that the second extra-cytoplasmic loop has higher rates of evolution than the internal loops and transmembrane segments, suggesting that selection, shifting, reassignments and broadening of receptor binding specificities involve mainly this loop.
Collapse
Affiliation(s)
- Pietro Liò
- Department of Zoology, University of Cambridge, Cambridge, UK.
| | | |
Collapse
|
12
|
Rogozin IB, Pavlov YI. Theoretical analysis of mutation hotspots and their DNA sequence context specificity. Mutat Res 2003; 544:65-85. [PMID: 12888108 DOI: 10.1016/s1383-5742(03)00032-2] [Citation(s) in RCA: 123] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
Mutation frequencies vary significantly along nucleotide sequences such that mutations often concentrate at certain positions called hotspots. Mutation hotspots in DNA reflect intrinsic properties of the mutation process, such as sequence specificity, that manifests itself at the level of interaction between mutagens, DNA, and the action of the repair and replication machineries. The hotspots might also reflect structural and functional features of the respective DNA sequences. When mutations in a gene are identified using a particular experimental system, resulting hotspots could reflect the properties of the gene product and the mutant selection scheme. Analysis of the nucleotide sequence context of hotspots can provide information on the molecular mechanisms of mutagenesis. However, the determinants of mutation frequency and specificity are complex, and there are many analytical methods for their study. Here we review computational approaches for analyzing mutation spectra (distribution of mutations along the target genes) that include many mutable (detectable) positions. The following methods are reviewed: derivation of a consensus sequence, application of regression approaches to correlate nucleotide sequence features with mutation frequency, mutation hotspot prediction, analysis of oligonucleotide composition of regions containing mutations, pairwise comparison of mutation spectra, analysis of multiple spectra, and analysis of "context-free" characteristics. The advantages and pitfalls of these methods are discussed and illustrated by examples from the literature. The most reliable analyses were obtained when several methods were combined and information from theoretical analysis and experimental observations was considered simultaneously. Simple, robust approaches should be used with small samples of mutations, whereas combinations of simple and complex approaches may be required for large samples. We discuss several well-documented studies where analysis of mutation spectra has substantially contributed to the current understanding of molecular mechanisms of mutagenesis. The nucleotide sequence context of mutational hotspots is a fingerprint of interactions between DNA and DNA repair, replication, and modification enzymes, and the analysis of hotspot context provides evidence of such interactions.
Collapse
Affiliation(s)
- Igor B Rogozin
- Institute of Cytology and Genetics, Russian Academy of Sciences, Novosibirsk, Russia
| | | |
Collapse
|
13
|
Arbogast BS, Edwards SV, Wakeley J, Beerli P, Slowinski JB. Estimating Divergence Times from Molecular Data on Phylogenetic and Population Genetic Timescales. ACTA ACUST UNITED AC 2002. [DOI: 10.1146/annurev.ecolsys.33.010802.150500] [Citation(s) in RCA: 471] [Impact Index Per Article: 21.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Affiliation(s)
- Brian S. Arbogast
- Department of Biological Sciences, Humboldt State University, Arcata, California 95521;
| | - Scott V. Edwards
- Department of Zoology, University of Washington, Seattle, Washington 98195;
| | - John Wakeley
- Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, Massachusetts 02138;
| | - Peter Beerli
- Department of Genome Sciences, University of Washington, Seattle, Washington 98195;
| | | |
Collapse
|
14
|
Gaucher EA, Gu X, Miyamoto MM, Benner SA. Predicting functional divergence in protein evolution by site-specific rate shifts. Trends Biochem Sci 2002; 27:315-21. [PMID: 12069792 DOI: 10.1016/s0968-0004(02)02094-7] [Citation(s) in RCA: 121] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Most modern tools that analyze protein evolution allow individual sites to mutate at constant rates over the history of the protein family. However, Walter Fitch observed in the 1970s that, if a protein changes its function, the mutability of individual sites might also change. This observation is captured in the "non-homogeneous gamma model", which extracts functional information from gene families by examining the different rates at which individual sites evolve. This model has recently been coupled with structural and molecular biology to identify sites that are likely to be involved in changing function within the gene family. Applying this to multiple gene families highlights the widespread divergence of functional behavior among proteins to generate paralogs and orthologs.
Collapse
Affiliation(s)
- Eric A Gaucher
- NASA Astrobiology Institute, University of Florida, Gainesville, FL 32611, USA
| | | | | | | |
Collapse
|
15
|
Gaucher EA, Miyamoto MM, Benner SA. Function-structure analysis of proteins using covarion-based evolutionary approaches: Elongation factors. Proc Natl Acad Sci U S A 2001; 98:548-52. [PMID: 11209054 PMCID: PMC14624 DOI: 10.1073/pnas.98.2.548] [Citation(s) in RCA: 83] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
The divergent evolution of protein sequences from genomic databases can be analyzed by the use of different mathematical models. The most common treat all sites in a protein sequence as equally variable. More sophisticated models acknowledge the fact that purifying selection generally tolerates variable amounts of amino acid replacement at different positions in a protein sequence. In their "stationary" versions, such models assume that the replacement rate at individual positions remains constant throughout evolutionary history. "Nonstationary" covarion versions, however, allow the replacement rate at a position to vary in different branches of the evolutionary tree. Recently, statistical methods have been developed that highlight this type of variation in replacement rates. Here, we show how positions that have variable rates of divergence in different regions of a tree ("covarion behavior"), coupled with analyses of experimental three-dimensional structures, can provide experimentally testable hypotheses that relate individual amino acid residues to specific functional differences in those branches. We illustrate this in the elongation factor family of proteins as a paradigm for applications of this type of analysis in functional genomics generally.
Collapse
Affiliation(s)
- E A Gaucher
- Department of Chemistry and Molecular Cell Biology Program, College of Medicine, University of Florida, Gainesville, FL 32611-7200, USA.
| | | | | |
Collapse
|