1
|
Yang Y, Braga MV, Dean MD. Insertion-Deletion Events Are Depleted in Protein Regions with Predicted Secondary Structure. Genome Biol Evol 2024; 16:evae093. [PMID: 38735759 PMCID: PMC11102076 DOI: 10.1093/gbe/evae093] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2024] [Revised: 04/16/2024] [Accepted: 04/21/2024] [Indexed: 05/14/2024] Open
Abstract
A fundamental goal in evolutionary biology and population genetics is to understand how selection shapes the fate of new mutations. Here, we test the null hypothesis that insertion-deletion (indel) events in protein-coding regions occur randomly with respect to secondary structures. We identified indels across 11,444 sequence alignments in mouse, rat, human, chimp, and dog genomes and then quantified their overlap with four different types of secondary structure-alpha helices, beta strands, protein bends, and protein turns-predicted by deep-learning methods of AlphaFold2. Indels overlapped secondary structures 54% as much as expected and were especially underrepresented over beta strands, which tend to form internal, stable regions of proteins. In contrast, indels were enriched by 155% over regions without any predicted secondary structures. These skews were stronger in the rodent lineages compared to the primate lineages, consistent with population genetic theory predicting that natural selection will be more efficient in species with larger effective population sizes. Nonsynonymous substitutions were also less common in regions of protein secondary structure, although not as strongly reduced as in indels. In a complementary analysis of thousands of human genomes, we showed that indels overlapping secondary structure segregated at significantly lower frequency than indels outside of secondary structure. Taken together, our study shows that indels are selected against if they overlap secondary structure, presumably because they disrupt the tertiary structure and function of a protein.
Collapse
Affiliation(s)
- Yi Yang
- Molecular and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA
| | - Matthew V Braga
- Molecular and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA
| | - Matthew D Dean
- Molecular and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA
| |
Collapse
|
2
|
Caswell B, Summers TJ, Licup GL, Cantu DC. Mutation Space of Spatially Conserved Amino Acid Sites in Proteins. ACS OMEGA 2023; 8:24302-24310. [PMID: 37457482 PMCID: PMC10339398 DOI: 10.1021/acsomega.3c01473] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/04/2023] [Accepted: 06/14/2023] [Indexed: 07/18/2023]
Abstract
The mutation space of spatially conserved (MSSC) amino acid residues is a protein structural quantity developed and described in this work. The MSSC quantifies how many mutations and which different mutations, i.e., the mutation space, occur in each amino acid site in a protein. The MSSC calculates the mutation space of amino acids in a target protein from the spatially conserved residues in a group of multiple protein structures. Spatially conserved amino acid residues are identified based on their relative positions in the protein structure. The MSSC examines each residue in a target protein, compares it to the residues present in the same relative position in other protein structures, and uses physicochemical criteria of mutations found in each conserved spatial site to quantify the mutation space of each amino acid in the target protein. The MSSC is analogous to scoring each site in a multiple sequence alignment but in three-dimensional space considering the spatial location of residues instead of solely the order in which they appear in a protein sequence. MSSC analysis was performed on example cases, and it reproduces the well-known observation that, regardless of secondary structure, solvent-exposed residues are more likely to be mutated than internal ones. The MSSC code is available on GitHub: "https://github.com/Cantu-Research-Group/Mutation_Space".
Collapse
|
3
|
Banerjee A, Bahar I. Structural Dynamics Predominantly Determine the Adaptability of Proteins to Amino Acid Deletions. Int J Mol Sci 2023; 24:8450. [PMID: 37176156 PMCID: PMC10179678 DOI: 10.3390/ijms24098450] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2023] [Revised: 05/01/2023] [Accepted: 05/06/2023] [Indexed: 05/15/2023] Open
Abstract
The insertion or deletion (indel) of amino acids has a variety of effects on protein function, ranging from disease-forming changes to gaining new functions. Despite their importance, indels have not been systematically characterized towards protein engineering or modification goals. In the present work, we focus on deletions composed of multiple contiguous amino acids (mAA-dels) and their effects on the protein (mutant) folding ability. Our analysis reveals that the mutant retains the native fold when the mAA-del obeys well-defined structural dynamics properties: localization in intrinsically flexible regions, showing low resistance to mechanical stress, and separation from allosteric signaling paths. Motivated by the possibility of distinguishing the features that underlie the adaptability of proteins to mAA-dels, and by the rapid evaluation of these features using elastic network models, we developed a positive-unlabeled learning-based classifier that can be adopted for protein design purposes. Trained on a consolidated set of features, including those reflecting the intrinsic dynamics of the regions where the mAA-dels occur, the new classifier yields a high recall of 84.3% for identifying mAA-dels that are stably tolerated by the protein. The comparative examination of the relative contribution of different features to the prediction reveals the dominant role of structural dynamics in enabling the adaptation of the mutant to mAA-del without disrupting the native fold.
Collapse
Affiliation(s)
- Anupam Banerjee
- Laufer Center for Physical and Quantitative Biology, Stony Brook University, Stony Brook, NY 11794, USA
| | - Ivet Bahar
- Laufer Center for Physical and Quantitative Biology, Stony Brook University, Stony Brook, NY 11794, USA
- Department of Biochemistry and Cell Biology, Stony Brook University, Stony Brook, NY 11794, USA
| |
Collapse
|
4
|
Sallah SR, Sergouniotis PI, Hardcastle C, Ramsden S, Lotery AJ, Lench N, Lovell SC, Black GCM. Assessing the Pathogenicity of In-Frame CACNA1F Indel Variants Using Structural Modeling. J Mol Diagn 2022; 24:1232-1239. [PMID: 36191840 DOI: 10.1016/j.jmoldx.2022.09.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2021] [Revised: 08/20/2022] [Accepted: 09/09/2022] [Indexed: 01/13/2023] Open
Abstract
Small in-frame insertion-deletion (indel) variants are a common form of genomic variation whose impact on rare disease phenotypes has been understudied. The prediction of the pathogenicity of such variants remains challenging. X-linked incomplete congenital stationary night blindness type 2 (CSNB2) is a nonprogressive, inherited retinal disorder caused by variants in CACNA1F, encoding the Cav1.4α1 channel protein. Here, structural analysis was used through homology modeling to interpret 10 disease-correlated and 10 putatively benign CACNA1F in-frame indel variants. CSNB2-correlated changes were found to be more highly conserved compared with putative benign variants. Notably, all 10 disease-correlated variants but none of the benign changes were within modeled regions of the protein. Structural analysis revealed that disease-correlated variants are predicted to destabilize the structure and function of the Cav1.4α1 channel protein. Overall, the use of structural information to interpret the consequences of in-frame indel variants provides an important adjunct that can improve the diagnosis for individuals with CSNB2.
Collapse
Affiliation(s)
- Shalaw R Sallah
- Division of Evolution and Genomic Sciences, School of Biological Sciences, Faculty of Biology, Medicines and Health, University of Manchester, Manchester Academic Health Science Centre, Manchester, United Kingdom; Manchester Centre for Genomic Medicine, Manchester University NHS Foundation Trust, Manchester Academic Health Science Centre, St. Mary's Hospital, Manchester, United Kingdom.
| | - Panagiotis I Sergouniotis
- Division of Evolution and Genomic Sciences, School of Biological Sciences, Faculty of Biology, Medicines and Health, University of Manchester, Manchester Academic Health Science Centre, Manchester, United Kingdom; Manchester Centre for Genomic Medicine, Manchester University NHS Foundation Trust, Manchester Academic Health Science Centre, St. Mary's Hospital, Manchester, United Kingdom
| | - Claire Hardcastle
- Manchester Centre for Genomic Medicine, Manchester University NHS Foundation Trust, Manchester Academic Health Science Centre, St. Mary's Hospital, Manchester, United Kingdom
| | - Simon Ramsden
- Manchester Centre for Genomic Medicine, Manchester University NHS Foundation Trust, Manchester Academic Health Science Centre, St. Mary's Hospital, Manchester, United Kingdom
| | - Andrew J Lotery
- Faculty of Medicine, University of Southampton, Southampton, United Kingdom
| | - Nick Lench
- Congenica Ltd., BioData Innovation Centre, Wellcome Genome Campus, Hinxton, Cambridge, United Kingdom
| | - Simon C Lovell
- Division of Evolution and Genomic Sciences, School of Biological Sciences, Faculty of Biology, Medicines and Health, University of Manchester, Manchester Academic Health Science Centre, Manchester, United Kingdom
| | - Graeme C M Black
- Division of Evolution and Genomic Sciences, School of Biological Sciences, Faculty of Biology, Medicines and Health, University of Manchester, Manchester Academic Health Science Centre, Manchester, United Kingdom; Manchester Centre for Genomic Medicine, Manchester University NHS Foundation Trust, Manchester Academic Health Science Centre, St. Mary's Hospital, Manchester, United Kingdom.
| |
Collapse
|
5
|
Rao RSP, Ahsan N, Xu C, Su L, Verburgt J, Fornelli L, Kihara D, Xu D. Evolutionary Dynamics of Indels in SARS-CoV-2 Spike Glycoprotein. Evol Bioinform Online 2021; 17:11769343211064616. [PMID: 34898980 PMCID: PMC8655444 DOI: 10.1177/11769343211064616] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2021] [Accepted: 11/12/2021] [Indexed: 01/28/2023] Open
Abstract
SARS-CoV-2, responsible for the current COVID-19 pandemic that claimed over 5.0 million lives, belongs to a class of enveloped viruses that undergo quick evolutionary adjustments under selection pressure. Numerous variants have emerged in SARS-CoV-2, posing a serious challenge to the global vaccination effort and COVID-19 management. The evolutionary dynamics of this virus are only beginning to be explored. In this work, we have analysed 1.79 million spike glycoprotein sequences of SARS-CoV-2 and found that the virus is fine-tuning the spike with numerous amino acid insertions and deletions (indels). Indels seem to have a selective advantage as the proportions of sequences with indels steadily increased over time, currently at over 89%, with similar trends across countries/variants. There were as many as 420 unique indel positions and 447 unique combinations of indels. Despite their high frequency, indels resulted in only minimal alteration of N-glycosylation sites, including both gain and loss. As indels and point mutations are positively correlated and sequences with indels have significantly more point mutations, they have implications in the evolutionary dynamics of the SARS-CoV-2 spike glycoprotein.
Collapse
Affiliation(s)
- R Shyama Prasad Rao
- Biostatistics and Bioinformatics Division, Yenepoya Research Center, Yenepoya University, Mangaluru, Karnataka, India
| | - Nagib Ahsan
- Department of Chemistry and Biochemistry, University of Oklahoma, Norman, OK, USA
- Mass Spectrometry, Proteomics and Metabolomics Core Facility, Stephenson Life Sciences Research Center, University of Oklahoma, Norman, OK, USA
| | - Chunhui Xu
- Department of Electrical Engineering and Computer Science, Informatics Institute, and Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, USA
| | - Lingtao Su
- Department of Electrical Engineering and Computer Science, Informatics Institute, and Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, USA
| | - Jacob Verburgt
- Department of Biological Sciences, Purdue University, West Lafayette, IN, USA
| | - Luca Fornelli
- Department of Chemistry and Biochemistry, University of Oklahoma, Norman, OK, USA
- Department of Biology, University of Oklahoma, Norman, OK, USA
| | - Daisuke Kihara
- Department of Biological Sciences, Purdue University, West Lafayette, IN, USA
- Department of Computer Science, Purdue University, West Lafayette, IN, USA
| | - Dong Xu
- Department of Electrical Engineering and Computer Science, Informatics Institute, and Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, USA
| |
Collapse
|
6
|
Li DD, Wang JL, Liu Y, Li YZ, Zhang Z. Expanded analyses of the functional correlations within structural classifications of glycoside hydrolases. Comput Struct Biotechnol J 2021; 19:5931-5942. [PMID: 34849197 PMCID: PMC8602953 DOI: 10.1016/j.csbj.2021.10.039] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2021] [Revised: 10/30/2021] [Accepted: 10/30/2021] [Indexed: 01/01/2023] Open
Abstract
Glycoside hydrolases (GHs) are greatly diverse in sequences and functions, but systematic studies of GH relationships based on structural information are lacking. Here, we report that GHs have multiple evolutionary origins and are structurally derived from 27 homologous superfamilies and 16 folds, but GHs are highly biased to distribute in a few superfamilies and folds. Six of these superfamilies are widely encoded by archaea, bacteria, and eukaryotes, indicating that they may be the most ancient in origin. Most superfamilies vary in enzyme function, and some, such as the superfamilies of (β/α)8-barrel and (α/α)6-barrel structures, exhibit extreme functional diversity; this is highly positively correlated with sequence diversity. More than one-third of glycosidase activities show a phenomenon of convergent evolution, especially the degradation functions of GHs on polysaccharides. The GHs of most superfamilies have relatively narrow environmental distributions, normally with the highest abundance in host-associated environments and a distribution preference for moderate low-temperature and acidic environments. Overall, our expanded analysis facilitates an understanding of complex GH sequence-structure-function relationships and may guide our screening and engineering of GHs.
Collapse
Affiliation(s)
- Dan-Dan Li
- State Key Laboratory of Microbial Technology, Institute of Microbial Technology, Shandong University, Qingdao 266237, China
| | - Jin-Lan Wang
- National Administration of Health Data, Jinan 250002, China
| | - Ya Liu
- State Key Laboratory of Microbial Technology, Institute of Microbial Technology, Shandong University, Qingdao 266237, China
| | - Yue-Zhong Li
- State Key Laboratory of Microbial Technology, Institute of Microbial Technology, Shandong University, Qingdao 266237, China
| | - Zheng Zhang
- State Key Laboratory of Microbial Technology, Institute of Microbial Technology, Shandong University, Qingdao 266237, China.,Suzhou Research Institute, Shandong University, Suzhou 215123, China
| |
Collapse
|
7
|
Zhao VY, Rodrigues JV, Lozovsky ER, Hartl DL, Shakhnovich EI. Switching an active site helix in dihydrofolate reductase reveals limits to subdomain modularity. Biophys J 2021; 120:4738-4750. [PMID: 34571014 PMCID: PMC8595743 DOI: 10.1016/j.bpj.2021.09.032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2021] [Revised: 09/14/2021] [Accepted: 09/22/2021] [Indexed: 11/23/2022] Open
Abstract
To what degree are individual structural elements within proteins modular such that similar structures from unrelated proteins can be interchanged? We study subdomain modularity by creating 20 chimeras of an enzyme, Escherichia coli dihydrofolate reductase (DHFR), in which a catalytically important, 10-residue α-helical sequence is replaced by α-helical sequences from a diverse set of proteins. The chimeras stably fold but have a range of diminished thermal stabilities and catalytic activities. Evolutionary coupling analysis indicates that the residues of this α-helix are under selection pressure to maintain catalytic activity in DHFR. Reversion to phenylalanine at key position 31 was found to partially restore catalytic activity, which could be explained by evolutionary coupling values. We performed molecular dynamics simulations using replica exchange with solute tempering. Chimeras with low catalytic activity exhibit nonhelical conformations that block the binding site and disrupt the positioning of the catalytically essential residue D27. Simulation observables and in vitro measurements of thermal stability and substrate-binding affinity are strongly correlated. Several E. coli strains with chromosomally integrated chimeric DHFRs can grow, with growth rates that follow predictions from a kinetic flux model that depends on the intracellular abundance and catalytic activity of DHFR. Our findings show that although α-helices are not universally substitutable, the molecular and fitness effects of modular segments can be predicted by the biophysical compatibility of the replacement segment.
Collapse
Affiliation(s)
- Victor Y Zhao
- Department of Chemistry and Chemical Biology, Harvard University, Cambridge, Massachusetts
| | - João V Rodrigues
- Department of Chemistry and Chemical Biology, Harvard University, Cambridge, Massachusetts
| | - Elena R Lozovsky
- Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, Massachusetts
| | - Daniel L Hartl
- Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, Massachusetts
| | - Eugene I Shakhnovich
- Department of Chemistry and Chemical Biology, Harvard University, Cambridge, Massachusetts.
| |
Collapse
|
8
|
Banerjee A, Kumar A, Ghosh KK, Mitra P. Estimating Change in Foldability Due to Multipoint Deletions in Protein Structures. J Chem Inf Model 2020; 60:6679-6690. [PMID: 33225697 DOI: 10.1021/acs.jcim.0c00802] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Insertions/deletions of amino acids in the protein backbone potentially result in altered structural/functional specifications. They can either contribute positively to the evolutionary process or can result in disease conditions. Despite being the second most prevalent form of protein modification, there are no databases or computational frameworks that delineate harmful multipoint deletions (MPD) from beneficial ones. We introduce a positive unlabeled learning-based prediction framework (PROFOUND) that utilizes fold-level attributes, environment-specific properties, and deletion site-specific properties to predict the change in foldability arising from such MPDs, both in the non-loop and loop regions of protein structures. In the absence of any protein structure dataset to study MPDs, we introduce a dataset with 153 MPD instances that lead to native-like folded structures and 7650 unlabeled MPD instances whose effect on the foldability of the corresponding proteins is unknown. PROFOUND on 10-fold cross-validation on our newly introduced dataset reports a recall of 82.2% (86.6%) and a fall out rate (FR) of 14.2% (20.6%), corresponding to MPDs in the protein loop (non-loop) region. The low FR suggests that the foldability in proteins subject to MPDs is not random and necessitates unique specifications of the deleted region. In addition, we find that additional evolutionary attributes contribute to higher recall and lower FR. The first of a kind foldability prediction system owing to MPD instances and the newly introduced dataset will potentially aid in novel protein engineering endeavors.
Collapse
Affiliation(s)
- Anupam Banerjee
- Advanced Technology Development Centre, Indian Institute of Technology Kharagpur, Kharagpur, West Bengal 721302, India
| | - Amit Kumar
- Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur, Kharagpur, West Bengal 721302, India
| | - Kushal Kanti Ghosh
- Department of Computer Science and Engineering, Jadavpur University, Kolkata 700032, India
| | - Pralay Mitra
- Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur, Kharagpur, West Bengal 721302, India
| |
Collapse
|
9
|
Zhang Z, Wang J, Gong Y, Li Y. Contributions of substitutions and indels to the structural variations in ancient protein superfamilies. BMC Genomics 2018; 19:771. [PMID: 30355304 PMCID: PMC6201574 DOI: 10.1186/s12864-018-5178-8] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2018] [Accepted: 10/16/2018] [Indexed: 11/10/2022] Open
Abstract
Background Quantitative evaluation of protein structural evolution is important for our understanding of protein biological functions and their evolutionary adaptation, and is useful in guiding protein engineering. However, compared to the models for sequence evolution, the quantitative models for protein structural evolution received less attention. Ancient protein superfamilies are often considered versatile, allowing genetic and functional diversifications during long-term evolution. In this study, we investigated the quantitative impacts of sequence variations on the structural evolution of homologues in 68 ancient protein superfamilies that exist widely in sequenced eukaryotic, bacterial and archaeal genomes. Results We found that the accumulated structural variations within ancient superfamilies could be explained largely by a bilinear model that simultaneously considers amino acid substitution and insertion/deletion (indel). Both substitutions and indels are essential for explaining the structural variations within ancient superfamilies. For those ancient superfamilies with high bilinear multiple correlation coefficients, the influence of each unit of substitution or indel on structural variations is almost constant within each superfamily, but varies greatly among different superfamilies. The influence of each unit indel on structural variations is always larger than that of each unit substitution within each superfamily, but the accumulated contributions of indels to structural variations are lower than those of substitutions in most superfamilies. The total contributions of sequence indels and substitutions (46% and 54%, respectively) to the structural variations that result from sequence variations are slightly different in ancient superfamilies. Conclusions Structural variations within ancient protein superfamilies accumulated under the significantly bilinear influence of amino acid substitutions and indels in sequences. Both substitutions and indels are essential for explaining the structural variations within ancient superfamilies. For those structural variations resulting from sequence variations, the total contribution of indels is slightly lower than that of amino acid substitutions. The regular clock exists not only in protein sequences, but also probably in protein structures. Electronic supplementary material The online version of this article (10.1186/s12864-018-5178-8) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Zheng Zhang
- State Key Laboratory of Microbial Technology, Institute of Microbial Technology, Shandong University, Qingdao, 266237, China
| | - Jinlan Wang
- Physical Examination Office of Shandong Province, Health and Family Planning Commission of Shandong Province, Jinan, 250014, China
| | - Ya Gong
- State Key Laboratory of Microbial Technology, Institute of Microbial Technology, Shandong University, Qingdao, 266237, China
| | - Yuezhong Li
- State Key Laboratory of Microbial Technology, Institute of Microbial Technology, Shandong University, Qingdao, 266237, China.
| |
Collapse
|
10
|
Jackson EL, Spielman SJ, Wilke CO. Computational prediction of the tolerance to amino-acid deletion in green-fluorescent protein. PLoS One 2017; 12:e0164905. [PMID: 28369116 PMCID: PMC5378326 DOI: 10.1371/journal.pone.0164905] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2016] [Accepted: 03/21/2017] [Indexed: 01/29/2023] Open
Abstract
Proteins evolve through two primary mechanisms: substitution, where mutations alter a protein's amino-acid sequence, and insertions and deletions (indels), where amino acids are either added to or removed from the sequence. Protein structure has been shown to influence the rate at which substitutions accumulate across sites in proteins, but whether structure similarly constrains the occurrence of indels has not been rigorously studied. Here, we investigate the extent to which structural properties known to covary with protein evolutionary rates might also predict protein tolerance to indels. Specifically, we analyze a publicly available dataset of single-amino-acid deletion mutations in enhanced green fluorescent protein (eGFP) to assess how well the functional effect of deletions can be predicted from protein structure. We find that weighted contact number (WCN), which measures how densely packed a residue is within the protein's three-dimensional structure, provides the best single predictor for whether eGFP will tolerate a given deletion. We additionally find that using protein design to explicitly model deletions results in improved predictions of functional status when combined with other structural predictors. Our work suggests that structure plays fundamental role in constraining deletions at sites in proteins, and further that similar biophysical constraints influence both substitutions and deletions. This study therefore provides a solid foundation for future work to examine how protein structure influences tolerance of more complex indel events, such as insertions or large deletions.
Collapse
Affiliation(s)
- Eleisha L. Jackson
- Department of Integrative Biology, The University of Texas at Austin, Austin, Texas, United States of America
- Center for Computational Biology and Bioinformatics, The University of Texas at Austin, Austin, Texas, United States of America
- Institute for Cellular and Molecular Biology, The University of Texas at Austin, Austin, Texas, United States of America
| | - Stephanie J. Spielman
- Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, Pennsylvania, United States of America
| | - Claus O. Wilke
- Department of Integrative Biology, The University of Texas at Austin, Austin, Texas, United States of America
- Center for Computational Biology and Bioinformatics, The University of Texas at Austin, Austin, Texas, United States of America
- Institute for Cellular and Molecular Biology, The University of Texas at Austin, Austin, Texas, United States of America
| |
Collapse
|
11
|
Measuring Accelerated Rates of Insertions and Deletions Independent of Rates of Nucleotide Substitution. J Mol Evol 2016; 83:137-146. [PMID: 27770175 PMCID: PMC5080320 DOI: 10.1007/s00239-016-9761-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2016] [Accepted: 10/11/2016] [Indexed: 11/16/2022]
Abstract
Evolutionary constraint for insertions and deletions (indels) is not necessarily equal to constraint for nucleotide substitutions for any given region of a genome. Knowing the variation in indel-specific evolutionary rates across the sequence will aid our understanding of evolutionary constraints on indels, and help us infer how indels have contributed to the evolution of the sequence. However, unlike for nucleotide substitutions, there has been no phylogenetic method that can statistically infer significantly different rates of indels across the sequence space independent of substitution rates. Here, we have developed a software that will find sites with accelerated evolutionary rates specific to indels, by introducing a scaling parameter that only applies to the indel rates and not to the nucleotide substitution rates. Using the software, we show that we can find regions of accelerated rates of indels in the protein alignments of primate genomes. We also confirm that the sites that have high rates of indels are different from the sites that have high rates of nucleotide substitutions within the protein sequences. By identifying regions with accelerated rates of indels independent of nucleotide substitutions, we will be able to better understand the impact of indel mutations on protein sequence evolution.
Collapse
|
12
|
Al-Shatnawi M, Ahmad MO, Swamy MNS. MSAIndelFR: a scheme for multiple protein sequence alignment using information on indel flanking regions. BMC Bioinformatics 2015; 16:393. [PMID: 26597571 PMCID: PMC4657235 DOI: 10.1186/s12859-015-0826-3] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2015] [Accepted: 11/14/2015] [Indexed: 11/16/2022] Open
Abstract
Background The alignment of multiple protein sequences is one of the most commonly performed tasks in bioinformatics. In spite of considerable research and efforts that have been recently deployed for improving the performance of multiple sequence alignment (MSA) algorithms, finding a highly accurate alignment between multiple protein sequences is still a challenging problem. Results We propose a novel and efficient algorithm called, MSAIndelFR, for multiple sequence alignment using the information on the predicted locations of IndelFRs and the computed average log–loss values obtained from IndelFR predictors, each of which is designed for a different protein fold. We demonstrate that the introduction of a new variable gap penalty function based on the predicted locations of the IndelFRs and the computed average log–loss values into the proposed algorithm substantially improves the protein alignment accuracy. This is illustrated by evaluating the performance of the algorithm in aligning sequences belonging to the protein folds for which the IndelFR predictors already exist and by using the reference alignments of the four popular benchmarks, BAliBASE 3.0, OXBENCH, PREFAB 4.0, and SABRE (SABmark 1.65). Conclusions We have proposed a novel and efficient algorithm, the MSAIndelFR algorithm, for multiple protein sequence alignment incorporating a new variable gap penalty function. It is shown that the performance of the proposed algorithm is superior to that of the most–widely used alignment algorithms, Clustal W2, Clustal Omega, Kalign2, MSAProbs, MAFFT, MUSCLE, ProbCons and Probalign, in terms of both the sum–of–pairs and total column metrics. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0826-3) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Mufleh Al-Shatnawi
- Department of Electrical and Computer Engineering, Concordia University, 1455 De Maisonneuve Blvd. W., Montreal, H3G 1M8, Quebec, Canada.
| | - M Omair Ahmad
- Department of Electrical and Computer Engineering, Concordia University, 1455 De Maisonneuve Blvd. W., Montreal, H3G 1M8, Quebec, Canada.
| | - M N S Swamy
- Department of Electrical and Computer Engineering, Concordia University, 1455 De Maisonneuve Blvd. W., Montreal, H3G 1M8, Quebec, Canada.
| |
Collapse
|
13
|
Substrate-binding specificity of chitinase and chitosanase as revealed by active-site architecture analysis. Carbohydr Res 2015; 418:50-56. [PMID: 26545262 DOI: 10.1016/j.carres.2015.10.002] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2015] [Revised: 10/03/2015] [Accepted: 10/06/2015] [Indexed: 11/21/2022]
Abstract
Chitinases and chitosanases, referred to as chitinolytic enzymes, are two important categories of glycoside hydrolases (GH) that play a key role in degrading chitin and chitosan, two naturally abundant polysaccharides. Here, we investigate the active site architecture of the major chitosanase (GH8, GH46) and chitinase families (GH18, GH19). Both charged (Glu, His, Arg, Asp) and aromatic amino acids (Tyr, Trp, Phe) are observed with higher frequency within chitinolytic active sites as compared to elsewhere in the enzyme structure, indicating significant roles related to enzyme function. Hydrogen bonds between chitinolytic enzymes and the substrate C2 functional groups, i.e. amino groups and N-acetyl groups, drive substrate recognition, while non-specific CH-π interactions between aromatic residues and substrate mainly contribute to tighter binding and enhanced processivity evident in GH8 and GH18 enzymes. For different families of chitinolytic enzymes, the number, type, and position of substrate atoms bound in the active site vary, resulting in different substrate-binding specificities. The data presented here explain the synergistic action of multiple enzyme families at a molecular level and provide a more reasonable method for functional annotation, which can be further applied toward the practical engineering of chitinases and chitosanases.
Collapse
|
14
|
Al-Shatnawi M, Ahmad MO, Swamy MNS. Prediction of Indel flanking regions in protein sequences using a variable-order Markov model. Bioinformatics 2015; 31:40-7. [PMID: 25178462 DOI: 10.1093/bioinformatics/btu556] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Insertion/deletion (indel) and amino acid substitution are two common events that lead to the evolution of and variations in protein sequences. Further, many of the human diseases and functional divergence between homologous proteins are more related to indel mutations, even though they occur less often than the substitution mutations do. A reliable identification of indels and their flanking regions is a major challenge in research related to protein evolution, structures and functions. RESULTS In this article, we propose a novel scheme to predict indel flanking regions in a protein sequence for a given protein fold, based on a variable-order Markov model. The proposed indel flanking region (IndelFR) predictors are designed based on prediction by partial match (PPM) and probabilistic suffix tree (PST), which are referred to as the PPM IndelFR and PST IndelFR predictors, respectively. The overall performance evaluation results show that the proposed predictors are able to predict IndelFRs in the protein sequences with a high accuracy and F1 measure. In addition, the results show that if one is interested only in predicting IndelFRs in protein sequences, it would be preferable to use the proposed predictors instead of HMMER 3.0 in view of the substantially superior performance of the former.
Collapse
Affiliation(s)
- Mufleh Al-Shatnawi
- Department of Electrical and Computer Engineering, Concordia University, QC H3G 2W1, Canada
| | - M Omair Ahmad
- Department of Electrical and Computer Engineering, Concordia University, QC H3G 2W1, Canada
| | - M N S Swamy
- Department of Electrical and Computer Engineering, Concordia University, QC H3G 2W1, Canada
| |
Collapse
|
15
|
Nelson ED, Grishin NV. Structural evolution of proteinlike heteropolymers. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2014; 90:062715. [PMID: 25615137 DOI: 10.1103/physreve.90.062715] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/24/2014] [Indexed: 06/04/2023]
Abstract
The biological function of a protein often depends on the formation of an ordered structure in order to support a smaller, chemically active configuration of amino acids against thermal fluctuations. Here we explore the development of proteins evolving to satisfy this requirement using an off-lattice polymer model in which monomers interact as low resolution amino acids. To evolve the model, we construct a Markov process in which sequences are subjected to random replacements, insertions, and deletions and are selected to recover a predefined minimum number of solid-ordered monomers using the Lindemann melting criterion. We show that polymers generated by this process consistently fold into soluble, ordered globules of similar length and complexity to small protein motifs. To compare the evolution of the globules with proteins, we analyze the statistics of amino acid replacements, the dependence of site mutation rates on solvent exposure, and the dependence of structural distance on sequence distance for homologous alignments. Despite the simplicity of the model, the results display a surprisingly close correspondence with protein data.
Collapse
Affiliation(s)
- Erik D Nelson
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, 6001 Forest Park Boulevard, Room ND10.124, Dallas, Texas 75235-9050, USA
| | - Nick V Grishin
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, 6001 Forest Park Boulevard, Room ND10.124, Dallas, Texas 75235-9050, USA
| |
Collapse
|
16
|
Mutt E, Mathew OK, Sowdhamini R. LenVarDB: database of length-variant protein domains. Nucleic Acids Res 2013; 42:D246-50. [PMID: 24194591 PMCID: PMC3964994 DOI: 10.1093/nar/gkt1014] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Protein domains are functionally and structurally independent modules, which add to the functional variety of proteins. This array of functional diversity has been enabled by evolutionary changes, such as amino acid substitutions or insertions or deletions, occurring in these protein domains. Length variations (indels) can introduce changes at structural, functional and interaction levels. LenVarDB (freely available at http://caps.ncbs.res.in/lenvardb/) traces these length variations, starting from structure-based sequence alignments in our Protein Alignments organized as Structural Superfamilies (PASS2) database, across 731 structural classification of proteins (SCOP)-based protein domain superfamilies connected to 2 730 625 sequence homologues. Alignment of sequence homologues corresponding to a structural domain is available, starting from a structure-based sequence alignment of the superfamily. Orientation of the length-variant (indel) regions in protein domains can be visualized by mapping them on the structure and on the alignment. Knowledge about location of length variations within protein domains and their visual representation will be useful in predicting changes within structurally or functionally relevant sites, which may ultimately regulate protein function. Non-technical summary: Evolutionary changes bring about natural changes to proteins that may be found in many organisms. Such changes could be reflected as amino acid substitutions or insertions–deletions (indels) in protein sequences. LenVarDB is a database that provides an early overview of observed length variations that were set among 731 protein families and after examining >2 million sequences. Indels are followed up to observe if they are close to the active site such that they can affect the activity of proteins. Inclusion of such information can aid the design of bioengineering experiments.
Collapse
Affiliation(s)
- Eshita Mutt
- International Institute of Information Technology-Hyderabad, Gachibowli, Hyderabad 500032, Andhra Pradesh, India, National Centre for Biological Sciences (TIFR), UAS-GKVK Campus, Bellary Road, Bangalore 560065, Karnataka, India and SASTRA University, Tirumalaisamudram, Thanjavur 613401, Tamil Nadu, India
| | | | | |
Collapse
|
17
|
Wang Y, Tan X, Paterson AH. Different patterns of gene structure divergence following gene duplication in Arabidopsis. BMC Genomics 2013; 14:652. [PMID: 24063813 PMCID: PMC3848917 DOI: 10.1186/1471-2164-14-652] [Citation(s) in RCA: 58] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2013] [Accepted: 09/20/2013] [Indexed: 01/10/2023] Open
Abstract
BACKGROUND Divergence in gene structure following gene duplication is not well understood. Gene duplication can occur via whole-genome duplication (WGD) and single-gene duplications including tandem, proximal and transposed duplications. Different modes of gene duplication may be associated with different types, levels, and patterns of structural divergence. RESULTS In Arabidopsis thaliana, we denote levels of structural divergence between duplicated genes by differences in coding-region lengths and average exon lengths, and the number of insertions/deletions (indels) and maximum indel length in their protein sequence alignment. Among recent duplicates of different modes, transposed duplicates diverge most dramatically in gene structure. In transposed duplications, parental loci tend to have longer coding-regions and exons, and smaller numbers of indels and maximum indel lengths than transposed loci, reflecting biased structural changes in transposed duplications. Structural divergence increases with evolutionary time for WGDs, but not transposed duplications, possibly because of biased gene losses following transposed duplications. Structural divergence has heterogeneous relationships with nucleotide substitution rates, but is consistently positively correlated with gene expression divergence. The NBS-LRR gene family shows higher-than-average levels of structural divergence. CONCLUSIONS Our study suggests that structural divergence between duplicated genes is greatly affected by the mechanisms of gene duplication and may be not proportional to evolutionary time, and that certain gene families are under selection on rapid evolution of gene structure.
Collapse
Affiliation(s)
- Yupeng Wang
- Plant Genome Mapping Laboratory, University of Georgia, Athens, GA 30602, USA.
| | | | | |
Collapse
|
18
|
Residue mutations and their impact on protein structure and function: detecting beneficial and pathogenic changes. Biochem J 2013; 449:581-94. [DOI: 10.1042/bj20121221] [Citation(s) in RCA: 131] [Impact Index Per Article: 11.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
Abstract
The present review focuses on the evolution of proteins and the impact of amino acid mutations on function from a structural perspective. Proteins evolve under the law of natural selection and undergo alternating periods of conservative evolution and of relatively rapid change. The likelihood of mutations being fixed in the genome depends on various factors, such as the fitness of the phenotype or the position of the residues in the three-dimensional structure. For example, co-evolution of residues located close together in three-dimensional space can occur to preserve global stability. Whereas point mutations can fine-tune the protein function, residue insertions and deletions (‘decorations’ at the structural level) can sometimes modify functional sites and protein interactions more dramatically. We discuss recent developments and tools to identify such episodic mutations, and examine their applications in medical research. Such tools have been tested on simulated data and applied to real data such as viruses or animal sequences. Traditionally, there has been little if any cross-talk between the fields of protein biophysics, protein structure–function and molecular evolution. However, the last several years have seen some exciting developments in combining these approaches to obtain an in-depth understanding of how proteins evolve. For example, a better understanding of how structural constraints affect protein evolution will greatly help us to optimize our models of sequence evolution. The present review explores this new synthesis of perspectives.
Collapse
|
19
|
Challis CJ, Schmidler SC. A stochastic evolutionary model for protein structure alignment and phylogeny. Mol Biol Evol 2012; 29:3575-87. [PMID: 22723302 DOI: 10.1093/molbev/mss167] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
We present a stochastic process model for the joint evolution of protein primary and tertiary structure, suitable for use in alignment and estimation of phylogeny. Indels arise from a classic Links model, and mutations follow a standard substitution matrix, whereas backbone atoms diffuse in three-dimensional space according to an Ornstein-Uhlenbeck process. The model allows for simultaneous estimation of evolutionary distances, indel rates, structural drift rates, and alignments, while fully accounting for uncertainty. The inclusion of structural information enables phylogenetic inference on time scales not previously attainable with sequence evolution models. The model also provides a tool for testing evolutionary hypotheses and improving our understanding of protein structural evolution.
Collapse
|
20
|
Guo B, Zou M, Wagner A. Pervasive indels and their evolutionary dynamics after the fish-specific genome duplication. Mol Biol Evol 2012; 29:3005-22. [PMID: 22490820 DOI: 10.1093/molbev/mss108] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023] Open
Abstract
Insertions and deletions (indels) in protein-coding genes are important sources of genetic variation. Their role in creating new proteins may be especially important after gene duplication. However, little is known about how indels affect the divergence of duplicate genes. We here study thousands of duplicate genes in five fish (teleost) species with completely sequenced genomes. The ancestor of these species has been subject to a fish-specific genome duplication (FSGD) event that occurred approximately 350 Ma. We find that duplicate genes contain at least 25% more indels than single-copy genes. These indels accumulated preferentially in the first 40 my after the FSGD. A lack of widespread asymmetric indel accumulation indicates that both members of a duplicate gene pair typically experience relaxed selection. Strikingly, we observe a 30-80% excess of deletions over insertions that is consistent for indels of various lengths and across the five genomes. We also find that indels preferentially accumulate inside loop regions of protein secondary structure and in regions where amino acids are exposed to solvent. We show that duplicate genes with high indel density also show high DNA sequence divergence. Indel density, but not amino acid divergence, can explain a large proportion of the tertiary structure divergence between proteins encoded by duplicate genes. Our observations are consistent across all five fish species. Taken together, they suggest a general pattern of duplicate gene evolution in which indels are important driving forces of evolutionary change.
Collapse
Affiliation(s)
- Baocheng Guo
- Institute of Evolutionary Biology and Environmental Studies, University of Zurich, Zurich, Switzerland
| | | | | |
Collapse
|
21
|
Zhang Z, Xing C, Wang L, Gong B, Liu H. IndelFR: a database of indels in protein structures and their flanking regions. Nucleic Acids Res 2011; 40:D512-8. [PMID: 22127860 PMCID: PMC3245007 DOI: 10.1093/nar/gkr1107] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
Insertion/deletion (indel) is one of the most common methods of protein sequence variation. Recent studies showed that indels could affect their flanking regions and they are important for protein function and evolution. Here, we describe the Indel Flanking Region Database (IndelFR, http://indel.bioinfo.sdu.edu.cn), which provides sequence and structure information about indels and their flanking regions in known protein domains. The indels were obtained through the pairwise alignment of homologous structures in SCOP superfamilies. The IndelFR database contains 2,925,017 indels with flanking regions extracted from 373,402 structural alignment pairs of 12,573 non-redundant domains from 1053 superfamilies. IndelFR provides access to information about indels and their flanking regions, including amino acid sequences, lengths, locations, secondary structure constitutions, hydrophilicity/hydrophobicity, domain information, 3D structures and so on. IndelFR has already been used for molecular evolution studies and may help to promote future functional studies of indels and their flanking regions.
Collapse
Affiliation(s)
- Zheng Zhang
- State Key Laboratory of Microbial Technology, Shandong University, Jinan 250100, China
| | | | | | | | | |
Collapse
|