1
|
Mesdaghi S, Price RM, Madine J, Rigden DJ. Deep Learning-based structure modelling illuminates structure and function in uncharted regions of β-solenoid fold space. J Struct Biol 2023; 215:108010. [PMID: 37544372 DOI: 10.1016/j.jsb.2023.108010] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2023] [Revised: 07/19/2023] [Accepted: 08/03/2023] [Indexed: 08/08/2023]
Abstract
Repeat proteins are common in all domains of life and exhibit a wide range of functions. One class of repeat protein contains solenoid folds where the repeating unit consists of β-strands separated by tight turns. β-solenoids have distinguishing structural features such as handedness, twist, oligomerisation state, coil shape and size which give rise to their diversity. Characterised β-solenoid repeat proteins are known to form regions in bacterial and viral virulence factors, antifreeze proteins and functional amyloids. For many of these proteins, the experimental structure has not been solved, as they are difficult to crystallise or model. Here we use various deep learning-based structure-modelling methods to discover novel predicted β-solenoids, perform structural database searches to mine further structural neighbours and relate their predicted structure to possible functions. We find both eukaryotic and prokaryotic adhesins, confirming a known functional linkage between adhesin function and the β-solenoid fold. We further identify exceptionally long, flat β-solenoid folds as possible structures of mucin tandem repeat regions and unprecedentedly small β-solenoid structures. Additionally, we characterise a novel β-solenoid coil shape, the FapC Greek key β-solenoid as well as plausible complexes between it and other proteins involved in Pseudomonas functional amyloid fibres.
Collapse
Affiliation(s)
- Shahram Mesdaghi
- The University of Liverpool, Institute of Systems, Molecular & Integrative Biology, Biosciences Building, Crown Street, Liverpool L69 7ZB, United Kingdom; Computational Biology Facility, MerseyBio, University of Liverpool, Crown Street, Liverpool L69 7ZB, United Kingdom
| | - Rebecca M Price
- The University of Liverpool, Institute of Systems, Molecular & Integrative Biology, Biosciences Building, Crown Street, Liverpool L69 7ZB, United Kingdom
| | - Jillian Madine
- The University of Liverpool, Institute of Systems, Molecular & Integrative Biology, Biosciences Building, Crown Street, Liverpool L69 7ZB, United Kingdom.
| | - Daniel J Rigden
- The University of Liverpool, Institute of Systems, Molecular & Integrative Biology, Biosciences Building, Crown Street, Liverpool L69 7ZB, United Kingdom.
| |
Collapse
|
2
|
Manasra S, Kajava AV. Why does the first protein repeat often become the only one? J Struct Biol 2023; 215:108014. [PMID: 37567371 DOI: 10.1016/j.jsb.2023.108014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2023] [Revised: 08/06/2023] [Accepted: 08/09/2023] [Indexed: 08/13/2023]
Abstract
Proteins with two similar motifs in tandem are one of the most common cases of tandem repeat proteins. The question arises: why is the first emerged repeat frequently fixed in the process of evolution, despite the ample opportunities to continue its multiplication at the DNA level? To answer this question, we systematically analyzed the structure and function of these proteins. Our analysis showed that, in the vast majority of cases, the structural repetitive units have a two-fold (C2) internal symmetry. These closed structures provide an internal structural limitation for the subsequent growth of the repeat number. Frequently, the units "swap" their secondary structure elements with each other. Moreover, the duplicated domains, in contrast to other tandem repeat proteins, form binding sites for small molecules around the axis of C2 symmetry. Thus, the closure of the C2 structures and the emergence of new functional sites around the axis of C2 symmetry provide plausible explanations for why a repeat, once appeared, becomes fixed in the evolutionary process. We have placed these structures within the general structural classification of tandem repeat proteins, classifying them as either Class IV or V depending on the size of the repetitive unit.
Collapse
Affiliation(s)
- Simona Manasra
- Institute of Bioengineering, ITMO University, Kronverksky Pr. 49, 197101 Saint Petersburg, Russia
| | - Andrey V Kajava
- Centre de Recherche en Biologie cellulaire de Montpellier (CRBM), UMR 5237 CNRS, Université Montpellier, 1919 Route de Mende, Cedex 5, 34293 Montpellier, France.
| |
Collapse
|
3
|
Accurate contact-based modelling of repeat proteins predicts the structure of new repeats protein families. PLoS Comput Biol 2021; 17:e1008798. [PMID: 33857128 PMCID: PMC8078820 DOI: 10.1371/journal.pcbi.1008798] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2020] [Revised: 04/27/2021] [Accepted: 02/15/2021] [Indexed: 12/18/2022] Open
Abstract
Repeat proteins are abundant in eukaryotic proteomes. They are involved in many eukaryotic specific functions, including signalling. For many of these proteins, the structure is not known, as they are difficult to crystallise. Today, using direct coupling analysis and deep learning it is often possible to predict a protein’s structure. However, the unique sequence features present in repeat proteins have been a challenge to use direct coupling analysis for predicting contacts. Here, we show that deep learning-based methods (trRosetta, DeepMetaPsicov (DMP) and PconsC4) overcomes this problem and can predict intra- and inter-unit contacts in repeat proteins. In a benchmark dataset of 815 repeat proteins, about 90% can be correctly modelled. Further, among 48 PFAM families lacking a protein structure, we produce models of forty-one families with estimated high accuracy. Repeat proteins are widespread among organisms and particularly abundant in eukaryotic proteomes. Their primary sequence presents repetition in the amino acid sequences that origin structures with repeated folds/domains. Although the repeated units often can be recognised from the sequence alone, often structural information is missing. Here, we used contact prediction for predicting the structure of repeats protein directly from their primary sequences. We benchmark the methods on a dataset comprehensive of all the known repeated structures. We evaluate the contact predictions and the obtained models for different classes of repeat proteins. Further, we develop and benchmark a quality assessment (QA) method specific for repeat proteins. Finally, we used the prediction pipeline for all PFAM repeat families without resolved structures and found that forty-one of them could be modelled with high accuracy.
Collapse
|
4
|
Paladin L, Bevilacqua M, Errigo S, Piovesan D, Mičetić I, Necci M, Monzon AM, Fabre ML, Lopez JL, Nilsson JF, Rios J, Menna PL, Cabrera M, Buitron MG, Kulik MG, Fernandez-Alberti S, Fornasari MS, Parisi G, Lagares A, Hirsh L, Andrade-Navarro MA, Kajava AV, Tosatto SCE. RepeatsDB in 2021: improved data and extended classification for protein tandem repeat structures. Nucleic Acids Res 2021; 49:D452-D457. [PMID: 33237313 PMCID: PMC7778985 DOI: 10.1093/nar/gkaa1097] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2020] [Revised: 10/17/2020] [Accepted: 11/19/2020] [Indexed: 11/21/2022] Open
Abstract
The RepeatsDB database (URL: https://repeatsdb.org/) provides annotations and classification for protein tandem repeat structures from the Protein Data Bank (PDB). Protein tandem repeats are ubiquitous in all branches of the tree of life. The accumulation of solved repeat structures provides new possibilities for classification and detection, but also increasing the need for annotation. Here we present RepeatsDB 3.0, which addresses these challenges and presents an extended classification scheme. The major conceptual change compared to the previous version is the hierarchical classification combining top levels based solely on structural similarity (Class > Topology > Fold) with two new levels (Clan > Family) requiring sequence similarity and describing repeat motifs in collaboration with Pfam. Data growth has been addressed with improved mechanisms for browsing the classification hierarchy. A new UniProt-centric view unifies the increasingly frequent annotation of structures from identical or similar sequences. This update of RepeatsDB aligns with our commitment to develop a resource that extracts, organizes and distributes specialized information on tandem repeat protein structures.
Collapse
Affiliation(s)
- Lisanna Paladin
- Dept. of Biomedical Sciences, University of Padua, Via Ugo Bassi 58/B, Padua 35121, Italy
| | - Martina Bevilacqua
- Dept. of Biomedical Sciences, University of Padua, Via Ugo Bassi 58/B, Padua 35121, Italy
| | - Sara Errigo
- Dept. of Biomedical Sciences, University of Padua, Via Ugo Bassi 58/B, Padua 35121, Italy
| | - Damiano Piovesan
- Dept. of Biomedical Sciences, University of Padua, Via Ugo Bassi 58/B, Padua 35121, Italy
| | - Ivan Mičetić
- Dept. of Biomedical Sciences, University of Padua, Via Ugo Bassi 58/B, Padua 35121, Italy
| | - Marco Necci
- Dept. of Biomedical Sciences, University of Padua, Via Ugo Bassi 58/B, Padua 35121, Italy
| | | | - Maria Laura Fabre
- IBBM-CONICET, Dept. of Biological Sciences, La Plata National University, 49 y 115, 1900 La Plata, Argentina
| | - Jose Luis Lopez
- IBBM-CONICET, Dept. of Biological Sciences, La Plata National University, 49 y 115, 1900 La Plata, Argentina
| | - Juliet F Nilsson
- IBBM-CONICET, Dept. of Biological Sciences, La Plata National University, 49 y 115, 1900 La Plata, Argentina
| | - Javier Rios
- Dept. of Science and Technology, National University of Quilmes, Roque Sáenz Peña 352, Bernal, Buenos Aires, Argentina
| | - Pablo Lorenzano Menna
- Dept. of Science and Technology, National University of Quilmes, Roque Sáenz Peña 352, Bernal, Buenos Aires, Argentina
| | - Maia Cabrera
- Dept. of Science and Technology, National University of Quilmes, Roque Sáenz Peña 352, Bernal, Buenos Aires, Argentina
| | - Martin Gonzalez Buitron
- Dept. of Science and Technology, National University of Quilmes, Roque Sáenz Peña 352, Bernal, Buenos Aires, Argentina
| | - Mariane Gonçalves Kulik
- Institute of Organismic and Molecular Evolution, Faculty of Biology, Johannes Gutenberg University of Mainz, Hans-Dieter-Hüsch-Weg 15, 55128 Mainz, Germany
| | - Sebastian Fernandez-Alberti
- Dept. of Science and Technology, National University of Quilmes, Roque Sáenz Peña 352, Bernal, Buenos Aires, Argentina
| | - Maria Silvina Fornasari
- Dept. of Science and Technology, National University of Quilmes, Roque Sáenz Peña 352, Bernal, Buenos Aires, Argentina
| | - Gustavo Parisi
- Dept. of Science and Technology, National University of Quilmes, Roque Sáenz Peña 352, Bernal, Buenos Aires, Argentina
| | - Antonio Lagares
- IBBM-CONICET, Dept. of Biological Sciences, La Plata National University, 49 y 115, 1900 La Plata, Argentina
| | - Layla Hirsh
- Dept. of Engineering, Faculty of Science and Engineering, Pontifical Catholic University of Peru, Av. Universitaria 1801 San Miguel, Lima 32, Lima, Peru
| | - Miguel A Andrade-Navarro
- Institute of Organismic and Molecular Evolution, Faculty of Biology, Johannes Gutenberg University of Mainz, Hans-Dieter-Hüsch-Weg 15, 55128 Mainz, Germany
| | - Andrey V Kajava
- Centre de Recherche en Biologie cellulaire de Montpellier, UMR 5237, CNRS, Univ. Montpellier, Montpellier, France
| | - Silvio C E Tosatto
- Dept. of Biomedical Sciences, University of Padua, Via Ugo Bassi 58/B, Padua 35121, Italy
| |
Collapse
|
5
|
Tørresen OK, Star B, Mier P, Andrade-Navarro MA, Bateman A, Jarnot P, Gruca A, Grynberg M, Kajava AV, Promponas VJ, Anisimova M, Jakobsen KS, Linke D. Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases. Nucleic Acids Res 2019; 47:10994-11006. [PMID: 31584084 PMCID: PMC6868369 DOI: 10.1093/nar/gkz841] [Citation(s) in RCA: 159] [Impact Index Per Article: 31.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2019] [Revised: 09/03/2019] [Accepted: 10/01/2019] [Indexed: 12/13/2022] Open
Abstract
The widespread occurrence of repetitive stretches of DNA in genomes of organisms across the tree of life imposes fundamental challenges for sequencing, genome assembly, and automated annotation of genes and proteins. This multi-level problem can lead to errors in genome and protein databases that are often not recognized or acknowledged. As a consequence, end users working with sequences with repetitive regions are faced with 'ready-to-use' deposited data whose trustworthiness is difficult to determine, let alone to quantify. Here, we provide a review of the problems associated with tandem repeat sequences that originate from different stages during the sequencing-assembly-annotation-deposition workflow, and that may proliferate in public database repositories affecting all downstream analyses. As a case study, we provide examples of the Atlantic cod genome, whose sequencing and assembly were hindered by a particularly high prevalence of tandem repeats. We complement this case study with examples from other species, where mis-annotations and sequencing errors have propagated into protein databases. With this review, we aim to raise the awareness level within the community of database users, and alert scientists working in the underlying workflow of database creation that the data they omit or improperly assemble may well contain important biological information valuable to others.
Collapse
Affiliation(s)
- Ole K Tørresen
- Centre for Ecological and Evolutionary Synthesis, Department of Biosciences, University of Oslo, NO-0316 Oslo, Norway
| | - Bastiaan Star
- Centre for Ecological and Evolutionary Synthesis, Department of Biosciences, University of Oslo, NO-0316 Oslo, Norway
| | - Pablo Mier
- Faculty of Biology, Johannes Gutenberg University Mainz, Hans-Dieter-Husch-Weg 15, 55128 Mainz, Germany
| | - Miguel A Andrade-Navarro
- Faculty of Biology, Johannes Gutenberg University Mainz, Hans-Dieter-Husch-Weg 15, 55128 Mainz, Germany
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton. CB10 1SD, UK
| | - Patryk Jarnot
- Institute of Informatics, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Poland
| | - Aleksandra Gruca
- Institute of Informatics, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Poland
| | - Marcin Grynberg
- Institute of Biochemistry and Biophysics PAS, Pawińskiego 5A, 02-106 Warsaw, Poland
| | - Andrey V Kajava
- Centre de Recherche en Biologie cellulaire de Montpellier, UMR 5237 CNRS, Universite Montpellier 1919 Route de Mende, CEDEX 5, 34293 Montpellier, France
- Institut de Biologie Computationnelle, 34095 Montpellier, France
| | - Vasilis J Promponas
- Bioinformatics Research Laboratory, Department of Biological Sciences, University of Cyprus, PO Box 20537, CY 1678 Nicosia, Cyprus
| | - Maria Anisimova
- Institute of Applied Simulations, School of Life Sciences and Facility Management, Zurich University of Applied Sciences (ZHAW), Wädenswil, Switzerland
- Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
| | - Kjetill S Jakobsen
- Centre for Ecological and Evolutionary Synthesis, Department of Biosciences, University of Oslo, NO-0316 Oslo, Norway
| | - Dirk Linke
- Section for Genetics and Evolutionary Biology, Department of Biosciences, University of Oslo, NO-0316 Oslo, Norway
| |
Collapse
|
6
|
Purcell O, Cao J, Müller IE, Chen YC, Lu TK. Artificial Repeat-Structured siRNA Precursors as Tunable Regulators for Saccharomyces cerevisiae. ACS Synth Biol 2018; 7:2403-2412. [PMID: 30176724 DOI: 10.1021/acssynbio.8b00185] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023]
Abstract
RNA interference (RNAi) is widely used as a research tool for studying biological systems and implementing artificial genetic circuits that function by modulating RNA concentrations. Here we engineered Saccharomyces cerevisiae containing a heterologous Saccharomyces castelli RNAi system as a test-bed for RNAi-based circuits. Unlike prior approaches, we describe a strategy that leverages repeat-structured siRNA precursors with incrementally sized stems formed from 23 bp-repeats to achieve modular RNAi-based gene regulation. These enable repression strength to be tuned in a systematic manner by changing the size of the siRNA precursor hairpin stem, without modifying the number or sequence of target sites in the target RNA. We demonstrate that this hairpin-based regulation is able to target both cytoplasmic and nuclear localized RNAs and is stable over extended growth periods. This platform enables the targeting of cellular RNAs as a tunable regulatory layer for sophisticated gene circuits in Saccharomyces cerevisiae.
Collapse
Affiliation(s)
- Oliver Purcell
- Synthetic Biology Center, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States
| | - Jicong Cao
- Synthetic Biology Center, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States
| | - Isaak E. Müller
- Synthetic Biology Center, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States
- Microbiology Program, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States
| | - Ying-Chou Chen
- Synthetic Biology Center, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States
| | - Timothy K. Lu
- Synthetic Biology Center, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States
- Department of Biological Engineering, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States
- Research Laboratory of Electronics, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States
| |
Collapse
|
7
|
Jelovic AM, Mitic NS, Eshafah S, Beljanski MV. Finding Statistically Significant Repeats in Nucleic Acids and Proteins. J Comput Biol 2017; 25:375-387. [PMID: 29272145 DOI: 10.1089/cmb.2017.0046] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
DNA repeats have great importance for biological research and a large number of tools for determining repeats have been developed. Herein we define a method for extracting a statistically significant subset of a determined set of repeats. Our aim was to identify a subset of repeats in the input sequences that are not expected to occur with a number of their appearances in a random sequence of the same length. It is expected that results obtained in such manner would reduce the quantity of processed material and could thereby represent a more important biological signal. With DNA, RNA, and protein sequences serving as input material, we also examined the possibility of statistical filtering of repeats in sequences over an arbitrary alphabet. A new method for selecting statistically significant repeats from a set of determined repeats has been defined. The proposed method was tested on a large number of randomly generated sequences. The application of the method on biological sequences revealed that for some viruses, shorter repeats are more statistically significant than longer ones because of their frequent appearance, whereas for bacteria, the majority of identified repeats are statistically significant.
Collapse
Affiliation(s)
- Ana M Jelovic
- 1 Faculty of Transport and Traffic Engineering, University of Belgrade , Belgrade, Serbia .,2 Faculty of Mathematics, University of Belgrade , Belgrade, Serbia
| | - Nenad S Mitic
- 2 Faculty of Mathematics, University of Belgrade , Belgrade, Serbia
| | - Samira Eshafah
- 2 Faculty of Mathematics, University of Belgrade , Belgrade, Serbia
| | - Milos V Beljanski
- 3 Institute of General and Physical Chemistry , Bio-Lab, Belgrade, Serbia
| |
Collapse
|
8
|
Islam Z, Nagampalli RSK, Fatima MT, Ashraf GM. New paradigm in ankyrin repeats: Beyond protein-protein interaction module. Int J Biol Macromol 2017; 109:1164-1173. [PMID: 29157912 DOI: 10.1016/j.ijbiomac.2017.11.101] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2017] [Revised: 11/13/2017] [Accepted: 11/16/2017] [Indexed: 01/06/2023]
Abstract
Classically, ankyrin repeat (ANK) proteins are built from tandems of two or more repeats and form curved solenoid structures that are associated with protein-protein interactions. These are short, widespread structural motif of around 33 amino acids repeats in tandem, having a canonical helix-loop-helix fold, found individually or in combination with other domains. The multiplicity of structural pattern enables it to form assemblies of diverse sizes, required for their abilities to confer multiple binding and structural roles of proteins. Three-dimensional structures of these repeats determined to date reveal a degree of structural variability that translates into the considerable functional versatility of this protein superfamily. Recent work on the ANK has proposed novel structural information, especially protein-lipid, protein-sugar and protein-protein interaction. Self-assembly of these repeats was also shown to prevent the associated protein in forming filaments. In this review, we summarize the latest findings and how the new structural information has increased our understanding of the structural determinants of ANK proteins. We discussed latest findings on how these proteins participate in various interactions to diversify the ANK roles in numerous biological processes, and explored the emerging and evolving field of designer ankyrins and its framework for protein engineering emphasizing on biotechnological applications.
Collapse
Affiliation(s)
- Zeyaul Islam
- Laboratório Nacional de Biociências, Centro Nacional de Pesquisa em Energia e Materiais, Campinas, SP, 13083-100, Brazil.
| | | | - Munazza Tamkeen Fatima
- Department of Biochemistry and Tissue Biology, Institute of Biology, State University of Campinas (UNICAMP), Campinas, SP, 13083-862, Brazil
| | - Ghulam Md Ashraf
- King Fahd Medical Research Center, King Abdulaziz University, P.O. Box 80216, Jeddah, 21589, Saudi Arabia.
| |
Collapse
|
9
|
Kharrat N, Belmabrouk S, Abdelhedi R, Benmarzoug R, Assidi M, Al Qahtani MH, Rebai A. Screening for clusters of charge in human virus proteomes. BMC Genomics 2016; 17:758. [PMID: 27766959 PMCID: PMC5073957 DOI: 10.1186/s12864-016-3086-3] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
Abstract
Background The identification of charge clusters (runs of charged residues) in proteins and their mapping within the protein structure sequence is an important step toward a comprehensive analysis of how these particular motifs mediate, via electrostatic interactions, various molecular processes such as protein sorting, translocation, docking, orientation and binding to DNA and to other proteins. Few algorithms that specifically identify these charge clusters have been designed and described in the literature. In this study, 197 distinctive human viral proteomes were screened for the occurrence of charge clusters (CC) using a new computational approach. Results Three hundred and seventy three CC have been identified within the 2549 viral protein sequences screened. The number of protein sequences that are CC-free is 2176 (85.3 %) while 150 and 180 proteins contained positive charge (PCC) and negative charge clusters (NCC), respectively. The NCCs (211 detected) were more prevalent than PCC (162). PCC-containing proteins are significantly longer than those having NCCs (p = 2.10-16). The most prevalent virus families having PCC and NCC were Herpesviridae followed by Papillomaviridae. However, the single-strand RNA group has in average three times more NCC than PCC. According to the functional domain classification, a significant difference in distribution was observed between PCC and NCC (p = 2. 10−8) with the occurrence of NCCs being more frequent in C-terminal region while PCC more often fall within functional domains. Only 29 proteins sequences contained both NCC and PCC. Moreover, 101 NCC were conserved in 84 proteins while only 62 PCC were conserved in 60 protein sequences. To understand the mechanism by which the membrane translocation functionalities are embedded in viral proteins, we screened our PCC for sequences corresponding to cell-penetrating peptides (CPPs) using two online databases: CellPPd and CPPpred. We found that all our PCCs, having length varying from 7 to 30 amino-acids were predicted as CPPs. Experimental validation is required to improve our understanding of the role of these PCCs in viral infection process. Conclusions Screening distinctive cluster charges in viral proteomes suggested a functional role of these protein regions and might provide potential clues to improve the current understanding of viral diseases in order to tailor better preventive and therapeutic approaches. Electronic supplementary material The online version of this article (doi:10.1186/s12864-016-3086-3) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Najla Kharrat
- Centre of Biotechnology of Sfax, Laboratory of Molecular and Cellular Screening Processes, Bioinformatics Group, PO. Box:1177, 3018, Sfax, Tunisia.
| | - Sabrine Belmabrouk
- Centre of Biotechnology of Sfax, Laboratory of Molecular and Cellular Screening Processes, Bioinformatics Group, PO. Box:1177, 3018, Sfax, Tunisia
| | - Rania Abdelhedi
- Centre of Biotechnology of Sfax, Laboratory of Molecular and Cellular Screening Processes, Bioinformatics Group, PO. Box:1177, 3018, Sfax, Tunisia
| | - Riadh Benmarzoug
- Centre of Biotechnology of Sfax, Laboratory of Molecular and Cellular Screening Processes, Bioinformatics Group, PO. Box:1177, 3018, Sfax, Tunisia
| | - Mourad Assidi
- Center of Excellence in Genomic Medicine Research, King Abdulaziz University, Jeddah, Saudi Arabia.,Center of Innovation in Personalized Medicine, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Mohammed H Al Qahtani
- Center of Excellence in Genomic Medicine Research, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Ahmed Rebai
- Centre of Biotechnology of Sfax, Laboratory of Molecular and Cellular Screening Processes, Bioinformatics Group, PO. Box:1177, 3018, Sfax, Tunisia
| |
Collapse
|
10
|
In search of the boundary between repetitive and non-repetitive protein sequences. Biochem Soc Trans 2016; 43:807-11. [PMID: 26517886 DOI: 10.1042/bst20150073] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Tandem repeats (TRs) are frequently not perfect, containing a number of mutations accumulated during evolution. One of the main problems is to distinguish between the sequences that contain highly imperfect TRs and the aperiodic sequences. The majority of proteins with TRs in sequences have repetitive arrangements in their 3D structures. Therefore, the 3D structures of proteins can be used as a benchmarking criterion for TR detection in sequences. Different TR detection tools use their own scoring procedures to determine the boundary between repetitive and non-repetitive protein sequences. Here we described these scoring functions and benchmark them by using known structural TRs. Our survey shows that none of the existing scoring procedures are able to achieve an appropriate separation between genuine structural TRs and non-TR regions. This suggests that if we want to obtain a collection of structurally and functionally meaningful TRs from a large scale analysis of proteomes, the TR scoring metrics need to be improved.
Collapse
|
11
|
Richard FD, Alves R, Kajava AV. Tally: a scoring tool for boundary determination between repetitive and non-repetitive protein sequences. Bioinformatics 2016; 32:1952-8. [PMID: 27153701 DOI: 10.1093/bioinformatics/btw118] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2015] [Accepted: 02/25/2016] [Indexed: 12/23/2022] Open
Abstract
MOTIVATION Tandem Repeats (TRs) are abundant in proteins, having a variety of fundamental functions. In many cases, evolution has blurred their repetitive patterns. This leads to the problem of distinguishing between sequences that contain highly imperfect TRs, and the sequences without TRs. The 3D structure of proteins can be used as a benchmarking criterion for TR detection in sequences, because the vast majority of proteins having TRs in sequences are built of repetitive 3D structural blocks. According to our benchmark, none of the existing scoring methods are able to clearly distinguish, based on the sequence analysis, between structures with and without 3D TRs. RESULTS We developed a scoring tool called Tally, which is based on a machine learning approach. Tally is able to achieve a better separation between sequences with structural TRs and sequences of aperiodic structures, than existing scoring procedures. It performs at a level of 81% sensitivity, while achieving a high specificity of 74% and an Area Under the Receiver Operating Characteristic Curve of 86%. Tally can be used to select a set of structurally and functionally meaningful TRs from all TRs detected in proteomes. The generated dataset is available for benchmarking purposes. AVAILABILITY AND IMPLEMENTATION Source code is available upon request. Tool and dataset can be accessed through our website: http://bioinfo.montp.cnrs.fr/?r=Tally CONTACT andrey.kajava@crbm.cnrs.fr SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- François D Richard
- Centre de Recherche en Biologie cellulaire de Montpellier (CRBM), UMR 5237 CNRS, Université Montpellier 1919 Route de Mende, Cedex 5, Montpellier 34293, France Institut de Biologie Computationnelle (IBC), Montpellier 34095, France
| | - Ronnie Alves
- Institut de Biologie Computationnelle (IBC), Montpellier 34095, France Pós-Graduação em Ciência da Computação (PPGCC), Universidade Federal do Pará, Belém, Brazil
| | - Andrey V Kajava
- Centre de Recherche en Biologie cellulaire de Montpellier (CRBM), UMR 5237 CNRS, Université Montpellier 1919 Route de Mende, Cedex 5, Montpellier 34293, France Institut de Biologie Computationnelle (IBC), Montpellier 34095, France University ITMO, Institute of Bioengineering, St. Petersburg 197101, Russia
| |
Collapse
|
12
|
Do Viet P, Roche DB, Kajava AV. TAPO: A combined method for the identification of tandem repeats in protein structures. FEBS Lett 2015; 589:2611-9. [PMID: 26320412 DOI: 10.1016/j.febslet.2015.08.025] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2015] [Revised: 08/10/2015] [Accepted: 08/13/2015] [Indexed: 10/23/2022]
Abstract
In recent years, there has been an emergence of new 3D structures of proteins containing tandem repeats (TRs), as a result of improved expression and crystallization strategies. Databases focused on structure classifications (PDB, SCOP, CATH) do not provide an easy solution for selection of these structures from PDB. Several approaches have been developed, but no best approach exists to identify the whole range of 3D TRs. Here we describe the TAndem PrOtein detector (TAPO) that uses periodicities of atomic coordinates and other types of structural representation, including strings generated by conformational alphabets, residue contact maps, and arrangements of vectors of secondary structure elements. The benchmarking shows the superior performance of TAPO over the existing programs. In accordance with our analysis of PDB using TAPO, 19% of proteins contain 3D TRs. This analysis allowed us to identify new families of 3D TRs, suggesting that TAPO can be used to regularly update the collection and classification of existing repetitive structures.
Collapse
Affiliation(s)
- Phuong Do Viet
- Centre de Recherche de Biochimie Macromoléculaire, UMR 5237 CNRS, Université Montpellier, 1919, Route de Mende, 34293 Montpellier Cedex 5, France; Institut de Biologie Computationnelle, Université Montpellier, Bat. 5, 860, rue St Priest, 34095 Montpellier Cedex 5, France
| | - Daniel B Roche
- Centre de Recherche de Biochimie Macromoléculaire, UMR 5237 CNRS, Université Montpellier, 1919, Route de Mende, 34293 Montpellier Cedex 5, France; Institut de Biologie Computationnelle, Université Montpellier, Bat. 5, 860, rue St Priest, 34095 Montpellier Cedex 5, France
| | - Andrey V Kajava
- Centre de Recherche de Biochimie Macromoléculaire, UMR 5237 CNRS, Université Montpellier, 1919, Route de Mende, 34293 Montpellier Cedex 5, France; Institut de Biologie Computationnelle, Université Montpellier, Bat. 5, 860, rue St Priest, 34095 Montpellier Cedex 5, France.
| |
Collapse
|
13
|
Jernigan KK, Bordenstein SR. Tandem-repeat protein domains across the tree of life. PeerJ 2015; 3:e732. [PMID: 25653910 PMCID: PMC4304861 DOI: 10.7717/peerj.732] [Citation(s) in RCA: 46] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2014] [Accepted: 12/29/2014] [Indexed: 12/19/2022] Open
Abstract
Tandem-repeat protein domains, composed of repeated units of conserved stretches of 20–40 amino acids, are required for a wide array of biological functions. Despite their diverse and fundamental functions, there has been no comprehensive assessment of their taxonomic distribution, incidence, and associations with organismal lifestyle and phylogeny. In this study, we assess for the first time the abundance of armadillo (ARM) and tetratricopeptide (TPR) repeat domains across all three domains in the tree of life and compare the results to our previous analysis on ankyrin (ANK) repeat domains in this journal. All eukaryotes and a majority of the bacterial and archaeal genomes analyzed have a minimum of one TPR and ARM repeat. In eukaryotes, the fraction of ARM-containing proteins is approximately double that of TPR and ANK-containing proteins, whereas bacteria and archaea are enriched in TPR-containing proteins relative to ARM- and ANK-containing proteins. We show in bacteria that phylogenetic history, rather than lifestyle or pathogenicity, is a predictor of TPR repeat domain abundance, while neither phylogenetic history nor lifestyle predicts ARM repeat domain abundance. Surprisingly, pathogenic bacteria were not enriched in TPR-containing proteins, which have been associated within virulence factors in certain species. Taken together, this comparative analysis provides a newly appreciated view of the prevalence and diversity of multiple types of tandem-repeat protein domains across the tree of life. A central finding of this analysis is that tandem repeat domain-containing proteins are prevalent not just in eukaryotes, but also in bacterial and archaeal species.
Collapse
Affiliation(s)
- Kristin K Jernigan
- Department of Cell and Developmental Biology, Vanderbilt University , Nashville, TN , USA
| | - Seth R Bordenstein
- Department of Biological Sciences, Vanderbilt University , Nashville, TN , USA ; Department of Pathology, Microbiology, and Immunology, Vanderbilt University , Nashville, TN , USA
| |
Collapse
|
14
|
Richard FD, Kajava AV. TRDistiller: A rapid filter for enrichment of sequence datasets with proteins containing tandem repeats. J Struct Biol 2014; 186:386-91. [DOI: 10.1016/j.jsb.2014.03.013] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2013] [Revised: 03/14/2014] [Accepted: 03/17/2014] [Indexed: 10/25/2022]
|
15
|
Detection, characterization and evolution of internal repeats in Chitinases of known 3-D structure. PLoS One 2014; 9:e91915. [PMID: 24637574 PMCID: PMC3956812 DOI: 10.1371/journal.pone.0091915] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2013] [Accepted: 02/17/2014] [Indexed: 11/24/2022] Open
Abstract
Chitinase proteins have evolved and diversified almost in all organisms ranging from prokaryotes to eukaryotes. During evolution, internal repeats may appear in amino acid sequences of proteins which alter the structural and functional features. Here we deciphered the internal repeats from Chitinase and characterized the structural similarities between them. Out of 24 diverse Chitinase sequences selected, six sequences (2CJL, 2DSK, 2XVP, 2Z37, 3EBV and 3HBE) did not contain any internal repeats of amino acid sequences. Ten sequences contained repeats of length <50, and the remaining 8 sequences contained repeat length between 50 and 100 residues. Two Chitinase sequences, 1ITX and 3SIM, were found to be structurally similar when analyzed using secondary structure of Chitinase from secondary and 3-Dimensional structure database of Protein Data Bank. Internal repeats of 3N17 and 1O6I were also involved in the ligand-binding site of those Chitinase proteins, respectively. Our analyses enhance our understanding towards the identification of structural characteristics of internal repeats in Chitinase proteins.
Collapse
|
16
|
María Velasco A, Becerra A, Hernández-Morales R, Delaye L, Jiménez-Corona ME, Ponce-de-Leon S, Lazcano A. Low complexity regions (LCRs) contribute to the hypervariability of the HIV-1 gp120 protein. J Theor Biol 2013; 338:80-6. [PMID: 24021867 DOI: 10.1016/j.jtbi.2013.08.039] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2013] [Revised: 08/01/2013] [Accepted: 08/31/2013] [Indexed: 01/27/2023]
Abstract
Low complexity regions (LCRs) are sequences of nucleic acids or proteins defined by a compositional bias. Their occurrence has been confirmed in sequences of the three cellular lineages (Bacteria, Archaea and Eucarya), and has also been reported in viral genomes. We present here the results of a detailed computer analysis of the LCRs present in the HIV-1 glycoprotein 120 (gp120) encoded by the viral gene env. The analysis was performed using a sample of 3637 Env polyprotein sequences derived from 4117 completely sequenced and translated HIV-1 genomes available in public databases as of December 2012. We have identified 1229 LCRs located in four different regions of the gp120 protein that correspond to four of the five regions that have been identified as hypervariable (V1, V2, V4 and V5). The remaining 29 LCRs are found in the signal peptide and in the conserved regions C2, C3, C4 and C5. No LCR has been identified in the hypervariable region V3. The LCRs detected in the V1, V2, V4, and V5 hypervariable regions exhibit a high Asn content in their amino acid composition, which very likely correspond to glycosylation sites, which may contribute to the retroviral ability to avoid the immune system. In sharp contrast with what is observed in gp120 proteins lacking LCRs, the glycosylation sites present in LCRs tend to be clustered towards the center of the region forming well-defined islands. The results presented here suggest that LCRs represent a hitherto undescribed source of genomic variability in lentivirus, and that these repeats may represent an important source of antigenic variation in HIV-1 populations. The results reported here may exemplify the evolutionary processes that may have increased the size of primitive cellular RNA genomes and the role of LCRs as a source of raw material during the processes of evolutionary acquisition of new functions.
Collapse
Affiliation(s)
- Ana María Velasco
- Facultad de Ciencias, UNAM, Ciudad Universitaria, Apdo. Postal 70-407, México D. F. 04510, Mexico; Laboratorios de Biológicos y Reactivos de México, Amores 1240, Colonia Del Valle, México D. F. 03100, Mexico
| | | | | | | | | | | | | |
Collapse
|
17
|
Kajava AV. Tandem repeats in proteins: from sequence to structure. J Struct Biol 2011; 179:279-88. [PMID: 21884799 DOI: 10.1016/j.jsb.2011.08.009] [Citation(s) in RCA: 159] [Impact Index Per Article: 12.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2011] [Revised: 08/15/2011] [Accepted: 08/17/2011] [Indexed: 10/17/2022]
Abstract
The bioinformatics analysis of proteins containing tandem repeats requires special computer programs and databases, since the conventional approaches predominantly developed for globular domains have limited success. Here, I survey bioinformatics tools which have been developed recently for identification and proteome-wide analysis of protein repeats. The last few years have also been marked by an emergence of new 3D structures of these proteins. Appraisal of the known structures and their classification uncovers a straightforward relationship between their architecture and the length of the repetitive units. This relationship and the repetitive character of structural folds suggest rules for better prediction of the 3D structures of such proteins. Furthermore, bioinformatics approaches combined with low resolution structural data, from biophysical techniques, especially, the recently emerged cryo-electron microscopy, lead to reliable prediction of the protein repeat structures and their mode of binding with partners within molecular complexes. This hybrid approach can actively be used for structural and functional annotations of proteomes.
Collapse
Affiliation(s)
- Andrey V Kajava
- Centre de Recherches de Biochimie Macromoléculaire, CNRS, Université Montpellier 1 et 2, 1919 Route de Mende, 34293 Montpellier, Cedex 5, France.
| |
Collapse
|
18
|
Babu V, Uthayakumar M, Kirti Vaishnavi M, Senthilkumar R, Shankar M, Archana C, Sathya Priya S, Sekar K. RPS: Repeats in Protein Sequences. J Appl Crystallogr 2011. [DOI: 10.1107/s0021889811009393] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Repeats are two or more contiguous segments of amino acid residues that are believed to have arisen as a result of intragenic duplication, recombination and mutation events. These repeats can be utilized for protein structure prediction and can provide insights into the protein evolution and phylogenetic relationship. Therefore, to aid structural biologists and phylogeneticists in their research, a computing resource (a web server and a database), Repeats in Protein Sequences (RPS), has been created. Using RPS, users can obtain useful information regarding identical, similar and distant repeats (of varying lengths) in protein sequences. In addition, users can check the frequency of occurrence of the repeats in sequence databases such as the Genome Database, PIR and SWISS-PROT and among the protein sequences available in the Protein Data Bank archive. Furthermore, users can view the three-dimensional structure of the repeats using the Java visualization plug-inJmol. The proposed computing resource can be accessed over the World Wide Web at http://bioserver1.physics.iisc.ernet.in/rps/.
Collapse
|
19
|
Jorda J, Xue B, Uversky VN, Kajava AV. Protein tandem repeats - the more perfect, the less structured. FEBS J 2010; 277:2673-82. [PMID: 20553501 DOI: 10.1111/j.1742-464x.2010.07684.x] [Citation(s) in RCA: 60] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
We analysed the structural properties of protein regions containing arrays of perfect and nearly perfect tandem repeats. Naturally occurring proteins with perfect repeats are practically absent among the proteins with known 3D structures. The great majority of such regions in the Protein Data Bank are found in the proteins designed de novo. The abundance of natural structured proteins with tandem repeats is inversely correlated with the repeat perfection: the chance of finding natural structured proteins in the Protein Data Bank increases with a decrease in the level of repeat perfection. Prediction of intrinsic disorder within the tandem repeats in the SwissProt proteins supports the conclusion that the level of repeat perfection correlates with their tendency to be unstructured. This correlation is valid across the various species and subcellular localizations, although the level of disordered tandem repeats varies significantly between these datasets. On average, in prokaryotes, tandem repeats of cytoplasmic proteins were predicted to be the most structured, whereas in eukaryotes, the most structured portion of the repeats was found in the membrane proteins. Our study supports the hypothesis that, in general, the repeat perfection is a sign of recent evolutionary events rather than of exceptional structural and (or) functional importance of the repeat residues.
Collapse
Affiliation(s)
- Julien Jorda
- Centre de Recherches de Biochimie Macromoléculaire, CNRS UMR-5237, University of Montpellier 1 and 2, France
| | | | | | | |
Collapse
|
20
|
|
21
|
Naamati G, Fromer M, Linial M. Expansion of tandem repeats in sea anemone Nematostella vectensis proteome: A source for gene novelty? BMC Genomics 2009; 10:593. [PMID: 20003297 PMCID: PMC2805694 DOI: 10.1186/1471-2164-10-593] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2009] [Accepted: 12/10/2009] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The complete proteome of the starlet sea anemone, Nematostella vectensis, provides insights into gene invention dating back to the Cnidarian-Bilaterian ancestor. With the addition of the complete proteomes of Hydra magnipapillata and Monosiga brevicollis, the investigation of proteins having unique features in early metazoan life has become practical. We focused on the properties and the evolutionary trends of tandem repeat (TR) sequences in Cnidaria proteomes. RESULTS We found that 11-16% of N. vectensis proteins contain tandem repeats. Most TRs cover 150 amino acid segments that are comprised of basic units of 5-20 amino acids. In total, the N. Vectensis proteome has about 3300 unique TR-units, but only a small fraction of them are shared with H. magnipapillata, M. brevicollis, or mammalian proteomes. The overall abundance of these TRs stands out relative to that of 14 proteomes representing the diversity among eukaryotes and within the metazoan world. TR-units are characterized by a unique composition of amino acids, with cysteine and histidine being over-represented. Structurally, most TR-segments are associated with coiled and disordered regions. Interestingly, 80% of the TR-segments can be read in more than one open reading frame. For over 100 of them, translation of the alternative frames would result in long proteins. Most domain families that are characterized as repeats in eukaryotes are found in the TR-proteomes from Nematostella and Hydra. CONCLUSIONS While most TR-proteins have originated from prediction tools and are still awaiting experimental validations, supportive evidence exists for hundreds of TR-units in Nematostella. The existence of TR-proteins in early metazoan life may have served as a robust mode for novel genes with previously overlooked structural and functional characteristics.
Collapse
|
22
|
Sandhya S, Rani SS, Pankaj B, Govind MK, Offmann B, Srinivasan N, Sowdhamini R. Length variations amongst protein domain superfamilies and consequences on structure and function. PLoS One 2009; 4:e4981. [PMID: 19333395 PMCID: PMC2659687 DOI: 10.1371/journal.pone.0004981] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2008] [Accepted: 02/26/2009] [Indexed: 11/24/2022] Open
Abstract
Background Related protein domains of a superfamily can be specified by proteins of diverse lengths. The structural and functional implications of indels in a domain scaffold have been examined. Methodology In this study, domain superfamilies with large length variations (more than 30% difference from average domain size, referred as ‘length-deviant’ superfamilies and ‘length-rigid’ domain superfamilies (<10% length difference from average domain size) were analyzed for the functional impact of such structural differences. Our delineated dataset, derived from an objective algorithm, enables us to address indel roles in the presence of peculiar structural repeats, functional variation, protein-protein interactions and to examine ‘domain contexts’ of proteins tolerant to large length variations. Amongst the top-10 length-deviant superfamilies analyzed, we found that 80% of length-deviant superfamilies possess distant internal structural repeats and nearly half of them acquired diverse biological functions. In general, length-deviant superfamilies have higher chance, than length-rigid superfamilies, to be engaged in internal structural repeats. We also found that ∼40% of length-deviant domains exist as multi-domain proteins involving interactions with domains from the same or other superfamilies. Indels, in diverse domain superfamilies, were found to participate in the accretion of structural and functional features amongst related domains. With specific examples, we discuss how indels are involved directly or indirectly in the generation of oligomerization interfaces, introduction of substrate specificity, regulation of protein function and stability. Conclusions Our data suggests a multitude of roles for indels that are specialized for domain members of different domain superfamilies. These specialist roles that we observe and trends in the extent of length variation could influence decision making in modeling of new superfamily members. Likewise, the observed limits of length variation, specific for each domain superfamily would be particularly relevant in the choice of alignment length search filters commonly applied in protein sequence analysis.
Collapse
Affiliation(s)
- Sankaran Sandhya
- National Centre for Biological Sciences (TIFR), GKVK Campus, Bangalore, India
| | - Saane Sudha Rani
- National Centre for Biological Sciences (TIFR), GKVK Campus, Bangalore, India
| | - Barah Pankaj
- National Centre for Biological Sciences (TIFR), GKVK Campus, Bangalore, India
| | | | - Bernard Offmann
- Laboratoire de Biochimie et Génétique Moléculaire BP 7151, Université de La Réunion, La Réunion, France
| | | | - Ramanathan Sowdhamini
- National Centre for Biological Sciences (TIFR), GKVK Campus, Bangalore, India
- * E-mail:
| |
Collapse
|
23
|
Sarani R, Udayaprakash NA, Subashini R, Mridula P, Yamane T, Sekar K. Large cryptic internal sequence repeats in protein structures from Homo sapiens. J Biosci 2009; 34:103-12. [DOI: 10.1007/s12038-009-0012-8] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
24
|
Wagner H, Morgenstern B, Dress A. Stability of multiple alignments and phylogenetic trees: an analysis of ABC-transporter proteins family. Algorithms Mol Biol 2008; 3:15. [PMID: 18990223 PMCID: PMC2637874 DOI: 10.1186/1748-7188-3-15] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2008] [Accepted: 11/06/2008] [Indexed: 11/17/2022] Open
Abstract
Background Sequence-based phylogeny reconstruction is a fundamental task in Bioinformatics. Practically all methods for phylogeny reconstruction are based on multiple alignments. The quality and stability of the underlying alignments is therefore crucial for phylogenetic analysis. Results In this short report, we investigate alignments and alignment-based phylogenies constructed for a set of 22 ABC transporters using CLUSTAL W and DIALIGN. Comparing the 22 "one-out phylogenies" one can obtain for this sequence set, some intrinsic phylogenetic instability is observed — even if attention is restricted to branches with high bootstrapping frequencies, the so-called safe branches. We show that this instability is caused by the fact that both, CLUSTAL W as well as DIALIGN, apparently get "confused" by sequence repeats in some of the ABC-transporter. To deal with such problems, two new DIALIGN options are introduced that prove helpful in our context, the "exclude-fragment" (or "xfr") and the "self-comparison" (or "sc") option. Conclusion "One-out strategies", known to be a useful tool for testing the stability of all sorts of data-analysis procedures, can successfully be used also in testing alignment stability. In case instabilities are observed, the sequences under consideration should be carefully checked for putative causes. In case one suspects sequence repeats to be the cause, the new "sc" option can be used to detect such repeats, and the "xfr" option can help to resolve the resulting problems.
Collapse
|
25
|
Simossis V, Kleinjung J, Heringa J. An overview of multiple sequence alignment. CURRENT PROTOCOLS IN BIOINFORMATICS 2008; Chapter 3:3.7.1-3.7.26. [PMID: 18428699 DOI: 10.1002/0471250953.bi0307s03] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Multiple sequence alignment is perhaps the most commonly applied bioinformatics technique. It often leads to fundamental biological insight into sequence-structure-function relationships of nucleotide or protein sequence families. In this unit, an overview of multiple sequence alignment techniques is presented, covering a history of nearly 30 years from the early pioneering methods to the current state-of-the-art techniques. Methodological and biological issues and end-user considerations, as well as alignment evaluation issues, are discussed.
Collapse
Affiliation(s)
- Victor Simossis
- Integrative Bioinformatics Institute (IBIVU), Free University, Amsterdam, The Netherlands
| | | | | |
Collapse
|
26
|
Cheng H, Kim BH, Grishin NV. MALIDUP: a database of manually constructed structure alignments for duplicated domain pairs. Proteins 2008; 70:1162-6. [PMID: 17932926 DOI: 10.1002/prot.21783] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
We describe MALIDUP (manual alignments of duplicated domains), a database of 241 pairwise structure alignments for homologous domains originated by internal duplication within the same polypeptide chain. Since duplicated domains within a protein frequently diverge in function and thus in sequence, this would be the first database of structurally similar homologs that is not strongly biased by sequence or functional similarity. Our manual alignments in most cases agree with the automatic structural alignments generated by several commonly used programs. This carefully constructed database could be used in studies on protein evolution and as a reference for testing structure alignment programs. The database is available at http://prodata.swmed.edu/malidup.
Collapse
Affiliation(s)
- Hua Cheng
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, Texas 75390-9050, USA
| | | | | |
Collapse
|
27
|
Barney BM. Classification of proteins based on minimal modular repeats: lessons from nature in protein design. J Proteome Res 2007; 5:473-82. [PMID: 16512661 DOI: 10.1021/pr050103m] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Proteins containing internal repeats within their primary sequence have received increased attention recently, as the extent of their presence in various organisms is recognized more fully, and their role in evolution is more thoroughly studied. Presented here is a technique used to detect and classify proteins based on a modular evolutionary phenomenon that results in a series of small internal repeats. The parameters chosen are based on a minimum segment of seven residues that result in simple functional scaffolds. The genomes and corresponding proteomes of a variety of eubacteria and archaea have been analyzed using an algorithm that searches prokaryotic genomes for proteins containing small conserved repeats assembled in a modular fashion similar to a recently characterized protein from the organism Nitrosomonas europaea. This analysis has revealed additional proteins present in N. europaea with similar modular characteristics. A further survey of a variety of organisms demonstrates that this evolutionary pathway has been utilized in other organisms as well, to yield a broad assortment of small modular proteins. A thorough description of the sequential characteristics of these modular proteins follows, along with a selection and discussion of the various proteins uncovered through this expanded search and analysis. Several databases of the proteins uncovered from this work and the program used to perform the search are available.
Collapse
Affiliation(s)
- Brett M Barney
- Department of Chemistry and Biochemistry, 0300 Old Main Hill, Utah State University, Logan, Utah 84322, USA.
| |
Collapse
|
28
|
Laskin AA, Skryabin KG, Korotkov EV. Latent Periodicity of Protein Families, Identified with the Indel-Aware Algorithm. J Proteome Res 2007; 6:862-8. [PMID: 17269743 DOI: 10.1021/pr0603203] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Latent amino acid repeats seem to be widespread in genetic sequences and to reflect their structure, function, and evolution. We have recently identified latent periodicity in more than 150 protein families including protein kinases and various nucleotide-binding proteins. The latent repeats in these families were correlated to their structure and evolution. However, a majority of known protein families were not identified with our latent periodicity search algorithm. The main presumable reason for this was the inability of our techniques to identify periodicities interspersed with insertions and deletions. We designed the new latent periodicity search algorithm, which is capable of taking into account insertions and deletions. As a result, we identified many novel cases of latent periodicity peculiar to protein families. Possible origins of the periodic structure of these families are discussed. Summarizing, we presume that latent periodicity is present in a substantial portion of known protein families. The latent periodicity matrices and the results of Swiss-Prot scans are available from http://bioinf.narod.ru/del/.
Collapse
Affiliation(s)
- Andrew A Laskin
- Bioengineering Center of Russian Academy of Sciences, Prospect 60-tya Oktyabrya 7/1, 117312 Moscow, Russia
| | | | | |
Collapse
|
29
|
Achmüller C, Werther F, Wechner P, Auer B. Synthesis of genes with multiple identical domains. Biotechniques 2007; 42:43-4, 46. [PMID: 17269484 DOI: 10.2144/000112313] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
|
30
|
Turutina VP, Laskin AA, Kudryashov NA, Skryabin KG, Korotkov EV. Identification of amino acid latent periodicity within 94 protein families. J Comput Biol 2006; 13:946-64. [PMID: 16761920 DOI: 10.1089/cmb.2006.13.946] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Here, we have applied information decomposition, cyclic profile alignment, and noise decomposition techniques to search for latent repeats within protein families of various functions. We have identified 94 protein families with a family-specific periodicity. In each case, the periodic element was found in greater than 70% of family members. Latent periodicity profiles with specific length and signature were obtained in each case. The possible relationship between the periodic elements thus identified and the evolutionary development of the protein families are discussed with specific reference to the possibility that there is a correlation between the periodic elements and protein function.
Collapse
Affiliation(s)
- Vera P Turutina
- Bioengineering Center of Russian Academy of Sciences, Prospect 60-tya Oktyabrya, Moscow
| | | | | | | | | |
Collapse
|
31
|
Morgenstern B, Prohaska SJ, Pöhler D, Stadler PF. Multiple sequence alignment with user-defined anchor points. Algorithms Mol Biol 2006; 1:6. [PMID: 16722533 PMCID: PMC1481597 DOI: 10.1186/1748-7188-1-6] [Citation(s) in RCA: 47] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2006] [Accepted: 04/19/2006] [Indexed: 11/15/2022] Open
Abstract
Background Automated software tools for multiple alignment often fail to produce biologically meaningful results. In such situations, expert knowledge can help to improve the quality of alignments. Results Herein, we describe a semi-automatic version of the alignment program DIALIGN that can take pre-defined constraints into account. It is possible for the user to specify parts of the sequences that are assumed to be homologous and should therefore be aligned to each other. Our software program can use these sites as anchor points by creating a multiple alignment respecting these constraints. This way, our alignment method can produce alignments that are biologically more meaningful than alignments produced by fully automated procedures. As a demonstration of how our method works, we apply our approach to genomic sequences around the Hox gene cluster and to a set of DNA-binding proteins. As a by-product, we obtain insights about the performance of the greedy algorithm that our program uses for multiple alignment and about the underlying objective function. This information will be useful for the further development of DIALIGN. The described alignment approach has been integrated into the TRACKER software system.
Collapse
Affiliation(s)
- Burkhard Morgenstern
- Universität Göttingen, Institut für Mikrobiologie und Genetik, Abteilung für Bioinformatik, Goldschmidtstrasse. 1, D-37077 Göttingen, Germany
| | - Sonja J Prohaska
- Universität Leipzig, Institut für Informatik und Interdisziplinäres Zentrum für Bioinformatik, Kreuzstrasse 7b, D-04103 Leipzig, Germany
| | - Dirk Pöhler
- Universität Göttingen, Institut für Mikrobiologie und Genetik, Abteilung für Bioinformatik, Goldschmidtstrasse. 1, D-37077 Göttingen, Germany
| | - Peter F Stadler
- Universität Leipzig, Institut für Informatik und Interdisziplinäres Zentrum für Bioinformatik, Kreuzstrasse 7b, D-04103 Leipzig, Germany
| |
Collapse
|
32
|
Turutina VP, Laskin AA, Kudryashov NA, Skryabin KG, Korotkov EV. Identification of latent periodicity in amino acid sequences of protein families. BIOCHEMISTRY. BIOKHIMIIA 2006; 71:18-31. [PMID: 16457614 DOI: 10.1134/s0006297906010032] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/06/2023]
Abstract
For detection of the latent periodicity of the protein families responsible for various biological functions, methods of information decomposition, cyclic profile alignment, and the method of noise decomposition have been used. The latent periodicity, being specific to a particular family, is recognized in 94 of 110 analyzed protein families. Family specific periodicity was found for more than 70% of amino acid sequences in each of these families. Based on such sequences the characteristic profile of the latent periodicity has been deduced for each family. Possible relationship between the recognized latent periodicity, evolution of proteins, and their structural organization is discussed.
Collapse
Affiliation(s)
- V P Turutina
- Bioengineering Center, Russian Academy of Sciences, Moscow, Russia
| | | | | | | | | |
Collapse
|
33
|
Fadiel A, Eichenbaum KD, Hamza A. 'Genomemark': detecting word periodicity in biological sequences. J Biomol Struct Dyn 2005; 23:457-64. [PMID: 16363880 DOI: 10.1080/07391102.2006.10507071] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
Abstract
Identifying and predicting the structural characteristics of novel repeats throughout the genome can lend insight into biological function. Specific repeats are believed to have biological significance as a function of their distribution patterns. We have developed 'GenomeMark,' a computer program that detects and statistically analyzes candidate repeats. Specifically, 'GenomeMark' identifies the periodic distribution of unique words, calculating their chi2 and Z-score values. Using 'GenomeMark,' we identified novel sequence words present in tandem throughout genomes. We found that these sequences have remarkable spacer sequence distributions and many were genome specific, validating the genome signature theory. Further analysis confirmed that many of these sequences have a specific biological function. The program is available from the authors upon request and is freely available for non-commercial and academic entities.
Collapse
Affiliation(s)
- A Fadiel
- Yale University School of Medicine, Yale Center for Research On Reproductive Biology, New Haven, CT 06511, USA.
| | | | | |
Collapse
|
34
|
Sivaraja V, Kumar TKS, Leena PST, Chang AN, Vidya C, Goforth RL, Rajalingam D, Arvind K, Ye JL, Chou J, Henry R, Yu C. Three-dimensional solution structures of the chromodomains of cpSRP43. J Biol Chem 2005; 280:41465-71. [PMID: 16183644 DOI: 10.1074/jbc.m507077200] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open
Abstract
Chloroplasts contain a unique signal recognition particle (cpSRP). Unlike the cytoplasmic forms, the cpSRP lacks RNA but contains a conserved 54-kDa GTPase and a novel 43-kDa subunit (cpSRP43). Recently, three functionally distinct chromodomains (CDs) have been identified in cpSRP43. In the present study, we report the three-dimensional solution structures of the three CDs (CD1, CD2, and CD3) using a variety of triple resonance NMR experiments. The structure of CD1 consists of a triple-stranded beta-sheet segment. The C-terminal helical segment typically found in the nuclear chromodomains is absent in CD1. The secondary structural elements in CD2 and CD3 include a triple-stranded antiparallel beta-sheet and a C-terminal helix. Interestingly, the orientation of the C-terminal helix is significantly different in the structures of CD2 and CD3. Critical comparison of the structures of the chromodomains of cpSRP43 with those found in nuclear chromodomain proteins revealed that the diverse protein-protein interactions mediated by the CDs appear to stem from the differences that exist in the surface charge potentials of each CD. Results of isothermal titration calorimetry experiments confirmed that only CD2 is involved in binding to cpSRP54. The negatively charged C-terminal helix in CD2 possibly plays a crucial role in the cpSRP54-cpSRP43 interaction.
Collapse
Affiliation(s)
- Vaithiyalingam Sivaraja
- Department of Chemistry and Biochemistry, University of Arkansas, Fayetteville, Arkansas 72701, USA
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
35
|
Laskin AA, Kudryashov NA, Skryabin KG, Korotkov EV. Latent periodicity of serine-threonine and tyrosine protein kinases and other protein families. Comput Biol Chem 2005; 29:229-43. [PMID: 15979043 DOI: 10.1016/j.compbiolchem.2005.04.003] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2004] [Revised: 04/18/2005] [Accepted: 04/18/2005] [Indexed: 11/22/2022]
Abstract
We identified latent periodicity in catalytic domains of approximately 85% of annotated serine-threonine and tyrosine protein kinases. Similar results were obtained for other 22 protein families and domains. We also designed the method of noise decomposition, which is aimed to distinguish between different periodicity types of the same period length. The method is to be used in conjunction with the method of cyclic profile alignment, and this combination is able to reveal structure-related or function-related patterns of latent periodicity. Possible origins of the periodic structure of protein kinase active sites are discussed. Summarizing, we presume that latent periodicity is the common property of many catalytic protein domains.
Collapse
Affiliation(s)
- Andrew A Laskin
- Bioengineering Center of Russian Academy of Sciences, Prospect 60-tya Oktyabrya, 7/1, 117312 Moscow, Russia.
| | | | | | | |
Collapse
|
36
|
Cheng H, Grishin NV. DOM-fold: a structure with crossing loops found in DmpA, ornithine acetyltransferase, and molybdenum cofactor-binding domain. Protein Sci 2005; 14:1902-10. [PMID: 15937278 PMCID: PMC2253344 DOI: 10.1110/ps.051364905] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
Abstract
Understanding relationships between sequence, structure, and evolution is important for functional characterization of proteins. Here, we define a novel DOM-fold as a consensus structure of the domains in DmpA (L-aminopeptidase D-Ala-esterase/amidase), OAT (ornithine acetyltransferase), and MocoBD (molybdenum cofactor-binding domain), and discuss possible evolutionary scenarios of its origin. As shown by a comprehensive structure similarity search, DOM-fold distinguished by a two-layered beta/alpha architecture of a particular topology with unusual crossing loops is unique to those three protein families. DmpA and OAT are evolutionarily related as indicated by their sequence, structural, and functional similarities. Structural similarity between the DmpA/OAT superfamily and the MocoBD domains has not been reported before. Contrary to previous reports, we conclude that functional similarities between DmpA/OAT proteins and N-terminal nucleophile (Ntn) hydrolases are convergent and are unlikely to be inherited from a common ancestor.
Collapse
Affiliation(s)
- Hua Cheng
- Department of Biochemistry, University of Texas Southwestern Medical Center, Dallas, 75390-9050, USA
| | | |
Collapse
|
37
|
Murray KB, Taylor WR, Thornton JM. Toward the detection and validation of repeats in protein structure. Proteins 2005; 57:365-80. [PMID: 15340924 DOI: 10.1002/prot.20202] [Citation(s) in RCA: 33] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
We present a method called DAVROS to detect, localize, and validate repeating motifs in protein structure allowing for insertions and deletions. DAVROS uses the score matrix from a structural alignment program (SAP) to search for repeating motifs using an algorithm based on concepts from signal processing and the statistical properties of the alignments. The method was tested against a nonredundant Protein Data Bank, and each chain was assigned a score. For the top 50 chains ranked by score, 70% contain repeating motifs detected without error. These represent 14 types of fold covering alpha, beta, and alphabeta protein classes. A second data set comprising protein chains in different sequence families for triosephosphate isomerase (TIM) barrel, leucine-rich repeat (LRR), trefoil, and alpha-alpha barrel folds was used to assess the ability of DAVROS to detect all motifs within a specific fold. For the second test set, the percentage of motifs detected was highest for the LRR chains (88.7%) and least for the TIM barrels (60%). This variability results from the regularity of the LRR motif compared to the alphabeta units of the TIM barrel, which generally have many more indels. These reduce the strength of the repeat signal in the SAP matrix, making repeat detection more difficult.
Collapse
Affiliation(s)
- Kevin B Murray
- European Bioinformatics Institute, European Molecular Biology Laboratory, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK
| | | | | |
Collapse
|
38
|
Fadiel A, Lithwick S, Ganji G, Scherer SW. Remarkable sequence signatures in archaeal genomes. ARCHAEA-AN INTERNATIONAL MICROBIOLOGICAL JOURNAL 2005; 1:185-90. [PMID: 15803664 PMCID: PMC2685567 DOI: 10.1155/2003/458235] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/27/2022]
Abstract
Complete archaeal genomes were probed for the presence of long (> or = 25 bp) oligonucleotide repeats (words). We detected the presence of many words distributed in tandem with narrow ranges of periodicity (i.e., spacer length between repeats). Similar words were not identified in genomes of non-archaeal species, namely Escherichia coli, Bacillus subtilis, Haemophilus influenzae, Mycoplasma genitalium and Mycoplasma pneumoniae. BLAST similarity searches against the GenBank nucleotide sequence database revealed that these words were archaeal species-specific, indicating that they are of a signature character. Sequence analysis and genome viewing tools showed these repeats to be restricted to non-coding regions. Thus, archaea appear to possess a non-coding genomic signature that is absent in bacterial species. The identification of a species-specific genomic signature would be of great value to archaeal genome mapping, evolutionary studies and analyses of genome complexity.
Collapse
Affiliation(s)
- Ahmed Fadiel
- The Center for Applied Genomics, Hospital for Sick Children, Toronto, Ontario M5G 1Z8, Canada.
| | | | | | | |
Collapse
|
39
|
Laskin AA, Kudryashov NA, Skryabin KG, Korotkov EV. Latent Periodicity of Serine/Threonine and Tyrosine Protein Kinases and Other Protein Families. Mol Biol 2005. [DOI: 10.1007/s11008-005-0052-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
40
|
Wong JL, Wessel GM. Major components of a sea urchin block to polyspermy are structurally and functionally conserved. Evol Dev 2005; 6:134-53. [PMID: 15099301 DOI: 10.1111/j.1525-142x.2004.04019.x] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
One sperm fusing with one egg is requisite for successful fertilization; additional sperm fusions are lethal to the embryo. Because sperm usually outnumber eggs, evolution has selected for mechanisms that prevent this polyspermy by immediately modifying the egg extracellular matrix. We focus here on the contribution of cortical granule contents in the sea urchin block to polyspermy to begin to understand how well this process is conserved. We identified each of the major constituents of the fertilization envelope in two species of seaurchins, Strongylocentrotus purpuratus and Lytechinus variegatus, that diverged 30 to 50 million years ago. Our results show that the five major structural components of the fertilization envelope, derived from the egg cortical granules, are semiconserved. Most of these orthologs share sequence identity and encode multiple low-density lipoprotein receptor type A repeats or CUB domains but at least two contain radically different carboxy-terminal repeats. Using a new association assay, we also show that these major structural components are functionally conserved during fertilization envelope construction. Thus, it seems that this population of female reproductive proteins has retained functional motifs while gaining significant sequence diversity-two opposing paths that may reflect cooperativity among the proteins that compose the fertilization envelope.
Collapse
Affiliation(s)
- Julian L Wong
- Department of Molecular Biology, Cellular Biology, and Biochemistry, Box G-J4, Brown University, Providence, RI 02912, USA
| | | |
Collapse
|
41
|
Mosavi LK, Cammett TJ, Desrosiers DC, Peng ZY. The ankyrin repeat as molecular architecture for protein recognition. Protein Sci 2005; 13:1435-48. [PMID: 15152081 PMCID: PMC2279977 DOI: 10.1110/ps.03554604] [Citation(s) in RCA: 638] [Impact Index Per Article: 33.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Abstract
The ankyrin repeat is one of the most frequently observed amino acid motifs in protein databases. This protein-protein interaction module is involved in a diverse set of cellular functions, and consequently, defects in ankyrin repeat proteins have been found in a number of human diseases. Recent biophysical, crystallographic, and NMR studies have been used to measure the stability and define the various topological features of this motif in an effort to understand the structural basis of ankyrin repeat-mediated protein-protein interactions. Characterization of the folding and assembly pathways suggests that ankyrin repeat domains generally undergo a two-state folding transition despite their modular structure. Also, the large number of available sequences has allowed the ankyrin repeat to be used as a template for consensus-based protein design. Such projects have been successful in revealing positions responsible for structure and function in the ankyrin repeat as well as creating a potential universal scaffold for molecular recognition.
Collapse
Affiliation(s)
- Leila K Mosavi
- MC3305, Department of Molecular, Microbial, and Structural Biology, University of Connecticut Health Center, 263 Farmington Avenue, Farmington, CT 06032, USA
| | | | | | | |
Collapse
|
42
|
Abstract
Comparison of two protein structures often results in not only a global alignment but also a number of distinct local alignments; the latter, referred to as alternative alignments, are however usually ignored in existing protein structure comparison analyses. Here, we used a novel method of protein structure comparison to extensively identify and characterize the alternative alignments obtained for structure pairs of a fold classification database. We showed that all alternative alignments can be classified into one of just a few types, and with which illustrated the potential of using alternative alignments to identify recurring protein substructures, including the internal structural repeats of a protein. Furthermore, we showed that among the alternative alignments obtained, permuted alignments, which included both circular and scrambled permutations, are as prevalent as topological alignments. These results demonstrated that the so far largely unattended alternative alignments of protein structures have implications and applications for research of protein classification and evolution.
Collapse
Affiliation(s)
- Edward S C Shih
- Institute of Biomedical Sciences, Academia Sinica, Taipei, Taiwan
| | | |
Collapse
|
43
|
Freiberg A, Machner MP, Pfeil W, Schubert WD, Heinz DW, Seckler R. Folding and stability of the leucine-rich repeat domain of internalin B from Listeri monocytogenes. J Mol Biol 2004; 337:453-61. [PMID: 15003459 DOI: 10.1016/j.jmb.2004.01.044] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2003] [Revised: 01/08/2004] [Accepted: 01/23/2004] [Indexed: 11/26/2022]
Abstract
Internalin B (InlB), a surface protein of the human pathogen Listeria monocytogenes, promotes invasion into various host cell types by inducing phagocytosis of the entire bacterium. The N-terminal half of InlB (residues 36-321, InlB321), which is sufficient for this process, contains a central leucine-rich repeat (LRR) domain that is flanked by a small alpha-helical cap and an immunoglobulin (Ig)-like domain. Here we investigated the spectroscopic properties, stability and folding of InlB321 and of a shorter variant lacking the Ig-like domain (InlB248). The circular dichroism spectra of both protein variants in the far ultraviolet region are very similar, with a characteristic minimum found at approximately 200 nm, possibly resulting from the high 3(10)-helical content in the LRR domain. Upon addition of chemical denaturants, both variants unfold in single transitions with unusually high cooperativity that are fully reversible and best described by two-state equilibria. The free energies of GdmCl-induced unfolding determined from transitions at 20 degrees C are 9.9(+/-0.8)kcal/mol for InlB321 and 5.4(+/-0.4)kcal/mol for InlB248. InlB321 is also more stable against thermal denaturation, as observed by scanning calorimetry. This suggests, that the Ig-like domain, which presumably does not directly interact with the host cell receptor during bacterial invasion, plays a critical role for the in vivo stability of InlB.
Collapse
Affiliation(s)
- Alexander Freiberg
- Potsdam University, Physical Biochemistry, Karl-Liebknecht-Str. 24-25, Haus 25, D-14476 Potsdam-Golm, Germany
| | | | | | | | | | | |
Collapse
|
44
|
Abstract
TBP functions in transcription initiation in all eukaryotes and in Archaebacteria. Although the 181-amino acid (aa) carboxyl (C-) terminal core of the protein is highly conserved, TBP proteins from different phyla exhibit diverse sequences in their amino (N-) terminal region. In mice, the TBP N-terminus plays a role in protecting the placenta from maternal rejection; however the presence of similar TBP N-termini in nontherian tetrapods suggests that this domain also has more primitive functions. To gain insights into the pretherian functions of the N-terminus, we investigated its phylogenetic distribution. TBP cDNAs were isolated from representative nontetrapod jawed vertebrates (zebrafish and shark), from more primitive jawless vertebrates (lamprey and hagfish), and from a prevertebrate cephalochordate (amphioxus). Results showed that the tetrapod N-terminus likely arose coincident with the earliest vertebrates. The primary structures of vertebrate N-termini indicates that, historically, this domain has undergone events involving intragenic duplication and modification of short oligopeptide-encoding DNA sequences, which might have provided a mechanism of de novo evolution of this polypeptide.
Collapse
Affiliation(s)
- Alla A Bondareva
- Veterinary Molecular Biology, Marsh Labs, Montana State University, USA
| | | |
Collapse
|
45
|
Mosavi LK, Williams S, Peng Zy ZY. Equilibrium folding and stability of myotrophin: a model ankyrin repeat protein. J Mol Biol 2002; 320:165-70. [PMID: 12079376 DOI: 10.1016/s0022-2836(02)00441-2] [Citation(s) in RCA: 49] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
Proteins containing stretches of repeating amino acid sequences are prevalent throughout nature, yet little is known about the general folding and assembly mechanisms of these systems. Here we propose myotrophin as a model system to study the folding of ankyrin repeat proteins. Myotrophin is folded over a large pH range and is soluble at high concentrations. Thermal and urea denaturation studies show that the protein displays cooperative two-state folding properties despite its modular nature. Taken together with previous studies on other ankyrin repeat proteins, our data suggest that the two-state folding pathway may be characteristic of ankyrin repeat proteins and other integrated alpha-helical repeat proteins in general.
Collapse
Affiliation(s)
- Leila K Mosavi
- Department of Biochemistry, University of Connecticut Health Center, MC-3305, 263 Farmington Avenue, Farmington, CT 06030, USA
| | | | | |
Collapse
|
46
|
Abstract
This paper describes three weighting schemes for improving the accuracy of progressive multiple sequence alignment methods: (1) global profile pre-processing, to capture for each sequence information about other sequences in a profile before the actual multiple alignment takes place; (2) local pre-processing; which incorporates a new protocol to only use non-overlapping local sequence regions to construct the pre-processed profiles; and (3) local-global alignment, a weighting scheme based on the double dynamic programming (DDP) technique to softly bias global alignment to local sequence motifs. The first two schemes allow the compilation of residue-specific multiple alignment reliability indices, which can be used in an iterative fashion. The schemes have been implemented with associated iterative modes in the PRALINE multiple sequence alignment method, and have been evaluated using the BAliBASE benchmark alignment database. These tests indicate that PRALINE is a toolbox able to build alignments with very high quality. We found that local profile pre-processing raises the alignment quality by 5.5% compared to PRALINE alignments generated under default conditions. Iteration enhances the quality by a further percentage point. The implications of multiple alignment scoring functions and iteration in relation to alignment quality and benchmarking are discussed.
Collapse
Affiliation(s)
- Jaap Heringa
- Division of Mathematical Biology, MRC National Institute for Medical Research, London, UK.
| |
Collapse
|
47
|
George RA, Heringa J. SnapDRAGON: a method to delineate protein structural domains from sequence data. J Mol Biol 2002; 316:839-51. [PMID: 11866536 DOI: 10.1006/jmbi.2001.5387] [Citation(s) in RCA: 69] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
We describe a method to identify protein domain boundaries from sequence information alone based on the assumption that hydrophobic residues cluster together in space. SnapDRAGON is a suite of programs developed to predict domain boundaries based on the consistency observed in a set of alternative ab initio three-dimensional (3D) models generated for a given protein multiple sequence alignment. This is achieved by running a distance geometry-based folding technique in conjunction with a 3D-domain assignment algorithm. The overall accuracy of our method in predicting the number of domains for a non-redundant data set of 414 multiple alignments, representing 185 single and 231 multiple-domain proteins, is 72.4 %. Using domain linker regions observed in the tertiary structures associated with each query alignment as the standard of truth, inter-domain boundary positions are delineated with an accuracy of 63.9 % for proteins comprising continuous domains only, and 35.4 % for proteins with discontinuous domains. Overall, domain boundaries are delineated with an accuracy of 51.8 %. The prediction accuracy values are independent of the pair-wise sequence similarities within each of the alignments. These results demonstrate the capability of our method to delineate domains in protein sequences associated with a wide variety of structural domain organisation.
Collapse
Affiliation(s)
- Richard A George
- Division of Mathematical Biology, National Institute for Medical Research, The Ridgeway, Mill Hill, London NW7 1AA, UK
| | | |
Collapse
|
48
|
Murray KB, Gorse D, Thornton JM. Wavelet transforms for the characterization and detection of repeating motifs. J Mol Biol 2002; 316:341-63. [PMID: 11851343 DOI: 10.1006/jmbi.2001.5332] [Citation(s) in RCA: 75] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
The role of repeating motifs in protein structures is thought to be as modular building blocks which allow an economic way of constructing complex proteins. In this work novel wavelet transform analysis techniques are used to detect and characterize repeating motifs in protein sequence and structure data, where the Kyte-Doolittle hydrophobicity scale (Eta Phi) and relative accessible surface area (rASA) data provide residue information about the protein sequence and structure, respectively. We analyze a variety of repeating protein motifs, TIM barrels, propellor blades, coiled coils and leucine-rich repeat structures. Detection and characterization of these motifs is performed using techniques based on the continuous wavelet transform (CWT). Results indicate that the wavelet transform techniques developed herein are a promising approach for the detection and characterization of repeating motifs for both structural and in some instances sequence data.
Collapse
Affiliation(s)
- Kevin B Murray
- Department of Biochemistry and Molecular Biology, University College London, UK
| | | | | |
Collapse
|
49
|
Abstract
The relationship between the amino acid sequence and the three-dimensional structure of proteins with internal repeats is discussed. In particular, correlations between the amino acid composition and the ability to fold in a unique structure, as well as classification of the structures based on their repeat length, are described. This analysis suggests rules that can be used for the structural prediction of repeat-containing proteins. The paper is focused on prediction and modeling of solenoid-like proteins with the repeat length ranging between 5 and 40 residues. The models of leucine-rich repeat proteins and bacterial proteins with pentapeptide repeats are examined in light of the recently solved structures of the related molecules.
Collapse
Affiliation(s)
- A V Kajava
- Center for Molecular Modeling, Bethesda, Maryland 20892-5626, USA
| |
Collapse
|
50
|
|