1
|
Gullotto D. Fine tuned exploration of evolutionary relationships within the protein universe. Stat Appl Genet Mol Biol 2021; 20:17-36. [PMID: 33594839 DOI: 10.1515/sagmb-2019-0039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2019] [Accepted: 01/12/2021] [Indexed: 11/15/2022]
Abstract
In the regime of domain classifications, the protein universe unveils a discrete set of folds connected by hierarchical relationships. Instead, at sub-domain-size resolution and because of physical constraints not necessarily requiring evolution to shape polypeptide chains, networks of protein motifs depict a continuous view that lies beyond the extent of hierarchical classification schemes. A number of studies, however, suggest that universal sub-sequences could be the descendants of peptides emerged in an ancient pre-biotic world. Should this be the case, evolutionary signals retained by structurally conserved motifs, along with hierarchical features of ancient domains, could sew relationships among folds that diverged beyond the point where homology is discernable. In view of the aforementioned, this paper provides a rationale where a network with hierarchical and continuous levels of the protein space, together with sequence profiles that probe the extent of sequence similarity and contacting residues that capture the transition from pre-biotic to domain world, has been used to explore relationships between ancient folds. Statistics of detected signals have been reported. As a result, an example of an emergent sub-network that makes sense from an evolutionary perspective, where conserved signals retrieved from the assessed protein space have been co-opted, has been discussed.
Collapse
Affiliation(s)
- Danilo Gullotto
- Advanced Computational Biostructural Research Collaboratory, I-95019, Zafferana Etnea, Italy
| |
Collapse
|
2
|
Prediction of Protein Tertiary Structure via Regularized Template Classification Techniques. Molecules 2020; 25:molecules25112467. [PMID: 32466409 PMCID: PMC7321371 DOI: 10.3390/molecules25112467] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2020] [Revised: 05/21/2020] [Accepted: 05/22/2020] [Indexed: 11/24/2022] Open
Abstract
We discuss the use of the regularized linear discriminant analysis (LDA) as a model reduction technique combined with particle swarm optimization (PSO) in protein tertiary structure prediction, followed by structure refinement based on singular value decomposition (SVD) and PSO. The algorithm presented in this paper corresponds to the category of template-based modeling. The algorithm performs a preselection of protein templates before constructing a lower dimensional subspace via a regularized LDA. The protein coordinates in the reduced spaced are sampled using a highly explorative optimization algorithm, regressive–regressive PSO (RR-PSO). The obtained structure is then projected onto a reduced space via singular value decomposition and further optimized via RR-PSO to carry out a structure refinement. The final structures are similar to those predicted by best structure prediction tools, such as Rossetta and Zhang servers. The main advantage of our methodology is that alleviates the ill-posed character of protein structure prediction problems related to high dimensional optimization. It is also capable of sampling a wide range of conformational space due to the application of a regularized linear discriminant analysis, which allows us to expand the differences over a reduced basis set.
Collapse
|
3
|
In Silico Characterization and Structural Modeling of Dermacentor andersoni p36 Immunosuppressive Protein. Adv Bioinformatics 2018; 2018:7963401. [PMID: 29849611 PMCID: PMC5911333 DOI: 10.1155/2018/7963401] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2017] [Accepted: 02/14/2018] [Indexed: 01/13/2023] Open
Abstract
Ticks cause approximately $17–19 billion economic losses to the livestock industry globally. Development of recombinant antitick vaccine is greatly hindered by insufficient knowledge and understanding of proteins expressed by ticks. Ticks secrete immunosuppressant proteins that modulate the host's immune system during blood feeding; these molecules could be a target for antivector vaccine development. Recombinant p36, a 36 kDa immunosuppressor from the saliva of female Dermacentor andersoni, suppresses T-lymphocytes proliferation in vitro. To identify potential unique structural and dynamic properties responsible for the immunosuppressive function of p36 proteins, this study utilized bioinformatic tool to characterize and model structure of D. andersoni p36 protein. Evaluation of p36 protein family as suitable vaccine antigens predicted a p36 homolog in Rhipicephalus appendiculatus, the tick vector of East Coast fever, with an antigenicity score of 0.7701 that compares well with that of Bm86 (0.7681), the protein antigen that constitute commercial tick vaccine Tickgard™. Ab initio modeling of the D. andersoni p36 protein yielded a 3D structure that predicted conserved antigenic region, which has potential of binding immunomodulating ligands including glycerol and lactose, found located within exposed loop, suggesting a likely role in immunosuppressive function of tick p36 proteins. Laboratory confirmation of these preliminary results is necessary in future studies.
Collapse
|
4
|
Abstract
Structural proteomics aims to understand the structural basis of protein interactions and functions. A prerequisite for this is the availability of 3D protein structures that mediate the biochemical interactions. The explosion in the number of available gene sequences set the stage for the next step in genome-scale projects -- to obtain 3D structures for each protein. To achieve this ambitious goal, the slow and costly structure determination experiments are supplemented with theoretical approaches. The current state and recent advances in structure modeling approaches are reviewed here, with special emphasis on comparative protein structure modeling techniques.
Collapse
Affiliation(s)
- András Fiser
- Department of Biochemistry, Seaver Foundation Center for Bioinformatics, Albert Einstein College of Medicine, 1300 Morris Park Ave., Bronx, NY 10461, USA.
| |
Collapse
|
5
|
Ravella D, Kumar MU, Sherlin D, Shankar M, Vaishnavi MK, Sekar K. SMS 2.0: an updated database to study the structural plasticity of short peptide fragments in non-redundant proteins. GENOMICS, PROTEOMICS & BIOINFORMATICS 2012; 10:44-50. [PMID: 22449400 PMCID: PMC5054494 DOI: 10.1016/s1672-0229(11)60032-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/11/2011] [Accepted: 12/19/2011] [Indexed: 11/21/2022]
Abstract
The function of a protein molecule is greatly influenced by its three-dimensional (3D) structure and therefore structure prediction will help identify its biological function. We have updated Sequence, Motif and Structure (SMS), the database of structurally rigid peptide fragments, by combining amino acid sequences and the corresponding 3D atomic coordinates of non-redundant (25%) and redundant (90%) protein chains available in the Protein Data Bank (PDB). SMS 2.0 provides information pertaining to the peptide fragments of length 5-14 residues. The entire dataset is divided into three categories, namely, same sequence motifs having similar, intermediate or dissimilar 3D structures. Further, options are provided to facilitate structural superposition using the program structural alignment of multiple proteins (STAMP) and the popular JAVA plug-in (Jmol) is deployed for visualization. In addition, functionalities are provided to search for the occurrences of the sequence motifs in other structural and sequence databases like PDB, Genome Database (GDB), Protein Information Resource (PIR) and Swiss-Prot. The updated database along with the search engine is available over the World Wide Web through the following URL http://cluster.physics.iisc.ernet.in/sms/.
Collapse
Affiliation(s)
| | | | | | | | | | - Kanagaraj Sekar
- Bioinformatics Centre (Centre of Excellence in Structural Biology and Bio-computing), Supercomputer Education and Research Centre, Indian Institute of Science, Bangalore 560012, India
| |
Collapse
|
6
|
Reddy BVB, Kaznessis YN. A QUANTITATIVE ANALYSIS OF INTERFACIAL AMINO ACID CONSERVATION IN PROTEIN-PROTEIN HETERO COMPLEXES. J Bioinform Comput Biol 2011; 3:1137-50. [PMID: 16278951 DOI: 10.1142/s0219720005001429] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2004] [Revised: 03/18/2005] [Accepted: 03/28/2005] [Indexed: 11/18/2022]
Abstract
A long-standing question in molecular biology is whether interfaces of protein-protein complexes are more conserved than the rest of the protein surfaces. Although it has been reported that conservation can be used as an indicator for predicting interaction sites on proteins, there are recent reports stating that the interface regions are only slightly more conserved than the rest of the protein surfaces, with conservation signals not being statistically significant enough for predicting protein-protein binding sites. In order to properly address these controversial reports we have studied a set of 28 well resolved hetero complex structures of proteins that consists of transient and non-transient complexes. The surface positions were classified into four conservation classes and the conservation index of the surface positions was quantitatively analyzed. The results indicate that the surface density of highly conserved positions is significantly higher in the protein-protein interface regions compared with the other regions of the protein surface. However, the average conservation index of the patches in the interface region is not significantly higher compared with other surface regions of the protein structures. This finding demonstrates that the number of conserved residue positions is a more appropriate indicator for predicting protein-protein binding sites than the average conservation index in the interacting region. We have further validated our findings on a set of 59 benchmark complex structures. Furthermore, an analysis of 19 complexes of antigen-antibody interactions shows that there is no conservation of amino acid positions in the interacting regions of these complexes, as expected, with the variable region of the immunoglobulins interacting mostly with the antigens. Interestingly, antigen interacting regions also have a higher number of non-conserved residue positions in the interacting region than the rest of the protein surface.
Collapse
Affiliation(s)
- Boojala V B Reddy
- Digital Technology Center and Department of Chemical Engineering and Materials Science, University of Minnesota, Minneapolis MN 55455, USA.
| | | |
Collapse
|
7
|
Abstract
Functional characterization of a protein is often facilitated by its 3D structure. However, the fraction of experimentally known 3D models is currently less than 1% due to the inherently time-consuming and complicated nature of structure determination techniques. Computational approaches are employed to bridge the gap between the number of known sequences and that of 3D models. Template-based protein structure modeling techniques rely on the study of principles that dictate the 3D structure of natural proteins from the theory of evolution viewpoint. Strategies for template-based structure modeling will be discussed with a focus on comparative modeling, by reviewing techniques available for all the major steps involved in the comparative modeling pipeline.
Collapse
Affiliation(s)
- Andras Fiser
- Department of Systems and Computational Biology, Albert Einstein College of Medicine, Bronx, NY, USA
| |
Collapse
|
8
|
Chowriappa P, Dua S, Kanno J, Thompson HW. Protein structure classification based on conserved hydrophobic residues. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2009; 6:639-651. [PMID: 19875862 DOI: 10.1109/tcbb.2008.77] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/28/2023]
Abstract
Protein folding is frequently guided by local residue interactions that form clusters in the protein core. The interactions between residue clusters serve as potential nucleation sites in the folding process. Evidence postulates that the residue interactions are governed by the hydrophobic propensities that the residues possess. An array of hydrophobicity scales has been developed to determine the hydrophobic propensities of residues under different environmental conditions. In this work, we propose a graph-theory-based data mining framework to extract and isolate protein structural features that sustain invariance in evolutionary-related proteins, through the integrated analysis of five well-known hydrophobicity scales over the 3D structure of proteins. We hypothesize that proteins of the same homology contain conserved hydrophobic residues and exhibit analogous residue interaction patterns in the folded state. The results obtained demonstrate that discriminatory residue interaction patterns shared among proteins of the same family can be employed for both the structural and the functional annotation of proteins. We obtained on the average 90 percent accuracy in protein classification with a significantly small feature vector compared to previous results in the area. This work presents an elaborate study, as well as validation evidence, to illustrate the efficacy of the method and the correctness of results reported.
Collapse
Affiliation(s)
- Pradeep Chowriappa
- Data Mining Research Laboratory and the Department of Computer Science, College of Engineering and Science, Louisiana Tech University, PO Box 10348, Nethken Hall, Ruston, LA 71272, USA.
| | | | | | | |
Collapse
|
9
|
Roca AI, Almada AE, Abajian AC. ProfileGrids as a new visual representation of large multiple sequence alignments: a case study of the RecA protein family. BMC Bioinformatics 2008; 9:554. [PMID: 19102758 PMCID: PMC2663765 DOI: 10.1186/1471-2105-9-554] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2008] [Accepted: 12/22/2008] [Indexed: 01/12/2023] Open
Abstract
Background Multiple sequence alignments are a fundamental tool for the comparative analysis of proteins and nucleic acids. However, large data sets are no longer manageable for visualization and investigation using the traditional stacked sequence alignment representation. Results We introduce ProfileGrids that represent a multiple sequence alignment as a matrix color-coded according to the residue frequency occurring at each column position. JProfileGrid is a Java application for computing and analyzing ProfileGrids. A dynamic interaction with the alignment information is achieved by changing the ProfileGrid color scheme, by extracting sequence subsets at selected residues of interest, and by relating alignment information to residue physical properties. Conserved family motifs can be identified by the overlay of similarity plot calculations on a ProfileGrid. Figures suitable for publication can be generated from the saved spreadsheet output of the colored matrices as well as by the export of conservation information for use in the PyMOL molecular visualization program. We demonstrate the utility of ProfileGrids on 300 bacterial homologs of the RecA family – a universally conserved protein involved in DNA recombination and repair. Careful attention was paid to curating the collected RecA sequences since ProfileGrids allow the easy identification of rare residues in an alignment. We relate the RecA alignment sequence conservation to the following three topics: the recently identified DNA binding residues, the unexplored MAW motif, and a unique Bacillus subtilis RecA homolog sequence feature. Conclusion ProfileGrids allow large protein families to be visualized more effectively than the traditional stacked sequence alignment form. This new graphical representation facilitates the determination of the sequence conservation at residue positions of interest, enables the examination of structural patterns by using residue physical properties, and permits the display of rare sequence features within the context of an entire alignment. JProfileGrid is free for non-commercial use and is available from . Furthermore, we present a curated RecA protein collection that is more diverse than previous data sets; and, therefore, this RecA ProfileGrid is a rich source of information for nanoanatomy analysis.
Collapse
Affiliation(s)
- Alberto I Roca
- Department of Molecular Biology and Biochemistry, 560 Steinhaus Hall, University of California, Irvine, California 92697-3900, USA.
| | | | | |
Collapse
|
10
|
Scheeff ED, Bourne PE. Application of protein structure alignments to iterated hidden Markov model protocols for structure prediction. BMC Bioinformatics 2006; 7:410. [PMID: 16970830 PMCID: PMC1622756 DOI: 10.1186/1471-2105-7-410] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2006] [Accepted: 09/14/2006] [Indexed: 11/30/2022] Open
Abstract
Background One of the most powerful methods for the prediction of protein structure from sequence information alone is the iterative construction of profile-type models. Because profiles are built from sequence alignments, the sequences included in the alignment and the method used to align them will be important to the sensitivity of the resulting profile. The inclusion of highly diverse sequences will presumably produce a more powerful profile, but distantly related sequences can be difficult to align accurately using only sequence information. Therefore, it would be expected that the use of protein structure alignments to improve the selection and alignment of diverse sequence homologs might yield improved profiles. However, the actual utility of such an approach has remained unclear. Results We explored several iterative protocols for the generation of profile hidden Markov models. These protocols were tailored to allow the inclusion of protein structure alignments in the process, and were used for large-scale creation and benchmarking of structure alignment-enhanced models. We found that models using structure alignments did not provide an overall improvement over sequence-only models for superfamily-level structure predictions. However, the results also revealed that the structure alignment-enhanced models were complimentary to the sequence-only models, particularly at the edge of the "twilight zone". When the two sets of models were combined, they provided improved results over sequence-only models alone. In addition, we found that the beneficial effects of the structure alignment-enhanced models could not be realized if the structure-based alignments were replaced with sequence-based alignments. Our experiments with different iterative protocols for sequence-only models also suggested that simple protocol modifications were unable to yield equivalent improvements to those provided by the structure alignment-enhanced models. Finally, we found that models using structure alignments provided fold-level structure assignments that were superior to those produced by sequence-only models. Conclusion When attempting to predict the structure of remote homologs, we advocate a combined approach in which both traditional models and models incorporating structure alignments are used.
Collapse
Affiliation(s)
- Eric D Scheeff
- San Diego Supercomputer Center, University of California, San Diego, 9500 Gilman Dr., La Jolla, CA 92093-0537, USA
- Present address: Razavi-Newman Center for Bioinformatics, The Salk Institute for Biological Studies, 10010 North Torrey Pines Rd., La Jolla, CA 92037, USA
| | - Philip E Bourne
- San Diego Supercomputer Center, University of California, San Diego, 9500 Gilman Dr., La Jolla, CA 92093-0537, USA
- Department of Pharmacology, University of California, San Diego, 9500 Gilman Dr., La Jolla, CA 92093, USA
| |
Collapse
|
11
|
Sacile R, Ruggiero C. Hunting for "key residues" in the modeling of globular protein folding: an artificial neural network-based approach. IEEE Trans Nanobioscience 2006; 1:85-91. [PMID: 16689212 DOI: 10.1109/tnb.2002.806914] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
An approach to modeling globular protein folding based on artificial neural networks (ANNs) is presented. This approach, that can be regarded as an inverse protein folding problem, investigates whether and when a protein fragment needs a specific residue in the center of its primary structure as a necessary condition to fold as observed. To perform this analysis, an ANN has been trained on a set of 55 proteins, searching for a relation between protein fragments modeled by 13alpha torsion angles and the residue corresponding to the central alpha torsion angle of the fragment. The results obtained show that only Asp, Gly, Pro, Ser and Val residues are often a necessary, even though not sufficient, condition to obtain a specific folded fragment structure, playing therefore, the role of "key residue" of this fragment.
Collapse
Affiliation(s)
- Roberto Sacile
- Department of Communication, Computer and System Sciences (DIST), University of Genoa, via Opera Pia 13, 16145 Genoa, Italy.
| | | |
Collapse
|
12
|
Blades MJ, Ison JC, Ranasinghe R, Findlay JBC. Automatic generation and evaluation of sparse protein signatures for families of protein structural domains. Protein Sci 2005; 14:13-23. [PMID: 15608116 PMCID: PMC2253312 DOI: 10.1110/ps.04929005] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
Abstract
We identified key residues from the structural alignment of families of protein domains from SCOP which we represented in the form of sparse protein signatures. A signature-generating algorithm (SigGen) was developed and used to automatically identify key residues based on several structural and sequence-based criteria. The capacity of the signatures to detect related sequences from the SWISSPROT database was assessed by receiver operator characteristic (ROC) analysis and jack-knife testing. Test signatures for families from each of the main SCOP classes are described in relation to the quality of the structural alignments, the SigGen parameters used, and their diagnostic performance. We show that automatically generated signatures are potently diagnostic for their family (ROC50 scores typically >0.8), consistently outperform random signatures, and can identify sequence relationships in the "twilight zone" of protein sequence similarity (<40%). Signatures based on 15%-30% of alignment positions occurred most frequently among the best-performing signatures. When alignment quality is poor, sparser signatures perform better, whereas signatures generated from higher-quality alignments of fewer structures require more positions to be diagnostic. Our validation of signatures from the Globin family shows that when sequences from the structural alignment are removed and new signatures generated, the omitted sequences are still detected. The positions highlighted by the signature often correspond (alignment specificity >0.7) to the key positions in the original (non-jack-knifed) alignment. We discuss potential applications of sparse signatures in sequence annotation and homology modeling.
Collapse
Affiliation(s)
- Matthew J Blades
- AstraZeneca R&D Charnwood, Bakewell Road, Loughborough, Leicestershire LE11 5RH, England.
| | | | | | | |
Collapse
|
13
|
Bhaduri A, Pugalenthi G, Gupta N, Sowdhamini R. iMOT: an interactive package for the selection of spatially interacting motifs. Nucleic Acids Res 2004; 32:W602-5. [PMID: 15215459 PMCID: PMC441513 DOI: 10.1093/nar/gkh375] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Functional selection and three-dimensional structural constraints of proteins relate to the retention of significant sequence similarity between proteins of similar fold and function despite poor overall sequence identity and evolutionary pressures. We report the availability of 'iMOT' (interacting MOTif) server, an interactive package for the automatic identification of spatially interacting motifs among distantly related proteins sharing similar folds and possessing common ancestral lineage. Spatial interactions between conserved stretches of a protein are evaluated by calculations of pseudo-potentials that describe the strength of interactions. Such an evaluation permits the automatic identification of highly interacting conserved regions of a protein. Interacting motifs have been shown to be useful in searching for distant homologues and establishing remote homologies among the largely unassigned sequences in genome databases. Information on such motifs should also be of value in protein folding, modelling and engineering experiments. The iMOT server can be accessed from http://www.ncbs.res.in/~faculty/mini/imot/iMOTserver.html. Supplementary Material can be accessed from: http://www.ncbs.res.in/~faculty/mini/imot/supplementary.html.
Collapse
Affiliation(s)
- A Bhaduri
- National Centre for Biological Sciences, Tata Institute of Fundamental Research, UAS-GKVK Campus, Bellary Road, Bangalore 560 065, Karnataka, India
| | | | | | | |
Collapse
|
14
|
Bhaduri A, Ravishankar R, Sowdhamini R. Conserved spatially interacting motifs of protein superfamilies: application to fold recognition and function annotation of genome data. Proteins 2004; 54:657-70. [PMID: 14997562 DOI: 10.1002/prot.10638] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Limitations in techniques for the elucidation of protein function have led to an increasing gap between the annotated proteins and those encoded in a genome. The functional selection and three-dimensional structural constraints of proteins in nature often relate to the retention of significant sequence similarity between proteins of similar fold and function despite poor sequence identity. We identify spatially interacting conserved regions, or motifs, within protein superfamilies that are critical for structure and/or function. A search in sequence databases using these descriptors as additional constraints is an approach to identifying putative additional members of superfamilies. Such constrained searches have been tested against proteins of known structure to demonstrate high percentage specificity (93) with a low error rate of 0.0004. This approach has been compared with other sensitive sequence search methods (e.g., PSI-BLAST, HMMsearch, and IMPALA). It has been extended to analyze the distribution of 11 superfamilies in 93 genomes, including the human genome.
Collapse
Affiliation(s)
- Anirban Bhaduri
- National Centre for Biological Sciences, Tata Institute of Fundamental Research, UAS-GKVK Campus, Bangalore, India
| | | | | |
Collapse
|
15
|
Bhaduri A, Pugalenthi G, Sowdhamini R. PASS2: an automated database of protein alignments organised as structural superfamilies. BMC Bioinformatics 2004; 5:35. [PMID: 15059245 PMCID: PMC407847 DOI: 10.1186/1471-2105-5-35] [Citation(s) in RCA: 34] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2003] [Accepted: 04/02/2004] [Indexed: 12/02/2022] Open
Abstract
Background The functional selection and three-dimensional structural constraints of proteins in nature often relates to the retention of significant sequence similarity between proteins of similar fold and function despite poor sequence identity. Organization of structure-based sequence alignments for distantly related proteins, provides a map of the conserved and critical regions of the protein universe that is useful for the analysis of folding principles, for the evolutionary unification of protein families and for maximizing the information return from experimental structure determination. The Protein Alignment organised as Structural Superfamily (PASS2) database represents continuously updated, structural alignments for evolutionary related, sequentially distant proteins. Description An automated and updated version of PASS2 is, in direct correspondence with SCOP 1.63, consisting of sequences having identity below 40% among themselves. Protein domains have been grouped into 628 multi-member superfamilies and 566 single member superfamilies. Structure-based sequence alignments for the superfamilies have been obtained using COMPARER, while initial equivalencies have been derived from a preliminary superposition using LSQMAN or STAMP 4.0. The final sequence alignments have been annotated for structural features using JOY4.0. The database is supplemented with sequence relatives belonging to different genomes, conserved spatially interacting and structural motifs, probabilistic hidden markov models of superfamilies based on the alignments and useful links to other databases. Probabilistic models and sensitive position specific profiles obtained from reliable superfamily alignments aid annotation of remote homologues and are useful tools in structural and functional genomics. PASS2 presents the phylogeny of its members both based on sequence and structural dissimilarities. Clustering of members allows us to understand diversification of the family members. The search engine has been improved for simpler browsing of the database. Conclusions The database resolves alignments among the structural domains consisting of evolutionarily diverged set of sequences. Availability of reliable sequence alignments of distantly related proteins despite poor sequence identity and single-member superfamilies permit better sampling of structures in libraries for fold recognition of new sequences and for the understanding of protein structure-function relationships of individual superfamilies. PASS2 is accessible at
Collapse
Affiliation(s)
- Anirban Bhaduri
- National Centre for Biological Sciences, Tata Institute of Fundamental Research, UAS-GKVK campus, Bellary Road, Bangalore, Karnataka 560 065, India
| | - Ganesan Pugalenthi
- National Centre for Biological Sciences, Tata Institute of Fundamental Research, UAS-GKVK campus, Bellary Road, Bangalore, Karnataka 560 065, India
| | - Ramanathan Sowdhamini
- National Centre for Biological Sciences, Tata Institute of Fundamental Research, UAS-GKVK campus, Bellary Road, Bangalore, Karnataka 560 065, India
| |
Collapse
|
16
|
Tang CL, Xie L, Koh IYY, Posy S, Alexov E, Honig B. On the Role of Structural Information in Remote Homology Detection and Sequence Alignment: New Methods Using Hybrid Sequence Profiles. J Mol Biol 2003; 334:1043-62. [PMID: 14643665 DOI: 10.1016/j.jmb.2003.10.025] [Citation(s) in RCA: 71] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
Structural alignments often reveal relationships between proteins that cannot be detected using sequence alignment alone. However, profile search methods based entirely on structural alignments alone have not been found to be effective in finding remote homologs. Here, we explore the role of structural information in remote homolog detection and sequence alignment. To this end, we develop a series of hybrid multidimensional alignment profiles that combine sequence, secondary and tertiary structure information into hybrid profiles. Sequence-based profiles are profiles whose position-specific scoring matrix is derived from sequence alignment alone; structure-based profiles are those derived from multiple structure alignments. We compare pure sequence-based profiles to pure structure-based profiles, as well as to hybrid profiles that use combined sequence-and-structure-based profiles, where sequence-based profiles are used in loop/motif regions and structural information is used in core structural regions. All of the hybrid methods offer significant improvement over simple profile-to-profile alignment. We demonstrate that both sequence-based and structure-based profiles contribute to remote homology detection and alignment accuracy, and that each contains some unique information. We discuss the implications of these results for further improvements in amino acid sequence and structural analysis.
Collapse
Affiliation(s)
- Christopher L Tang
- Department of Biochemistry and Molecular Biophysics, Howard Hughes Medical Institute, Columbia University, New York, NY 10032, USA
| | | | | | | | | | | |
Collapse
|
17
|
Abstract
The success of structural genomics initiatives requires the development and application of tools for structure analysis, prediction, and annotation. In this paper we review recent developments in these areas; specifically structure alignment, the detection of remote homologs and analogs, homology modeling and the use of structures to predict function. We also discuss various rationales for structural genomics initiatives. These include the structure-based clustering of sequence space and genome-wide function assignment. It is also argued that structural genomics can be integrated into more traditional biological research if specific biological questions are included in target selection strategies.
Collapse
Affiliation(s)
- Sharon Goldsmith-Fischman
- Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York 10032, USA
| | | |
Collapse
|
18
|
Mihalek I, Res I, Yao H, Lichtarge O. Combining inference from evolution and geometric probability in protein structure evaluation. J Mol Biol 2003; 331:263-79. [PMID: 12875851 DOI: 10.1016/s0022-2836(03)00663-6] [Citation(s) in RCA: 40] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Starting from the hypothesis that evolutionarily important residues form a spatially limited cluster in a protein's native fold, we discuss the possibility of detecting a non-native structure based on the absence of such clustering. The relevant residues are determined using the Evolutionary Trace method. We propose a quantity to measure clustering of the selected residues on the structure and show that the exact values for its average and variance over several ensembles of interest can be found. This enables us to study the behavior of the associated z-scores. Since our approach rests on an analytic result, it proves to be general, customizable, and computationally fast. We find that clustering is indeed detectable in a large representative protein set. Furthermore, we show that non-native structures tend to achieve lower residue-clustering z-scores than those attained by the native folds. The most important conclusion that we draw from this work is that consistency between structural and evolutionary information, manifested in clustering of key residues, imposes powerful constraints on the conformational space of a protein.
Collapse
Affiliation(s)
- I Mihalek
- Department of Molecular and Human Genetics, Baylor College of Medicine, One Baylor Plaza, Houston, TX 77030, USA
| | | | | | | |
Collapse
|
19
|
Ortiz AR, Strauss CEM, Olmea O. MAMMOTH (matching molecular models obtained from theory): an automated method for model comparison. Protein Sci 2002; 11:2606-21. [PMID: 12381844 PMCID: PMC2373724 DOI: 10.1110/ps.0215902] [Citation(s) in RCA: 320] [Impact Index Per Article: 14.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
Advances in structural genomics and protein structure prediction require the design of automatic, fast, objective, and well benchmarked methods capable of comparing and assessing the similarity of low-resolution three-dimensional structures, via experimental or theoretical approaches. Here, a new method for sequence-independent structural alignment is presented that allows comparison of an experimental protein structure with an arbitrary low-resolution protein tertiary model. The heuristic algorithm is given and then used to show that it can describe random structural alignments of proteins with different folds with good accuracy by an extreme value distribution. From this observation, a structural similarity score between two proteins or two different conformations of the same protein is derived from the likelihood of obtaining a given structural alignment by chance. The performance of the derived score is then compared with well established, consensus manual-based scores and data sets. We found that the new approach correlates better than other tools with the gold standard provided by a human evaluator. Timings indicate that the algorithm is fast enough for routine use with large databases of protein models. Overall, our results indicate that the new program (MAMMOTH) will be a good tool for protein structure comparisons in structural genomics applications. MAMMOTH is available from our web site at http://physbio.mssm.edu/~ortizg/.
Collapse
Affiliation(s)
- Angel R Ortiz
- Department of Physiology and Biophysics, Mount Sinai School of Medicine, New York University, New York, New York 10029, USA.
| | | | | |
Collapse
|
20
|
Reddy BVB, Li WW, Bourne PE. Use of conserved key amino acid positions to morph protein folds. Biopolymers 2002; 64:139-45. [PMID: 12012349 DOI: 10.1002/bip.10152] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
By using three-dimensional (3D) structure alignments and a previously published method to determine Conserved Key Amino Acid Positions (CKAAPs) we propose a theoretical method to design mutations that can be used to morph the protein folds. The original Paracelsus challenge, met by several groups, called for the engineering of a stable but different structure by modifying less than 50% of the amino acid residues. We have used the sequences from the Protein Data Bank (PDB) identifiers 1ROP, and 2CRO, which were previously used in the Paracelsus challenge by those groups, and suggest mutation to CKAAPs to morph the protein fold. The total number of mutations suggested is less than 40% of the starting sequence theoretically improving the challenge results. From secondary structure prediction experiments of the proposed mutant sequence structures, we observe that each of the suggested mutant protein sequences likely folds to a different, non-native potentially stable target structure. These results are an early indicator that analyses using structure alignments leading to CKAAPs of a given structure are of value in protein engineering experiments.
Collapse
Affiliation(s)
- Boojala V B Reddy
- San Diego Supercomputer Center, University of California-San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA
| | | | | |
Collapse
|
21
|
Abstract
The genomes of over 60 organisms from all three kingdoms of life are now entirely sequenced. In many respects, the inventory of proteins used in different kingdoms appears surprisingly similar. However, eukaryotes differ from other kingdoms in that they use many long proteins, and have more proteins with coiled-coil helices and with regions abundant in regular secondary structure. Particular structural domains are used in many pathways. Nevertheless, one domain tends to occur only once in one particular pathway. Many proteins do not have close homologues in different species (orphans) and there could even be folds that are specific to one species. This view implies that protein fold space is discrete. An alternative model suggests that structure space is continuous and that modern proteins evolved by aggregating fragments of ancient proteins. Either way, after having harvested proteomes by applying standard tools, the challenge now seems to be to develop better methods for comparative proteomics.
Collapse
Affiliation(s)
- Burkhard Rost
- CUBIC, Department of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street, BB217, New York, NY 10032, USA.
| |
Collapse
|
22
|
Burley SK, Bonanno JB. Structural genomics of proteins from conserved biochemical pathways and processes. Curr Opin Struct Biol 2002; 12:383-91. [PMID: 12127459 DOI: 10.1016/s0959-440x(02)00330-5] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/24/2023]
Abstract
During the past year, X-ray crystallographers and solution NMR spectroscopists have made significant progress towards the complete structural characterization of conserved biochemical pathways and processes. Some of these advances were made in the context of nascent structural genomics programs, which promise to accelerate structural studies of biologically and medically important proteins. The results of high-throughput protein production, crystallization, structure determination, homology modeling and functional annotation published by two such programs have provided insight into the evolution and function of enzymes in the isoprenoid biosynthesis and ribulose monophosphate pathways.
Collapse
Affiliation(s)
- Stephen K Burley
- Howard Hughes Medical Institute, Laboratories of Molecular Biophysics, The Rockefeller University, 1230 York Avenue, New York, New York 10021, USA.
| | | |
Collapse
|
23
|
Madabushi S, Yao H, Marsh M, Kristensen DM, Philippi A, Sowa ME, Lichtarge O. Structural clusters of evolutionary trace residues are statistically significant and common in proteins. J Mol Biol 2002; 316:139-54. [PMID: 11829509 DOI: 10.1006/jmbi.2001.5327] [Citation(s) in RCA: 159] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Given the massive increase in the number of new sequences and structures, a critical problem is how to integrate these raw data into meaningful biological information. One approach, the Evolutionary Trace, or ET, uses phylogenetic information to rank the residues in a protein sequence by evolutionary importance and then maps those ranked at the top onto a representative structure. If these residues form structural clusters, they can identify functional surfaces such as those involved in molecular recognition. Now that a number of examples have shown that ET can identify binding sites and focus mutational studies on their relevant functional determinants, we ask whether the method can be improved so as to be applicable on a large scale. To address this question, we introduce a new treatment of gaps resulting from insertions and deletions, which streamlines the selection of sequences used as input. We also introduce objective statistics to assess the significance of the total number of clusters and of the size of the largest one. As a result of the novel treatment of gaps, ET performance improves measurably. We find evolutionarily privileged clusters that are significant at the 5% level in 45 out of 46 (98%) proteins drawn from a variety of structural classes and biological functions. In 37 of the 38 proteins for which a protein-ligand complex is available, the dominant cluster contacts the ligand. We conclude that spatial clustering of evolutionarily important residues is a general phenomenon, consistent with the cooperative nature of residues that determine structure and function. In practice, these results suggest that ET can be applied on a large scale to identify functional sites in a significant fraction of the structures in the protein databank (PDB). This approach to combining raw sequences and structure to obtain detailed insights into the molecular basis of function should prove valuable in the context of the Structural Genomics Initiative.
Collapse
Affiliation(s)
- Srinivasan Madabushi
- Structural and Computational Biology and Molecular Biophysics Program, Baylor College of Medicine, Houston, TX 77030, USA
| | | | | | | | | | | | | |
Collapse
|
24
|
Friedberg I, Margalit H. Persistently conserved positions in structurally similar, sequence dissimilar proteins: roles in preserving protein fold and function. Protein Sci 2002; 11:350-60. [PMID: 11790845 PMCID: PMC2373454 DOI: 10.1110/ps.18602] [Citation(s) in RCA: 44] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
Abstract
Many protein pairs that share the same fold do not have any detectable sequence similarity, providing a valuable source of information for studying sequence-structure relationship. In this study, we use a stringent data set of structurally similar, sequence-dissimilar protein pairs to characterize residues that may play a role in the determination of protein structure and/or function. For each protein in the database, we identify amino-acid positions that show residue conservation within both close and distant family members. These positions are termed "persistently conserved". We then proceed to determine the "mutually" persistently conserved (MPC) positions: those structurally aligned positions in a protein pair that are persistently conserved in both pair mates. Because of their intra- and interfamily conservation, these positions are good candidates for determining protein fold and function. We find that 45% of the persistently conserved positions are mutually conserved. A significant fraction of them are located in critical positions for secondary structure determination, they are mostly buried, and many of them form spatial clusters within their protein structures. A substitution matrix based on the subset of MPC positions shows two distinct characteristics: (i) it is different from other available matrices, even those that are derived from structural alignments; (ii) its relative entropy is high, emphasizing the special residue restrictions imposed on these positions. Such a substitution matrix should be valuable for protein design experiments.
Collapse
Affiliation(s)
- Iddo Friedberg
- Department of Molecular Genetics and Biotechnology, The Hebrew University-Hadassah Medical School, Jerusalem 91120, Israel
| | | |
Collapse
|
25
|
Li WW, Reddy BVB, Tate JG, Shindyalov IN, Bourne PE. CKAAPs DB: a Conserved Key Amino Acid Positions DataBase. Nucleic Acids Res 2002; 30:409-11. [PMID: 11752351 PMCID: PMC99066 DOI: 10.1093/nar/30.1.409] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
The Conserved Key Amino Acid Positions DataBase (CKAAPs DB) provides access to an analysis of structurally similar proteins with dissimilar sequences where key residues within a common fold are identified. CKAAPs may be important in protein folding and structural stability and function, and hence useful for protein engineering studies. This paper provides an update to the initial report of CKAAPs DB [Li et al. (2001) Nucleic Acids Res., 29, 329-331]. CKAAPs DB contains CKAAPs for the representative set of polypeptide chains derived from the CE and FSSP databases, as well as subdomains (conserved regions of the order of 100 residues within a domain) identified by CE. The new version now offers different perspectives on the CKAAPs. First, CKAAPs are mapped onto their respective Protein Data Bank (PDB) structures rendered by Molscript, providing a spatial context for the CKAAPs. Secondly, CKAAPs may be highlighted within a structure-based sequence alignment, as well as secondary structure alignment. Thirdly, the resulting sequence homologs from the structure alignment may be viewed in alignments colorized based on identities and property groups using Mview. New search capabilities have also been provided for searching by keyword combinations, PDB IDs, EC numbers, GI numbers, LocusLink ID, taxonomy, gene ontology and pathways. A new custom CKAAPs analysis interface has been implemented where a user may change the criteria for inclusion of chains, initiate CKAAPs analysis and retrieve results. CKAAPs DB is accessible through the web at http://ckaaps.sdsc.edu/. Plain text analysis results are available by FTP at ftp://ftp.sdsc.edu/pub/sdsc/biology/ckaap.
Collapse
Affiliation(s)
- Wilfred W Li
- San Diego Supercomputer Center and Department of Pharmacology, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0505, USA
| | | | | | | | | |
Collapse
|
26
|
Abstract
In the post-genomic era, the new discipline of functional genomics is now facing the challenge of associating a function (as well as estimating its relevance to industrial applications) to about 100,000 microbial, plant or animal genes of known sequence but unknown function. Besides the design of databases, computational methods are increasingly becoming intimately linked with the various experimental approaches. Consequently, bioinformatics is rapidly evolving into independent fields addressing the specific problems of interpreting i) genomic sequences, ii) protein sequences and 3D-structures, as well as iii) transcriptome and macromolecular interaction data. It is thus increasingly difficult for the biologist to choose the computational approaches that perform best in these various areas. This paper attempts to review the most useful developments of the last 2 years.
Collapse
Affiliation(s)
- J M Claverie
- Structural and Genetic Information Laboratory,UMR 1889 CNRS-AVENTIS, 31 Chemin Joseph Aiguier, 13402 Marseille Cedex 20, France.
| | | | | | | |
Collapse
|
27
|
Li WW, Reddy BV, Shindyalov IN, Bourne PE. CKAAPs DB: a conserved key amino acid positions database. Nucleic Acids Res 2001; 29:329-31. [PMID: 11125128 PMCID: PMC29811 DOI: 10.1093/nar/29.1.329] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
The Conserved Key Amino Acid Positions DataBase (CKAAPs DB) provides access to an analysis of structurally similar proteins with dissimilar sequences where key residues within a common fold are identified. The derivation and significance of CKAAPs starting from pairwise structure alignments is described fully in Reddy et al. [Reddy,B.V.B., Li,W.W., Shindyalov,I.N. and Bourne,P.E. (2000) PROTEINS:, in press]. The CKAAPs identified from this theoretical analysis are provided to experimentalists and theoreticians for potential use in protein engineering and modeling. It has been suggested that CKAAPs may be crucial features for protein folding, structural stability and function. Over 170 substructures, as defined by the Combinatorial Extension (CE) database, which are found in approximately 3000 representative polypeptide chains have been analyzed and are available in the CKAAPs DB. CKAAPs DB also provides CKAAPs of the representative set of proteins derived from the CE and FSSP databases. Thus the database contains over 5000 representative poly-peptide chains, covering all known structures in the PDB. A web interface to a relational database permits fast retrieval of structure-sequence alignments, CKAAPs and associated statistics. Users may query by PDB ID, protein name, function and Enzyme Classification number. Users may also submit protein alignments of their own to obtain CKAAPs. An interface to display CKAAPs on each structure from a web browser is also being implemented. CKAAPs DB is maintained by the San Diego Supercomputer Center and accessible at the URL http://ckaaps.sdsc.edu.
Collapse
Affiliation(s)
- W W Li
- San Diego Supercomputer Center and Department of Pharmacology, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0505, USA
| | | | | | | |
Collapse
|