1
|
Complex evolutionary footprints revealed in an analysis of reused protein segments of diverse lengths. Proc Natl Acad Sci U S A 2017; 114:11703-11708. [PMID: 29078314 PMCID: PMC5676897 DOI: 10.1073/pnas.1707642114] [Citation(s) in RCA: 55] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023] Open
Abstract
We question a central paradigm: namely, that the protein domain is the “atomic unit” of evolution. In conflict with the current textbook view, our results unequivocally show that duplication of protein segments happens both above and below the domain level among amino acid segments of diverse lengths. Indeed, we show that significant evolutionary information is lost when the protein is approached as a string of domains. Our finer-grained approach reveals a far more complicated picture, where reused segments often intertwine and overlap with each other. Our results are consistent with a recursive model of evolution, in which segments of various lengths, typically smaller than domains, “hop” between environments. The fit segments remain, leaving traces that can still be detected. Proteins share similar segments with one another. Such “reused parts”—which have been successfully incorporated into other proteins—are likely to offer an evolutionary advantage over de novo evolved segments, as most of the latter will not even have the capacity to fold. To systematically explore the evolutionary traces of segment “reuse” across proteins, we developed an automated methodology that identifies reused segments from protein alignments. We search for “themes”—segments of at least 35 residues of similar sequence and structure—reused within representative sets of 15,016 domains [Evolutionary Classification of Protein Domains (ECOD) database] or 20,398 chains [Protein Data Bank (PDB)]. We observe that theme reuse is highly prevalent and that reuse is more extensive when the length threshold for identifying a theme is lower. Structural domains, the best characterized form of reuse in proteins, are just one of many complex and intertwined evolutionary traces. Others include long themes shared among a few proteins, which encompass and overlap with shorter themes that recur in numerous proteins. The observed complexity is consistent with evolution by duplication and divergence, and some of the themes might include descendants of ancestral segments. The observed recursive footprints, where the same amino acid can simultaneously participate in several intertwined themes, could be a useful concept for protein design. Data are available at http://trachel-srv.cs.haifa.ac.il/rachel/ppi/themes/.
Collapse
|
2
|
Kubrycht J, Sigler K, Souček P, Hudeček J. Structures composing protein domains. Biochimie 2013; 95:1511-24. [DOI: 10.1016/j.biochi.2013.04.001] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2013] [Accepted: 04/02/2013] [Indexed: 12/21/2022]
|
3
|
Abstract
Functional characterization of genes and their protein products is essential to biological and clinical research. Yet, there is still no reliable way of assigning functional annotations to proteins in a high-throughput manner. In this article, the authors provide an introduction to the task of automated protein function prediction. They discuss about the motivation for automated protein function prediction, the challenges faced in this task, as well as some approaches that are currently available. In particular, they take a closer look at methods that use protein-protein interaction for protein function prediction, elaborating on their underlying techniques and assumptions, as well as their strengths and limitations.
Collapse
|
4
|
Rorick M. Quantifying protein modularity and evolvability: a comparison of different techniques. Biosystems 2012; 110:22-33. [PMID: 22796584 DOI: 10.1016/j.biosystems.2012.06.006] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2011] [Revised: 06/20/2012] [Accepted: 06/27/2012] [Indexed: 10/28/2022]
Abstract
Modularity increases evolvability by reducing constraints on adaptation and by allowing preexisting parts to function in new contexts for novel uses. Protein evolution provides an excellent context to study the causes and consequences of biological modularity. In order to address such questions, however, an index for protein modularity is necessary. This paper proposes a simple index for protein modularity-"module density"-which is the number of evolutionarily independent modules that compose a protein divided by the number of amino acids in the protein. The decomposition of proteins into constituent modules can be accomplished by either of two classes of methods. The first class of methods relies on "suppositional" criteria to assign amino acids to modules, whereas the second class of methods relies on "coevolutionary" criteria for this task. One simple and practical method from the first class consists of approximating the number of modules in a protein as the number of regular secondary structure elements (i.e., helices and sheets). Methods based on coevolutionary criteria require more elaborate data, but they have the advantage of being able to specify modules without prior assumptions about why they exist. Given the increasing availability of datasets sampling protein mutational spectra (e.g., from comparative genomics, experimental evolution, and computational prediction), methods based on coevolutionary criteria will likely become more promising in the near future. The ability to meaningfully quantify protein modularity via simple indices has the potential to aid future efforts to understand protein evolutionary rate determinants, improve molecular evolution models and engineer novel proteins.
Collapse
Affiliation(s)
- Mary Rorick
- University of Michigan, Department of Ecology and Evolutionary Biology, Ann Arbor, MI 48109-1048, United States.
| |
Collapse
|
5
|
Rorick MM, Wagner GP. Protein structural modularity and robustness are associated with evolvability. Genome Biol Evol 2011; 3:456-75. [PMID: 21602570 PMCID: PMC3134980 DOI: 10.1093/gbe/evr046] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023] Open
Abstract
Theory suggests that biological modularity and robustness allow for maintenance of fitness under mutational change, and when this change is adaptive, for evolvability. Empirical demonstrations that these traits promote evolvability in nature remain scant however. This is in part because modularity, robustness, and evolvability are difficult to define and measure in real biological systems. Here, we address whether structural modularity and/or robustness confer evolvability at the level of proteins by looking for associations between indices of protein structural modularity, structural robustness, and evolvability. We propose a novel index for protein structural modularity: the number of regular secondary structure elements (helices and strands) divided by the number of residues in the structure. We index protein evolvability as the proportion of sites with evidence of being under positive selection multiplied by the average rate of adaptive evolution at these sites, and we measure this as an average over a phylogeny of 25 mammalian species. We use contact density as an index of protein designability, and thus, structural robustness. We find that protein evolvability is positively associated with structural modularity as well as structural robustness and that the effect of structural modularity on evolvability is independent of the structural robustness index. We interpret these associations to be the result of reduced constraints on amino acid substitutions in highly modular and robust protein structures, which results in faster adaptation through natural selection.
Collapse
Affiliation(s)
- Mary M Rorick
- Department of Genetics, Yale University, New Haven, Connecticut, USA.
| | | |
Collapse
|
6
|
Bernardes JS, Carbone A, Zaverucha G. A discriminative method for family-based protein remote homology detection that combines inductive logic programming and propositional models. BMC Bioinformatics 2011; 12:83. [PMID: 21429187 PMCID: PMC3078102 DOI: 10.1186/1471-2105-12-83] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2010] [Accepted: 03/23/2011] [Indexed: 11/23/2022] Open
Abstract
BACKGROUND Remote homology detection is a hard computational problem. Most approaches have trained computational models by using either full protein sequences or multiple sequence alignments (MSA), including all positions. However, when we deal with proteins in the "twilight zone" we can observe that only some segments of sequences (motifs) are conserved. We introduce a novel logical representation that allows us to represent physico-chemical properties of sequences, conserved amino acid positions and conserved physico-chemical positions in the MSA. From this, Inductive Logic Programming (ILP) finds the most frequent patterns (motifs) and uses them to train propositional models, such as decision trees and support vector machines (SVM). RESULTS We use the SCOP database to perform our experiments by evaluating protein recognition within the same superfamily. Our results show that our methodology when using SVM performs significantly better than some of the state of the art methods, and comparable to other. However, our method provides a comprehensible set of logical rules that can help to understand what determines a protein function. CONCLUSIONS The strategy of selecting only the most frequent patterns is effective for the remote homology detection. This is possible through a suitable first-order logical representation of homologous properties, and through a set of frequent patterns, found by an ILP system, that summarizes essential features of protein functions.
Collapse
Affiliation(s)
- Juliana S Bernardes
- COPPE, Programa de Engenharia de Sistemas e Computação, Universidade Federal do Rio de Janeiro, Rio de Janeiro, Brazil
- Université Pierre et Marie Curie, UMR7238, Génomique Analytique, 15 rue de l'Ecole de Médecine, F-75006 Paris, France
| | - Alessandra Carbone
- Université Pierre et Marie Curie, UMR7238, Génomique Analytique, 15 rue de l'Ecole de Médecine, F-75006 Paris, France
- CNRS, UMR7238, Laboratoire de Génomique des Microorganismes, F-75006 Paris, France
| | - Gerson Zaverucha
- COPPE, Programa de Engenharia de Sistemas e Computação, Universidade Federal do Rio de Janeiro, Rio de Janeiro, Brazil
| |
Collapse
|
7
|
Nelson KJ, Knutson ST, Soito L, Klomsiri C, Poole LB, Fetrow JS. Analysis of the peroxiredoxin family: using active-site structure and sequence information for global classification and residue analysis. Proteins 2011; 79:947-64. [PMID: 21287625 PMCID: PMC3065352 DOI: 10.1002/prot.22936] [Citation(s) in RCA: 136] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2010] [Revised: 10/13/2010] [Accepted: 10/25/2010] [Indexed: 12/25/2022]
Abstract
Peroxiredoxins (Prxs) are a widespread and highly expressed family of cysteine-based peroxidases that react very rapidly with H₂O₂, organic peroxides, and peroxynitrite. Correct subfamily classification has been problematic because Prx subfamilies are frequently not correlated with phylogenetic distribution and diverge in their preferred reductant, oligomerization state, and tendency toward overoxidation. We have developed a method that uses the Deacon Active Site Profiler (DASP) tool to extract functional-site profiles from structurally characterized proteins to computationally define subfamilies and to identify new Prx subfamily members from GenBank(nr). For the 58 literature-defined Prx test proteins, 57 were correctly assigned, and none were assigned to the incorrect subfamily. The >3500 putative Prx sequences identified were then used to analyze residue conservation in the active site of each Prx subfamily. Our results indicate that the existence and location of the resolving cysteine vary in some subfamilies (e.g., Prx5) to a greater degree than previously appreciated and that interactions at the A interface (common to Prx5, Tpx, and higher order AhpC/Prx1 structures) are important for stabilization of the correct active-site geometry. Interestingly, this method also allows us to further divide the AhpC/Prx1 into four groups that are correlated with functional characteristics. The DASP method provides more accurate subfamily classification than PSI-BLAST for members of the Prx family and can now readily be applied to other large protein families.
Collapse
Affiliation(s)
- Kimberly J. Nelson
- Department of Biochemistry, Wake Forest University Health Sciences, Medical Center Blvd., Winston-Salem NC 27157
| | - Stacy T. Knutson
- Departments of Physics and Computer Science, Wake Forest University, Winston-Salem, NC 27109
| | - Laura Soito
- Department of Biochemistry, Wake Forest University Health Sciences, Medical Center Blvd., Winston-Salem NC 27157
| | - Chananat Klomsiri
- Department of Biochemistry, Wake Forest University Health Sciences, Medical Center Blvd., Winston-Salem NC 27157
| | - Leslie B. Poole
- Department of Biochemistry, Wake Forest University Health Sciences, Medical Center Blvd., Winston-Salem NC 27157
| | - Jacquelyn S. Fetrow
- Departments of Physics and Computer Science, Wake Forest University, Winston-Salem, NC 27109
| |
Collapse
|
8
|
Trifonov EN, Frenkel ZM. Evolution of protein modularity. Curr Opin Struct Biol 2009; 19:335-40. [PMID: 19386484 DOI: 10.1016/j.sbi.2009.03.007] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2009] [Accepted: 03/16/2009] [Indexed: 10/20/2022]
Abstract
Proteins in their evolution appear to follow several discrete stages, which is reflected in their modular organization. The sequences of the protein modules are highly variable while their functions and structures are rather conserved. The relatedness of the variable sequences is well represented by the networks in natural protein sequence space that also suggests evolutionary connections.
Collapse
Affiliation(s)
- Edward N Trifonov
- Genome Diversity Center, Institute of Evolution, University of Haifa, Haifa 31905, Israel.
| | | |
Collapse
|
9
|
Rajasekaran S, Balla S, Gradie P, Gryk MR, Kadaveru K, Kundeti V, Maciejewski MW, Mi T, Rubino N, Vyas J, Schiller MR. Minimotif miner 2nd release: a database and web system for motif search. Nucleic Acids Res 2009; 37:D185-90. [PMID: 18978024 PMCID: PMC2686579 DOI: 10.1093/nar/gkn865] [Citation(s) in RCA: 54] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2008] [Accepted: 10/16/2008] [Indexed: 11/24/2022] Open
Abstract
Minimotif Miner (MnM) consists of a minimotif database and a web-based application that enables prediction of motif-based functions in user-supplied protein queries. We have revised MnM by expanding the database more than 10-fold to approximately 5000 motifs and standardized the motif function definitions. The web-application user interface has been redeveloped with new features including improved navigation, screencast-driven help, support for alias names and expanded SNP analysis. A sample analysis of prion shows how MnM 2 can be used. Weblink: http://mnm.engr.uconn.edu, weblink for version 1 is http://sms.engr.uconn.edu.
Collapse
Affiliation(s)
- Sanguthevar Rajasekaran
- Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06029-2155, Department of Molecular, Microbial, and Structural Biology, Biological System Modeling Group, University of Connecticut Health Center, 263 Farmington Ave. Farmington, CT 06030-3305 and Memorial Sloan-Kettering Cancer Center, NY 10021, USA
| | - Sudha Balla
- Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06029-2155, Department of Molecular, Microbial, and Structural Biology, Biological System Modeling Group, University of Connecticut Health Center, 263 Farmington Ave. Farmington, CT 06030-3305 and Memorial Sloan-Kettering Cancer Center, NY 10021, USA
| | - Patrick Gradie
- Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06029-2155, Department of Molecular, Microbial, and Structural Biology, Biological System Modeling Group, University of Connecticut Health Center, 263 Farmington Ave. Farmington, CT 06030-3305 and Memorial Sloan-Kettering Cancer Center, NY 10021, USA
| | - Michael R. Gryk
- Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06029-2155, Department of Molecular, Microbial, and Structural Biology, Biological System Modeling Group, University of Connecticut Health Center, 263 Farmington Ave. Farmington, CT 06030-3305 and Memorial Sloan-Kettering Cancer Center, NY 10021, USA
| | - Krishna Kadaveru
- Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06029-2155, Department of Molecular, Microbial, and Structural Biology, Biological System Modeling Group, University of Connecticut Health Center, 263 Farmington Ave. Farmington, CT 06030-3305 and Memorial Sloan-Kettering Cancer Center, NY 10021, USA
| | - Vamsi Kundeti
- Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06029-2155, Department of Molecular, Microbial, and Structural Biology, Biological System Modeling Group, University of Connecticut Health Center, 263 Farmington Ave. Farmington, CT 06030-3305 and Memorial Sloan-Kettering Cancer Center, NY 10021, USA
| | - Mark W. Maciejewski
- Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06029-2155, Department of Molecular, Microbial, and Structural Biology, Biological System Modeling Group, University of Connecticut Health Center, 263 Farmington Ave. Farmington, CT 06030-3305 and Memorial Sloan-Kettering Cancer Center, NY 10021, USA
| | - Tian Mi
- Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06029-2155, Department of Molecular, Microbial, and Structural Biology, Biological System Modeling Group, University of Connecticut Health Center, 263 Farmington Ave. Farmington, CT 06030-3305 and Memorial Sloan-Kettering Cancer Center, NY 10021, USA
| | - Nicholas Rubino
- Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06029-2155, Department of Molecular, Microbial, and Structural Biology, Biological System Modeling Group, University of Connecticut Health Center, 263 Farmington Ave. Farmington, CT 06030-3305 and Memorial Sloan-Kettering Cancer Center, NY 10021, USA
| | - Jay Vyas
- Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06029-2155, Department of Molecular, Microbial, and Structural Biology, Biological System Modeling Group, University of Connecticut Health Center, 263 Farmington Ave. Farmington, CT 06030-3305 and Memorial Sloan-Kettering Cancer Center, NY 10021, USA
| | - Martin R. Schiller
- Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06029-2155, Department of Molecular, Microbial, and Structural Biology, Biological System Modeling Group, University of Connecticut Health Center, 263 Farmington Ave. Farmington, CT 06030-3305 and Memorial Sloan-Kettering Cancer Center, NY 10021, USA
| |
Collapse
|
10
|
Liu B, Wang X, Lin L, Dong Q, Wang X. A discriminative method for protein remote homology detection and fold recognition combining Top-n-grams and latent semantic analysis. BMC Bioinformatics 2008; 9:510. [PMID: 19046430 PMCID: PMC2613933 DOI: 10.1186/1471-2105-9-510] [Citation(s) in RCA: 109] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2008] [Accepted: 12/01/2008] [Indexed: 11/23/2022] Open
Abstract
BACKGROUND Protein remote homology detection and fold recognition are central problems in bioinformatics. Currently, discriminative methods based on support vector machine (SVM) are the most effective and accurate methods for solving these problems. A key step to improve the performance of the SVM-based methods is to find a suitable representation of protein sequences. RESULTS In this paper, a novel building block of proteins called Top-n-grams is presented, which contains the evolutionary information extracted from the protein sequence frequency profiles. The protein sequence frequency profiles are calculated from the multiple sequence alignments outputted by PSI-BLAST and converted into Top-n-grams. The protein sequences are transformed into fixed-dimension feature vectors by the occurrence times of each Top-n-gram. The training vectors are evaluated by SVM to train classifiers which are then used to classify the test protein sequences. We demonstrate that the prediction performance of remote homology detection and fold recognition can be improved by combining Top-n-grams and latent semantic analysis (LSA), which is an efficient feature extraction technique from natural language processing. When tested on superfamily and fold benchmarks, the method combining Top-n-grams and LSA gives significantly better results compared to related methods. CONCLUSION The method based on Top-n-grams significantly outperforms the methods based on many other building blocks including N-grams, patterns, motifs and binary profiles. Therefore, Top-n-gram is a good building block of the protein sequences and can be widely used in many tasks of the computational biology, such as the sequence alignment, the prediction of domain boundary, the designation of knowledge-based potentials and the prediction of protein binding sites.
Collapse
Affiliation(s)
- Bin Liu
- Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, PR China
| | - Xiaolong Wang
- Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, PR China
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, PR China
| | - Lei Lin
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, PR China
| | - Qiwen Dong
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, PR China
| | - Xuan Wang
- Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, PR China
| |
Collapse
|
11
|
Ben-Hur A, Brutlag D. Sequence Motifs: Highly Predictive Features of Protein Function. FEATURE EXTRACTION 2008. [DOI: 10.1007/978-3-540-35488-8_32] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/11/2023]
|
12
|
Wase NV, Wright PC. Systems biology of cyanobacterial secondary metabolite production and its role in drug discovery. Expert Opin Drug Discov 2008; 3:903-29. [DOI: 10.1517/17460441.3.8.903] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Affiliation(s)
- Nishikant V Wase
- The University of Sheffield, Biological and Environmental Systems Group, Department of Chemical and Process Engineering, Mappin St., Sheffield, S1 3JD, UK ;
| | - Phillip C Wright
- The University of Sheffield, Biological and Environmental Systems Group, Department of Chemical and Process Engineering, Mappin St., Sheffield, S1 3JD, UK ;
| |
Collapse
|
13
|
Wang H, Segal E, Ben-Hur A, Li QR, Vidal M, Koller D. InSite: a computational method for identifying protein-protein interaction binding sites on a proteome-wide scale. Genome Biol 2008; 8:R192. [PMID: 17868464 PMCID: PMC2375030 DOI: 10.1186/gb-2007-8-9-r192] [Citation(s) in RCA: 47] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2007] [Revised: 07/25/2007] [Accepted: 09/14/2007] [Indexed: 12/30/2022] Open
Abstract
We propose InSite, a computational method that integrates high-throughput protein and sequence data to infer the specific binding regions of interacting protein pairs. We compared our predictions with binding sites in Protein Data Bank and found significantly more binding events occur at sites we predicted. Several regions containing disease-causing mutations or cancer polymorphisms in human are predicted to be binding for protein pairs related to the disease, which suggests novel mechanistic hypotheses for several diseases.
Collapse
Affiliation(s)
- Haidong Wang
- Computer Science Department, Stanford University, Serra Mall, Stanford, CA 94305, USA
| | - Eran Segal
- Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Rehovot 76100, Israel
| | - Asa Ben-Hur
- Computer Science Department, Colorado State University, South Howes Street, Fort Collins, CO 80523, USA
| | - Qian-Ru Li
- Center for Cancer Systems Biology (CCSB) and Department of Cancer Biology, Dana-Farber Cancer Institute, and Department of Genetics, Harvard Medical School, Binney Street, Boston, MA 02115, USA
| | - Marc Vidal
- Center for Cancer Systems Biology (CCSB) and Department of Cancer Biology, Dana-Farber Cancer Institute, and Department of Genetics, Harvard Medical School, Binney Street, Boston, MA 02115, USA
| | - Daphne Koller
- Computer Science Department, Stanford University, Serra Mall, Stanford, CA 94305, USA
| |
Collapse
|
14
|
Hsu CM, Chen CY, Liu BJ. MAGIIC-PRO: detecting functional signatures by efficient discovery of long patterns in protein sequences. Nucleic Acids Res 2008; 36:1400-6. [PMID: 18314547 PMCID: PMC3143912 DOI: 10.1093/nar/gkm717] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
This paper presents a web service named MAGIICPRO,which aims to discover functional signatures of a query protein by sequential pattern mining. Automatic discovery of patterns from unaligned biological sequences is an important problem in molecular biology. MAGIIC-PRO is different from several previously established methods performing similar tasks in two major ways. The first remarkable feature of MAGIIC-PRO is its efficiency in delivering long patterns. With incorporating a new type of gap constraints and some of the state-of-theart data mining techniques, MAGIIC-PRO usually identifies satisfied patterns within an acceptable response time. The efficiency of MAGIIC-PRO enables the users to quickly discover functional signatures of which the residues are not from only one region of the protein sequences or are only conserved in few members of a protein family. The second remarkable feature of MAGIIC-PRO is its effort in refining the mining results. Considering large flexible gaps improves the completeness of the derived functional signatures. The users can be directly guided to the patterns with as many blocks as that are conserved simultaneously. In this paper,we show by experiments that MAGIIC-PRO is efficient and effective in identifying ligand-binding sites and hot regions in protein-protein interactions directly from sequences. The web service is availableat http://biominer.bime.ntu.edu.tw/magiicproand a mirror site at http://biominer.cse.yzu.edu.tw/magiicpro.
Collapse
Affiliation(s)
- Chen-Ming Hsu
- Department of Computer Science and Engineering, Yuan Ze University, Chung-Li, 320, Taiwan, Republic of China
| | | | | |
Collapse
|
15
|
Håndstad T, Hestnes AJH, Sætrom P. Motif kernel generated by genetic programming improves remote homology and fold detection. BMC Bioinformatics 2007; 8:23. [PMID: 17254344 PMCID: PMC1794419 DOI: 10.1186/1471-2105-8-23] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2006] [Accepted: 01/25/2007] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Protein remote homology detection is a central problem in computational biology. Most recent methods train support vector machines to discriminate between related and unrelated sequences and these studies have introduced several types of kernels. One successful approach is to base a kernel on shared occurrences of discrete sequence motifs. Still, many protein sequences fail to be classified correctly for a lack of a suitable set of motifs for these sequences. RESULTS We introduce the GPkernel, which is a motif kernel based on discrete sequence motifs where the motifs are evolved using genetic programming. All proteins can be grouped according to evolutionary relations and structure, and the method uses this inherent structure to create groups of motifs that discriminate between different families of evolutionary origin. When tested on two SCOP benchmarks, the superfamily and fold recognition problems, the GPkernel gives significantly better results compared to related methods of remote homology detection. CONCLUSION The GPkernel gives particularly good results on the more difficult fold recognition problem compared to the other methods. This is mainly because the method creates motif sets that describe similarities among subgroups of both the related and unrelated proteins. This rich set of motifs give a better description of the similarities and differences between different folds than do previous motif-based methods.
Collapse
Affiliation(s)
- Tony Håndstad
- Department of Computer and Information Science, Norwegian University of Science and Technology, NO-7052, Trondheim, Norway
| | - Arne JH Hestnes
- Department of Computer and Information Science, Norwegian University of Science and Technology, NO-7052, Trondheim, Norway
| | - Pål Sætrom
- Department of Computer and Information Science, Norwegian University of Science and Technology, NO-7052, Trondheim, Norway
- Interagon AS, Laboratoriesenteret, NO-7006 Trondheim, Norway
| |
Collapse
|
16
|
Hsu CM, Chen CY, Liu BJ. MAGIIC-PRO: detecting functional signatures by efficient discovery of long patterns in protein sequences. Nucleic Acids Res 2006; 34:W356-61. [PMID: 16845025 PMCID: PMC1538832 DOI: 10.1093/nar/gkl309] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
This paper presents a web service named MAGIIC-PRO, which aims to discover functional signatures of a query protein by sequential pattern mining. Automatic discovery of patterns from unaligned biological sequences is an important problem in molecular biology. MAGIIC-PRO is different from several previously established methods performing similar tasks in two major ways. The first remarkable feature of MAGIIC-PRO is its efficiency in delivering long patterns. With incorporating a new type of gap constraints and some of the state-of-the-art data mining techniques, MAGIIC-PRO usually identifies satisfied patterns within an acceptable response time. The efficiency of MAGIIC-PRO enables the users to quickly discover functional signatures of which the residues are not from only one region of the protein sequences or are only conserved in few members of a protein family. The second remarkable feature of MAGIIC-PRO is its effort in refining the mining results. Considering large flexible gaps improves the completeness of the derived functional signatures. The users can be directly guided to the patterns with as many blocks as that are conserved simultaneously. In this paper, we show by experiments that MAGIIC-PRO is efficient and effective in identifying ligand-binding sites and hot regions in protein–protein interactions directly from sequences. The web service is available at and a mirror site at .
Collapse
Affiliation(s)
| | - Chien-Yu Chen
- Department of Bio-Industrial Mechatronics Engineering, National Taiwan UniversityTaipei, 106, Taiwan, Republic Of China
- To whom correspondence should be addressed. Tel: +886 2 33665334; Fax: +886 2 23627620;
| | | |
Collapse
|
17
|
Abstract
The protein-protein interaction networks of even well-studied model organisms are sketchy at best, highlighting the continued need for computational methods to help direct experimentalists in the search for novel interactions. This need has prompted the development of a number of methods for predicting protein-protein interactions based on various sources of data and methodologies. The common method for choosing negative examples for training a predictor of protein-protein interactions is based on annotations of cellular localization, and the observation that pairs of proteins that have different localization patterns are unlikely to interact. While this method leads to high quality sets of non-interacting proteins, we find that this choice can lead to biased estimates of prediction accuracy, because the constraints placed on the distribution of the negative examples makes the task easier. The effects of this bias are demonstrated in the context of both sequence-based and non-sequence based features used for predicting protein-protein interactions.
Collapse
Affiliation(s)
- Asa Ben-Hur
- Department of Computer Science, Colorado State University, Fort Collins CO, USA
- Department of Computer Science, University of Colorado, Boulder CO, USA
| | - William Stafford Noble
- Department of Genome Sciences, University of Washington, Seattle WA, USA
- Department of Computer Science and Engineering, University of Washington, Seattle WA, USA
| |
Collapse
|
18
|
Gutman R, Berezin C, Wollman R, Rosenberg Y, Ben-Tal N. QuasiMotiFinder: protein annotation by searching for evolutionarily conserved motif-like patterns. Nucleic Acids Res 2005; 33:W255-61. [PMID: 15980465 PMCID: PMC1160256 DOI: 10.1093/nar/gki496] [Citation(s) in RCA: 30] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Sequence signature databases such as PROSITE, which include amino acid segments that are indicative of a protein's function, are useful for protein annotation. Lamentably, the annotation is not always accurate. A signature may be falsely detected in a protein that does not carry out the associated function (false positive prediction, FP) or may be overlooked in a protein that does carry out the function (false negative prediction, FN). A new approach has emerged in which a signature is replaced with a sequence profile, calculated based on multiple sequence alignment (MSA) of homologous proteins that share the same function. This approach, which is superior to the simple pattern search, essentially searches with the sequence of the query protein against an MSA library. We suggest here an alternative approach, implemented in the QuasiMotiFinder web server (http://quasimotifinder.tau.ac.il/), which is based on a search with an MSA of homologous query proteins against the original PROSITE signatures. The explicit use of the average evolutionary conservation of the signature in the query proteins significantly reduces the rate of FP prediction compared with the simple pattern search. QuasiMotiFinder also has a reduced rate of FN prediction compared with simple pattern searches, since the traditional search for precise signatures has been replaced by a permissive search for signature-like patterns that are physicochemically similar to known signatures. Overall, QuasiMotiFinder and the profile search are comparable to each other in terms of performance. They are also complementary to each other in that signatures that are falsely detected in (or overlooked by) one may be correctly detected by the other.
Collapse
Affiliation(s)
| | | | | | | | - Nir Ben-Tal
- To whom correspondence should be addressed. Tel: +972 3 640 6709; Fax: +972 3 640 6834;
| |
Collapse
|