1
|
Bernardes J, Zaverucha G, Vaquero C, Carbone A. Improvement in Protein Domain Identification Is Reached by Breaking Consensus, with the Agreement of Many Profiles and Domain Co-occurrence. PLoS Comput Biol 2016; 12:e1005038. [PMID: 27472895 PMCID: PMC4966962 DOI: 10.1371/journal.pcbi.1005038] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2015] [Accepted: 06/28/2016] [Indexed: 11/30/2022] Open
Abstract
Traditional protein annotation methods describe known domains with probabilistic models representing consensus among homologous domain sequences. However, when relevant signals become too weak to be identified by a global consensus, attempts for annotation fail. Here we address the fundamental question of domain identification for highly divergent proteins. By using high performance computing, we demonstrate that the limits of state-of-the-art annotation methods can be bypassed. We design a new strategy based on the observation that many structural and functional protein constraints are not globally conserved through all species but might be locally conserved in separate clades. We propose a novel exploitation of the large amount of data available: 1. for each known protein domain, several probabilistic clade-centered models are constructed from a large and differentiated panel of homologous sequences, 2. a decision-making protocol combines outcomes obtained from multiple models, 3. a multi-criteria optimization algorithm finds the most likely protein architecture. The method is evaluated for domain and architecture prediction over several datasets and statistical testing hypotheses. Its performance is compared against HMMScan and HHblits, two widely used search methods based on sequence-profile and profile-profile comparison. Due to their closeness to actual protein sequences, clade-centered models are shown to be more specific and functionally predictive than the broadly used consensus models. Based on them, we improved annotation of Plasmodium falciparum protein sequences on a scale not previously possible. We successfully predict at least one domain for 72% of P. falciparum proteins against 63% achieved previously, corresponding to 30% of improvement over the total number of Pfam domain predictions on the whole genome. The method is applicable to any genome and opens new avenues to tackle evolutionary questions such as the reconstruction of ancient domain duplications, the reconstruction of the history of protein architectures, and the estimation of protein domain age. Website and software: http://www.lcqb.upmc.fr/CLADE. Current sequence databases contain hundreds of billions of nucleotides coding for genes and a classification of these sequences is a primary problem in genomics. A reasonable way to organize these sequences is through their predicted domains, but the identification of domains in very divergent sequences, spanning the entire phylogenetic tree of species, is a difficult problem. By generating multiple probabilistic models for a domain, describing the spread of evolutionary patterns in different phylogenetic clades, we can effectively explore domains that are likely to be coded in gene sequences. Through a machine learning approach and optimization techniques, coding for expected evolutionary constraints, we filter the many possibilities of domain identification found for a gene and propose the most likely domain architecture associated to it. The application of this novel approach to the full genome of Plasmodium falciparum, to a dataset of sequences from three SCOP datasets highlights the interest of exploring multiple pathways of domain evolution in the aim of extracting biological information from genomic sequences. Our new computational approach was developed with the hope of providing a novel tier of accurate and precise tools that complement existing tools such as HMMer, HHblits and PSI-BLAST, by exploring in a novel way the large amount of sequence data available. The existence of powerful databases for sequences, domains and architectures help make this hope a reality.
Collapse
Affiliation(s)
- Juliana Bernardes
- Sorbonne Universités, UPMC Univ-Paris 6, CNRS, UMR 7238, Laboratoire de Biologie Computationnelle et Quantitative, Paris, France
- * E-mail: (JB); (AC)
| | - Gerson Zaverucha
- COPPE, Programa de Engenharia de Sistemas e Computação, Universidade Federal do Rio de Janeiro, Rio de Janeiro, Brazil
| | - Catherine Vaquero
- Sorbonne Universités, UPMC Univ-Paris 6, INSERM U1135, CNRS ERL 8255, Centre d’Immunologie et des Maladies Infectieuses (CIMI-Paris), Paris, France
| | - Alessandra Carbone
- Sorbonne Universités, UPMC Univ-Paris 6, CNRS, UMR 7238, Laboratoire de Biologie Computationnelle et Quantitative, Paris, France
- Institut Universitaire de France, Paris, France
- * E-mail: (JB); (AC)
| |
Collapse
|
2
|
Oh Brother, Where Art Thou? Finding Orthologs in the Twilight and Midnight Zones of Sequence Similarity. Evol Biol 2016. [DOI: 10.1007/978-3-319-41324-2_22] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
3
|
König C, Cárdenas MI, Giraldo J, Alquézar R, Vellido A. Label noise in subtype discrimination of class C G protein-coupled receptors: A systematic approach to the analysis of classification errors. BMC Bioinformatics 2015; 16:314. [PMID: 26415951 PMCID: PMC4587730 DOI: 10.1186/s12859-015-0731-9] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2015] [Accepted: 08/31/2015] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The characterization of proteins in families and subfamilies, at different levels, entails the definition and use of class labels. When the adscription of a protein to a family is uncertain, or even wrong, this becomes an instance of what has come to be known as a label noise problem. Label noise has a potentially negative effect on any quantitative analysis of proteins that depends on label information. This study investigates class C of G protein-coupled receptors, which are cell membrane proteins of relevance both to biology in general and pharmacology in particular. Their supervised classification into different known subtypes, based on primary sequence data, is hampered by label noise. The latter may stem from a combination of expert knowledge limitations and the lack of a clear correspondence between labels that mostly reflect GPCR functionality and the different representations of the protein primary sequences. RESULTS In this study, we describe a systematic approach, using Support Vector Machine classifiers, to the analysis of G protein-coupled receptor misclassifications. As a proof of concept, this approach is used to assist the discovery of labeling quality problems in a curated, publicly accessible database of this type of proteins. We also investigate the extent to which physico-chemical transformations of the protein sequences reflect G protein-coupled receptor subtype labeling. The candidate mislabeled cases detected with this approach are externally validated with phylogenetic trees and against further trusted sources such as the National Center for Biotechnology Information, Universal Protein Resource, European Bioinformatics Institute and Ensembl Genome Browser information repositories. CONCLUSIONS In quantitative classification problems, class labels are often by default assumed to be correct. Label noise, though, is bound to be a pervasive problem in bioinformatics, where labels may be obtained indirectly through complex, many-step similarity modelling processes. In the case of G protein-coupled receptors, methods capable of singling out and characterizing those sequences with consistent misclassification behaviour are required to minimize this problem. A systematic, Support Vector Machine-based method has been proposed in this study for such purpose. The proposed method enables a filtering approach to the label noise problem and might become a support tool for database curators in proteomics.
Collapse
Affiliation(s)
- Caroline König
- Dept. of Computer Science, Univ. Politècnica de Catalunya, C. Jordi Girona, 1-3, Barcelona, 08034, Spain.
| | - Martha I Cárdenas
- Dept. of Computer Science, Univ. Politècnica de Catalunya, C. Jordi Girona, 1-3, Barcelona, 08034, Spain. .,Institut de Neurociències, Unitat de Bioestadística, Univ. Autònoma de Barcelona, Cerdanyola del Vallès, Barcelona, 08193, Spain.
| | - Jesús Giraldo
- Institut de Neurociències, Unitat de Bioestadística, Univ. Autònoma de Barcelona, Cerdanyola del Vallès, Barcelona, 08193, Spain.
| | - René Alquézar
- Dept. of Computer Science, Univ. Politècnica de Catalunya, C. Jordi Girona, 1-3, Barcelona, 08034, Spain. .,Institut de Robòtica i Informàtica Industrial, CSIC-UPC, Barcelona, 08034, Spain.
| | - Alfredo Vellido
- Dept. of Computer Science, Univ. Politècnica de Catalunya, C. Jordi Girona, 1-3, Barcelona, 08034, Spain. .,Centro de Investigación Biomédica en Red en Bioingeniería, Biomateriales y Nanomedicina (CIBER-BBN), Cerdanyola del Vallès, Barcelona, 08193, Spain.
| |
Collapse
|
4
|
Remote homology detection incorporating the context of physicochemical properties. Comput Biol Med 2014; 45:43-50. [PMID: 24480162 DOI: 10.1016/j.compbiomed.2013.11.012] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2013] [Revised: 11/10/2013] [Accepted: 11/18/2013] [Indexed: 11/22/2022]
Abstract
A new method for remote protein homology detection, called support vector machine incorporating the context of physicochemical properties (SVM-CP), is presented. Recent discriminative methods are based on concatenating information extracted from each protein by considering several physicochemical properties. We show that there are physicochemical properties that reflect the functional or structural characteristics of each specific protein family, but there are also some physicochemical properties that affect the accuracy of the classification techniques. The research highlights the importance of the selection of physicochemical properties in remote homology detection. Most of the methods slide a window over every protein sequence to extract physicochemical information. This extraction is usually performed by giving the same importance to every value in the window, i.e., averaging the physicochemical values in the observation window. SVM-CP takes into account that every residue in a sliding window has a different weight, which reflects the importance or contribution to the representative value of the window. The SVM-CP method reaches a receiver operating characteristic (ROC) score of 0.93462, which is the highest value for a remote homology detection method based on the sequence composition information.
Collapse
|
5
|
Gullotto D, Nolassi MS, Bernini A, Spiga O, Niccolai N. Probing the protein space for extending the detection of weak homology folds. J Theor Biol 2013; 320:152-8. [DOI: 10.1016/j.jtbi.2012.12.005] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2012] [Revised: 11/03/2012] [Accepted: 12/05/2012] [Indexed: 12/19/2022]
|
6
|
Liu B, Wang X, Chen Q, Dong Q, Lan X. Using amino acid physicochemical distance transformation for fast protein remote homology detection. PLoS One 2012; 7:e46633. [PMID: 23029559 PMCID: PMC3460876 DOI: 10.1371/journal.pone.0046633] [Citation(s) in RCA: 81] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2012] [Accepted: 09/03/2012] [Indexed: 11/18/2022] Open
Abstract
Protein remote homology detection is one of the most important problems in bioinformatics. Discriminative methods such as support vector machines (SVM) have shown superior performance. However, the performance of SVM-based methods depends on the vector representations of the protein sequences. Prior works have demonstrated that sequence-order effects are relevant for discrimination, but little work has explored how to incorporate the sequence-order information along with the amino acid physicochemical properties into the prediction. In order to incorporate the sequence-order effects into the protein remote homology detection, the physicochemical distance transformation (PDT) method is proposed. Each protein sequence is converted into a series of numbers by using the physicochemical property scores in the amino acid index (AAIndex), and then the sequence is converted into a fixed length vector by PDT. The sequence-order information can be efficiently included into the feature vector with little computational cost by this approach. Finally, the feature vectors are input into a support vector machine classifier to detect the protein remote homologies. Our experiments on a well-known benchmark show the proposed method SVM-PDT achieves superior or comparable performance with current state-of-the-art methods and its computational cost is considerably superior to those of other methods. When the evolutionary information extracted from the frequency profiles is combined with the PDT method, the profile-based PDT approach can improve the performance by 3.4% and 11.4% in terms of ROC score and ROC50 score respectively. The local sequence-order information of the protein can be efficiently captured by the proposed PDT and the physicochemical properties extracted from the amino acid index are incorporated into the prediction. The physicochemical distance transformation provides a general framework, which would be a valuable tool for protein-level study.
Collapse
Affiliation(s)
- Bin Liu
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, People's Republic of China.
| | | | | | | | | |
Collapse
|