1
|
Bernardes JS, Carbone A, Zaverucha G. A discriminative method for family-based protein remote homology detection that combines inductive logic programming and propositional models. BMC Bioinformatics 2011; 12:83. [PMID: 21429187 PMCID: PMC3078102 DOI: 10.1186/1471-2105-12-83] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2010] [Accepted: 03/23/2011] [Indexed: 11/23/2022] Open
Abstract
BACKGROUND Remote homology detection is a hard computational problem. Most approaches have trained computational models by using either full protein sequences or multiple sequence alignments (MSA), including all positions. However, when we deal with proteins in the "twilight zone" we can observe that only some segments of sequences (motifs) are conserved. We introduce a novel logical representation that allows us to represent physico-chemical properties of sequences, conserved amino acid positions and conserved physico-chemical positions in the MSA. From this, Inductive Logic Programming (ILP) finds the most frequent patterns (motifs) and uses them to train propositional models, such as decision trees and support vector machines (SVM). RESULTS We use the SCOP database to perform our experiments by evaluating protein recognition within the same superfamily. Our results show that our methodology when using SVM performs significantly better than some of the state of the art methods, and comparable to other. However, our method provides a comprehensible set of logical rules that can help to understand what determines a protein function. CONCLUSIONS The strategy of selecting only the most frequent patterns is effective for the remote homology detection. This is possible through a suitable first-order logical representation of homologous properties, and through a set of frequent patterns, found by an ILP system, that summarizes essential features of protein functions.
Collapse
Affiliation(s)
- Juliana S Bernardes
- COPPE, Programa de Engenharia de Sistemas e Computação, Universidade Federal do Rio de Janeiro, Rio de Janeiro, Brazil
- Université Pierre et Marie Curie, UMR7238, Génomique Analytique, 15 rue de l'Ecole de Médecine, F-75006 Paris, France
| | - Alessandra Carbone
- Université Pierre et Marie Curie, UMR7238, Génomique Analytique, 15 rue de l'Ecole de Médecine, F-75006 Paris, France
- CNRS, UMR7238, Laboratoire de Génomique des Microorganismes, F-75006 Paris, France
| | - Gerson Zaverucha
- COPPE, Programa de Engenharia de Sistemas e Computação, Universidade Federal do Rio de Janeiro, Rio de Janeiro, Brazil
| |
Collapse
|
2
|
Gianazza E, Eberini I, Sensi C, Barile M, Vergani L, Vanoni MA. Energy matters: mitochondrial proteomics for biomedicine. Proteomics 2011; 11:657-74. [PMID: 21241019 DOI: 10.1002/pmic.201000412] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2010] [Revised: 09/22/2010] [Accepted: 11/03/2010] [Indexed: 12/16/2022]
Abstract
This review compiles results of medical relevance from mitochondrial proteomics, grouped either according to the type of disease - genetic or degenerative - or to the involved mechanism - oxidative stress or apoptosis. The findings are commented in the light of our current understanding of uniformity/variability in cell responses to different stimuli. Specificities in the conceptual and technical approaches to human mitochondrial proteomics are also outlined.
Collapse
Affiliation(s)
- Elisabetta Gianazza
- Dipartimento di Scienze Farmacologiche, Università degli Studi di Milano, Milano, Italy.
| | | | | | | | | | | |
Collapse
|
3
|
Latino DARS, Aires-de-Sousa J. Assignment of EC numbers to enzymatic reactions with MOLMAP reaction descriptors and random forests. J Chem Inf Model 2009; 49:1839-46. [PMID: 19588957 DOI: 10.1021/ci900104b] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
The MOLMAP descriptor relies on a Kohonen SOM that defines types of covalent bonds on the basis of their physicochemical and topological properties. The MOLMAP descriptor of a molecule represents the types of bonds available in that molecule. The MOLMAP descriptor of a reaction is defined as the difference between the MOLMAPs of the products and the reactants and numerically encodes the pattern of changes in bonds during a chemical reaction. In this study, a genome-scale data set of enzymatic reactions available in the KEGG database was encoded by the MOLMAP descriptors and was explored for the assignment of the official EC number from the reaction equation with Random Forests as the machine learning algorithm. EC numbers were correctly assigned in 95%, 90%, and 85% (for independent test sets) at the class, subclass, and subsubclass EC number level, respectively, with training sets including one reaction from each available full EC number. Increasing differences between training and test sets were explored, leading to decreased percentages of correct assignments. The classification of reactions only from the main reactants and products was obtained at the class, subclass, and subsubclass level with accuracies of 78%, 74%, and 63%, respectively.
Collapse
Affiliation(s)
- Diogo A R S Latino
- CQFB, REQUIMTE, Departamento de Quimica, Faculdade de Ciencias e Tecnologia, Universidade Nova de Lisboa, 2829-516 Caparica, Portugal
| | | |
Collapse
|
4
|
Weber GW, Ozöğür-Akyüz S, Kropat E. A review on data mining and continuous optimization applications in computational biology and medicine. ACTA ACUST UNITED AC 2009; 87:165-81. [PMID: 19530130 DOI: 10.1002/bdrc.20151] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
An emerging research area in computational biology and biotechnology is devoted to mathematical modeling and prediction of gene-expression patterns; it nowadays requests mathematics to deeply understand its foundations. This article surveys data mining and machine learning methods for an analysis of complex systems in computational biology. It mathematically deepens recent advances in modeling and prediction by rigorously introducing the environment and aspects of errors and uncertainty into the genetic context within the framework of matrix and interval arithmetics. Given the data from DNA microarray experiments and environmental measurements, we extract nonlinear ordinary differential equations which contain parameters that are to be determined. This is done by a generalized Chebychev approximation and generalized semi-infinite optimization. Then, time-discretized dynamical systems are studied. By a combinatorial algorithm which constructs and follows polyhedra sequences, the region of parametric stability is detected. In addition, we analyze the topological landscape of gene-environment networks in terms of structural stability. As a second strategy, we will review recent model selection and kernel learning methods for binary classification which can be used to classify microarray data for cancerous cells or for discrimination of other kind of diseases. This review is practically motivated and theoretically elaborated; it is devoted to a contribution to better health care, progress in medicine, a better education, and more healthy living conditions.
Collapse
Affiliation(s)
- Gerhard-Wilhelm Weber
- Institute of Applied Mathematics, Middle East Technical University, Ankara 06531, Turkey
| | | | | |
Collapse
|
5
|
Latino DARS, Zhang QY, Aires-de-Sousa J. Genome-scale classification of metabolic reactions and assignment of EC numbers with self-organizing maps. ACTA ACUST UNITED AC 2008; 24:2236-44. [PMID: 18676416 DOI: 10.1093/bioinformatics/btn405] [Citation(s) in RCA: 33] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION The automatic perception of chemical similarities between metabolic reactions is required for a variety of applications ranging from the computer-aided validation of classification systems, to genome-scale reconstruction (or comparison) of metabolic pathways, to the classification of enzymatic mechanisms. Comparison of metabolic reactions has been mostly based on Enzyme Commission (EC) numbers, which are extremely useful and widespread, but not always straightforward to apply, and often problematic when an enzyme catalyzes several reactions, when the same reaction is catalyzed by different enzymes, when official full EC numbers are unavailable or when reactions are not catalyzed by enzymes. Different methods should be available to compare metabolic reactions. Simultaneously, methods are required for the automatic assignment of EC numbers to reactions still not officially classified. RESULTS We have proposed the MOLMAP reaction descriptors to numerically encode the structural transformations resulting from a chemical reaction. Here, such descriptors are applied to the mapping of a genome-scale database of almost 4000 metabolic reactions by Kohonen self-organizing maps (SOMs), and its screening for inconsistencies in EC numbers. This approach allowed for the SOMs to assign EC numbers at the class, subclass and sub-subclass levels for reactions of independent test sets with accuracies up to 92, 80 and 70%, respectively. Different levels of similarity between training and test sets were explored. The approach also led to the identification of a number of similar reactions bearing differences at the EC class level. AVAILABILITY The programs to generate MOLMAP descriptors from atomic properties included in SDF files are available upon request for evaluation.
Collapse
Affiliation(s)
- Diogo A R S Latino
- CQFB, REQUIMTE, Departamento de Química, Faculdade de Ciências e Tecnologia, Universidade Nova de Lisboa, 2829-516 Caparica, Portugal
| | | | | |
Collapse
|
6
|
Exarchos KP, Papaloukas C, Exarchos TP, Troganis AN, Fotiadis DI. Prediction of cis/trans isomerization using feature selection and support vector machines. J Biomed Inform 2008; 42:140-9. [PMID: 18586558 DOI: 10.1016/j.jbi.2008.05.006] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2008] [Revised: 04/26/2008] [Accepted: 05/12/2008] [Indexed: 10/22/2022]
Abstract
In protein structures the peptide bond is found to be in trans conformation in the majority of the cases. Only a small fraction of peptide bonds in proteins is reported to be in cis conformation. Most of these instances (>90%) occur when the peptide bond is an imide (X-Pro) rather than an amide bond (X-nonPro). Due to the implication of cis/trans isomerization in many biologically significant processes, the accurate prediction of the peptide bond conformation is of high interest. In this study, we evaluate the effect of a wide range of features, towards the reliable prediction of both proline and non-proline cis/trans isomerization. We use evolutionary profiles, secondary structure information, real-valued solvent accessibility predictions for each amino acid and the physicochemical properties of the surrounding residues. We also explore the predictive impact of a modified feature vector, which consists of condensed position-specific scoring matrices (PSSMX), secondary structure and solvent accessibility. The best discriminating ability is achieved using the first feature vector combined with a wrapper feature selection algorithm and a support vector machine (SVM). The proposed method results in 70% accuracy, 75% sensitivity and 71% positive predictive value (PPV) in the prediction of the peptide bond conformation between any two amino acids. The output of the feature selection stage is investigated in order to identify discriminatory features as well as the contribution of each neighboring residue in the formation of the peptide bond, thus, advancing our knowledge towards cis/trans isomerization.
Collapse
Affiliation(s)
- Konstantinos P Exarchos
- Unit of Medical Technology and Intelligent Information Systems, Department of Computer Science, University of Ioannina, P.O. Box 1186, GR 45110 Ioannina, Greece
| | | | | | | | | |
Collapse
|
7
|
Shah AR, Oehmen CS, Webb-Robertson BJ. SVM-HUSTLE--an iterative semi-supervised machine learning approach for pairwise protein remote homology detection. Bioinformatics 2008; 24:783-90. [DOI: 10.1093/bioinformatics/btn028] [Citation(s) in RCA: 32] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
|
8
|
Sarac OS, Gürsoy-Yüzügüllü O, Cetin-Atalay R, Atalay V. Subsequence-based feature map for protein function classification. Comput Biol Chem 2007; 32:122-30. [PMID: 18243801 DOI: 10.1016/j.compbiolchem.2007.11.004] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2007] [Accepted: 11/30/2007] [Indexed: 11/19/2022]
Abstract
Automated classification of proteins is indispensable for further in vivo investigation of excessive number of unknown sequences generated by large scale molecular biology techniques. This study describes a discriminative system based on feature space mapping, called subsequence profile map (SPMap) for functional classification of protein sequences. SPMap takes into account the information coming from the subsequences of a protein. A group of protein sequences that belong to the same level of classification is decomposed into fixed-length subsequences and they are clustered to obtain a representative feature space mapping. Mapping is defined as the distribution of the subsequences of a protein sequence over these clusters. The resulting feature space representation is used to train discriminative classifiers for functional families. The aim of this approach is to incorporate information coming from important subregions that are conserved over a family of proteins while avoiding the difficult task of explicit motif identification. The performance of the method was assessed through tests on various protein classification tasks. Our results showed that SPMap is capable of high accuracy classification in most of these tasks. Furthermore SPMap is fast and scalable enough to handle large datasets.
Collapse
Affiliation(s)
- Omer Sinan Sarac
- Department of Computer Engineering, Middle East Technical University, 06531 Ankara, Turkey
| | | | | | | |
Collapse
|
9
|
Ballabio D, Consonni V, Todeschini R. Classification of multiway analytical data based on MOLMAP approach. Anal Chim Acta 2007; 605:134-46. [PMID: 18036376 DOI: 10.1016/j.aca.2007.10.029] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2007] [Revised: 10/17/2007] [Accepted: 10/18/2007] [Indexed: 11/26/2022]
Abstract
A new method for the study of molecule chemical information organized into three-way data structures (MOLMAP) was recently proposed in literature. Basically, MOLMAP molecular fingerprints are calculated by projecting bond properties of molecules into Kohonen networks and used to generate molecular descriptors for QSAR modeling. Since this technique has never been applied to other kinds of chemical multiway data, in this study classification models were carried out by means of MOLMAP approach on three-way analytical datasets of electronic nose and fluorescence data. For comparing purposes, other classification methods were applied to the same datasets: Discriminant Analysis on the PARAFAC scores and Partial Least Square-Discriminant Analysis (PLS-DA) on the unfolded data. Since the MOLMAP approach provided good results for the analyzed datasets, here, we propose the MOLMAP approach to be used as a general technique for the classification of multiway datasets. Actually, besides the good classification performances, other advantages came out: (a) the MOLMAP scores appeared as effective fingerprints for data characterization; (b) the role and importance of each portion of the multiway data can be analyzed in a comprehensive way; (c) it is possible to understand which variables have greater discriminant power and consequently apply data reduction.
Collapse
Affiliation(s)
- Davide Ballabio
- Milano Chemometrics and QSAR Research Group, Department of Environmental Sciences, University of Milano-Bicocca, P.za della Scienza 1, 20126 Milano, Italy.
| | | | | |
Collapse
|
10
|
Oğul H, Mumcuoğu EU. Subcellular localization prediction with new protein encoding schemes. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2007; 4:227-32. [PMID: 17473316 DOI: 10.1109/tcbb.2007.070209] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/15/2023]
Abstract
Subcellular localization is one of the key properties in functional annotation of proteins. Support vector machines (SVMs) have been widely used for automated prediction of subcellular localizations. Existing methods differ in the protein encoding schemes used. In this study, we present two methods for protein encoding to be used for SVM-based subcellular localization prediction: n-peptide compositions with reduced amino acid alphabets for larger values of n and pairwise sequence similarity scores based on whole sequence and N-terminal sequence. We tested the methods on a common benchmarking data set that consists of 2,427 eukaryotic proteins with four localization sites. As a result of 5-fold cross-validation tests, the encoding with n-peptide compositions provided the accuracies of 84.5, 88.9, 66.3, and 94.3 percent for cytoplasmic, extracellular, mitochondrial, and nuclear proteins, where the overall accuracy was 87.1 percent. The second method provided 83.6, 87.7, 87.9, and 90.5 percent accuracies for individual locations and 87.8 percent overall accuracy. A hybrid system, which we called PredLOC, makes a final decision based on the results of the two presented methods which achieved an overall accuracy of 91.3 percent, which is better than the achievements of many of the existing methods. The new system also outperformed the recent methods in the experiments conducted on a new-unique SWISSPROT test set.
Collapse
Affiliation(s)
- Hasan Oğul
- Department of Computer Engineering, Baskent University, Ankara, Turkey.
| | | |
Collapse
|
11
|
Abstract
Nuclear localization of proteins is a crucial element in the dynamic life of the cell. It is complicated by the massive diversity of targeting signals and the existence of proteins that shuttle between the nucleus and cytoplasm. Nevertheless, a majority of subcellular localization tools that predict nuclear proteins have been developed without involving dual localized proteins in the data sets. Hence, in general, the existing models are focused on predicting statically nuclear proteins, rather than nuclear localization itself. We present an independent analysis of existing nuclear localization predictors, using a nonredundant data set extracted from Swiss-Prot R50.0. We demonstrate that accuracy on truly novel proteins is lower than that of previous estimations, and that existing models generalize poorly to dual localized proteins. We have developed a model trained to identify nuclear proteins including dual localized proteins. The results suggest that using more recent data and including dual localized proteins improves the overall prediction. The final predictor NUCLEO operates with a realistic success rate of 0.70 and a correlation coefficient of 0.38, as established on the independent test set. (NUCLEO is available at: http://pprowler.itee.uq.edu.au.).
Collapse
Affiliation(s)
- John Hawkins
- ARC Centre for Complex Systems, School of Information Technology and Electrical Engineering, University of Queensland, QLD 4072, Australia.
| | | | | |
Collapse
|
12
|
Shah AR, Oehmen CS, Harper J, Webb-Robertson BJM. Integrating subcellular location for improving machine learning models of remote homology detection in eukaryotic organisms. Comput Biol Chem 2007; 31:138-42. [PMID: 17416337 DOI: 10.1016/j.compbiolchem.2007.02.012] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2007] [Accepted: 02/20/2007] [Indexed: 11/30/2022]
Abstract
A significant challenge in homology detection is to identify sequences that share a common evolutionary ancestor, despite significant primary sequence divergence. Remote homologs will often have less than 30% sequence identity, yet still retain common structural and functional properties. We demonstrate a novel method for identifying remote homologs using a support vector machine (SVM) classifier trained by fusing sequence similarity scores and subcellular location prediction. SVMs have been shown to perform well in a variety of applications where binary classification of data is the goal. At the same time, data fusion methods have been shown to be highly effective in enhancing discriminative power of data. Combining these two approaches in the application SVM-SimLoc resulted in identification of significantly more remote homologs (p-value<0.006) than using either sequence similarity or subcellular location independently.
Collapse
Affiliation(s)
- Anuj R Shah
- Computational Biology & Bioinformatics, Pacific Northwest National Laboratory, Richland, WA 99352, USA.
| | | | | | | |
Collapse
|
13
|
Kuznetsov IB, Gou Z, Li R, Hwang S. Using evolutionary and structural information to predict DNA‐binding sites on DNA‐binding proteins. Proteins 2006; 64:19-27. [PMID: 16568445 DOI: 10.1002/prot.20977] [Citation(s) in RCA: 106] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
Proteins that interact with DNA are involved in a number of fundamental biological activities such as DNA replication, transcription, and repair. A reliable identification of DNA-binding sites in DNA-binding proteins is important for functional annotation, site-directed mutagenesis, and modeling protein-DNA interactions. We apply Support Vector Machine (SVM), a supervised pattern recognition method, to predict DNA-binding sites in DNA-binding proteins using the following features: amino acid sequence, profile of evolutionary conservation of sequence positions, and low-resolution structural information. We use a rigorous statistical approach to study the performance of predictors that utilize different combinations of features and how this performance is affected by structural and sequence properties of proteins. Our results indicate that an SVM predictor based on a properly scaled profile of evolutionary conservation in the form of a position specific scoring matrix (PSSM) significantly outperforms a PSSM-based neural network predictor. The highest accuracy is achieved by SVM predictor that combines the profile of evolutionary conservation with low-resolution structural information. Our results also show that knowledge-based predictors of DNA-binding sites perform significantly better on proteins from mainly-alpha structural class and that the performance of these predictors is significantly correlated with certain structural and sequence properties of proteins. These observations suggest that it may be possible to assign a reliability index to the overall accuracy of the prediction of DNA-binding sites in any given protein using its sequence and structural properties. A web-server implementation of the predictors is freely available online at http://lcg.rit.albany.edu/dp-bind/.
Collapse
Affiliation(s)
- Igor B Kuznetsov
- Gen*NY*sis Center for Excellence in Cancer Genomics, Department of Epidemiology and Biostatistics, University at Albany, Rensselaer, NewYork 12144, USA.
| | | | | | | |
Collapse
|