1
|
Agüero-Chapin G, Galpert D, Molina-Ruiz R, Ancede-Gallardo E, Pérez-Machado G, De la Riva GA, Antunes A. Graph Theory-Based Sequence Descriptors as Remote Homology Predictors. Biomolecules 2019; 10:E26. [PMID: 31878100 PMCID: PMC7022958 DOI: 10.3390/biom10010026] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2019] [Revised: 12/16/2019] [Accepted: 12/18/2019] [Indexed: 12/23/2022] Open
Abstract
Alignment-free (AF) methodologies have increased in popularity in the last decades as alternative tools to alignment-based (AB) algorithms for performing comparative sequence analyses. They have been especially useful to detect remote homologs within the twilight zone of highly diverse gene/protein families and superfamilies. The most popular alignment-free methodologies, as well as their applications to classification problems, have been described in previous reviews. Despite a new set of graph theory-derived sequence/structural descriptors that have been gaining relevance in the detection of remote homology, they have been omitted as AF predictors when the topic is addressed. Here, we first go over the most popular AF approaches used for detecting homology signals within the twilight zone and then bring out the state-of-the-art tools encoding graph theory-derived sequence/structure descriptors and their success for identifying remote homologs. We also highlight the tendency of integrating AF features/measures with the AB ones, either into the same prediction model or by assembling the predictions from different algorithms using voting/weighting strategies, for improving the detection of remote signals. Lastly, we briefly discuss the efforts made to scale up AB and AF features/measures for the comparison of multiple genomes and proteomes. Alongside the achieved experiences in remote homology detection by both the most popular AF tools and other less known ones, we provide our own using the graphical-numerical methodologies, MARCH-INSIDE, TI2BioP, and ProtDCal. We also present a new Python-based tool (SeqDivA) with a friendly graphical user interface (GUI) for delimiting the twilight zone by using several similar criteria.
Collapse
Affiliation(s)
- Guillermin Agüero-Chapin
- CIIMAR/CIMAR, Interdisciplinary Centre of Marine and Environmental Research, University of Porto, Terminal de Cruzeiros do Porto de Leixões, Av. General Norton de Matos s/n 4450-208 Porto, Portugal
- Department of Biology, Faculty of Sciences, University of Porto, Rua do Campo Alegre, 4169-007 Porto, Portugal
| | - Deborah Galpert
- Departamento de Ciencia de la Computación. Universidad Central ¨Marta Abreu¨ de Las Villas (UCLV), Santa Clara 54830, Cuba;
| | - Reinaldo Molina-Ruiz
- Centro de Bioactivos Químicos (CBQ), Universidad Central ¨Marta Abreu¨ de Las Villas (UCLV), Santa Clara 54830, Cuba;
| | - Evys Ancede-Gallardo
- Programa de Doctorado en Fisicoquímica Molecular, Facultad de Ciencias Exactas, Universidad Andrés Bello, Av. República 239, Santiago 8370146, Chile;
| | - Gisselle Pérez-Machado
- EpiDisease S.L. Spin-Off of Centro de Investigación Biomédica en Red de Enfermedades Raras (CIBERER), 46980 Valencia, Spain;
| | - Gustavo A. De la Riva
- Laboratorio de Biotecnología Aplicada S. de R.L. de C.V., GRECA Inc., Carretera La Piedad-Carapán, km 3.5, La Piedad, Michoacán 59300, Mexico;
- Tecnológico Nacional de México, Instituto Tecnológico de la Piedad, Av. Ricardo Guzmán Romero, Santa Fe, La Piedad de Cavadas, Michoacán 59370, Mexico
| | - Agostinho Antunes
- CIIMAR/CIMAR, Interdisciplinary Centre of Marine and Environmental Research, University of Porto, Terminal de Cruzeiros do Porto de Leixões, Av. General Norton de Matos s/n 4450-208 Porto, Portugal
- Department of Biology, Faculty of Sciences, University of Porto, Rua do Campo Alegre, 4169-007 Porto, Portugal
| |
Collapse
|
2
|
Beier R, Labudde D. Numeric promoter description - A comparative view on concepts and general application. J Mol Graph Model 2015; 63:65-77. [PMID: 26655334 DOI: 10.1016/j.jmgm.2015.11.011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2015] [Revised: 11/12/2015] [Accepted: 11/17/2015] [Indexed: 11/25/2022]
Abstract
Nucleic acid molecules play a key role in a variety of biological processes. Starting from storage and transfer tasks, this also comprises the triggering of biological processes, regulatory effects and the active influence gained by target binding. Based on the experimental output (in this case promoter sequences), further in silico analyses aid in gaining new insights into these processes and interactions. The numerical description of nucleic acids thereby constitutes a bridge between the concrete biological issues and the analytical methods. Hence, this study compares 26 descriptor sets obtained by applying well-known numerical description concepts to an established dataset of 38 DNA promoter sequences. The suitability of the description sets was evaluated by computing partial least squares regression models and assessing the model accuracy. We conclude that the major importance regarding the descriptive power is attached to positional information rather than to explicitly incorporated physico-chemical information, since a sufficient amount of implicit physico-chemical information is already encoded in the nucleobase classification. The regression models especially benefited from employing the information that is encoded in the sequential and structural neighborhood of the nucleobases. Thus, the analyses of n-grams (short fragments of length n) suggested that they are valuable descriptors for DNA target interactions. A mixed n-gram descriptor set thereby yielded the best description of the promoter sequences. The corresponding regression model was checked and found to be plausible as it was able to reproduce the characteristic binding motifs of promoter sequences in a reasonable degree. As most functional nucleic acids are based on the principle of molecular recognition, the findings are not restricted to promoter sequences, but can rather be transferred to other kinds of functional nucleic acids. Thus, the concepts presented in this study could provide advantages for future nucleic acid-based technologies, like biosensoring, therapeutics and molecular imaging.
Collapse
Affiliation(s)
- Rico Beier
- University of Applied Sciences Mittweida, Technikumplatz 17, 09648 Mittweida, Germany.
| | - Dirk Labudde
- University of Applied Sciences Mittweida, Technikumplatz 17, 09648 Mittweida, Germany.
| |
Collapse
|
3
|
Fernandez-Lozano C, Cuiñas RF, Seoane JA, Fernández-Blanco E, Dorado J, Munteanu CR. Classification of signaling proteins based on molecular star graph descriptors using Machine Learning models. J Theor Biol 2015; 384:50-8. [DOI: 10.1016/j.jtbi.2015.07.038] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2015] [Revised: 07/20/2015] [Accepted: 07/27/2015] [Indexed: 12/11/2022]
|
4
|
Markov mean properties for cell death-related protein classification. J Theor Biol 2014; 349:12-21. [DOI: 10.1016/j.jtbi.2014.01.033] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2013] [Revised: 01/21/2014] [Accepted: 01/24/2014] [Indexed: 11/18/2022]
|
5
|
Fernandez-Lozano C, Fernández-Blanco E, Dave K, Pedreira N, Gestal M, Dorado J, Munteanu CR. Improving enzyme regulatory protein classification by means of SVM-RFE feature selection. MOLECULAR BIOSYSTEMS 2014; 10:1063-71. [DOI: 10.1039/c3mb70489k] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/23/2023]
|
6
|
Niu B, Zhang Y, Ding J, Lu Y, Wang M, Lu W, Yuan X, Yin J. Predicting network of drug-enzyme interaction based on machine learning method. BIOCHIMICA ET BIOPHYSICA ACTA-PROTEINS AND PROTEOMICS 2013; 1844:214-23. [PMID: 23907006 DOI: 10.1016/j.bbapap.2013.07.008] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/01/2012] [Revised: 07/16/2013] [Accepted: 07/18/2013] [Indexed: 12/11/2022]
Abstract
It is important to correctly and efficiently map drugs and enzymes to their possible interaction network in modern drug research. In this work, a novel approach was introduced to encode drug and enzyme molecules with physicochemical molecular descriptors and pseudo amino acid composition, respectively. Based on this encoding method, Random Forest was adopted to build the drug-enzyme interaction network. After selecting the optimal features that are able to represent the main factors of drug-enzyme interaction in our prediction, a total of 129 features were attained which can be clustered into nine categories: Elemental Analysis, Geometry, Chemistry, Amino Acid Composition, Secondary Structure, Polarity, Molecular Volume, Codon Diversity and Electrostatic Charge. It is further found that Geometry features were the most important of all the features. As a result, our predicting model achieved an MCC of 0.915 and a sensitivity of 87.9% at the specificity level of 99.8% for 10-fold cross-validation test, and achieved an MCC of 0.895 and a sensitivity of 95.7% at the specificity level of 95.4% for independent set test. This article is part of a Special Issue entitled: Computational Proteomics, Systems Biology & Clinical Implications. Guest Editor: Yudong Cai.
Collapse
Affiliation(s)
- Bing Niu
- College of Life Science, Shanghai University, 99 Shang-Da Road, Shanghai 200072, China
| | | | | | | | | | | | | | | |
Collapse
|
7
|
Prediction of subcellular location of mycobacterial protein using feature selection techniques. Mol Divers 2009; 14:667-71. [DOI: 10.1007/s11030-009-9205-1] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2009] [Accepted: 10/20/2009] [Indexed: 11/26/2022]
|
8
|
Multi-target spectral moment: QSAR for antiviral drugs vs. different viral species. Anal Chim Acta 2009; 651:159-64. [PMID: 19782806 DOI: 10.1016/j.aca.2009.08.022] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2009] [Revised: 08/05/2009] [Accepted: 08/18/2009] [Indexed: 11/23/2022]
Abstract
The antiviral QSAR models have an important limitation today. They predict the biological activity of drugs against only one viral species. This is determined by the fact that most of the current reported molecular descriptors encode only information about the molecular structure. As a result, predicting the probability with which a drug is active against different viral species with a single unifying model is a goal of major importance. In this work, we use Markov Chain theory to calculate new multi-target spectral moments to fit a QSAR model for drugs active against 40 viral species. The model is based on 500 drugs (including active and non-active compounds) tested as antiviral agents in the recent literature; not all drugs were predicted against all viruses, but only those with experimental values. The database also contains 207 well-known compounds (not as recent as the previous ones) reported in the Merck Index with other activities that do not include antiviral action against any virus species. We used Linear Discriminant Analysis (LDA) to classify all these drugs into two classes as active or non-active against the different viral species tested, whose data we processed. The model correctly classifies 5129 out of 5594 non-active compounds (91.69%) and 412 out of 422 active compounds (97.63%). Overall training predictability was 92.34%. The validation of the model was carried out by means of external predicting series, the model classifying, thus, 2568 out of 2779 non-active compounds and 224 out of 229 active compounds. Overall training predictability was 92.82%. The present work reports the first attempts to calculate within a unified framework the probabilities of antiviral drugs against different virus species based on a spectral moment analysis.
Collapse
|
9
|
Shen HB, Chou KC. Identification of proteases and their types. Anal Biochem 2008; 385:153-60. [PMID: 19007742 DOI: 10.1016/j.ab.2008.10.020] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2009] [Revised: 10/13/2008] [Accepted: 10/14/2008] [Indexed: 10/21/2022]
Abstract
Called by many as biology's version of Swiss army knives, proteases cut long sequences of amino acids into fragments and regulate most physiological processes. They are vitally important in the life cycle. Different types of proteases have different action mechanisms and biological processes. With the avalanche of protein sequences generated during the postgenomic age, it is highly desirable for both basic research and drug design to develop a fast and reliable method for identifying the types of proteases according to their sequences or even just for whether they are proteases or not. In this article, three recently developed identification methods in this regard are discussed: (i) FunD-PseAAC, (ii) GO-PseAAC, and (iii) FunD-PsePSSM. The first two were established by hybridizing the FunD (functional domain) approach and the GO (gene ontology) approach, respectively, with the PseAAC (pseudo amino acid composition) approach. The third method was established by fusing the FunD approach with the PsePSSM (pseudo position-specific scoring matrix) approach. Of these three methods, only FunD-PsePSSM has provided a server called ProtIdent (protease identifier), which is freely accessible to the public via the website at http://www.csbio.sjtu.edu.cn/bioinf/Protease. For the convenience of users, a step-by-step guide on how to use ProtIdent is illustrated. Meanwhile, the caveat in using ProtIdent and how to understand the success expectancy rate of a statistical predictor are discussed. Finally, the essence of why ProtIdent can yield a high success rate in identifying proteases and their types is elucidated.
Collapse
Affiliation(s)
- Hong-Bin Shen
- Institute of Image Processing and Pattern Recognition, Shanghai Jiaotong University, Shanghai 200240, China.
| | | |
Collapse
|
10
|
Perez-Bello A, Munteanu CR, Ubeira FM, De Magalhães AL, Uriarte E, González-Díaz H. Alignment-free prediction of mycobacterial DNA promoters based on pseudo-folding lattice network or star-graph topological indices. J Theor Biol 2008; 256:458-66. [PMID: 18992259 PMCID: PMC7126577 DOI: 10.1016/j.jtbi.2008.09.035] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2008] [Revised: 09/23/2008] [Accepted: 09/25/2008] [Indexed: 12/01/2022]
Abstract
The importance of the promoter sequences in the function regulation of several important mycobacterial pathogens creates the necessity to design simple and fast theoretical models that can predict them. This work proposes two DNA promoter QSAR models based on pseudo-folding lattice network (LN) and star-graphs (SG) topological indices. In addition, a comparative study with the previous RNA electrostatic parameters of thermodynamically-driven secondary structure folding representations has been carried out. The best model of this work was obtained with only two LN stochastic electrostatic potentials and it is characterized by accuracy, selectivity and specificity of 90.87%, 82.96% and 92.95%, respectively. In addition, we pointed out the SG result dependence on the DNA sequence codification and we proposed a QSAR model based on codons and only three SG spectral moments.
Collapse
Affiliation(s)
- Alcides Perez-Bello
- Department of Microbiology and Parasitology, University of Santiago de Compostela, Santiago de Compostela 15782, Spain.
| | | | | | | | | | | |
Collapse
|
11
|
Xiao X, Lin WZ, Chou KC. Using grey dynamic modeling and pseudo amino acid composition to predict protein structural classes. J Comput Chem 2008; 29:2018-24. [PMID: 18381630 DOI: 10.1002/jcc.20955] [Citation(s) in RCA: 64] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Using the pseudo amino acid (PseAA) composition to represent the sample of a protein can incorporate a considerable amount of sequence pattern information so as to improve the prediction quality for its structural or functional classification. However, how to optimally formulate the PseAA composition is an important problem yet to be solved. In this article the grey modeling approach is introduced that is particularly efficient in coping with complicated systems such as the one consisting of many proteins with different sequence orders and lengths. On the basis of the grey model, four coefficients derived from each of the protein sequences concerned are adopted for its PseAA components. The PseAA composition thus formulated is called the "grey-PseAA" composition that can catch the essence of a protein sequence and better reflect its overall pattern. In our study we have demonstrated that introduction of the grey-PseAA composition can remarkably enhance the success rates in predicting the protein structural class. It is anticipated that the concept of grey-PseAA composition can be also used to predict many other protein attributes, such as subcellular localization, membrane protein type, enzyme functional class, GPCR type, protease type, among many others.
Collapse
Affiliation(s)
- Xuan Xiao
- Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen 333000, China.
| | | | | |
Collapse
|
12
|
Prediction of protein structural classes by Chou’s pseudo amino acid composition: approached using continuous wavelet transform and principal component analysis. Amino Acids 2008; 37:415-25. [DOI: 10.1007/s00726-008-0170-2] [Citation(s) in RCA: 66] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2008] [Accepted: 08/03/2008] [Indexed: 10/21/2022]
|
13
|
Dai Q, Wang T. Use of linear regression model to compare RNA secondary structures. J Theor Biol 2008; 253:854-60. [DOI: 10.1016/j.jtbi.2008.04.023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2008] [Revised: 04/17/2008] [Accepted: 04/17/2008] [Indexed: 11/25/2022]
|
14
|
Munteanu CR, González-Díaz H, Magalhães AL. Enzymes/non-enzymes classification model complexity based on composition, sequence, 3D and topological indices. J Theor Biol 2008; 254:476-82. [PMID: 18606172 DOI: 10.1016/j.jtbi.2008.06.003] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2008] [Revised: 05/15/2008] [Accepted: 06/06/2008] [Indexed: 10/21/2022]
Abstract
The huge amount of new proteins that need a fast enzymatic activity characterization creates demands of protein QSAR theoretical models. The protein parameters that can be used for an enzyme/non-enzyme classification includes the simpler indices such as composition, sequence and connectivity, also called topological indices (TIs) and the computationally expensive 3D descriptors. A comparison of the 3D versus lower dimension indices has not been reported with respect to the power of discrimination of proteins according to enzyme action. A set of 966 proteins (enzymes and non-enzymes) whose structural characteristics are provided by PDB/DSSP files was analyzed with Python/Biopython scripts, STATISTICA and Weka. The list of indices includes, but it is not restricted to pure composition indices (residue fractions), DSSP secondary structure protein composition and 3D indices (surface and access). We also used mixed indices such as composition-sequence indices (Chou's pseudo-amino acid compositions or coupling numbers), 3D-composition (surface fractions) and DSSP secondary structure amino acid composition/propensities (obtained with our Prot-2S Web tool). In addition, we extend and test for the first time several classic TIs for the Randic's protein sequence Star graphs using our Sequence to Star Graph (S2SG) Python application. All the indices were processed with general discriminant analysis models (GDA), neural networks (NN) and machine learning (ML) methods and the results are presented versus complexity, average of Shannon's information entropy (Sh) and data/method type. This study compares for the first time all these classes of indices to assess the ratios between model accuracy and indices/model complexity in enzyme/non-enzyme discrimination. The use of different methods and complexity of data shows that one cannot establish a direct relation between the complexity and the accuracy of the model.
Collapse
Affiliation(s)
- Cristian Robert Munteanu
- REQUIMTE/Faculty of Science, Chemistry Department, University of Porto, Porto 4169-007, Portugal.
| | | | | |
Collapse
|
15
|
Prediction of protein structure class by coupling improved genetic algorithm and support vector machine. Amino Acids 2008; 35:581-90. [DOI: 10.1007/s00726-008-0084-z] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2007] [Accepted: 01/31/2008] [Indexed: 10/22/2022]
|
16
|
González-Díaz H, González-Díaz Y, Santana L, Ubeira FM, Uriarte E. Proteomics, networks and connectivity indices. Proteomics 2008; 8:750-78. [DOI: 10.1002/pmic.200700638] [Citation(s) in RCA: 170] [Impact Index Per Article: 10.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
|
17
|
An ensemble of reduced alphabets with protein encoding based on grouped weight for predicting DNA-binding proteins. Amino Acids 2008; 36:167-75. [DOI: 10.1007/s00726-008-0044-7] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2007] [Accepted: 02/07/2008] [Indexed: 10/22/2022]
|
18
|
Cruz-Monteagudo M, González-Díaz H, Borges F, Dominguez ER, Cordeiro MNDS. 3D-MEDNEs: an alternative "in silico" technique for chemical research in toxicology. 2. quantitative proteome-toxicity relationships (QPTR) based on mass spectrum spiral entropy. Chem Res Toxicol 2008; 21:619-32. [PMID: 18257557 DOI: 10.1021/tx700296t] [Citation(s) in RCA: 30] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Low range mass spectra (MS) characterization of serum proteome offers the best chance of discovering proteome-(early drug-induced cardiac toxicity) relationships, called here Pro-EDICToRs. However, due to the thousands of proteins involved, finding the single disease-related protein could be a hard task. The search for a model based on general MS patterns becomes a more realistic choice. In our previous work ( González-Díaz, H. , et al. Chem. Res. Toxicol. 2003, 16, 1318- 1327 ), we introduced the molecular structure information indices called 3D-Markovian electronic delocalization entropies (3D-MEDNEs). In this previous work, quantitative structure-toxicity relationship (QSTR) techniques allowed us to link 3D-MEDNEs with blood toxicological properties of drugs. In this second part, we extend 3D-MEDNEs to numerically encode biologically relevant information present in MS of the serum proteome for the first time. Using the same idea behind QSTR techniques, we can seek now by analogy a quantitative proteome-toxicity relationship (QPTR). The new QPTR models link MS 3D-MEDNEs with drug-induced toxicological properties from blood proteome information. We first generalized Randic's spiral graph and lattice networks of protein sequences to represent the MS of 62 serum proteome samples with more than 370 100 intensity ( I i ) signals with m/ z bandwidth above 700-12000 each. Next, we calculated the 3D-MEDNEs for each MS using the software MARCH-INSIDE. After that, we developed several QPTR models using different machine learning and MS representation algorithms to classify samples as control or positive Pro-EDICToRs samples. The best QPTR proposed showed accuracy values ranging from 83.8% to 87.1% and leave-one-out (LOO) predictive ability of 77.4-85.5%. This work demonstrated that the idea behind classic drug QSTR models may be extended to construct QPTRs with proteome MS data.
Collapse
Affiliation(s)
- Maykel Cruz-Monteagudo
- Physico-Chemical Molecular Research Unit, Department of Organic Chemistry, Faculty of Pharmacy, University of Porto, 4150-047 Porto, Portugal
| | | | | | | | | |
Collapse
|
19
|
Nanni L, Lumini A. Combing ontologies and dipeptide composition for predicting DNA-binding proteins. Amino Acids 2008; 34:635-41. [PMID: 18175049 DOI: 10.1007/s00726-007-0016-3] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2007] [Accepted: 12/06/2007] [Indexed: 12/11/2022]
Abstract
Given a novel protein it is very important to know if it is a DNA-binding protein, because DNA-binding proteins participate in the fundamental role to regulate gene expression. In this work, we propose a parallel fusion between a classifier trained using the features extracted from the gene ontology database and a classifier trained using the dipeptide composition of the protein. As classifiers the support vector machine (SVM) and the 1-nearest neighbour are used. Matthews's correlation coefficient obtained by our fusion method is approximately 0.97 when the jackknife cross-validation is used; this result outperforms the best performance obtained in the literature (0.924) using the same dataset where the SVM is trained using only the Chou's pseudo amino acid based features. In this work also the area under the ROC-curve (AUC) is reported and our results show that the fusion permits to obtain a very interesting 0.995 AUC. In particular we want to stress that our fusion obtains a 5% false negative with a 0% of false positive. Matthews's correlation coefficient obtained using the single best GO-number is only 0.7211 and hence it is not possible to use the gene ontology database as a simple lookup table. Finally, we test the complementarity of the two tested feature extraction methods using the Q-statistic. We obtain the very interesting result of 0.58, which means that the features extracted from the gene ontology database and the features extracted from the amino acid sequence are partially independent and that their parallel fusion should be studied more.
Collapse
Affiliation(s)
- Loris Nanni
- DEIS, IEIIT-CNR, Università di Bologna, Viale Risorgimento 2, 40136 Bologna, Italy.
| | | |
Collapse
|
20
|
Cruz-Monteagudo M, González-Díaz H, Agüero-Chapín G, Santana L, Borges F, Domínguez ER, Podda G, Uriarte E. Computational chemistry development of a unified free energy Markov model for the distribution of 1300 chemicals to 38 different environmental or biological systems. J Comput Chem 2007; 28:1909-23. [PMID: 17405109 DOI: 10.1002/jcc.20730] [Citation(s) in RCA: 52] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Predicting tissue and environmental distribution of chemicals is of major importance for environmental and life sciences. Most of the molecular descriptors used in computational prediction of chemicals partition behavior consider molecular structure but ignore the nature of the partition system. Consequently, computational models derived up-to-date are restricted to the specific system under study. Here, a free energy-based descriptor (DeltaG(k)) is introduced, which circumvent this problem. Based on DeltaG(k), we developed for the first time a single linear classification model to predict the partition behavior of a broad number of structurally diverse drugs and other chemicals (1300) for 38 different partition systems of biological and environmental significance. The model presented training/predicting set accuracies of 91.79/88.92%. Parametrical assumptions were checked. Desirability analysis was used to explore the levels of the predictors that produce the most desirable partition properties. Finally, inversion of the partition direction for each one of the 38 partition systems evidences that our models correctly classified 89.08% of compounds with an uncertainty of only +/-0.17% independently of the direction of the partition process used to seek the model. Other 10 different classification models (linear, neural networks, and genetic algorithms) were also tested for the same purposes. None of these computational models favorably compare with respect to the linear model indicating that our approach capture the main aspects that govern chemicals partition in different systems.
Collapse
Affiliation(s)
- Maykel Cruz-Monteagudo
- Physico-Chemical Molecular Research Unit, Department of Organic Chemistry, Faculty of Pharmacy, University of Porto 4050-047, Porto, Portugal
| | | | | | | | | | | | | | | |
Collapse
|
21
|
González-Díaz H, Agüero-Chapin G, Varona J, Molina R, Delogu G, Santana L, Uriarte E, Podda G. 2D-RNA-coupling numbers: A new computational chemistry approach to link secondary structure topology with biological function. J Comput Chem 2007; 28:1049-56. [PMID: 17279496 DOI: 10.1002/jcc.20576] [Citation(s) in RCA: 42] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Methods for prediction of proteins, DNA, or RNA function and mapping it onto sequence often rely on bioinformatics alignment approach instead of chemical structure. Consequently, it is interesting to develop computational chemistry approaches based on molecular descriptors. In this sense, many researchers used sequence-coupling numbers and our group extended them to 2D proteins representations. However, no coupling numbers have been reported for 2D-RNA topology graphs, which are highly branched and contain useful information. Here, we use a computational chemistry scheme: (a) transforming sequences into RNA secondary structures, (b) defining and calculating new 2D-RNA-coupling numbers, (c) seek a structure-function model, and (d) map biological function onto the folded RNA. We studied as example 1-aminocyclopropane-1-carboxylic acid (ACC) oxidases known as ACO, which control fruit ripening having importance for biotechnology industry. First, we calculated tau(k)(2D-RNA) values to a set of 90-folded RNAs, including 28 transcripts of ACO and control sequences. Afterwards, we compared the classification performance of 10 different classifiers implemented in the software WEKA. In particular, the logistic equation ACO = 23.8 . tau(1)(2D-RNA) + 41.4 predicts ACOs with 98.9%, 98.0%, and 97.8% of accuracy in training, leave-one-out and 10-fold cross-validation, respectively. Afterwards, with this equation we predict ACO function to a sequence isolated in this work from Coffea arabica (GenBank accession DQ218452). The tau(1)(2D-RNA) also favorably compare with other descriptors. This equation allows us to map the codification of ACO activity on different mRNA topology features. The present computational-chemistry approach is general and could be extended to connect RNA secondary structure topology to other functions.
Collapse
Affiliation(s)
- Humberto González-Díaz
- Department of Organic Chemistry, University of Santiago de Compostela, Santiago de Compostela 15782, Spain.
| | | | | | | | | | | | | | | |
Collapse
|