1
|
Suma LS, Vinod Chandra SS. Mining of structural motifs in proteins using artificial bee colony optimization framework for druggability. J Bioinform Comput Biol 2021; 19:2150025. [PMID: 34590991 DOI: 10.1142/s0219720021500256] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
In this work, we have developed an optimization framework for digging out common structural patterns inherent in DNA binding proteins. A novel variant of the artificial bee colony optimization algorithm is proposed to improve the exploitation process. Experiments on four benchmark objective functions for different dimensions proved the speedier convergence of the algorithm. Also, it has generated optimum features of Helix Turn Helix structural pattern based on the objective function defined with occurrence count on secondary structure. The proposed algorithm outperformed the compared methods in convergence speed and the quality of generated motif features. The motif locations obtained using the derived common pattern are compared with the results of two other motif detection tools. 92% of tested proteins have produced matching locations with the results of the compared methods. The performance of the approach was analyzed with various measures and observed higher sensitivity, specificity and area under the curve values. A novel strategy for druggability finding by docking studies, targeting the motif locations is also discussed.
Collapse
Affiliation(s)
- L S Suma
- Department of Computational Biology and Bioinformatics, University of Kerala, Thiruvananthapuram, Kerala, India
| | - S S Vinod Chandra
- Department of Computer Science, University of Kerala, Thiruvananthapuram, Kerala, India
| |
Collapse
|
2
|
Emamjomeh A, Choobineh D, Hajieghrari B, MahdiNezhad N, Khodavirdipour A. DNA-protein interaction: identification, prediction and data analysis. Mol Biol Rep 2019; 46:3571-3596. [PMID: 30915687 DOI: 10.1007/s11033-019-04763-1] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2018] [Accepted: 03/14/2019] [Indexed: 12/30/2022]
Abstract
Life in living organisms is dependent on specific and purposeful interaction between other molecules. Such purposeful interactions make the various processes inside the cells and the bodies of living organisms possible. DNA-protein interactions, among all the types of interactions between different molecules, are of considerable importance. Currently, with the development of numerous experimental techniques, diverse methods are convenient for recognition and investigating such interactions. While the traditional experimental techniques to identify DNA-protein complexes are time-consuming and are unsuitable for genome-scale studies, the current high throughput approaches are more efficient in determining such interaction at a large-scale, but they are clearly too costly to be practice for daily applications. Hence, according to the availability of much information related to different biological sequences and clearing different dimensions of conditions in which such interactions are formed, with the developments related to the computer, mathematics, and statistics motivate scientists to develop bioinformatics tools for prediction the interaction site(s). Until now, there has been much progress in this field. In this review, the factors and conditions governing the interaction and the laboratory techniques for examining such interactions are addressed. In addition, developed bioinformatics tools are introduced and compared for this reason and, in the end, several suggestions are offered for the promotion of such tools in prediction with much more precision.
Collapse
Affiliation(s)
- Abbasali Emamjomeh
- Laboratory of Computational Biotechnology and Bioinformatics (CBB), Department of Plant Breeding and Biotechnology (PBB), University of Zabol, Zabol, 98615-538, Iran.
| | - Darush Choobineh
- Agricultural Biotechnology, Department of Plant Breeding and Biotechnology (PBB), Faculty of Agriculture, University of Zabol, Zabol, Iran
| | - Behzad Hajieghrari
- Department of Agricultural Biotechnology, College of Agriculture, Jahrom University, Jahrom, 74135-111, Iran.
| | - Nafiseh MahdiNezhad
- Laboratory of Computational Biotechnology and Bioinformatics (CBB), Department of Plant Breeding and Biotechnology (PBB), University of Zabol, Zabol, 98615-538, Iran
| | - Amir Khodavirdipour
- Division of Human Genetics, Department of Anatomy, St. John's hospital, Bangalore, India
| |
Collapse
|
3
|
Oldfield CJ, Chen K, Kurgan L. Computational Prediction of Secondary and Supersecondary Structures from Protein Sequences. Methods Mol Biol 2019; 1958:73-100. [PMID: 30945214 DOI: 10.1007/978-1-4939-9161-7_4] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Many new methods for the sequence-based prediction of the secondary and supersecondary structures have been developed over the last several years. These and older sequence-based predictors are widely applied for the characterization and prediction of protein structure and function. These efforts have produced countless accurate predictors, many of which rely on state-of-the-art machine learning models and evolutionary information generated from multiple sequence alignments. We describe and motivate both types of predictions. We introduce concepts related to the annotation and computational prediction of the three-state and eight-state secondary structure as well as several types of supersecondary structures, such as β hairpins, coiled coils, and α-turn-α motifs. We review 34 predictors focusing on recent tools and provide detailed information for a selected set of 14 secondary structure and 3 supersecondary structure predictors. We conclude with several practical notes for the end users of these predictive methods.
Collapse
Affiliation(s)
- Christopher J Oldfield
- Department of Computer Science, College of Engineering, Virginia Commonwealth University, Richmond, VA, USA
| | - Ke Chen
- School of Computer Science and Software Engineering, Tianjin Polytechnic University, Tianjin, People's Republic of China
| | - Lukasz Kurgan
- Department of Computer Science, College of Engineering, Virginia Commonwealth University, Richmond, VA, USA.
| |
Collapse
|
4
|
Flot M, Mishra A, Kuchi AS, Hoque MT. StackSSSPred: A Stacking-Based Prediction of Supersecondary Structure from Sequence. Methods Mol Biol 2019; 1958:101-122. [PMID: 30945215 DOI: 10.1007/978-1-4939-9161-7_5] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Supersecondary structure (SSS) refers to specific geometric arrangements of several secondary structure (SS) elements that are connected by loops. The SSS can provide useful information about the spatial structure and function of a protein. As such, the SSS is a bridge between the secondary structure and tertiary structure. In this chapter, we propose a stacking-based machine learning method for the prediction of two types of SSSs, namely, β-hairpins and β-α-β, from the protein sequence based on comprehensive feature encoding. To encode protein residues, we utilize key features such as solvent accessibility, conservation profile, half surface exposure, torsion angle fluctuation, disorder probabilities, and more. The usefulness of the proposed approach is assessed using a widely used threefold cross-validation technique. The obtained empirical result shows that the proposed approach is useful and prediction can be improved further.
Collapse
Affiliation(s)
- Michael Flot
- Department of Computer Science, University of New Orleans, New Orleans, LA, USA
| | - Avdesh Mishra
- Department of Computer Science, University of New Orleans, New Orleans, LA, USA
| | - Aditi Sharma Kuchi
- Department of Computer Science, University of New Orleans, New Orleans, LA, USA
| | - Md Tamjidul Hoque
- Department of Computer Science, University of New Orleans, New Orleans, LA, USA.
| |
Collapse
|
5
|
Predicting DNA binding proteins using support vector machine with hybrid fractal features. J Theor Biol 2013; 343:186-92. [PMID: 24189096 DOI: 10.1016/j.jtbi.2013.10.009] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2013] [Revised: 08/12/2013] [Accepted: 10/17/2013] [Indexed: 11/20/2022]
Abstract
DNA-binding proteins play a vitally important role in many biological processes. Prediction of DNA-binding proteins from amino acid sequence is a significant but not fairly resolved scientific problem. Chaos game representation (CGR) investigates the patterns hidden in protein sequences, and visually reveals previously unknown structure. Fractal dimensions (FD) are good tools to measure sizes of complex, highly irregular geometric objects. In order to extract the intrinsic correlation with DNA-binding property from protein sequences, CGR algorithm, fractal dimension and amino acid composition are applied to formulate the numerical features of protein samples in this paper. Seven groups of features are extracted, which can be computed directly from the primary sequence, and each group is evaluated by the 10-fold cross-validation test and Jackknife test. Comparing the results of numerical experiments, the group of amino acid composition and fractal dimension (21-dimension vector) gets the best result, the average accuracy is 81.82% and average Matthew's correlation coefficient (MCC) is 0.6017. This resulting predictor is also compared with existing method DNA-Prot and shows better performances.
Collapse
|
6
|
Garcin P, Delalande O, Zhang JY, Cassier-Chauvat C, Chauvat F, Boulard Y. A transcriptional-switch model for Slr1738-controlled gene expression in the cyanobacterium Synechocystis. BMC STRUCTURAL BIOLOGY 2012; 12:1. [PMID: 22289274 PMCID: PMC3293774 DOI: 10.1186/1472-6807-12-1] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/11/2011] [Accepted: 01/30/2012] [Indexed: 12/13/2022]
Abstract
BACKGROUND Protein-DNA interactions play a crucial role in the life of biological organisms in controlling transcription, regulation, as well as DNA recombination and repair. The deep understanding of these processes, which requires the atomic description of the interactions occurring between the proteins and their DNA partners is often limited by the absence of a 3D structure of such complexes. RESULTS In this study, using a method combining sequence homology, structural analogy modeling and biochemical data, we first build the 3D structure of the complex between the poorly-characterized PerR-like regulator Slr1738 and its target DNA, which controls the defences against metal and oxidative stresses in Synechocystis. In a second step, we propose an expanded version of the Slr1738-DNA structure, which accommodates the DNA binding of Slr1738 multimers, a feature likely operating in the complex Slr1738-mediated regulation of stress responses. Finally, in agreement with experimental data we present a 3D-structure of the Slr1738-DNA complex resulting from the binding of multimers of the FUR-like regulator onto its target DNA that possesses internal repeats. CONCLUSION Using a combination of different types of data, we build and validate a relevant model of the tridimensional structure of a biologically important protein-DNA complex. Then, based on published observations, we propose more elaborated multimeric models that may be biologically important to understand molecular mechanisms.
Collapse
Affiliation(s)
- Paul Garcin
- CEA, Institut de Biologie et de Technologies de Saclay, Service de Biologie Intégrative et Génétique Moléculaire, LBI, CEA-Saclay, F-91191 Gif sur Yvette CEDEX, France
| | - Olivier Delalande
- CEA, Institut de Biologie et de Technologies de Saclay, Service de Biologie Intégrative et Génétique Moléculaire, LBI, CEA-Saclay, F-91191 Gif sur Yvette CEDEX, France
| | - Ju-Yuan Zhang
- CEA, Institut de Biologie et de Technologies de Saclay, Service de Biologie Intégrative et Génétique Moléculaire, LBI, CEA-Saclay, F-91191 Gif sur Yvette CEDEX, France
| | - Corinne Cassier-Chauvat
- CEA, Institut de Biologie et de Technologies de Saclay, Service de Biologie Intégrative et Génétique Moléculaire, LBI, CEA-Saclay, F-91191 Gif sur Yvette CEDEX, France
- CNRS, URA 2096, F-91191 Gif sur Yvette CEDEX, France
| | - Franck Chauvat
- CEA, Institut de Biologie et de Technologies de Saclay, Service de Biologie Intégrative et Génétique Moléculaire, LBI, CEA-Saclay, F-91191 Gif sur Yvette CEDEX, France
| | - Yves Boulard
- CEA, Institut de Biologie et de Technologies de Saclay, Service de Biologie Intégrative et Génétique Moléculaire, LBI, CEA-Saclay, F-91191 Gif sur Yvette CEDEX, France
| |
Collapse
|
7
|
Chen K, Kurgan L. Computational prediction of secondary and supersecondary structures. Methods Mol Biol 2012; 932:63-86. [PMID: 22987347 DOI: 10.1007/978-1-62703-065-6_5] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
The sequence-based prediction of the secondary and supersecondary structures enjoys strong interest and finds applications in numerous areas related to the characterization and prediction of protein structure and function. Substantial efforts in these areas over the last three decades resulted in the development of accurate predictors, which take advantage of modern machine learning models and availability of evolutionary information extracted from multiple sequence alignment. In this chapter, we first introduce and motivate both prediction areas and introduce basic concepts related to the annotation and prediction of the secondary and supersecondary structures, focusing on the β hairpin, coiled coil, and α-turn-α motifs. Next, we overview state-of-the-art prediction methods, and we provide details for 12 modern secondary structure predictors and 4 representative supersecondary structure predictors. Finally, we provide several practical notes for the users of these prediction tools.
Collapse
Affiliation(s)
- Ke Chen
- Department of Electrical and Computer Engineering, University of Alberta, Edmonton, AB, Canada
| | | |
Collapse
|
8
|
Georgescu J, Munhoz VHO, Bechinger B. NMR structures of the histidine-rich peptide LAH4 in micellar environments: membrane insertion, pH-dependent mode of antimicrobial action, and DNA transfection. Biophys J 2011; 99:2507-15. [PMID: 20959091 DOI: 10.1016/j.bpj.2010.05.038] [Citation(s) in RCA: 51] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2009] [Revised: 05/12/2010] [Accepted: 05/21/2010] [Indexed: 11/26/2022] Open
Abstract
The LAH4 family of histidine-rich peptides exhibits potent antimicrobial and DNA transfection activities, both of which require interactions with cellular membranes. The bilayer association of the peptides has been shown to be strongly pH-dependent, with in-planar alignments under acidic conditions and transmembrane orientations when the histidines are discharged. Therefore, we investigated the pH- and temperature-dependent conformations of LAH4 in DPC micellar solutions and in a TFE/PBS solvent mixture. In the presence of detergent and at pH 4.1, LAH4 adopts helical conformations between residues 9 and 24 concomitantly with a high hydrophobic moment. At pH 6.1, a helix-loop-helix structure forms with a hinge encompassing residues His¹⁰-Ala¹³. The data suggest that the high density of histidine residues and the resulting electrostatic repulsion lead to both a decrease in the pK values of the histidines and a less stable α-helical conformation of this region. The hinged structure at pH 6.1 facilitates membrane anchoring and insertion. At pH 7.8, the histidines are uncharged and an extended helical conformation including residues 4-21 is again obtained. LAH4 thus exhibits a high degree of conformational plasticity. The structures provide a stroboscopic view of the conformational changes that occur during membrane insertion, and are discussed in the context of antimicrobial activity and DNA transfection.
Collapse
Affiliation(s)
- Julia Georgescu
- Institut de Chimie, Université de Strasbourg/Centre National de Recherche Scientifique, France
| | | | | |
Collapse
|
9
|
Langlois RE, Lu H. Boosting the prediction and understanding of DNA-binding domains from sequence. Nucleic Acids Res 2010; 38:3149-58. [PMID: 20156993 PMCID: PMC2879530 DOI: 10.1093/nar/gkq061] [Citation(s) in RCA: 44] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
DNA-binding proteins perform vital functions related to transcription, repair and replication. We have developed a new sequence-based machine learning protocol to identify DNA-binding proteins. We compare our method with an extensive benchmark of previously published structure-based machine learning methods as well as a standard sequence alignment technique, BLAST. Furthermore, we elucidate important feature interactions found in a learned model and analyze how specific rules capture general mechanisms that extend across DNA-binding motifs. This analysis is carried out using the malibu machine learning workbench available at http://proteomics.bioengr.uic.edu/malibu and the corresponding data sets and features are available at http://proteomics.bioengr.uic.edu/dna.
Collapse
Affiliation(s)
- Robert E Langlois
- Department of Bioengineering, University of Illinois at Chicago, Chicago, IL 60612, USA
| | | |
Collapse
|
10
|
Xiong W, Li T, Chen K, Tang K. Local combinational variables: an approach used in DNA-binding helix-turn-helix motif prediction with sequence information. Nucleic Acids Res 2009; 37:5632-40. [PMID: 19651875 PMCID: PMC2761287 DOI: 10.1093/nar/gkp628] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2009] [Revised: 07/14/2009] [Accepted: 07/14/2009] [Indexed: 11/24/2022] Open
Abstract
Sequence-based approach for motif prediction is of great interest and remains a challenge. In this work, we develop a local combinational variable approach for sequence-based helix-turn-helix (HTH) motif prediction. First we choose a sequence data set for 88 proteins of 22 amino acids in length to launch an optimized traversal for extracting local combinational segments (LCS) from the data set. Then after LCS refinement, local combinational variables (LCV) are generated to construct prediction models for HTH motifs. Prediction ability of LCV sets at different thresholds is calculated to settle a moderate threshold. The large data set we used comprises 13 HTH families, with 17 455 sequences in total. Our approach predicts HTH motifs more precisely using only primary protein sequence information, with 93.29% accuracy, 93.93% sensitivity and 92.66% specificity. Prediction results of newly reported HTH-containing proteins compared with other prediction web service presents a good prediction model derived from the LCV approach. Comparisons with profile-HMM models from the Pfam protein families database show that the LCV approach maintains a good balance while dealing with HTH-containing proteins and non-HTH proteins at the same time. The LCV approach is to some extent a complementary to the profile-HMM models for its better identification of false-positive data. Furthermore, genome-wide predictions detect new HTH proteins in both Homo sapiens and Escherichia coli organisms, which enlarge applications of the LCV approach. Software for mining LCVs from sequence data set can be obtained from anonymous ftp site ftp://cheminfo.tongji.edu.cn/LCV/freely.
Collapse
Affiliation(s)
| | - Tonghua Li
- Department of Chemistry, Tongji University, Shanghai, 200092, China
| | | | | |
Collapse
|
11
|
Gupta S, Bansal S, Deb JK, Kundu B. Interplay between DtxR and nitric oxide reductase activities: a functional genomics approach indicating involvement of homologous protein domains in bacterial pathogenesis. Int J Exp Pathol 2007; 88:377-85. [PMID: 17877539 PMCID: PMC2517329 DOI: 10.1111/j.1365-2613.2007.00544.x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022] Open
Abstract
Corynebacterium diphtheriae pathogenesis depends on the production of toxin (Dtx), which in turn depends on a micromolar concentration of nitric oxide (NO)-mediated deactivation of DtxR (an iron-dependent regulator). Inside a host, the pathogen often encounters excess of NO that acts as an oxidative toxicant. Therefore a critical level of NO needs to be maintained by the pathogen. This necessitates reduction of excess NO by the presence of a reductase, namely nitric oxide reductase (NOR). Similar to the expression of toxin, the expression of NOR is possibly regulated by another regulator NorR, as has been found in other gram positive and gram-negative bacteria. Therefore, a correlation between concentration of NO on the deactivation of DtxR and transactivation of NorR becomes apparent. However, unlike many other pathogens the presence of NOR and NorR in C. diphtheriae has not been established. We applied a combination of bioinformatics and comparative genomics approach on C. diphtheriae genome using Escherichia coli as a model organism to find some structural and functional homologoues for the two genes in question. The various domain characteristics for the two proteins (NOR and NorR) have been taken into account in this analysis. Through extensive genome and proteome search we have been able to identify key regulatory genes, which are possibly involved in coordination and control of NO stress in C. diphtheriae. Our finding will progress the understanding of the complete regulatory mechanism for evasion and maintenance of pathogenesis by this and other pathogenic organisms.
Collapse
Affiliation(s)
- Shwetank Gupta
- Department of Biochemical Engineering and Biotechnology, Indian Institute of Technology, New Delhi, India
| | | | | | | |
Collapse
|
12
|
Iyer LM, Anantharaman V, Wolf MY, Aravind L. Comparative genomics of transcription factors and chromatin proteins in parasitic protists and other eukaryotes. Int J Parasitol 2007; 38:1-31. [PMID: 17949725 DOI: 10.1016/j.ijpara.2007.07.018] [Citation(s) in RCA: 192] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2007] [Revised: 07/26/2007] [Accepted: 07/30/2007] [Indexed: 11/18/2022]
Abstract
Comparative genomics of parasitic protists and their free-living relatives are profoundly impacting our understanding of the regulatory systems involved in transcription and chromatin dynamics. While some parts of these systems are highly conserved, other parts are rapidly evolving, thereby providing the molecular basis for the variety in the regulatory adaptations of eukaryotes. The gross number of specific transcription factors and chromatin proteins are positively correlated with proteome size in eukaryotes. However, the individual types of specific transcription factors show an enormous variety across different eukaryotic lineages. The dominant families of specific transcription factors even differ between sister lineages, and have been shaped by gene loss and lineage-specific expansions. Recognition of this principle has helped in identifying the hitherto unknown, major specific transcription factors of several parasites, such as apicomplexans, Entamoeba histolytica, Trichomonas vaginalis, Phytophthora and ciliates. Comparative analysis of predicted chromatin proteins from protists allows reconstruction of the early evolutionary history of histone and DNA modification, nucleosome assembly and chromatin-remodeling systems. Many key catalytic, peptide-binding and DNA-binding domains in these systems ultimately had bacterial precursors, but were put together into distinctive regulatory complexes that are unique to the eukaryotes. In the case of histone methylases, histone demethylases and SWI2/SNF2 ATPases, proliferation of paralogous families followed by acquisition of novel domain architectures, seem to have played a major role in producing a diverse set of enzymes that create and respond to an epigenetic code of modified histones. The diversification of histone acetylases and DNA methylases appears to have proceeded via repeated emergence of new versions, most probably via transfers from bacteria to different eukaryotic lineages, again resulting in lineage-specific diversity in epigenetic signals. Even though the key histone modifications are universal to eukaryotes, domain architectures of proteins binding post-translationally modified-histones vary considerably across eukaryotes. This indicates that the histone code might be "interpreted" differently from model organisms in parasitic protists and their relatives. The complexity of domain architectures of chromatin proteins appears to have increased during eukaryotic evolution. Thus, Trichomonas, Giardia, Naegleria and kinetoplastids have relatively simple domain architectures, whereas apicomplexans and oomycetes have more complex architectures. RNA-dependent post-transcriptional silencing systems, which interact with chromatin-level regulatory systems, show considerable variability across parasitic protists, with complete loss in many apicomplexans and partial loss in Trichomonas vaginalis. This evolutionary synthesis offers a robust scaffold for future investigation of transcription and chromatin structure in parasitic protists.
Collapse
Affiliation(s)
- Lakshminarayan M Iyer
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | | | | | | |
Collapse
|
13
|
Langlois RE, Carson MB, Bhardwaj N, Lu H. Learning to translate sequence and structure to function: identifying DNA binding and membrane binding proteins. Ann Biomed Eng 2007; 35:1043-52. [PMID: 17436108 PMCID: PMC2706547 DOI: 10.1007/s10439-007-9312-z] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2006] [Accepted: 04/02/2007] [Indexed: 10/23/2022]
Abstract
A protein's function depends in a large part on interactions with other molecules. With an increasing number of protein structures becoming available every year, a corresponding structural annotation approach identifying such interactions grows more expedient. At the same time, machine learning has gained popularity in bioinformatics providing robust annotation of genes and proteins without sequence homology. Here we have developed a general machine learning protocol to identify proteins that bind DNA and membrane. In general, there is no theory or even rule of thumb to pick the best machine learning algorithm. Thus, a systematic comparison of several classification algorithms known to perform well is investigated. Indeed, the boosted tree classifier is found to give the best performance, achieving 93% and 88% accuracy to discriminate non-homologous proteins that bind membrane and DNA, respectively, significantly outperforming all previously published works. We also attempted to address the importance of the attributes in function prediction and the relationships between relevant attributes. A graphical model based on boosted trees is applied to study the important features in discriminating DNA-binding proteins. In summary, the current protocol identified physical features important in DNA and membrane binding, rather than annotating function through sequence similarity.
Collapse
Affiliation(s)
| | | | | | - Hui Lu
- Corresponding Author: Hui Lu 851 S Morgan, Rm 218, M/C063 Chicago, IL 60607 Phone: (312) 413−2021 Fax: (312) 413−2018
| |
Collapse
|