101
|
Shen HB, Yang J, Chou KC. Fuzzy KNN for predicting membrane protein types from pseudo-amino acid composition. J Theor Biol 2006; 240:9-13. [PMID: 16197963 DOI: 10.1016/j.jtbi.2005.08.016] [Citation(s) in RCA: 140] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2005] [Revised: 08/15/2005] [Accepted: 08/18/2005] [Indexed: 11/30/2022]
Abstract
Cell membranes are vitally important to the life of a cell. Although the basic structure of biological membrane is provided by the lipid bilayer, membrane proteins perform most of the specific functions. Membrane proteins are putatively classified into five different types. Identification of their types is currently an important topic in bioinformatics and proteomics. In this paper, based on the concept of representing protein samples in terms of their pseudo-amino acid composition, the fuzzy K-nearest neighbors (KNN) algorithm has been introduced to predict membrane protein types, and high success rates were observed. It is anticipated that, the current approach, which is based on a branch of fuzzy mathematics and represents a new strategy, may play an important complementary role to the existing methods in this area. The novel approach may also have notable impact on prediction of the other attributes, such as protein structural class, protein subcellular localization, and enzyme family class, among many others.
Collapse
Affiliation(s)
- Hong-Bin Shen
- Institute of Image Processing and Pattern Recognition, Shanghai Jiaotong University, 200030 Shanghai, China
| | | | | |
Collapse
|
102
|
Cai YD, Feng KY, Lu WC, Chou KC. Using LogitBoost classifier to predict protein structural classes. J Theor Biol 2006; 238:172-6. [PMID: 16043193 DOI: 10.1016/j.jtbi.2005.05.034] [Citation(s) in RCA: 156] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2005] [Revised: 05/04/2005] [Accepted: 05/05/2005] [Indexed: 11/19/2022]
Abstract
Prediction of protein classification is an important topic in molecular biology. This is because it is able to not only provide useful information from the viewpoint of structure itself, but also greatly stimulate the characterization of many other features of proteins that may be closely correlated with their biological functions. In this paper, the LogitBoost, one of the boosting algorithms developed recently, is introduced for predicting protein structural classes. It performs classification using a regression scheme as the base learner, which can handle multi-class problems and is particularly superior in coping with noisy data. It was demonstrated that the LogitBoost outperformed the support vector machines in predicting the structural classes for a given dataset, indicating that the new classifier is very promising. It is anticipated that the power in predicting protein structural classes as well as many other bio-macromolecular attributes will be further strengthened if the LogitBoost and some other existing algorithms can be effectively complemented with each other.
Collapse
Affiliation(s)
- Yu-Dong Cai
- Department of Chemistry, College of Sciences, Shanghai University, 99 Shang-Da Road, Shanghai 200436, China
| | | | | | | |
Collapse
|
103
|
Guo YZ, Li ML, Wang KL, Wen ZN, Lu MC, Liu LX, Jiang L. Fast fourier transform-based support vector machine for prediction of G-protein coupled receptor subfamilies. Acta Biochim Biophys Sin (Shanghai) 2005; 37:759-66. [PMID: 16270155 DOI: 10.1111/j.1745-7270.2005.00110.x] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022] Open
Abstract
Although the sequence information on G-protein coupled receptors (GPCRs) continues to grow, many GPCRs remain orphaned (i.e. ligand specificity unknown) or poorly characterized with little structural information available, so an automated and reliable method is badly needed to facilitate the identification of novel receptors. In this study, a method of fast Fourier transform-based support vector machine has been developed for predicting GPCR subfamilies according to protein's hydrophobicity. In classifying Class B, C, D and F subfamilies, the method achieved an overall Matthe's correlation coefficient and accuracy of 0.95 and 93.3%, respectively, when evaluated using the jackknife test. The method achieved an accuracy of 100% on the Class B independent dataset. The results show that this method can classify GPCR subfamilies as well as their functional classification with high accuracy. A web server implementing the prediction is available at http://chem.scu.edu.cn/blast/Pred-GPCR.
Collapse
Affiliation(s)
- Yan-Zhi Guo
- College of Chemistry, Sichuan University, Chengdu 610064, China
| | | | | | | | | | | | | |
Collapse
|
104
|
Shen HB, Yang J, Liu XJ, Chou KC. Using supervised fuzzy clustering to predict protein structural classes. Biochem Biophys Res Commun 2005; 334:577-81. [PMID: 16023077 DOI: 10.1016/j.bbrc.2005.06.128] [Citation(s) in RCA: 125] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2005] [Accepted: 06/12/2005] [Indexed: 11/23/2022]
Abstract
Prediction of protein classification is both an important and a tempting topic in protein science. This is because of not only that the knowledge thus obtained can provide useful information about the overall structure of a query protein, but also that the practice itself can technically stimulate the development of novel predictors that may be straightforwardly applied to many other relevant areas. In this paper, a novel approach, the so-called "supervised fuzzy clustering approach" is introduced that is featured by utilizing the class label information during the training process. Based on such an approach, a set of "if-then" fuzzy rules for predicting the protein structural classes are extracted from a training dataset. It has been demonstrated through two different working datasets that the overall success prediction rates obtained by the supervised fuzzy clustering approach are all higher than those by the unsupervised fuzzy c-means introduced by the previous investigators [C.T. Zhang, K.C. Chou, G.M. Maggiora. Protein Eng. (1995) 8, 425-435]. It is anticipated that the current predictor may play an important complementary role to other existing predictors in this area to further strengthen the power in predicting the structural classes of proteins and their other characteristic attributes.
Collapse
Affiliation(s)
- Hong-Bin Shen
- Institute of Image Processing and Pattern Recognition, Shanghai Jiaotong University, Shanghai 200030, China
| | | | | | | |
Collapse
|
105
|
Feng KY, Cai YD, Chou KC. Boosting classifier for predicting protein domain structural class. Biochem Biophys Res Commun 2005; 334:213-7. [PMID: 15993842 DOI: 10.1016/j.bbrc.2005.06.075] [Citation(s) in RCA: 111] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2005] [Accepted: 06/14/2005] [Indexed: 11/26/2022]
Abstract
A novel classifier, the so-called "LogitBoost" classifier, was introduced to predict the structural class of a protein domain according to its amino acid sequence. LogitBoost is featured by introducing a log-likelihood loss function to reduce the sensitivity to noise and outliers, as well as by performing classification via combining many weak classifiers together to build up a very strong and robust classifier. It was demonstrated thru jackknife cross-validation tests that LogitBoost outperformed other classifiers including "support vector machine," a very powerful classifier widely used in biological literatures. It is anticipated that LogitBoost can also become a useful vehicle in classifying other attributes of proteins according to their sequences, such as subcellular localization and enzyme family class, among many others.
Collapse
Affiliation(s)
- Kai-Yan Feng
- Imaging Science and Biomedical Engineering, Medical School, The University of Manchester, Manchester, M13 9PT, UK
| | | | | |
Collapse
|
106
|
Shen H, Chou KC. Using optimized evidence-theoretic K-nearest neighbor classifier and pseudo-amino acid composition to predict membrane protein types. Biochem Biophys Res Commun 2005; 334:288-92. [PMID: 16002049 DOI: 10.1016/j.bbrc.2005.06.087] [Citation(s) in RCA: 128] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2005] [Accepted: 06/14/2005] [Indexed: 10/25/2022]
Abstract
Knowledge of membrane protein type often provides crucial hints toward determining the function of an uncharacterized membrane protein. With the avalanche of new protein sequences emerging during the post-genomic era, it is highly desirable to develop an automated method that can serve as a high throughput tool in identifying the types of newly found membrane proteins according to their primary sequences, so as to timely make the relevant annotations on them for the reference usage in both basic research and drug discovery. Based on the concept of pseudo-amino acid composition [K.C. Chou, Proteins: Struct. Funct. Genet. 43 (2001) 246-255; Erratum: Proteins: Struct. Funct. Genet. 44 (2001) 60] that has made it possible to incorporate a considerable amount of sequence-order effects by representing a protein sample in terms of a set of discrete numbers, a novel predictor, the so-called "optimized evidence-theoretic K-nearest neighbor" or "OET-KNN" classifier, was proposed. It was demonstrated via the self-consistency test, jackknife test, and independent dataset test that the new predictor, compared with many previous ones, yielded higher success rates in most cases. The new predictor can also be used to improve the prediction quality for, among many other protein attributes, structural class, subcellular localization, enzyme family class, and G-protein coupled receptor type. The OET-KNN classifier will be available as a web-server at http://www.pami.sjtu.edu.cn/kcchou.
Collapse
Affiliation(s)
- Hongbin Shen
- Institute of Image Processing and Pattern Recognition, Shanghai Jiaotong University, Shanghai 200030, China
| | | |
Collapse
|
107
|
Sirois S, Hatzakis G, Wei D, Du Q, Chou KC. Assessment of chemical libraries for their druggability. Comput Biol Chem 2005; 29:55-67. [PMID: 15680586 DOI: 10.1016/j.compbiolchem.2004.11.003] [Citation(s) in RCA: 46] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2004] [Revised: 09/22/2004] [Accepted: 11/30/2004] [Indexed: 11/17/2022]
Abstract
High throughput virtual screening is acknowledged as the initial means for identifying hit compounds that will be eventually transformed to leads or drug candidates. To improve quality of screening, it is essential to have powerful methods for the analysis of the compound databases. For this purpose, we have developed a novel and practical scoring function to assess the druggability of compounds. The proposed function consists of 12 metrics that take into account physical, chemical and structural properties as well as the presence of undesirable functional groups. We have applied this 12-metric scoring function to 44 different databases that include more than 3.8 million compounds, which are commercially available. The overall quality of each database was evaluated according to the score and rank measured by our 12-metric function. Our findings suggest that, the majority of compounds that do not satisfy druggable rules do so due to high molecular weight, high logP values and the presence of reactive functional groups.
Collapse
Affiliation(s)
- Suzanne Sirois
- Immune Deficiency Treatment Centre, Montreal General Hospital, McGill University, 1650 Cedar av., Que., H3G 1A4, Canada A5 140
| | | | | | | | | |
Collapse
|
108
|
Abstract
The completion of the human genome sequencing project has identified approximately 720 genes that belong to the G-protein coupled receptor (GPCR) superfamily. Approximately half of these genes are thought to encode sensory receptors. Of the remaining 360 receptors, the natural ligand has been identified for approximately 210 receptors, leaving 150 so-called orphan GPCRs with no known ligand or function. The identification of ligands active at orphan GPCRs has been achieved through the development of a number of experimental approaches, including the screening of putative small molecule and peptide ligands, reverse pharmacology, and the use of bioinformatics to predict candidate ligands. In this review, we discuss the methodologies developed for the identification of ligands at orphan GPCRs and include examples of their successful application.
Collapse
Affiliation(s)
- Alan Wise
- 7TMR Systems Research Europe, GlaxoSmithKline, Gunnels Wood Road, Stevenage, Herts SG1 2NY, United Kingdom.
| | | | | |
Collapse
|
109
|
Cai YD, Ricardo PW, Jen CH, Chou KC. Application of SVM to predict membrane protein types. J Theor Biol 2004; 226:373-6. [PMID: 14759643 DOI: 10.1016/j.jtbi.2003.08.015] [Citation(s) in RCA: 115] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2003] [Revised: 08/22/2003] [Accepted: 08/28/2003] [Indexed: 11/28/2022]
Abstract
As a continuous effort to develop automated methods for predicting membrane protein types that was initiated by Chou and Elrod (PROTEINS: Structure, Function, and Genetics, 1999, 34, 137-153), the support vector machine (SVM) is introduced. Results obtained through re-substitution, jackknife, and independent data set tests, respectively, have indicated that the SVM approach is quite a promising one, suggesting that the covariant discriminant algorithm (Chou and Elrod, Protein Eng. 12 (1999) 107) and SVM, if effectively complemented with each other, will become a powerful tool for predicting membrane protein types and the other protein attributes as well.
Collapse
Affiliation(s)
- Yu-Dong Cai
- Shanghai Research Centre of Biotechnology, Chinese Academy of Sciences, Shanghai 200233, China.
| | | | | | | |
Collapse
|
110
|
Chou KC, Cai YD. Prediction and classification of protein subcellular location-sequence-order effect and pseudo amino acid composition. J Cell Biochem 2003; 90:1250-60. [PMID: 14635197 DOI: 10.1002/jcb.10719] [Citation(s) in RCA: 136] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Given a protein sequence, how to identify its subcellular location? With the rapid increase in newly found protein sequences entering into databanks, the problem has become more and more important because the function of a protein is closely correlated with its localization. To practically deal with the challenge, a dataset has been established that allows the identification performed among the following 14 subcellular locations: (1) cell wall, (2) centriole, (3) chloroplast, (4) cytoplasm, (5) cytoskeleton, (6) endoplasmic reticulum, (7) extracellular, (8) Golgi apparatus, (9) lysosome, (10) mitochondria, (11) nucleus, (12) peroxisome, (13) plasma membrane, and (14) vacuole. Compared with the datasets constructed by the previous investigators, the current one represents the largest in the scope of localizations covered, and hence many proteins which were totally out of picture in the previous treatments, can now be investigated. Meanwhile, to enhance the potential and flexibility in taking into account the sequence-order effect, the series-mode pseudo-amino-acid-composition has been introduced as a representation for a protein. High success rates are obtained by the re-substitution test, jackknife test, and independent dataset test, respectively. It is anticipated that the current automated method can be developed to a high throughput tool for practical usage in both basic research and pharmaceutical industry.
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Gordon Life Science Institute, San Diego, CA 92130, USA
| | | |
Collapse
|
111
|
Abstract
In the protein universe, many proteins are composed of two or more polypeptide chains, generally referred to as subunits, that associate through noncovalent interactions and, occasionally, disulfide bonds. With the number of protein sequences entering into data banks rapidly increasing, we are confronted with a challenge: how to develop an automated method to identify the quaternary attribute for a new polypeptide chain (i.e., whether it is formed just as a monomer, or as a dimer, trimer, or any other oligomer). This is important, because the functions of proteins are closely related to their quaternary attribute. For example, some critical ligands only bind to dimers but not to monomers; some marvelous allosteric transitions only occur in tetramers but not other oligomers; and some ion channels are formed by tetramers, whereas others are formed by pentamers. To explore this problem, we adopted the pseudo amino acid composition originally proposed for improving the prediction of protein subcellular location (Chou, Proteins, 2001; 43:246-255). The advantage of using the pseudo amino acid composition to represent a protein is that it has paved a way that can take into account a considerable amount of sequence-order effects to significantly improve prediction quality. Results obtained by resubstitution, jack-knife, and independent data set tests, have indicated that the current approach might be quite promising in dealing with such an extremely complicated and difficult problem.
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Gordon Life Science Institute, Kalamazoo, Michigan 49009, USA.
| | | |
Collapse
|
112
|
Cai YD, Chou KC. Nearest neighbour algorithm for predicting protein subcellular location by combining functional domain composition and pseudo-amino acid composition. Biochem Biophys Res Commun 2003; 305:407-11. [PMID: 12745090 DOI: 10.1016/s0006-291x(03)00775-7] [Citation(s) in RCA: 66] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
In this paper, based on the approach by combining the "functional domain composition" [K.C. Chou, Y. D. Cai, J. Biol. Chem. 277 (2002) 45765] and the pseudo-amino acid composition [K.C. Chou, Proteins Struct. Funct. Genet. 43 (2001) 246; Correction Proteins Struct. Funct. Genet. 2044 (2001) 2060], the Nearest Neighbour Algorithm (NNA) was developed for predicting the protein subcellular location. Very high success rates were observed, suggesting that such a hybrid approach may become a useful high-throughput tool in the area of bioinformatics and proteomics.
Collapse
Affiliation(s)
- Yu-Dong Cai
- Shanghai Research Centre of Biotechnology, Chinese Academy of Sciences, Shanghai 200233, China.
| | | |
Collapse
|
113
|
Pan YX, Zhang ZZ, Guo ZM, Feng GY, Huang ZD, He L. Application of pseudo amino acid composition for predicting protein subcellular location: stochastic signal processing approach. JOURNAL OF PROTEIN CHEMISTRY 2003; 22:395-402. [PMID: 13678304 DOI: 10.1023/a:1025350409648] [Citation(s) in RCA: 98] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
The function of a protein is closely correlated with its subcellular location. With the success of human genome project and the rapid increase in the number of newly found protein sequences entering into data banks, it is highly desirable to develop an automated method for predicting the subcellular location of proteins. The establishment of such a predictor will no doubt expedite the functionality determination of newly found proteins and the process of prioritizing genes and proteins identified by genomics efforts as potential molecular targets for drug design. Based on the concept of pseudo amino acid composition originally proposed by K. C. Chou (Proteins: Struct. Funct. Genet. 43: 246-255, 2001), the digital signal processing approach has been introduced to partially incorporate the sequence order effect. One of the remarkable merits by doing so is that many existing tools in mathematics and engineering can be straightforwardly used in predicting protein subcellular location. The results thus obtained are quite encouraging. It is anticipated that the digital signal processing may serve as a useful vehicle for many other protein science areas as well.
Collapse
Affiliation(s)
- Yu-Xi Pan
- Bio-X Life Science Research Center, Shanghai Jiao Tong University, Shanghai, China.
| | | | | | | | | | | |
Collapse
|
114
|
Current Awareness on Comparative and Functional Genomics. Comp Funct Genomics 2003; 4:277-84. [PMID: 18629117 PMCID: PMC2447404 DOI: 10.1002/cfg.227] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
|