1
|
Abstract
During the last three decades or so, many efforts have been made to study the protein cleavage
sites by some disease-causing enzyme, such as HIV (Human Immunodeficiency Virus) protease
and SARS (Severe Acute Respiratory Syndrome) coronavirus main proteinase. It has become increasingly
clear <i>via</i> this mini-review that the motivation driving the aforementioned studies is quite wise,
and that the results acquired through these studies are very rewarding, particularly for developing peptide
drugs.
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Gordon Life Science Institute, Boston, MA 02478, United States
| |
Collapse
|
2
|
Chou KC. An Insightful 10-year Recollection Since the Emergence of the 5-steps Rule. Curr Pharm Des 2020; 25:4223-4234. [PMID: 31782354 DOI: 10.2174/1381612825666191129164042] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2019] [Accepted: 11/25/2019] [Indexed: 11/22/2022]
Abstract
OBJECTIVE One of the most challenging and also the most difficult problems is how to formulate a biological sequence with a vector but considerably keep its sequence order information. METHODS To address such a problem, the approach of Pseudo Amino Acid Components or PseAAC has been developed. RESULTS AND CONCLUSION It has become increasingly clear via the 10-year recollection that the aforementioned proposal has been indeed very powerful.
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Gordon Life Science Institute, Boston, Massachusetts 02478, United States.,Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
3
|
Protein sequence information extraction and subcellular localization prediction with gapped k-Mer method. BMC Bioinformatics 2019; 20:719. [PMID: 31888447 PMCID: PMC6936157 DOI: 10.1186/s12859-019-3232-4] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022] Open
Abstract
BACKGROUND Subcellular localization prediction of protein is an important component of bioinformatics, which has great importance for drug design and other applications. A multitude of computational tools for proteins subcellular location have been developed in the recent decades, however, existing methods differ in the protein sequence representation techniques and classification algorithms adopted. RESULTS In this paper, we firstly introduce two kinds of protein sequences encoding schemes: dipeptide information with space and Gapped k-mer information. Then, the Gapped k-mer calculation method which is based on quad-tree is also introduced. CONCLUSIONS >From the prediction results, this method not only reduces the dimension, but also improves the prediction precision of protein subcellular localization.
Collapse
|
4
|
Chou KC. Proposing Pseudo Amino Acid Components is an Important Milestone for Proteome and Genome Analyses. Int J Pept Res Ther 2019. [DOI: 10.1007/s10989-019-09910-7] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
|
5
|
Abstract
Background:
Revealing the subcellular location of a newly discovered protein can
bring insight into their function and guide research at the cellular level. The experimental methods
currently used to identify the protein subcellular locations are both time-consuming and expensive.
Thus, it is highly desired to develop computational methods for efficiently and effectively identifying
the protein subcellular locations. Especially, the rapidly increasing number of protein sequences
entering the genome databases has called for the development of automated analysis methods.
Methods:
In this review, we will describe the recent advances in predicting the protein subcellular
locations with machine learning from the following aspects: i) Protein subcellular location benchmark
dataset construction, ii) Protein feature representation and feature descriptors, iii) Common
machine learning algorithms, iv) Cross-validation test methods and assessment metrics, v) Web
servers.
Result & Conclusion:
Concomitant with a large number of protein sequences generated by highthroughput
technologies, four future directions for predicting protein subcellular locations with
machine learning should be paid attention. One direction is the selection of novel and effective features
(e.g., statistics, physical-chemical, evolutional) from the sequences and structures of proteins.
Another is the feature fusion strategy. The third is the design of a powerful predictor and the fourth
one is the protein multiple location sites prediction.
Collapse
Affiliation(s)
- Ting-He Zhang
- School of Automation, Northwestern Polytechnical University, Xi'an, 710072, China
| | - Shao-Wu Zhang
- School of Automation, Northwestern Polytechnical University, Xi'an, 710072, China
| |
Collapse
|
6
|
Yu B, Li S, Qiu W, Wang M, Du J, Zhang Y, Chen X. Prediction of subcellular location of apoptosis proteins by incorporating PsePSSM and DCCA coefficient based on LFDA dimensionality reduction. BMC Genomics 2018; 19:478. [PMID: 29914358 PMCID: PMC6006758 DOI: 10.1186/s12864-018-4849-9] [Citation(s) in RCA: 33] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2017] [Accepted: 06/01/2018] [Indexed: 01/05/2023] Open
Abstract
BACKGROUND Apoptosis is associated with some human diseases, including cancer, autoimmune disease, neurodegenerative disease and ischemic damage, etc. Apoptosis proteins subcellular localization information is very important for understanding the mechanism of programmed cell death and the development of drugs. Therefore, the prediction of subcellular localization of apoptosis protein is still a challenging task. RESULTS In this paper, we propose a novel method for predicting apoptosis protein subcellular localization, called PsePSSM-DCCA-LFDA. Firstly, the protein sequences are extracted by combining pseudo-position specific scoring matrix (PsePSSM) and detrended cross-correlation analysis coefficient (DCCA coefficient), then the extracted feature information is reduced dimensionality by LFDA (local Fisher discriminant analysis). Finally, the optimal feature vectors are input to the SVM classifier to predict subcellular location of the apoptosis proteins. The overall prediction accuracy of 99.7, 99.6 and 100% are achieved respectively on the three benchmark datasets by the most rigorous jackknife test, which is better than other state-of-the-art methods. CONCLUSION The experimental results indicate that our method can significantly improve the prediction accuracy of subcellular localization of apoptosis proteins, which is quite high to be able to become a promising tool for further proteomics studies. The source code and all datasets are available at https://github.com/QUST-BSBRC/PsePSSM-DCCA-LFDA/ .
Collapse
Affiliation(s)
- Bin Yu
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China. .,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China. .,School of Life Sciences, University of Science and Technology of China, Hefei, 230027, China.
| | - Shan Li
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China.,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Wenying Qiu
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China.,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Minghui Wang
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China.,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Junwei Du
- College of Information Science and Technology, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Yusen Zhang
- School of Mathematics and Statistics, Shandong University at Weihai, Weihai, 264209, China
| | - Xing Chen
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, 21116, China
| |
Collapse
|
7
|
An Efficient Approach for Prediction of Nuclear Receptor and Their Subfamilies Based on Fuzzy k-Nearest Neighbor with Maximum Relevance Minimum Redundancy. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES INDIA SECTION A-PHYSICAL SCIENCES 2018. [DOI: 10.1007/s40010-016-0325-6] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
8
|
Zai X, Yang Q, Liu K, Li R, Qian M, Zhao T, Li Y, Yin Y, Dong D, Fu L, Li S, Xu J, Chen W. A comprehensive proteogenomic study of the human Brucella vaccine strain 104 M. BMC Genomics 2017; 18:402. [PMID: 28535754 PMCID: PMC5442703 DOI: 10.1186/s12864-017-3800-9] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2016] [Accepted: 05/16/2017] [Indexed: 03/21/2023] Open
Abstract
BACKGROUND Brucella spp. are Gram-negative, facultative intracellular pathogens that cause brucellosis in both humans and animals. The B. abortus vaccine strain 104 M is the only vaccine available in China for the prevention of brucellosis in humans. Although the B. abortus 104 M genome has been fully sequenced, the current genome annotations are not yet complete. In addition, the main mechanisms underpinning its residual toxicity and vaccine-induced immune protection have yet to be elucidated. Mapping the proteome of B. abortus 104 M will help to improve genome annotation quality, thereby facilitating a greater understanding of its biology. RESULTS In this study, we utilized a proteogenomic approach that combined subcellular fractionation and peptide fractionation to perform a whole-proteome analysis and genome reannotation of B. abortus 104 M using high-resolution mass spectrometry. In total, 1,729 proteins (56.3% of 3,072) including 218 hypothetical proteins were identified using the culture conditions that were employed this study. The annotations of the B. abortus 104 M genome were also refined following identification and validation by reverse transcription-PCR. In addition, 14 pivotal virulence factors and 17 known protective antigens known to be involved in residual toxicity and immune protection were confirmed at the protein level following induction by the 104 M vaccine. Moreover, a further insight into the cell biology of multichromosomal bacteria was obtained following the elucidation of differences in protein expression levels between the small and large chromosomes. CONCLUSIONS The work presented in this report used a proteogenomic approach to perform whole-proteome analysis and genome reannotation in B. abortus 104 M; this work helped to improve genome annotation quality. Our analysis of virulence factors, protective antigens and other protein effectors provided the basis for further research to elucidate the mechanisms of residual toxicity and immune protection induced by the 104 M vaccine. Finally, the potential link between replication dynamics, gene function, and protein expression levels in this multichromosomal bacterium was detailed.
Collapse
Affiliation(s)
- Xiaodong Zai
- Laboratory of Vaccine and Antibody Engineering, Beijing Institute of Biotechnology, Beijing, China
| | - Qiaoling Yang
- Laboratory of Vaccine and Antibody Engineering, Beijing Institute of Biotechnology, Beijing, China
| | - Kun Liu
- Laboratory of Vaccine and Antibody Engineering, Beijing Institute of Biotechnology, Beijing, China
| | - Ruihua Li
- Laboratory of Vaccine and Antibody Engineering, Beijing Institute of Biotechnology, Beijing, China
| | - Mengying Qian
- Laboratory of Vaccine and Antibody Engineering, Beijing Institute of Biotechnology, Beijing, China
| | - Taoran Zhao
- Laboratory of Vaccine and Antibody Engineering, Beijing Institute of Biotechnology, Beijing, China
| | - Yaohui Li
- Laboratory of Vaccine and Antibody Engineering, Beijing Institute of Biotechnology, Beijing, China
| | - Ying Yin
- Laboratory of Vaccine and Antibody Engineering, Beijing Institute of Biotechnology, Beijing, China
| | - Dayong Dong
- Laboratory of Vaccine and Antibody Engineering, Beijing Institute of Biotechnology, Beijing, China
| | - Ling Fu
- Laboratory of Vaccine and Antibody Engineering, Beijing Institute of Biotechnology, Beijing, China
| | - Shanhu Li
- Laboratory of Vaccine and Antibody Engineering, Beijing Institute of Biotechnology, Beijing, China
| | - Junjie Xu
- Laboratory of Vaccine and Antibody Engineering, Beijing Institute of Biotechnology, Beijing, China.
| | - Wei Chen
- Laboratory of Vaccine and Antibody Engineering, Beijing Institute of Biotechnology, Beijing, China.
| |
Collapse
|
9
|
Liu B, Wu H, Chou KC. Pse-in-One 2.0: An Improved Package of Web Servers for Generating Various Modes of Pseudo Components of DNA, RNA, and Protein Sequences. ACTA ACUST UNITED AC 2017. [DOI: 10.4236/ns.2017.94007] [Citation(s) in RCA: 91] [Impact Index Per Article: 11.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
10
|
Kavianpour H, Vasighi M. Structural classification of proteins using texture descriptors extracted from the cellular automata image. Amino Acids 2016; 49:261-271. [DOI: 10.1007/s00726-016-2354-5] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2016] [Accepted: 10/18/2016] [Indexed: 12/12/2022]
|
11
|
Tiwari AK. Prediction of G-protein coupled receptors and their subfamilies by incorporating various sequence features into Chou's general PseAAC. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2016; 134:197-213. [PMID: 27480744 DOI: 10.1016/j.cmpb.2016.07.004] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/19/2016] [Revised: 05/27/2016] [Accepted: 07/01/2016] [Indexed: 06/06/2023]
Abstract
BACKGROUND AND OBJECTIVE The G-protein coupled receptors are the largest superfamilies of membrane proteins and important targets for the drug design. G-protein coupled receptors are responsible for many physiochemical processes such as smell, taste, vision, neurotransmission, metabolism, cellular growth and immune response. So it is necessary to design a robust and efficient approach for the prediction of G-protein coupled receptors and their subfamilies. METHODS In this paper, the protein samples are represented by amino acid composition, dipeptide composition, correlation features, composition, transition, distribution, sequence order descriptors and pseudo amino acid composition with total 1497 number of sequence derived features. To address the issue of efficient classification of G-protein coupled receptors and their subfamilies, we propose to use a weighted k-nearest neighbor classifier with UNION of best 50 features, selected by Fisher score based feature selection, ReliefF, fast correlation based filter, minimum redundancy maximum relevancy, and support vector machine based recursive elimination feature selection methods to exploit the advantages of these feature selection methods. RESULTS The proposed method achieved an overall accuracy of 99.9%, 98.3%, 95.4%, MCC values of 1.00, 0.98, 0.95, ROC area values of 1.00, 0.998, 0.996 and precision of 99.9%, 98.3% and 95.5% using 10-fold cross-validation to predict the G-protein coupled receptors and non-G-protein coupled receptors, subfamilies of G-protein coupled receptors, and subfamilies of class A G-protein coupled receptors, respectively. CONCLUSIONS The high accuracies, MCC, ROC area values, and precision values indicate that the proposed method is better for the prediction of G-protein coupled receptors families and their subfamilies.
Collapse
|
12
|
Protein subcellular localization of fluorescence microscopy images: Employing new statistical and Texton based image features and SVM based ensemble classification. Inf Sci (N Y) 2016. [DOI: 10.1016/j.ins.2016.01.064] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023]
|
13
|
An efficient approach for the prediction of ion channels and their subfamilies. Comput Biol Chem 2015; 58:205-21. [PMID: 26256801 DOI: 10.1016/j.compbiolchem.2015.07.002] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2015] [Revised: 06/25/2015] [Accepted: 07/08/2015] [Indexed: 01/25/2023]
Abstract
Ion channels are integral membrane proteins that are responsible for controlling the flow of ions across the cell. There are various biological functions that are performed by different types of ion channels. Therefore for new drug discovery it is necessary to develop a novel computational intelligence techniques based approach for the reliable prediction of ion channels families and their subfamilies. In this paper random forest based approach is proposed to predict ion channels families and their subfamilies by using sequence derived features. Here, seven feature vectors are used to represent the protein sample, including amino acid composition, dipeptide composition, correlation features, composition, transition and distribution and pseudo amino acid composition. The minimum redundancy and maximum relevance feature selection is used to find the optimal number of features for improving the prediction performance. The proposed method achieved an overall accuracy of 100%, 98.01%, 91.5%, 93.0%, 92.2%, 78.6%, 95.5%, 84.9%, MCC values of 1.00, 0.92, 0.88, 0.88, 0.90, 0.79, 0.91, 0.81 and ROC area values of 1.00, 0.99, 0.99, 0.99, 0.99, 0.95, 0.99 and 0.96 using 10-fold cross validation to predict the ion channels and non-ion channels, voltage gated ion channels and ligand gated ion channels, four subfamilies (calcium, potassium, sodium and chloride) of voltage gated ion channels, and four subfamilies of ligand gated ion channels and predict subfamilies of voltage gated calcium, potassium, sodium and chloride ion channels respectively.
Collapse
|
14
|
Chen W, Lin H, Chou KC. Pseudo nucleotide composition or PseKNC: an effective formulation for analyzing genomic sequences. MOLECULAR BIOSYSTEMS 2015; 11:2620-34. [DOI: 10.1039/c5mb00155b] [Citation(s) in RCA: 262] [Impact Index Per Article: 26.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
With the avalanche of DNA/RNA sequences generated in the post-genomic age, it is urgent to develop automated methods for analyzing the relationship between the sequences and their functions.
Collapse
Affiliation(s)
- Wei Chen
- Department of Physics
- School of Sciences
- and Center for Genomics and Computational Biology
- Hebei United University
- Tangshan 063000
| | - Hao Lin
- Gordon Life Science Institute
- Boston
- USA
- Key Laboratory for Neuro-Information of Ministry of Education
- Center of Bioinformatics
| | - Kuo-Chen Chou
- Department of Physics
- School of Sciences
- and Center for Genomics and Computational Biology
- Hebei United University
- Tangshan 063000
| |
Collapse
|
15
|
Zhu PP, Li WC, Zhong ZJ, Deng EZ, Ding H, Chen W, Lin H. Predicting the subcellular localization of mycobacterial proteins by incorporating the optimal tripeptides into the general form of pseudo amino acid composition. MOLECULAR BIOSYSTEMS 2015; 11:558-63. [DOI: 10.1039/c4mb00645c] [Citation(s) in RCA: 97] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
Mycobacterium tuberculosis is a bacterium that causes tuberculosis, one of the most prevalent infectious diseases.
Collapse
Affiliation(s)
- Pan-Pan Zhu
- Key Laboratory for Neuro-Information of Ministry of Education
- Center of Bioinformatics
- School of Life Science and Technology
- University of Electronic Science and Technology of China
- Chengdu 610054
| | - Wen-Chao Li
- Key Laboratory for Neuro-Information of Ministry of Education
- Center of Bioinformatics
- School of Life Science and Technology
- University of Electronic Science and Technology of China
- Chengdu 610054
| | - Zhe-Jin Zhong
- Key Laboratory for Neuro-Information of Ministry of Education
- Center of Bioinformatics
- School of Life Science and Technology
- University of Electronic Science and Technology of China
- Chengdu 610054
| | - En-Ze Deng
- Key Laboratory for Neuro-Information of Ministry of Education
- Center of Bioinformatics
- School of Life Science and Technology
- University of Electronic Science and Technology of China
- Chengdu 610054
| | - Hui Ding
- Key Laboratory for Neuro-Information of Ministry of Education
- Center of Bioinformatics
- School of Life Science and Technology
- University of Electronic Science and Technology of China
- Chengdu 610054
| | - Wei Chen
- Department of Physics
- School of Sciences
- and Center for Genomics and Computational Biology
- Hebei United University
- Tangshan 063000
| | - Hao Lin
- Key Laboratory for Neuro-Information of Ministry of Education
- Center of Bioinformatics
- School of Life Science and Technology
- University of Electronic Science and Technology of China
- Chengdu 610054
| |
Collapse
|
16
|
Tiwari AK, Srivastava R. A survey of computational intelligence techniques in protein function prediction. INTERNATIONAL JOURNAL OF PROTEOMICS 2014; 2014:845479. [PMID: 25574395 PMCID: PMC4276698 DOI: 10.1155/2014/845479] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 09/10/2014] [Revised: 10/31/2014] [Accepted: 11/07/2014] [Indexed: 02/08/2023]
Abstract
During the past, there was a massive growth of knowledge of unknown proteins with the advancement of high throughput microarray technologies. Protein function prediction is the most challenging problem in bioinformatics. In the past, the homology based approaches were used to predict the protein function, but they failed when a new protein was different from the previous one. Therefore, to alleviate the problems associated with homology based traditional approaches, numerous computational intelligence techniques have been proposed in the recent past. This paper presents a state-of-the-art comprehensive review of various computational intelligence techniques for protein function predictions using sequence, structure, protein-protein interaction network, and gene expression data used in wide areas of applications such as prediction of DNA and RNA binding sites, subcellular localization, enzyme functions, signal peptides, catalytic residues, nuclear/G-protein coupled receptors, membrane proteins, and pathway analysis from gene expression datasets. This paper also summarizes the result obtained by many researchers to solve these problems by using computational intelligence techniques with appropriate datasets to improve the prediction performance. The summary shows that ensemble classifiers and integration of multiple heterogeneous data are useful for protein function prediction.
Collapse
Affiliation(s)
- Arvind Kumar Tiwari
- Department of Computer Science & Engineering, Indian Institute of Technology (BHU), Varanasi 221005, India
| | - Rajeev Srivastava
- Department of Computer Science & Engineering, Indian Institute of Technology (BHU), Varanasi 221005, India
| |
Collapse
|
17
|
acACS: improving the prediction accuracy of protein subcellular locations and protein classification by incorporating the average chemical shifts composition. ScientificWorldJournal 2014; 2014:864135. [PMID: 25110749 PMCID: PMC4106170 DOI: 10.1155/2014/864135] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2014] [Revised: 06/15/2014] [Accepted: 06/16/2014] [Indexed: 11/17/2022] Open
Abstract
The chemical shift is sensitive to changes in the local environments and can report the structural changes. The structure information of a protein can be represented by the average chemical shifts (ACS) composition, which has been broadly applied for enhancing the prediction accuracy in protein subcellular locations and protein classification. However, different kinds of ACS composition can solve different problems. We established an online web server named acACS, which can convert secondary structure into average chemical shift and then compose the vector for representing a protein by using the algorithm of auto covariance. Our solution is easy to use and can meet the needs of users.
Collapse
|
18
|
Fan YN, Xiao X, Min JL, Chou KC. iNR-Drug: predicting the interaction of drugs with nuclear receptors in cellular networking. Int J Mol Sci 2014; 15:4915-37. [PMID: 24651462 PMCID: PMC3975431 DOI: 10.3390/ijms15034915] [Citation(s) in RCA: 66] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2014] [Revised: 02/12/2014] [Accepted: 02/16/2014] [Indexed: 12/20/2022] Open
Abstract
Nuclear receptors (NRs) are closely associated with various major diseases such as cancer, diabetes, inflammatory disease, and osteoporosis. Therefore, NRs have become a frequent target for drug development. During the process of developing drugs against these diseases by targeting NRs, we are often facing a problem: Given a NR and chemical compound, can we identify whether they are really in interaction with each other in a cell? To address this problem, a predictor called “iNR-Drug” was developed. In the predictor, the drug compound concerned was formulated by a 256-D (dimensional) vector derived from its molecular fingerprint, and the NR by a 500-D vector formed by incorporating its sequential evolution information and physicochemical features into the general form of pseudo amino acid composition, and the prediction engine was operated by the SVM (support vector machine) algorithm. Compared with the existing prediction methods in this area, iNR-Drug not only can yield a higher success rate, but is also featured by a user-friendly web-server established at http://www.jci-bioinfo.cn/iNR-Drug/, which is particularly useful for most experimental scientists to obtain their desired data in a timely manner. It is anticipated that the iNR-Drug server may become a useful high throughput tool for both basic research and drug development, and that the current approach may be easily extended to study the interactions of drug with other targets as well.
Collapse
Affiliation(s)
- Yue-Nong Fan
- Computer Department, Jing-De-Zhen Ceramic Institute, Jingdezhen 333046, Jiangxi, China.
| | - Xuan Xiao
- Computer Department, Jing-De-Zhen Ceramic Institute, Jingdezhen 333046, Jiangxi, China.
| | - Jian-Liang Min
- Computer Department, Jing-De-Zhen Ceramic Institute, Jingdezhen 333046, Jiangxi, China.
| | - Kuo-Chen Chou
- Center of Excellence in Genomic Medicine Research (CEGMR), King Abdulaziz University, Jeddah 21589, Saudi Arabia.
| |
Collapse
|
19
|
Zhang SW, Liu YF, Yu Y, Zhang TH, Fan XN. MSLoc-DT: A new method for predicting the protein subcellular location of multispecies based on decision templates. Anal Biochem 2014; 449:164-71. [DOI: 10.1016/j.ab.2013.12.013] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2013] [Revised: 11/08/2013] [Accepted: 12/12/2013] [Indexed: 12/12/2022]
|
20
|
Protein subcellular localization in human and hamster cell lines: Employing local ternary patterns of fluorescence microscopy images. J Theor Biol 2014; 340:85-95. [DOI: 10.1016/j.jtbi.2013.08.017] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2013] [Revised: 07/09/2013] [Accepted: 08/15/2013] [Indexed: 11/21/2022]
|
21
|
Min JL, Xiao X, Chou KC. iEzy-drug: a web server for identifying the interaction between enzymes and drugs in cellular networking. BIOMED RESEARCH INTERNATIONAL 2013; 2013:701317. [PMID: 24371828 PMCID: PMC3858977 DOI: 10.1155/2013/701317] [Citation(s) in RCA: 56] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/07/2013] [Accepted: 09/17/2013] [Indexed: 01/16/2023]
Abstract
With the features of extremely high selectivity and efficiency in catalyzing almost all the chemical reactions in cells, enzymes play vitally important roles for the life of an organism and hence have become frequent targets for drug design. An essential step in developing drugs by targeting enzymes is to identify drug-enzyme interactions in cells. It is both time-consuming and costly to do this purely by means of experimental techniques alone. Although some computational methods were developed in this regard based on the knowledge of the three-dimensional structure of enzyme, unfortunately their usage is quite limited because three-dimensional structures of many enzymes are still unknown. Here, we reported a sequence-based predictor, called "iEzy-Drug," in which each drug compound was formulated by a molecular fingerprint with 258 feature components, each enzyme by the Chou's pseudo amino acid composition generated via incorporating sequential evolution information and physicochemical features derived from its sequence, and the prediction engine was operated by the fuzzy K-nearest neighbor algorithm. The overall success rate achieved by iEzy-Drug via rigorous cross-validations was about 91%. Moreover, to maximize the convenience for the majority of experimental scientists, a user-friendly web server was established, by which users can easily obtain their desired results.
Collapse
Affiliation(s)
- Jian-Liang Min
- Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen 333046, China
| | - Xuan Xiao
- Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen 333046, China
- Information School, ZheJiang Textile & Fashion College, NingBo 315211, China
- Gordon Life Science Institute, Belmont, MA 02478, USA
| | - Kuo-Chen Chou
- Gordon Life Science Institute, Belmont, MA 02478, USA
- Center of Excellence in Genomic Medicine Research (CEGMR), King Abdulaziz University, Jeddah, Saudi Arabia
| |
Collapse
|
22
|
Fan GL, Li QZ. Discriminating bioluminescent proteins by incorporating average chemical shift and evolutionary information into the general form of Chou's pseudo amino acid composition. J Theor Biol 2013; 334:45-51. [DOI: 10.1016/j.jtbi.2013.06.003] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2013] [Revised: 05/30/2013] [Accepted: 06/03/2013] [Indexed: 01/22/2023]
|
23
|
Predicting acidic and alkaline enzymes by incorporating the average chemical shift and gene ontology informations into the general form of Chou's PseAAC. Process Biochem 2013. [DOI: 10.1016/j.procbio.2013.05.012] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
24
|
Tang S, Li T, Cong P, Xiong W, Wang Z, Sun J. PlantLoc: an accurate web server for predicting plant protein subcellular localization by substantiality motif. Nucleic Acids Res 2013; 41:W441-7. [PMID: 23729470 PMCID: PMC3692052 DOI: 10.1093/nar/gkt428] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Knowledge of subcellular localizations (SCLs) of plant proteins relates to their functions and aids in understanding the regulation of biological processes at the cellular level. We present PlantLoc, a highly accurate and fast webserver for predicting the multi-label SCLs of plant proteins. The PlantLoc server has two innovative characters: building localization motif libraries by a recursive method without alignment and Gene Ontology information; and establishing simple architecture for rapidly and accurately identifying plant protein SCLs without a machine learning algorithm. PlantLoc provides predicted SCLs results, confidence estimates and which is the substantiality motif and where it is located on the sequence. PlantLoc achieved the highest accuracy (overall accuracy of 80.8%) of identification of plant protein SCLs as benchmarked by using a new test dataset compared other plant SCL prediction webservers. The ability of PlantLoc to predict multiple sites was also significantly higher than for any other webserver. The predicted substantiality motifs of queries also have great potential for analysis of relationships with protein functional regions. The PlantLoc server is available at http://cal.tongji.edu.cn/PlantLoc/.
Collapse
Affiliation(s)
- Shengnan Tang
- Department of Chemistry, Tongji University, Shanghai 200092, China
| | | | | | | | | | | |
Collapse
|
25
|
A novel algorithm combining support vector machine with the discrete wavelet transform for the prediction of protein subcellular localization. Comput Biol Med 2012; 42:180-7. [DOI: 10.1016/j.compbiomed.2011.11.006] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2011] [Revised: 09/29/2011] [Accepted: 11/15/2011] [Indexed: 02/03/2023]
|
26
|
Fan GL, Li QZ. Predicting protein submitochondria locations by combining different descriptors into the general form of Chou's pseudo amino acid composition. Amino Acids 2011; 43:545-55. [PMID: 22102053 DOI: 10.1007/s00726-011-1143-4] [Citation(s) in RCA: 61] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2011] [Accepted: 10/27/2011] [Indexed: 10/15/2022]
Abstract
Knowledge of the submitochondria location of protein is integral to understanding its function and a necessity in the proteomics era. In this work, a new submitochondria data set is constructed, and an approach for predicting protein submitochondria locations is proposed by combining the amino acid composition, dipeptide composition, reduced physicochemical properties, gene ontology, evolutionary information, and pseudo-average chemical shift. The overall prediction accuracy is 93.57% for the submitochondria location and 97.79% for the three membrane protein types in the mitochondria inner membrane using the algorithm of the increment of diversity combined with the support vector machine. The performance of the pseudo-average chemical shift is excellent. For contrast, the method is also used to predict submitochondria locations in the data set constructed by Du and Li; an accuracy of 94.95% is obtained by our method, which is better than that of other existing methods.
Collapse
Affiliation(s)
- Guo-Liang Fan
- Department of Physics, School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China
| | | |
Collapse
|
27
|
Khan A, Majid A, Hayat M. CE-PLoc: an ensemble classifier for predicting protein subcellular locations by fusing different modes of pseudo amino acid composition. Comput Biol Chem 2011; 35:218-29. [PMID: 21864791 DOI: 10.1016/j.compbiolchem.2011.05.003] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2011] [Revised: 05/17/2011] [Accepted: 05/18/2011] [Indexed: 12/18/2022]
Abstract
Precise information about protein locations in a cell facilitates in the understanding of the function of a protein and its interaction in the cellular environment. This information further helps in the study of the specific metabolic pathways and other biological processes. We propose an ensemble approach called "CE-PLoc" for predicting subcellular locations based on fusion of individual classifiers. The proposed approach utilizes features obtained from both dipeptide composition (DC) and amphiphilic pseudo amino acid composition (PseAAC) based feature extraction strategies. Different feature spaces are obtained by varying the dimensionality using PseAAC for a selected base learner. The performance of the individual learning mechanisms such as support vector machine, nearest neighbor, probabilistic neural network, covariant discriminant, which are trained using PseAAC based features is first analyzed. Classifiers are developed using same learning mechanism but trained on PseAAC based feature spaces of varying dimensions. These classifiers are combined through voting strategy and an improvement in prediction performance is achieved. Prediction performance is further enhanced by developing CE-PLoc through the combination of different learning mechanisms trained on both DC based feature space and PseAAC based feature spaces of varying dimensions. The predictive performance of proposed CE-PLoc is evaluated for two benchmark datasets of protein subcellular locations using accuracy, MCC, and Q-statistics. Using the jackknife test, prediction accuracies of 81.47 and 83.99% are obtained for 12 and 14 subcellular locations datasets, respectively. In case of independent dataset test, prediction accuracies are 87.04 and 87.33% for 12 and 14 class datasets, respectively.
Collapse
Affiliation(s)
- Asifullah Khan
- Department of Information and Computer Sciences, Pakistan Institute of Engineering and Applied Sciences, Nilore, Islamabad, Pakistan.
| | | | | |
Collapse
|
28
|
ur-Rehman Z, Khan A. G-protein-coupled receptor prediction using pseudo-amino-acid composition and multiscale energy representation of different physiochemical properties. Anal Biochem 2011; 412:173-82. [DOI: 10.1016/j.ab.2011.01.040] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2010] [Revised: 01/26/2011] [Accepted: 01/27/2011] [Indexed: 11/28/2022]
|
29
|
Naveed M, Khan AU. GPCR-MPredictor: multi-level prediction of G protein-coupled receptors using genetic ensemble. Amino Acids 2011; 42:1809-23. [DOI: 10.1007/s00726-011-0902-6] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2010] [Accepted: 03/26/2011] [Indexed: 11/27/2022]
|
30
|
Wu ZC, Xiao X, Chou KC. iLoc-Plant: a multi-label classifier for predicting the subcellular localization of plant proteins with both single and multiple sites. MOLECULAR BIOSYSTEMS 2011; 7:3287-97. [PMID: 21984117 DOI: 10.1039/c1mb05232b] [Citation(s) in RCA: 181] [Impact Index Per Article: 12.9] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Affiliation(s)
- Zhi-Cheng Wu
- Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen 333046, China
| | | | | |
Collapse
|
31
|
Chou KC. Some remarks on protein attribute prediction and pseudo amino acid composition. J Theor Biol 2010; 273:236-47. [PMID: 21168420 PMCID: PMC7125570 DOI: 10.1016/j.jtbi.2010.12.024] [Citation(s) in RCA: 971] [Impact Index Per Article: 64.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2010] [Revised: 12/08/2010] [Accepted: 12/13/2010] [Indexed: 11/29/2022]
Abstract
With the accomplishment of human genome sequencing, the number of sequence-known proteins has increased explosively. In contrast, the pace is much slower in determining their biological attributes. As a consequence, the gap between sequence-known proteins and attribute-known proteins has become increasingly large. The unbalanced situation, which has critically limited our ability to timely utilize the newly discovered proteins for basic research and drug development, has called for developing computational methods or high-throughput automated tools for fast and reliably identifying various attributes of uncharacterized proteins based on their sequence information alone. Actually, during the last two decades or so, many methods in this regard have been established in hope to bridge such a gap. In the course of developing these methods, the following things were often needed to consider: (1) benchmark dataset construction, (2) protein sample formulation, (3) operating algorithm (or engine), (4) anticipated accuracy, and (5) web-server establishment. In this review, we are to discuss each of the five procedures, with a special focus on the introduction of pseudo amino acid composition (PseAAC), its different modes and applications as well as its recent development, particularly in how to use the general formulation of PseAAC to reflect the core and essential features that are deeply hidden in complicated protein sequences.
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Gordon Life Science Institute, 13784 Torrey Del Mar Drive, San Diego, CA 92130, USA.
| |
Collapse
|
32
|
Prediction of protein–protein interaction types using the decision templates based on multiple classier fusion. ACTA ACUST UNITED AC 2010. [DOI: 10.1016/j.mcm.2010.01.025] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
|
33
|
Jaramillo-Garzon JA, Perera-Lluna A, Castellanos-Dominguez CG. Predictability of protein subcellular locations by pattern recognition techniques. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2010; 2010:5512-5. [PMID: 21096466 DOI: 10.1109/iembs.2010.5626772] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
An analysis of the predictability of subcellular locations is performed by using simple pattern recognition techniques in an attempt to capture the real dimensions of the problem at hand. Results show that there are some particular locations that does not need of high complexity classification models to be predicted with high accuracies, and some partial biological explanations are formulated. All the experiments were carried out over a set of Arabidopsis Thaliana proteins and classes were defined according to the plants GO slim.
Collapse
Affiliation(s)
- J A Jaramillo-Garzon
- Deperatamento de Ingeniería Eléctrica, Electrónica y Computación, Universidad Nacional de Colombia sede Manizales, Campus La Nubia, km 7 vía al Magdalena, (Caldas), Colombia.
| | | | | |
Collapse
|
34
|
Zakeri P, Moshiri B, Sadeghi M. Prediction of protein submitochondria locations based on data fusion of various features of sequences. J Theor Biol 2010; 269:208-16. [PMID: 21040732 DOI: 10.1016/j.jtbi.2010.10.026] [Citation(s) in RCA: 42] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2010] [Revised: 10/16/2010] [Accepted: 10/22/2010] [Indexed: 01/16/2023]
Abstract
In this study, the predictors are developed for protein submitochondria locations based on various features of sequences. Information about the submitochondria location for a mitochondria protein can provide much better understanding about its function. We use ten representative models of protein samples such as pseudo amino acid composition, dipeptide composition, functional domain composition, the combining discrete model based on prediction of solvent accessibility and secondary structure elements, the discrete model of pairwise sequence similarity, etc. We construct a predictor based on support vector machines (SVMs) for each representative model. The overall prediction accuracy by the leave-one-out cross validation test obtained by the predictor which is based on the discrete model of pairwise sequence similarity is 1% better than the best computational system that exists for this problem. Moreover, we develop a method based on ordered weighted averaging (OWA) which is one of the fusion data operators. Therefore, OWA is applied on the 11 best SVM-based classifiers that are constructed based on various features of sequence. This method is called Mito-Loc. The overall leave-one-out cross validation accuracy obtained by Mito-Loc is about 95%. This indicates that our proposed approach (Mito-Loc) is superior to the result of the best existing approach which has already been reported.
Collapse
Affiliation(s)
- Pooya Zakeri
- Department of Electrical and Computer Engineering, Isfahan University of Technology, Isfahan, Iran
| | | | | |
Collapse
|
35
|
Medema MH, Zhou M, van Hijum SAFT, Gloerich J, Wessels HJCT, Siezen RJ, Strous M. A predicted physicochemically distinct sub-proteome associated with the intracellular organelle of the anammox bacterium Kuenenia stuttgartiensis. BMC Genomics 2010; 11:299. [PMID: 20459862 PMCID: PMC2881027 DOI: 10.1186/1471-2164-11-299] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2010] [Accepted: 05/12/2010] [Indexed: 12/31/2022] Open
Abstract
BACKGROUND Anaerobic ammonium-oxidizing (anammox) bacteria perform a key step in global nitrogen cycling. These bacteria make use of an organelle to oxidize ammonia anaerobically to nitrogen (N2) and so contribute approximately 50% of the nitrogen in the atmosphere. It is currently unknown which proteins constitute the organellar proteome and how anammox bacteria are able to specifically target organellar and cell-envelope proteins to their correct final destinations. Experimental approaches are complicated by the absence of pure cultures and genetic accessibility. However, the genome of the anammox bacterium Candidatus "Kuenenia stuttgartiensis" has recently been sequenced. Here, we make use of these genome data to predict the organellar sub-proteome and address the molecular basis of protein sorting in anammox bacteria. RESULTS Two training sets representing organellar (30 proteins) and cell envelope (59 proteins) proteins were constructed based on previous experimental evidence and comparative genomics. Random forest (RF) classifiers trained on these two sets could differentiate between organellar and cell envelope proteins with ~89% accuracy using 400 features consisting of frequencies of two adjacent amino acid combinations. A physicochemically distinct organellar sub-proteome containing 562 proteins was predicted with the best RF classifier. This set included almost all catabolic and respiratory factors encoded in the genome. Apparently, the cytoplasmic membrane performs no catabolic functions. We predict that the Tat-translocation system is located exclusively in the organellar membrane, whereas the Sec-translocation system is located on both the organellar and cytoplasmic membranes. Canonical signal peptides were predicted and validated experimentally, but a specific (N- or C-terminal) signal that could be used for protein targeting to the organelle remained elusive. CONCLUSIONS A physicochemically distinct organellar sub-proteome was predicted from the genome of the anammox bacterium K. stuttgartiensis. This result provides strong in silico support for the existing experimental evidence for the existence of an organelle in this bacterium, and is an important step forward in unravelling a geochemically relevant case of cytoplasmic differentiation in bacteria. The predicted dual location of the Sec-translocation system and the apparent absence of a specific N- or C-terminal signal in the organellar proteins suggests that additional chaperones may be necessary that act on an as-yet unknown property of the targeted proteins.
Collapse
Affiliation(s)
- Marnix H Medema
- Department of Microbiology, Radboud University Nijmegen, Toernooiveld 1, 6525 ED Nijmegen, the Netherlands
| | | | | | | | | | | | | |
Collapse
|
36
|
Protein location prediction using atomic composition and global features of the amino acid sequence. Biochem Biophys Res Commun 2010; 391:1670-4. [DOI: 10.1016/j.bbrc.2009.12.118] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2009] [Accepted: 12/21/2009] [Indexed: 11/17/2022]
|
37
|
Keerthikumar S, Bhadra S, Kandasamy K, Raju R, Ramachandra YL, Bhattacharyya C, Imai K, Ohara O, Mohan S, Pandey A. Prediction of candidate primary immunodeficiency disease genes using a support vector machine learning approach. DNA Res 2009; 16:345-51. [PMID: 19801557 PMCID: PMC2780952 DOI: 10.1093/dnares/dsp019] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023] Open
Abstract
Screening and early identification of primary immunodeficiency disease (PID) genes is a major challenge for physicians. Many resources have catalogued molecular alterations in known PID genes along with their associated clinical and immunological phenotypes. However, these resources do not assist in identifying candidate PID genes. We have recently developed a platform designated Resource of Asian PDIs, which hosts information pertaining to molecular alterations, protein-protein interaction networks, mouse studies and microarray gene expression profiling of all known PID genes. Using this resource as a discovery tool, we describe the development of an algorithm for prediction of candidate PID genes. Using a support vector machine learning approach, we have predicted 1442 candidate PID genes using 69 binary features of 148 known PID genes and 3162 non-PID genes as a training data set. The power of this approach is illustrated by the fact that six of the predicted genes have recently been experimentally confirmed to be PID genes. The remaining genes in this predicted data set represent attractive candidates for testing in patients where the etiology cannot be ascribed to any of the known PID genes.
Collapse
|
38
|
Qiu JD, Luo SH, Huang JH, Sun XY, Liang RP. Predicting subcellular location of apoptosis proteins based on wavelet transform and support vector machine. Amino Acids 2009; 38:1201-8. [DOI: 10.1007/s00726-009-0331-y] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2009] [Accepted: 07/20/2009] [Indexed: 11/28/2022]
|
39
|
Khan A, Majid A, Choi TS. Predicting protein subcellular location: exploiting amino acid based sequence of feature spaces and fusion of diverse classifiers. Amino Acids 2009; 38:347-50. [DOI: 10.1007/s00726-009-0238-7] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2008] [Accepted: 01/12/2009] [Indexed: 10/21/2022]
|
40
|
Gao QB, Jin ZC, Ye XF, Wu C, He J. Prediction of nuclear receptors with optimal pseudo amino acid composition. Anal Biochem 2009; 387:54-9. [PMID: 19454254 DOI: 10.1016/j.ab.2009.01.018] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2008] [Revised: 12/04/2008] [Accepted: 01/09/2009] [Indexed: 10/21/2022]
Abstract
Nuclear receptors are involved in multiple cellular signaling pathways that affect and regulate processes such as organ development and maintenance, ion transport, homeostasis, and apoptosis. In this article, an optimal pseudo amino acid composition based on physicochemical characters of amino acids is suggested to represent proteins for predicting the subfamilies of nuclear receptors. Six physicochemical characters of amino acids were adopted to generate the protein sequence features via web server PseAAC. The optimal values of the rank of correlation factor and the weighting factor about PseAAC were determined to get the appropriate descriptor of proteins that leads to the best performance. A nonredundant dataset of nuclear receptors in four subfamilies is constructed to evaluate the method using support vector machines. An overall accuracy of 99.6% was achieved in the fivefold cross-validation test as well as the jackknife test, and an overall accuracy of 98.4% was reached in a blind dataset test. The performance is very competitive with that of some previous methods.
Collapse
Affiliation(s)
- Qing-Bin Gao
- Department of Health Statistics, Second Military Medical University, Shanghai 200433, China
| | | | | | | | | |
Collapse
|
41
|
Improving Protein Localization Prediction Using Amino Acid Group Based Physichemical Encoding. ACTA ACUST UNITED AC 2009. [DOI: 10.1007/978-3-642-00727-9_24] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/25/2023]
|
42
|
Prediction of subcellular location apoptosis proteins with ensemble classifier and feature selection. Amino Acids 2008; 38:975-83. [PMID: 19048186 DOI: 10.1007/s00726-008-0209-4] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2008] [Accepted: 11/03/2008] [Indexed: 10/21/2022]
Abstract
Apoptosis proteins have a central role in the development and the homeostasis of an organism. These proteins are very important for understanding the mechanism of programmed cell death. The function of an apoptosis protein is closely related to its subcellular location. It is crucial to develop powerful tools to predict apoptosis protein locations for rapidly increasing gap between the number of known structural proteins and the number of known sequences in protein databank. In this study, amino acids pair compositions with different spaces are used to construct feature sets for representing sample of protein feature selection approach based on binary particle swarm optimization, which is applied to extract effective feature. Ensemble classifier is used as prediction engine, of which the basic classifier is the fuzzy K-nearest neighbor. Each basic classifier is trained with different feature sets. Two datasets often used in prior works are selected to validate the performance of proposed approach. The results obtained by jackknife test are quite encouraging, indicating that the proposed method might become a potentially useful tool for subcellular location of apoptosis protein, or at least can play a complimentary role to the existing methods in the relevant areas. The supplement information and software written in Matlab are available by contacting the corresponding author.
Collapse
|
43
|
Gupta K, Sehgal V, Levchenko A. A method for probabilistic mapping between protein structure and function taxonomies through cross training. BMC STRUCTURAL BIOLOGY 2008; 8:40. [PMID: 18834528 PMCID: PMC2573881 DOI: 10.1186/1472-6807-8-40] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/10/2008] [Accepted: 10/03/2008] [Indexed: 02/06/2023]
Abstract
BACKGROUND Prediction of function of proteins on the basis of structure and vice versa is a partially solved problem, largely in the domain of biophysics and biochemistry. This underlies the need of computational and bioinformatics approach to solve the problem. Large and organized latent knowledge on protein classification exists in the form of independently created protein classification databases. By creating probabilistic maps between classes of structural classification databases (e.g. SCOP) and classes of functional classification databases (e.g. PROSITE), structure and function of proteins could be probabilistically related. RESULTS We demonstrate that PROSITE and SCOP have significant semantic overlap, in spite of independent classification schemes. By training classifiers of SCOP using classes of PROSITE as attributes and vice versa, accuracy of Support Vector Machine classifiers for both SCOP and PROSITE was improved. Novel attributes, 2-D elastic profiles and Blocks were used to improve time complexity and accuracy. Many relationships were extracted between classes of SCOP and PROSITE using decision trees. CONCLUSION We demonstrate that presented approach can discover new probabilistic relationships between classes of different taxonomies and render a more accurate classification. Extensive mappings between existing protein classification databases can be created to link the large amount of organized data. Probabilistic maps were created between classes of SCOP and PROSITE allowing predictions of structure using function, and vice versa. In our experiments, we also found that functions are indeed more strongly related to structure than are structure to functions.
Collapse
Affiliation(s)
- Kshitiz Gupta
- The Whitaker Institute for Biomedical Engineering, Johns Hopkins School of Medicine, Baltimore, MD, USA
- Department of Computer Science & Engineering, Indian Institute of Technology, Bombay, Mumbai, India
| | - Vivek Sehgal
- Department of Computer Science & Engineering, Indian Institute of Technology, Bombay, Mumbai, India
- Department of Computer Science, University of Maryland, College ParkCollege Park, MD, USA
- Yahoo! Inc., 701 First Avenue, Sunnyvale, CA, USA
| | - Andre Levchenko
- The Whitaker Institute for Biomedical Engineering, Johns Hopkins School of Medicine, Baltimore, MD, USA
| |
Collapse
|
44
|
Using Chou’s pseudo amino acid composition to predict subcellular localization of apoptosis proteins: An approach with immune genetic algorithm-based ensemble classifier. Pattern Recognit Lett 2008. [DOI: 10.1016/j.patrec.2008.06.007] [Citation(s) in RCA: 135] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
45
|
Xiao X, Lin WZ, Chou KC. Using grey dynamic modeling and pseudo amino acid composition to predict protein structural classes. J Comput Chem 2008; 29:2018-24. [PMID: 18381630 DOI: 10.1002/jcc.20955] [Citation(s) in RCA: 64] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Using the pseudo amino acid (PseAA) composition to represent the sample of a protein can incorporate a considerable amount of sequence pattern information so as to improve the prediction quality for its structural or functional classification. However, how to optimally formulate the PseAA composition is an important problem yet to be solved. In this article the grey modeling approach is introduced that is particularly efficient in coping with complicated systems such as the one consisting of many proteins with different sequence orders and lengths. On the basis of the grey model, four coefficients derived from each of the protein sequences concerned are adopted for its PseAA components. The PseAA composition thus formulated is called the "grey-PseAA" composition that can catch the essence of a protein sequence and better reflect its overall pattern. In our study we have demonstrated that introduction of the grey-PseAA composition can remarkably enhance the success rates in predicting the protein structural class. It is anticipated that the concept of grey-PseAA composition can be also used to predict many other protein attributes, such as subcellular localization, membrane protein type, enzyme functional class, GPCR type, protease type, among many others.
Collapse
Affiliation(s)
- Xuan Xiao
- Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen 333000, China.
| | | | | |
Collapse
|
46
|
Prediction of protein structural classes by Chou’s pseudo amino acid composition: approached using continuous wavelet transform and principal component analysis. Amino Acids 2008; 37:415-25. [DOI: 10.1007/s00726-008-0170-2] [Citation(s) in RCA: 66] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2008] [Accepted: 08/03/2008] [Indexed: 10/21/2022]
|
47
|
Xiao X, Wang P, Chou KC. Predicting protein structural classes with pseudo amino acid composition: an approach using geometric moments of cellular automaton image. J Theor Biol 2008; 254:691-6. [PMID: 18634802 DOI: 10.1016/j.jtbi.2008.06.016] [Citation(s) in RCA: 89] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2008] [Revised: 06/18/2008] [Accepted: 06/18/2008] [Indexed: 11/28/2022]
Abstract
A novel approach was developed for predicting the structural classes of proteins based on their sequences. It was assumed that proteins belonging to the same structural class must bear some sort of similar texture on the images generated by the cellular automaton evolving rule [Wolfram, S., 1984. Cellular automation as models of complexity. Nature 311, 419-424]. Based on this, two geometric invariant moment factors derived from the image functions were used as the pseudo amino acid components [Chou, K.C., 2001. Prediction of protein cellular attributes using pseudo amino acid composition. Proteins: Struct., Funct., Genet. (Erratum: ibid., 2001, vol. 44, 60) 43, 246-255] to formulate the protein samples for statistical prediction. The success rates thus obtained on a previously constructed benchmark dataset are quite promising, implying that the cellular automaton image can help to reveal some inherent and subtle features deeply hidden in a pile of long and complicated amino acid sequences.
Collapse
Affiliation(s)
- Xuan Xiao
- Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen 33300, China.
| | | | | |
Collapse
|
48
|
Feng Y, Luo L. Use of tetrapeptide signals for protein secondary-structure prediction. Amino Acids 2008; 35:607-14. [PMID: 18431531 DOI: 10.1007/s00726-008-0089-7] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2007] [Accepted: 03/04/2008] [Indexed: 10/22/2022]
Abstract
This paper develops a novel sequence-based method, tetra-peptide-based increment of diversity with quadratic discriminant analysis (TPIDQD for short), for protein secondary-structure prediction. The proposed TPIDQD method is based on tetra-peptide signals and is used to predict the structure of the central residue of a sequence fragment. The three-state overall per-residue accuracy (Q (3)) is about 80% in the threefold cross-validated test for 21-residue fragments in the CB513 dataset. The accuracy can be further improved by taking long-range sequence information (fragments of more than 21 residues) into account in prediction. The results show the tetra-peptide signals can indeed reflect some relationship between an amino acid's sequence and its secondary structure, indicating the importance of tetra-peptide signals as the protein folding code in the protein structure prediction.
Collapse
Affiliation(s)
- Yonge Feng
- Laboratory of Theoretical Biophysics, Faculty of Science and Technology, Inner Mongolia University, Hohhot, 010021, China.
| | | |
Collapse
|
49
|
Zhang SW, Chen W, Yang F, Pan Q. Using Chou's pseudo amino acid composition to predict protein quaternary structure: a sequence-segmented PseAAC approach. Amino Acids 2008; 35:591-8. [PMID: 18427713 DOI: 10.1007/s00726-008-0086-x] [Citation(s) in RCA: 71] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2008] [Accepted: 02/28/2008] [Indexed: 12/11/2022]
Abstract
In the protein universe, many proteins are composed of two or more polypeptide chains, generally referred to as subunits, which associate through noncovalent interactions and, occasionally, disulfide bonds to form protein quaternary structures. It has long been known that the functions of proteins are closely related to their quaternary structures; some examples include enzymes, hemoglobin, DNA polymerase, and ion channels. However, it is extremely labor-expensive and even impossible to quickly determine the structures of hundreds of thousands of protein sequences solely from experiments. Since the number of protein sequences entering databanks is increasing rapidly, it is highly desirable to develop computational methods for classifying the quaternary structures of proteins from their primary sequences. Since the concept of Chou's pseudo amino acid composition (PseAAC) was introduced, a variety of approaches, such as residue conservation scores, von Neumann entropy, multiscale energy, autocorrelation function, moment descriptors, and cellular automata, have been utilized to formulate the PseAAC for predicting different attributes of proteins. Here, in a different approach, a sequence-segmented PseAAC is introduced to represent protein samples. Meanwhile, multiclass SVM classifier modules were adopted to classify protein quaternary structures. As a demonstration, the dataset constructed by Chou and Cai [(2003) Proteins 53:282-289] was adopted as a benchmark dataset. The overall jackknife success rates thus obtained were 88.2-89.1%, indicating that the new approach is quite promising for predicting protein quaternary structure.
Collapse
Affiliation(s)
- Shao-Wu Zhang
- College of Automation, Northwestern Polytechnical University, 710072, Xi'an, China.
| | | | | | | |
Collapse
|
50
|
Zhao XM, Chen L, Aihara K. Protein function prediction with high-throughput data. Amino Acids 2008; 35:517-30. [PMID: 18427717 DOI: 10.1007/s00726-008-0077-y] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2008] [Accepted: 03/13/2008] [Indexed: 12/12/2022]
Abstract
Protein function prediction is one of the main challenges in post-genomic era. The availability of large amounts of high-throughput data provides an alternative approach to handling this problem from the computational viewpoint. In this review, we provide a comprehensive description of the computational methods that are currently applicable to protein function prediction, especially from the perspective of machine learning. Machine learning techniques can generally be classified as supervised learning, semi-supervised learning and unsupervised learning. By classifying the existing computational methods for protein annotation into these three groups, we are able to present a comprehensive framework on protein annotation based on machine learning techniques. In addition to describing recently developed theoretical methodologies, we also cover representative databases and software tools that are widely utilized in the prediction of protein function.
Collapse
Affiliation(s)
- Xing-Ming Zhao
- ERATO Aihara Complexity Modelling Project, JST, Tokyo, 151-0064, Japan
| | | | | |
Collapse
|