1
|
iTAGPred: A Two-Level Prediction Model for Identification of Angiogenesis and Tumor Angiogenesis Biomarkers. Appl Bionics Biomech 2021; 2021:2803147. [PMID: 34616486 PMCID: PMC8490072 DOI: 10.1155/2021/2803147] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2021] [Accepted: 09/02/2021] [Indexed: 12/09/2022] Open
Abstract
A crucial biological process called angiogenesis plays a vital role in migration, growth, and wound healing of endothelial cells and other processes that are controlled by chemical signals. Angiogenesis is the process that controls the growth of blood vessels within tissues while angiogenesis proteins play a significant role in the proper working of this process. The balancing of these signals is necessary for the proper working of angiogenesis. Unbalancing of these signals increases blood vessel formation, which causes abnormal growth or several diseases including cancer. The proposed work focuses on developing a two-layered prediction model using different classifiers like random forest (RF), neural network, and support vector machine. The first level performs in silico identification of angiogenesis proteins based on the primary structure. In the case the protein is an angiogenesis protein, then the second level predicts whether the protein is linked with tumor angiogenesis or not. The performance of the model is evaluated through various validation techniques. The model was evaluated using k-fold cross-validation, independent, self-consistency, and jackknife testing. The overall accuracy using an RF classifier for angiogenesis at the first level was 97.8% and for tumor angiogenesis at the second level was 99.5%, ANN showed 94.1% accuracy for angiogenesis and 79.9% for tumor angiogenesis, and the accuracy of SVM for angiogenesis was 78.8% and for tumor angiogenesis was 65.19%.
Collapse
|
2
|
Perpetuo L, Klein J, Ferreira R, Guedes S, Amado F, Leite-Moreira A, Silva AMS, Thongboonkerd V, Vitorino R. How can artificial intelligence be used for peptidomics? Expert Rev Proteomics 2021; 18:527-556. [PMID: 34343059 DOI: 10.1080/14789450.2021.1962303] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Abstract
INTRODUCTION Peptidomics is an emerging field of omics sciences using advanced isolation, analysis, and computational techniques that enable qualitative and quantitative analyses of various peptides in biological samples. Peptides can act as useful biomarkers and as therapeutic molecules for diseases. AREAS COVERED The use of therapeutic peptides can be predicted quickly and efficiently using data-driven computational methods, particularly artificial intelligence (AI) approach. Various AI approaches are useful for peptide-based drug discovery, such as support vector machine, random forest, extremely randomized trees, and other more recently developed deep learning methods. AI methods are relatively new to the development of peptide-based therapies, but these techniques already become essential tools in protein science by dissecting novel therapeutic peptides and their functions (Figure 1).[Figure: see text]. EXPERT OPINION Researchers have shown that AI models can facilitate the development of peptidomics and selective peptide therapies in the field of peptide science. Biopeptide prediction is important for the discovery and development of successful peptide-based drugs. Due to their ability to predict therapeutic roles based on sequence details, many AI-dependent prediction tools have been developed (Figure 1).
Collapse
Affiliation(s)
- Luís Perpetuo
- iBiMED, Department of Medical Sciences, University of Aveiro, Aveiro
| | - Julie Klein
- Institut National de la Santé et de la Recherche Médicale (INSERM), U1297, Institute of Cardiovascular and Metabolic Disease, Université Toulouse III, Toulouse, France
| | - Rita Ferreira
- LAQV/REQUIMTE, Department of Chemistry, University of Aveiro, Aveiro
| | - Sofia Guedes
- LAQV/REQUIMTE, Department of Chemistry, University of Aveiro, Aveiro
| | - Francisco Amado
- LAQV/REQUIMTE, Department of Chemistry, University of Aveiro, Aveiro
| | - Adelino Leite-Moreira
- UnIC, Departamento de Cirurgia e Fisiologia, Faculdade de Medicina da Universidade do Porto, Porto
| | - Artur M S Silva
- LAQV/REQUIMTE, Department of Chemistry, University of Aveiro, Aveiro
| | - Visith Thongboonkerd
- Medical Proteomics Unit, Office for Research and Development, Faculty of Medicine Siriraj Hospital, Mahidol University, Bangkok 10700, Thailand
| | - Rui Vitorino
- iBiMED, Department of Medical Sciences, University of Aveiro, Aveiro.,LAQV/REQUIMTE, Department of Chemistry, University of Aveiro, Aveiro.,UnIC, Departamento de Cirurgia e Fisiologia, Faculdade de Medicina da Universidade do Porto, Porto
| |
Collapse
|
3
|
Feng P, Feng L, Tang C. Comparison and Analysis of Computational Methods for Identifying N6-Methyladenosine Sites in Saccharomyces cerevisiae. Curr Pharm Des 2021; 27:1219-1229. [PMID: 33167827 DOI: 10.2174/1381612826666201109110703] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2020] [Accepted: 07/20/2020] [Indexed: 11/22/2022]
Abstract
BACKGROUND N6-methyladenosine (m6A) plays critical roles in a broad range of biological processes. Knowledge about the precise location of m6A site in the transcriptome is vital for deciphering its biological functions. Although experimental techniques have made substantial contributions to identify m6A, they are still labor intensive and time consuming. As complement to experimental methods, in the past few years, a series of computational approaches have been proposed to identify m6A sites. METHODS In order to facilitate researchers to select appropriate methods for identifying m6A sites, it is necessary to conduct a comprehensive review and comparison of existing methods. RESULTS Since research works on m6A in Saccharomyces cerevisiae are relatively clear, in this review, we summarized recent progress of computational prediction of m6A sites in S. cerevisiae and assessed the performance of existing computational methods. Finally, future directions of computationally identifying m6A sites are presented. CONCLUSION Taken together, we anticipate that this review will serve as an important guide for computational analysis of m6A modifications.
Collapse
Affiliation(s)
- Pengmian Feng
- School of Basic Medical Sciences, Chengdu University of Traditional Chinese Medicine, Chengdu 611730, China
| | - Lijing Feng
- School of Sciences, North China University of Science and Technology, Tangshan 063000, China
| | - Chaohui Tang
- School of Basic Medical Sciences, Chengdu University of Traditional Chinese Medicine, Chengdu 611730, China
| |
Collapse
|
4
|
Yang XF, Zhou YK, Zhang L, Gao Y, Du PF. Predicting LncRNA Subcellular Localization Using Unbalanced Pseudo-k Nucleotide Compositions. Curr Bioinform 2020. [DOI: 10.2174/1574893614666190902151038] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
Background:
Long non-coding RNAs (lncRNAs) are transcripts with a length more
than 200 nucleotides, functioning in the regulation of gene expression. More evidence has shown
that the biological functions of lncRNAs are intimately related to their subcellular localizations.
Therefore, it is very important to confirm the lncRNA subcellular localization.
Methods:
In this paper, we proposed a novel method to predict the subcellular localization of
lncRNAs. To more comprehensively utilize lncRNA sequence information, we exploited both kmer
nucleotide composition and sequence order correlated factors of lncRNA to formulate
lncRNA sequences. Meanwhile, a feature selection technique which was based on the Analysis Of
Variance (ANOVA) was applied to obtain the optimal feature subset. Finally, we used the support
vector machine (SVM) to perform the prediction.
Results:
The AUC value of the proposed method can reach 0.9695, which indicated the proposed
predictor is an efficient and reliable tool for determining lncRNA subcellular localization. Furthermore,
the predictor can reach the maximum overall accuracy of 90.37% in leave-one-out cross
validation, which clearly outperforms the existing state-of- the-art method.
Conclusion:
It is demonstrated that the proposed predictor is feasible and powerful for the prediction
of lncRNA subcellular. To facilitate subsequent genetic sequence research, we shared the
source code at https://github.com/NicoleYXF/lncRNA.
Collapse
Affiliation(s)
- Xiao-Fei Yang
- College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
| | - Yuan-Ke Zhou
- College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
| | - Lin Zhang
- College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
| | - Yang Gao
- School of Medicine, Nankai University, Tianjin 300071, China
| | - Pu-Feng Du
- College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
| |
Collapse
|
5
|
Abstract
During the last three decades or so, many efforts have been made to study the protein cleavage
sites by some disease-causing enzyme, such as HIV (Human Immunodeficiency Virus) protease
and SARS (Severe Acute Respiratory Syndrome) coronavirus main proteinase. It has become increasingly
clear <i>via</i> this mini-review that the motivation driving the aforementioned studies is quite wise,
and that the results acquired through these studies are very rewarding, particularly for developing peptide
drugs.
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Gordon Life Science Institute, Boston, MA 02478, United States
| |
Collapse
|
6
|
|
7
|
Some illuminating remarks on molecular genetics and genomics as well as drug development. Mol Genet Genomics 2020; 295:261-274. [PMID: 31894399 DOI: 10.1007/s00438-019-01634-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2019] [Accepted: 12/05/2019] [Indexed: 02/07/2023]
Abstract
Facing the explosive growth of biological sequences unearthed in the post-genomic age, one of the most important but also most difficult problems in computational biology is how to express a biological sequence with a discrete model or a vector, but still keep it with considerable sequence-order information or its special pattern. To deal with such a challenging problem, the ideas of "pseudo amino acid components" and "pseudo K-tuple nucleotide composition" have been proposed. The ideas and their approaches have further stimulated the birth for "distorted key theory", "wenxing diagram", and substantially strengthening the power in treating the multi-label systems, as well as the establishment of the famous "5-steps rule". All these logic developments are quite natural that are very useful not only for theoretical scientists but also for experimental scientists in conducting genetics/genomics analysis and drug development. Presented in this review paper are also their future perspectives; i.e., their impacts will become even more significant and propounding.
Collapse
|
8
|
Shao YT, Liu XX, Lu Z, Chou KC. pLoc_Deep-mHum: Predict Subcellular Localization of Human Proteins by Deep Learning. ACTA ACUST UNITED AC 2020. [DOI: 10.4236/ns.2020.127042] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
9
|
Shao Y, Chou KC. pLoc_Deep-mEuk: Predict Subcellular Localization of Eukaryotic Proteins by Deep Learning. ACTA ACUST UNITED AC 2020. [DOI: 10.4236/ns.2020.126034] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
10
|
pLoc_bal-mHum: Predict subcellular localization of human proteins by PseAAC and quasi-balancing training dataset. Genomics 2019; 111:1274-1282. [DOI: 10.1016/j.ygeno.2018.08.007] [Citation(s) in RCA: 56] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2018] [Revised: 08/14/2018] [Accepted: 08/16/2018] [Indexed: 12/17/2022]
|
11
|
Chou KC. Impacts of Pseudo Amino Acid Components and 5-steps Rule to Proteomics and Proteome Analysis. Curr Top Med Chem 2019; 19:2283-2300. [DOI: 10.2174/1568026619666191018100141] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2019] [Revised: 08/18/2019] [Accepted: 08/26/2019] [Indexed: 01/27/2023]
Abstract
Stimulated by the 5-steps rule during the last decade or so, computational proteomics has achieved remarkable progresses in the following three areas: (1) protein structural class prediction; (2) protein subcellular location prediction; (3) post-translational modification (PTM) site prediction. The results obtained by these predictions are very useful not only for an in-depth study of the functions of proteins and their biological processes in a cell, but also for developing novel drugs against major diseases such as cancers, Alzheimer’s, and Parkinson’s. Moreover, since the targets to be predicted may have the multi-label feature, two sets of metrics are introduced: one is for inspecting the global prediction quality, while the other for the local prediction quality. All the predictors covered in this review have a userfriendly web-server, through which the majority of experimental scientists can easily obtain their desired data without the need to go through the complicated mathematics.
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, 610054, China
| |
Collapse
|
12
|
Chou KC. Advances in Predicting Subcellular Localization of Multi-label Proteins and its Implication for Developing Multi-target Drugs. Curr Med Chem 2019; 26:4918-4943. [PMID: 31060481 DOI: 10.2174/0929867326666190507082559] [Citation(s) in RCA: 78] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2018] [Revised: 01/29/2019] [Accepted: 01/31/2019] [Indexed: 12/16/2022]
Abstract
The smallest unit of life is a cell, which contains numerous protein molecules. Most
of the functions critical to the cell’s survival are performed by these proteins located in its different
organelles, usually called ‘‘subcellular locations”. Information of subcellular localization
for a protein can provide useful clues about its function. To reveal the intricate pathways at the
cellular level, knowledge of the subcellular localization of proteins in a cell is prerequisite.
Therefore, one of the fundamental goals in molecular cell biology and proteomics is to determine
the subcellular locations of proteins in an entire cell. It is also indispensable for prioritizing
and selecting the right targets for drug development. Unfortunately, it is both timeconsuming
and costly to determine the subcellular locations of proteins purely based on experiments.
With the avalanche of protein sequences generated in the post-genomic age, it is highly
desired to develop computational methods for rapidly and effectively identifying the subcellular
locations of uncharacterized proteins based on their sequences information alone. Actually,
considerable progresses have been achieved in this regard. This review is focused on those
methods, which have the capacity to deal with multi-label proteins that may simultaneously
exist in two or more subcellular location sites. Protein molecules with this kind of characteristic
are vitally important for finding multi-target drugs, a current hot trend in drug development.
Focused in this review are also those methods that have use-friendly web-servers established so
that the majority of experimental scientists can use them to get the desired results without the
need to go through the detailed mathematics involved.
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Gordon Life Science Institute, Boston, MA 02478, United States
| |
Collapse
|
13
|
Abstract
The smallest unit of life is a cell, which contains numerous protein molecules. Most
of the functions critical to the cell’s survival are performed by these proteins located in its different
organelles, usually called ‘‘subcellular locations”. Information of subcellular localization
for a protein can provide useful clues about its function. To reveal the intricate pathways at the
cellular level, knowledge of the subcellular localization of proteins in a cell is prerequisite.
Therefore, one of the fundamental goals in molecular cell biology and proteomics is to determine
the subcellular locations of proteins in an entire cell. It is also indispensable for prioritizing
and selecting the right targets for drug development. Unfortunately, it is both timeconsuming
and costly to determine the subcellular locations of proteins purely based on experiments.
With the avalanche of protein sequences generated in the post-genomic age, it is highly
desired to develop computational methods for rapidly and effectively identifying the subcellular
locations of uncharacterized proteins based on their sequences information alone. Actually,
considerable progresses have been achieved in this regard. This review is focused on those
methods, which have the capacity to deal with multi-label proteins that may simultaneously
exist in two or more subcellular location sites. Protein molecules with this kind of characteristic
are vitally important for finding multi-target drugs, a current hot trend in drug development.
Focused in this review are also those methods that have use-friendly web-servers established so
that the majority of experimental scientists can use them to get the desired results without the
need to go through the detailed mathematics involved.
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Gordon Life Science Institute, Boston, MA 02478, United States
| |
Collapse
|
14
|
Chou KC. Proposing Pseudo Amino Acid Components is an Important Milestone for Proteome and Genome Analyses. Int J Pept Res Ther 2019. [DOI: 10.1007/s10989-019-09910-7] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
|
15
|
|
16
|
Xiao X, Cheng X, Chen G, Mao Q, Chou KC. pLoc_bal-mVirus: Predict Subcellular Localization of Multi-Label Virus Proteins by Chou's General PseAAC and IHTS Treatment to Balance Training Dataset. Med Chem 2019; 15:496-509. [DOI: 10.2174/1573406415666181217114710] [Citation(s) in RCA: 44] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2018] [Revised: 10/23/2018] [Accepted: 12/12/2018] [Indexed: 12/17/2022]
Abstract
Background/Objective:Knowledge of protein subcellular localization is vitally important for both basic research and drug development. Facing the avalanche of protein sequences emerging in the post-genomic age, it is urgent to develop computational tools for timely and effectively identifying their subcellular localization based on the sequence information alone. Recently, a predictor called “pLoc-mVirus” was developed for identifying the subcellular localization of virus proteins. Its performance is overwhelmingly better than that of the other predictors for the same purpose, particularly in dealing with multi-label systems in which some proteins, known as “multiplex proteins”, may simultaneously occur in, or move between two or more subcellular location sites. Despite the fact that it is indeed a very powerful predictor, more efforts are definitely needed to further improve it. This is because pLoc-mVirus was trained by an extremely skewed dataset in which some subset was over 10 times the size of the other subsets. Accordingly, it cannot avoid the biased consequence caused by such an uneven training dataset.Methods:Using the Chou's general PseAAC (Pseudo Amino Acid Composition) approach and the IHTS (Inserting Hypothetical Training Samples) treatment to balance out the training dataset, we have developed a new predictor called “pLoc_bal-mVirus” for predicting the subcellular localization of multi-label virus proteins.Results:Cross-validation tests on exactly the same experiment-confirmed dataset have indicated that the proposed new predictor is remarkably superior to pLoc-mVirus, the existing state-of-theart predictor for the same purpose.Conclusion:Its user-friendly web-server is available at http://www.jci-bioinfo.cn/pLoc_balmVirus/, by which the majority of experimental scientists can easily get their desired results without the need to go through the detailed complicated mathematics. Accordingly, pLoc_bal-mVirus will become a very useful tool for designing multi-target drugs and in-depth understanding of the biological process in a cell.
Collapse
Affiliation(s)
- Xuan Xiao
- Gordon Life Science Institute, Boston, MA 02478, United States
| | - Xiang Cheng
- Gordon Life Science Institute, Boston, MA 02478, United States
| | - Genqiang Chen
- College of Chemistry, Chemical Engineering and Biotechnology, Donghua University, Shanghai 201620, China
| | - Qi Mao
- College of Information Science and Technology, Donghua University, Shanghai, China
| | - Kuo-Chen Chou
- Gordon Life Science Institute, Boston, MA 02478, United States
| |
Collapse
|
17
|
Niu B, Liang C, Lu Y, Zhao M, Chen Q, Zhang Y, Zheng L, Chou KC. Glioma stages prediction based on machine learning algorithm combined with protein-protein interaction networks. Genomics 2019; 112:837-847. [PMID: 31150762 DOI: 10.1016/j.ygeno.2019.05.024] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2019] [Accepted: 05/25/2019] [Indexed: 12/18/2022]
Abstract
BACKGROUND Glioma is the most lethal nervous system cancer. Recent studies have made great efforts to study the occurrence and development of glioma, but the molecular mechanisms are still unclear. This study was designed to reveal the molecular mechanisms of glioma based on protein-protein interaction network combined with machine learning methods. Key differentially expressed genes (DEGs) were screened and selected by using the protein-protein interaction (PPI) networks. RESULTS As a result, 19 genes between grade I and grade II, 21 genes between grade II and grade III, and 20 genes between grade III and grade IV. Then, five machine learning methods were employed to predict the gliomas stages based on the selected key genes. After comparison, Complement Naive Bayes classifier was employed to build the prediction model for grade II-III with accuracy 72.8%. And Random forest was employed to build the prediction model for grade I-II and grade III-VI with accuracy 97.1% and 83.2%, respectively. Finally, the selected genes were analyzed by PPI networks, Gene Ontology (GO) terms and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways, and the results improve our understanding of the biological functions of select DEGs involved in glioma growth. We expect that the key genes expressed have a guiding significance for the occurrence of gliomas or, at the very least, that they are useful for tumor researchers. CONCLUSION Machine learning combined with PPI networks, GO and KEGG analyses of selected DEGs improve our understanding of the biological functions involved in glioma growth.
Collapse
Affiliation(s)
- Bing Niu
- School of Life Sciences, Shanghai University, Shanghai 200444, China; Gordon Life Science Institute, Boston, MA 02478, USA.
| | - Chaofeng Liang
- Department of Neurosurgery, The Third Affiliated Hospital of Sun Yat-sen University, Guangzhou, China
| | - Yi Lu
- School of Life Sciences, Shanghai University, Shanghai 200444, China
| | - Manman Zhao
- School of Life Sciences, Shanghai University, Shanghai 200444, China
| | - Qin Chen
- School of Life Sciences, Shanghai University, Shanghai 200444, China.
| | - Yuhui Zhang
- Renji Hospital, Medical School, Shanghai Jiaotong University, 160 Pujian Rd, New Pudong District, Shanghai 200127, China; Changhai Hospital, Second Military Medical University, Shanghai 200433, China.
| | - Linfeng Zheng
- Department of Radiology, Shanghai General Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai 200080, China; Department of Radiology, Shanghai First People's Hospital, Baoshan Branch, Shanghai 200940, China.
| | - Kuo-Chen Chou
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China; Gordon Life Science Institute, Boston, MA 02478, USA.
| |
Collapse
|
18
|
Yang W, Zhu XJ, Huang J, Ding H, Lin H. A Brief Survey of Machine Learning Methods in Protein Sub-Golgi Localization. Curr Bioinform 2019. [DOI: 10.2174/1574893613666181113131415] [Citation(s) in RCA: 111] [Impact Index Per Article: 18.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
Abstract
Background:The location of proteins in a cell can provide important clues to their functions in various biological processes. Thus, the application of machine learning method in the prediction of protein subcellular localization has become a hotspot in bioinformatics. As one of key organelles, the Golgi apparatus is in charge of protein storage, package, and distribution.Objective:The identification of protein location in Golgi apparatus will provide in-depth insights into their functions. Thus, the machine learning-based method of predicting protein location in Golgi apparatus has been extensively explored. The development of protein sub-Golgi apparatus localization prediction should be reviewed for providing a whole background for the fields.Method:The benchmark dataset, feature extraction, machine learning method and published results were summarized.Results:We briefly introduced the recent progresses in protein sub-Golgi apparatus localization prediction using machine learning methods and discussed their advantages and disadvantages.Conclusion:We pointed out the perspective of machine learning methods in protein sub-Golgi localization prediction.
Collapse
Affiliation(s)
- Wuritu Yang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, Sichuan, 610054, China
| | - Xiao-Juan Zhu
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, Sichuan, 610054, China
| | - Jian Huang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, Sichuan, 610054, China
| | - Hui Ding
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, Sichuan, 610054, China
| | - Hao Lin
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, Sichuan, 610054, China
| |
Collapse
|
19
|
MFSC: Multi-voting based feature selection for classification of Golgi proteins by adopting the general form of Chou's PseAAC components. J Theor Biol 2019; 463:99-109. [DOI: 10.1016/j.jtbi.2018.12.017] [Citation(s) in RCA: 39] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2018] [Revised: 12/02/2018] [Accepted: 12/14/2018] [Indexed: 12/29/2022]
|
20
|
Abstract
Cancer is a serious health issue worldwide. Traditional treatment methods focus on killing cancer cells by using anticancer drugs or radiation therapy, but the cost of these methods is quite high, and in addition there are side effects. With the discovery of anticancer peptides, great progress has been made in cancer treatment. For the purpose of prompting the application of anticancer peptides in cancer treatment, it is necessary to use computational methods to identify anticancer peptides (ACPs). In this paper, we propose a sequence-based model for identifying ACPs (SAP). In our proposed SAP, the peptide is represented by 400D features or 400D features with g-gap dipeptide features, and then the unrelated features are pruned using the maximum relevance-maximum distance method. The experimental results demonstrate that our model performs better than some existing methods. Furthermore, our model has also been extended to other classifiers, and the performance is stable compared with some state-of-the-art works.
Collapse
Affiliation(s)
- Lei Xu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen 518060, China.
| | - Guangmin Liang
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen 518060, China.
| | - Longjie Wang
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen 518060, China.
| | - Changrui Liao
- Key Laboratory of Optoelectronic Devices and Systems of Ministry of Education and Guangdong Province, College of Optoelectronic Engineering, Shenzhen University, Shenzhen 518060, China.
| |
Collapse
|
21
|
iACP: a sequence-based tool for identifying anticancer peptides. Oncotarget 2017; 7:16895-909. [PMID: 26942877 PMCID: PMC4941358 DOI: 10.18632/oncotarget.7815] [Citation(s) in RCA: 313] [Impact Index Per Article: 39.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2016] [Accepted: 02/11/2016] [Indexed: 02/07/2023] Open
Abstract
Cancer remains a major killer worldwide. Traditional methods of cancer treatment are expensive and have some deleterious side effects on normal cells. Fortunately, the discovery of anticancer peptides (ACPs) has paved a new way for cancer treatment. With the explosive growth of peptide sequences generated in the post genomic age, it is highly desired to develop computational methods for rapidly and effectively identifying ACPs, so as to speed up their application in treating cancer. Here we report a sequence-based predictor called iACP developed by the approach of optimizing the g-gap dipeptide components. It was demonstrated by rigorous cross-validations that the new predictor remarkably outperformed the existing predictors for the same purpose in both overall accuracy and stability. For the convenience of most experimental scientists, a publicly accessible web-server for iACP has been established at http://lin.uestc.edu.cn/server/iACP, by which users can easily obtain their desired results.
Collapse
|
22
|
Rahman MS, Rahman MK, Kaykobad M, Rahman MS. isGPT: An optimized model to identify sub-Golgi protein types using SVM and Random Forest based feature selection. Artif Intell Med 2017; 84:90-100. [PMID: 29183738 DOI: 10.1016/j.artmed.2017.11.003] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2017] [Revised: 11/13/2017] [Accepted: 11/17/2017] [Indexed: 10/18/2022]
Abstract
The Golgi Apparatus (GA) is a key organelle for protein synthesis within the eukaryotic cell. The main task of GA is to modify and sort proteins for transport throughout the cell. Proteins permeate through the GA on the ER (Endoplasmic Reticulum) facing side (cis side) and depart on the other side (trans side). Based on this phenomenon, we get two types of GA proteins, namely, cis-Golgi protein and trans-Golgi protein. Any dysfunction of GA proteins can result in congenital glycosylation disorders and some other forms of difficulties that may lead to neurodegenerative and inherited diseases like diabetes, cancer and cystic fibrosis. So, the exact classification of GA proteins may contribute to drug development which will further help in medication. In this paper, we focus on building a new computational model that not only introduces easy ways to extract features from protein sequences but also optimizes classification of trans-Golgi and cis-Golgi proteins. After feature extraction, we have employed Random Forest (RF) model to rank the features based on the importance score obtained from it. After selecting the top ranked features, we have applied Support Vector Machine (SVM) to classify the sub-Golgi proteins. We have trained regression model as well as classification model and found the former to be superior. The model shows improved performance over all previous methods. As the benchmark dataset is significantly imbalanced, we have applied Synthetic Minority Over-sampling Technique (SMOTE) to the dataset to make it balanced and have conducted experiments on both versions. Our method, namely, identification of sub-Golgi Protein Types (isGPT), achieves accuracy values of 95.4%, 95.9% and 95.3% for 10-fold cross-validation test, jackknife test and independent test respectively. According to different performance metrics, isGPT performs better than state-of-the-art techniques. The source code of isGPT, along with relevant dataset and detailed experimental results, can be found at https://github.com/srautonu/isGPT.
Collapse
Affiliation(s)
- M Saifur Rahman
- Department of CSE, BUET, ECE Building, West Palasi, Dhaka 1205, Bangladesh.
| | | | - M Kaykobad
- Department of CSE, BUET, ECE Building, West Palasi, Dhaka 1205, Bangladesh.
| | - M Sohel Rahman
- Department of CSE, BUET, ECE Building, West Palasi, Dhaka 1205, Bangladesh.
| |
Collapse
|
23
|
Akbar S, Hayat M, Iqbal M, Jan MA. iACP-GAEnsC: Evolutionary genetic algorithm based ensemble classification of anticancer peptides by utilizing hybrid feature space. Artif Intell Med 2017; 79:62-70. [PMID: 28655440 DOI: 10.1016/j.artmed.2017.06.008] [Citation(s) in RCA: 96] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2017] [Revised: 06/12/2017] [Accepted: 06/16/2017] [Indexed: 01/10/2023]
Abstract
Cancer is a fatal disease, responsible for one-quarter of all deaths in developed countries. Traditional anticancer therapies such as, chemotherapy and radiation, are highly expensive, susceptible to errors and ineffective techniques. These conventional techniques induce severe side-effects on human cells. Due to perilous impact of cancer, the development of an accurate and highly efficient intelligent computational model is desirable for identification of anticancer peptides. In this paper, evolutionary intelligent genetic algorithm-based ensemble model, 'iACP-GAEnsC', is proposed for the identification of anticancer peptides. In this model, the protein sequences are formulated, using three different discrete feature representation methods, i.e., amphiphilic Pseudo amino acid composition, g-Gap dipeptide composition, and Reduce amino acid alphabet composition. The performance of the extracted feature spaces are investigated separately and then merged to exhibit the significance of hybridization. In addition, the predicted results of individual classifiers are combined together, using optimized genetic algorithm and simple majority technique in order to enhance the true classification rate. It is observed that genetic algorithm-based ensemble classification outperforms than individual classifiers as well as simple majority voting base ensemble. The performance of genetic algorithm-based ensemble classification is highly reported on hybrid feature space, with an accuracy of 96.45%. In comparison to the existing techniques, 'iACP-GAEnsC' model has achieved remarkable improvement in terms of various performance metrics. Based on the simulation results, it is observed that 'iACP-GAEnsC' model might be a leading tool in the field of drug design and proteomics for researchers.
Collapse
Affiliation(s)
- Shahid Akbar
- Department of Computer Science, Abdul Wali Khan University Mardan, KP 23200, Pakistan.
| | - Maqsood Hayat
- Department of Computer Science, Abdul Wali Khan University Mardan, KP 23200, Pakistan.
| | - Muhammad Iqbal
- Department of Computer Science, Abdul Wali Khan University Mardan, KP 23200, Pakistan.
| | - Mian Ahmad Jan
- Department of Computer Science, Abdul Wali Khan University Mardan, KP 23200, Pakistan.
| |
Collapse
|
24
|
Liu B, Wu H, Chou KC. Pse-in-One 2.0: An Improved Package of Web Servers for Generating Various Modes of Pseudo Components of DNA, RNA, and Protein Sequences. ACTA ACUST UNITED AC 2017. [DOI: 10.4236/ns.2017.94007] [Citation(s) in RCA: 91] [Impact Index Per Article: 11.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
25
|
Classifying Multifunctional Enzymes by Incorporating Three Different Models into Chou’s General Pseudo Amino Acid Composition. J Membr Biol 2016; 249:551-7. [DOI: 10.1007/s00232-016-9904-3] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2015] [Accepted: 04/11/2016] [Indexed: 10/21/2022]
|
26
|
Yang R, Zhang C, Gao R, Zhang L. A Novel Feature Extraction Method with Feature Selection to Identify Golgi-Resident Protein Types from Imbalanced Data. Int J Mol Sci 2016; 17:218. [PMID: 26861308 PMCID: PMC4783950 DOI: 10.3390/ijms17020218] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2015] [Accepted: 01/26/2016] [Indexed: 01/08/2023] Open
Abstract
The Golgi Apparatus (GA) is a major collection and dispatch station for numerous proteins destined for secretion, plasma membranes and lysosomes. The dysfunction of GA proteins can result in neurodegenerative diseases. Therefore, accurate identification of protein subGolgi localizations may assist in drug development and understanding the mechanisms of the GA involved in various cellular processes. In this paper, a new computational method is proposed for identifying cis-Golgi proteins from trans-Golgi proteins. Based on the concept of Common Spatial Patterns (CSP), a novel feature extraction technique is developed to extract evolutionary information from protein sequences. To deal with the imbalanced benchmark dataset, the Synthetic Minority Over-sampling Technique (SMOTE) is adopted. A feature selection method called Random Forest-Recursive Feature Elimination (RF-RFE) is employed to search the optimal features from the CSP based features and g-gap dipeptide composition. Based on the optimal features, a Random Forest (RF) module is used to distinguish cis-Golgi proteins from trans-Golgi proteins. Through the jackknife cross-validation, the proposed method achieves a promising performance with a sensitivity of 0.889, a specificity of 0.880, an accuracy of 0.885, and a Matthew's Correlation Coefficient (MCC) of 0.765, which remarkably outperforms previous methods. Moreover, when tested on a common independent dataset, our method also achieves a significantly improved performance. These results highlight the promising performance of the proposed method to identify Golgi-resident protein types. Furthermore, the CSP based feature extraction method may provide guidelines for protein function predictions.
Collapse
Affiliation(s)
- Runtao Yang
- School of Control Science and Engineering, Shandong University, Jinan 250061, China.
| | - Chengjin Zhang
- School of Control Science and Engineering, Shandong University, Jinan 250061, China.
- School of Mechanical, Electrical and Information Engineering, Shandong University atWeihai, Weihai 264209, China.
| | - Rui Gao
- School of Control Science and Engineering, Shandong University, Jinan 250061, China.
| | - Lina Zhang
- School of Control Science and Engineering, Shandong University, Jinan 250061, China.
| |
Collapse
|
27
|
Zou HL, Xiao X. Predicting the Functional Types of Singleplex and Multiplex Eukaryotic Membrane Proteins via Different Models of Chou's Pseudo Amino Acid Compositions. J Membr Biol 2015; 249:23-9. [PMID: 26458844 DOI: 10.1007/s00232-015-9830-9] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2015] [Accepted: 07/30/2015] [Indexed: 01/14/2023]
Abstract
Given a membrane protein sequence, how can we identify its type, particularly when a query protein may have the multiplex character, i.e., simultaneously exist at two or more different types. However, most of the existing predictors or methods can only be used to deal with the single-type or "singleplex" membrane proteins. Actually, multiple-type or "multiplex" membrane proteins should not be ignored because they usually posses some unique biological functions worthy of our special notice. In this study, three different models were developed, which have the ability to deal with the systems containing both singleplex and multiplex membrane proteins. The overall success rate thus obtained was 0.6440, indicating that the study may become a very useful high-throughput tool in identifying the functional types of membrane proteins.
Collapse
Affiliation(s)
- Hong-Liang Zou
- Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen, 333046, China.
| | - Xuan Xiao
- Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen, 333046, China.
- Information School, ZheJiang Textile & Fashion College, Ningbo, 315211, China.
- Gordon Life Science Institute, 53 South Cottage Road, Belmont, MA, 02478, USA.
| |
Collapse
|
28
|
Liu B, Chen J, Wang X. Protein remote homology detection by combining Chou’s distance-pair pseudo amino acid composition and principal component analysis. Mol Genet Genomics 2015; 290:1919-31. [DOI: 10.1007/s00438-015-1044-4] [Citation(s) in RCA: 61] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2015] [Accepted: 04/06/2015] [Indexed: 02/07/2023]
|
29
|
Chen W, Lin H, Chou KC. Pseudo nucleotide composition or PseKNC: an effective formulation for analyzing genomic sequences. MOLECULAR BIOSYSTEMS 2015; 11:2620-34. [DOI: 10.1039/c5mb00155b] [Citation(s) in RCA: 262] [Impact Index Per Article: 26.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
With the avalanche of DNA/RNA sequences generated in the post-genomic age, it is urgent to develop automated methods for analyzing the relationship between the sequences and their functions.
Collapse
Affiliation(s)
- Wei Chen
- Department of Physics
- School of Sciences
- and Center for Genomics and Computational Biology
- Hebei United University
- Tangshan 063000
| | - Hao Lin
- Gordon Life Science Institute
- Boston
- USA
- Key Laboratory for Neuro-Information of Ministry of Education
- Center of Bioinformatics
| | - Kuo-Chen Chou
- Department of Physics
- School of Sciences
- and Center for Genomics and Computational Biology
- Hebei United University
- Tangshan 063000
| |
Collapse
|
30
|
A new multi-label classifier in identifying the functional types of human membrane proteins. J Membr Biol 2014; 248:179-86. [PMID: 25433431 DOI: 10.1007/s00232-014-9755-8] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2014] [Accepted: 11/11/2014] [Indexed: 10/24/2022]
Abstract
Membrane proteins were found to be involved in various cellular processes performing various important functions, which are mainly associated to their type. Given a membrane protein sequence, how can we identify its type(s)? Particularly, how can we deal with the multi-type problem since one membrane protein may simultaneously belong to two or more different types? To address these problems, which are obviously very important to both basic research and drug development, a new multi-label classifier was developed based on pseudo amino acid composition with multi-label k-nearest neighbor algorithm. The success rate achieved by the new predictor on the benchmark dataset by jackknife test is 73.94%, indicating that the method is promising and the predictor may become a very useful high-throughput tool, or at least play a complementary role to the existing predictors in identifying functional types of membrane proteins.
Collapse
|
31
|
Xu Y, Wen X, Wen LS, Wu LY, Deng NY, Chou KC. iNitro-Tyr: prediction of nitrotyrosine sites in proteins with general pseudo amino acid composition. PLoS One 2014; 9:e105018. [PMID: 25121969 PMCID: PMC4133382 DOI: 10.1371/journal.pone.0105018] [Citation(s) in RCA: 167] [Impact Index Per Article: 15.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2014] [Accepted: 07/16/2014] [Indexed: 12/31/2022] Open
Abstract
Nitrotyrosine is one of the post-translational modifications (PTMs) in proteins that occurs when their tyrosine residue is nitrated. Compared with healthy people, a remarkably increased level of nitrotyrosine is detected in those suffering from rheumatoid arthritis, septic shock, and coeliac disease. Given an uncharacterized protein sequence that contains many tyrosine residues, which one of them can be nitrated and which one cannot? This is a challenging problem, not only directly related to in-depth understanding the PTM’s mechanism but also to the nitrotyrosine-based drug development. Particularly, with the avalanche of protein sequences generated in the postgenomic age, it is highly desired to develop a high throughput tool in this regard. Here, a new predictor called “iNitro-Tyr” was developed by incorporating the position-specific dipeptide propensity into the general pseudo amino acid composition for discriminating the nitrotyrosine sites from non-nitrotyrosine sites in proteins. It was demonstrated via the rigorous jackknife tests that the new predictor not only can yield higher success rate but also is much more stable and less noisy. A web-server for iNitro-Tyr is accessible to the public at http://app.aporc.org/iNitro-Tyr/. For the convenience of most experimental scientists, we have further provided a protocol of step-by-step guide, by which users can easily get their desired results without the need to follow the complicated mathematics that were presented in this paper just for the integrity of its development process. It has not escaped our notice that the approach presented here can be also used to deal with the other PTM sites in proteins.
Collapse
Affiliation(s)
- Yan Xu
- Department of Information and Computer Science, University of Science and Technology Beijing, Beijing, China
- * E-mail:
| | - Xin Wen
- Department of Information and Computer Science, University of Science and Technology Beijing, Beijing, China
| | - Li-Shu Wen
- College of Sciences, Liaoning Shiyou University, FuShun, China
| | - Ling-Yun Wu
- Institute of Applied Mathematics, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China
| | - Nai-Yang Deng
- College of Science, China Agricultural University, Beijing, China
| | - Kuo-Chen Chou
- Center of Excellence in Genomic Medicine Research (CEGMR), King Abdulaziz University, Jeddah, Saudi Arabia
- Gordon Life Science Institute, Boston, Massachusetts, United States of America
| |
Collapse
|
32
|
Screening drug target proteins based on sequence information. J Biomed Inform 2014; 49:269-74. [DOI: 10.1016/j.jbi.2014.03.009] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2013] [Revised: 02/05/2014] [Accepted: 03/14/2014] [Indexed: 01/15/2023]
|
33
|
iCTX-type: a sequence-based predictor for identifying the types of conotoxins in targeting ion channels. BIOMED RESEARCH INTERNATIONAL 2014; 2014:286419. [PMID: 24991545 PMCID: PMC4058692 DOI: 10.1155/2014/286419] [Citation(s) in RCA: 137] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/13/2014] [Revised: 04/22/2014] [Accepted: 05/07/2014] [Indexed: 11/30/2022]
Abstract
Conotoxins are small disulfide-rich neurotoxic peptides, which can bind to ion channels with very high specificity and modulate their activities. Over the last few decades, conotoxins have been the drug candidates for treating chronic pain, epilepsy, spasticity, and cardiovascular diseases. According to their functions and targets, conotoxins are generally categorized into three types: potassium-channel type, sodium-channel type, and calcium-channel types. With the avalanche of peptide sequences generated in the postgenomic age, it is urgent and challenging to develop an automated method for rapidly and accurately identifying the types of conotoxins based on their sequence information alone. To address this challenge, a new predictor, called iCTX-Type, was developed by incorporating the dipeptide occurrence frequencies of a conotoxin sequence into a 400-D (dimensional) general pseudoamino acid composition, followed by the feature optimization procedure to reduce the sample representation from 400-D to 50-D vector. The overall success rate achieved by iCTX-Type via a rigorous cross-validation was over 91%, outperforming its counterpart (RBF network). Besides, iCTX-Type is so far the only predictor in this area with its web-server available, and hence is particularly useful for most experimental scientists to get their desired results without the need to follow the complicated mathematics involved.
Collapse
|
34
|
iSS-PseDNC: identifying splicing sites using pseudo dinucleotide composition. BIOMED RESEARCH INTERNATIONAL 2014; 2014:623149. [PMID: 24967386 PMCID: PMC4055483 DOI: 10.1155/2014/623149] [Citation(s) in RCA: 97] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/19/2014] [Revised: 04/22/2014] [Accepted: 04/23/2014] [Indexed: 11/17/2022]
Abstract
In eukaryotic genes, exons are generally interrupted by introns. Accurately removing introns and joining exons together are essential processes in eukaryotic gene expression. With the avalanche of genome sequences generated in the postgenomic age, it is highly desired to develop automated methods for rapid and effective detection of splice sites that play important roles in gene structure annotation and even in RNA splicing. Although a series of computational methods were proposed for splice site identification, most of them neglected the intrinsic local structural properties. In the present study, a predictor called “iSS-PseDNC” was developed for identifying splice sites. In the new predictor, the sequences were formulated by a novel feature-vector called “pseudo dinucleotide composition” (PseDNC) into which six DNA local structural properties were incorporated. It was observed by the rigorous cross-validation tests on two benchmark datasets that the overall success rates achieved by iSS-PseDNC in identifying splice donor site and splice acceptor site were 85.45% and 87.73%, respectively. It is anticipated that iSS-PseDNC may become a useful tool for identifying splice sites and that the six DNA local structural properties described in this paper may provide novel insights for in-depth investigations into the mechanism of RNA splicing.
Collapse
|
35
|
Xu Y, Wen X, Shao XJ, Deng NY, Chou KC. iHyd-PseAAC: predicting hydroxyproline and hydroxylysine in proteins by incorporating dipeptide position-specific propensity into pseudo amino acid composition. Int J Mol Sci 2014; 15:7594-610. [PMID: 24857907 PMCID: PMC4057693 DOI: 10.3390/ijms15057594] [Citation(s) in RCA: 177] [Impact Index Per Article: 16.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2014] [Revised: 04/04/2014] [Accepted: 04/17/2014] [Indexed: 11/16/2022] Open
Abstract
Post-translational modifications (PTMs) play crucial roles in various cell functions and biological processes. Protein hydroxylation is one type of PTM that usually occurs at the sites of proline and lysine. Given an uncharacterized protein sequence, which site of its Pro (or Lys) can be hydroxylated and which site cannot? This is a challenging problem, not only for in-depth understanding of the hydroxylation mechanism, but also for drug development, because protein hydroxylation is closely relevant to major diseases, such as stomach and lung cancers. With the avalanche of protein sequences generated in the post-genomic age, it is highly desired to develop computational methods to address this problem. In view of this, a new predictor called “iHyd-PseAAC” (identify hydroxylation by pseudo amino acid composition) was proposed by incorporating the dipeptide position-specific propensity into the general form of pseudo amino acid composition. It was demonstrated by rigorous cross-validation tests on stringent benchmark datasets that the new predictor is quite promising and may become a useful high throughput tool in this area. A user-friendly web-server for iHyd-PseAAC is accessible at http://app.aporc.org/iHyd-PseAAC/. Furthermore, for the convenience of the majority of experimental scientists, a step-by-step guide on how to use the web-server is given. Users can easily obtain their desired results by following these steps without the need of understanding the complicated mathematical equations presented in this paper just for its integrity.
Collapse
Affiliation(s)
- Yan Xu
- Department of Information and Computer Science, University of Science and Technology Beijing, Beijing 100083, China.
| | - Xin Wen
- Department of Information and Computer Science, University of Science and Technology Beijing, Beijing 100083, China.
| | - Xiao-Jian Shao
- Department of Mathematics and Information Science, Binzhou University, Binzhou 256603, China.
| | - Nai-Yang Deng
- College of Science, China Agricultural University, Beijing 100083, China.
| | - Kuo-Chen Chou
- Center of Excellence in Genomic Medicine Research (CEGMR), King Abdulaziz University, Jeddah 21589, Saudi Arabia.
| |
Collapse
|
36
|
Fan YN, Xiao X, Min JL, Chou KC. iNR-Drug: predicting the interaction of drugs with nuclear receptors in cellular networking. Int J Mol Sci 2014; 15:4915-37. [PMID: 24651462 PMCID: PMC3975431 DOI: 10.3390/ijms15034915] [Citation(s) in RCA: 66] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2014] [Revised: 02/12/2014] [Accepted: 02/16/2014] [Indexed: 12/20/2022] Open
Abstract
Nuclear receptors (NRs) are closely associated with various major diseases such as cancer, diabetes, inflammatory disease, and osteoporosis. Therefore, NRs have become a frequent target for drug development. During the process of developing drugs against these diseases by targeting NRs, we are often facing a problem: Given a NR and chemical compound, can we identify whether they are really in interaction with each other in a cell? To address this problem, a predictor called “iNR-Drug” was developed. In the predictor, the drug compound concerned was formulated by a 256-D (dimensional) vector derived from its molecular fingerprint, and the NR by a 500-D vector formed by incorporating its sequential evolution information and physicochemical features into the general form of pseudo amino acid composition, and the prediction engine was operated by the SVM (support vector machine) algorithm. Compared with the existing prediction methods in this area, iNR-Drug not only can yield a higher success rate, but is also featured by a user-friendly web-server established at http://www.jci-bioinfo.cn/iNR-Drug/, which is particularly useful for most experimental scientists to obtain their desired data in a timely manner. It is anticipated that the iNR-Drug server may become a useful high throughput tool for both basic research and drug development, and that the current approach may be easily extended to study the interactions of drug with other targets as well.
Collapse
Affiliation(s)
- Yue-Nong Fan
- Computer Department, Jing-De-Zhen Ceramic Institute, Jingdezhen 333046, Jiangxi, China.
| | - Xuan Xiao
- Computer Department, Jing-De-Zhen Ceramic Institute, Jingdezhen 333046, Jiangxi, China.
| | - Jian-Liang Min
- Computer Department, Jing-De-Zhen Ceramic Institute, Jingdezhen 333046, Jiangxi, China.
| | - Kuo-Chen Chou
- Center of Excellence in Genomic Medicine Research (CEGMR), King Abdulaziz University, Jeddah 21589, Saudi Arabia.
| |
Collapse
|
37
|
Du P, Gu S, Jiao Y. PseAAC-General: fast building various modes of general form of Chou's pseudo-amino acid composition for large-scale protein datasets. Int J Mol Sci 2014; 15:3495-506. [PMID: 24577312 PMCID: PMC3975349 DOI: 10.3390/ijms15033495] [Citation(s) in RCA: 211] [Impact Index Per Article: 19.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2014] [Revised: 02/13/2014] [Accepted: 02/14/2014] [Indexed: 11/16/2022] Open
Abstract
The general form pseudo-amino acid composition (PseAAC) has been widely used to represent protein sequences in predicting protein structural and functional attributes. We developed the program PseAAC-General to generate various different modes of Chou’s general PseAAC, such as the gene ontology mode, the functional domain mode, and the sequential evolution mode. This program allows the users to define their own desired modes. In every mode, 544 physicochemical properties of the amino acids are available for choosing. The computing efficiency is at least 100 times that of existing programs, which makes it able to facilitate the extensive studies on proteins and peptides. The PseAAC-General is freely available via SourceForge. It runs on both Linux and Windows.
Collapse
Affiliation(s)
- Pufeng Du
- School of Computer Science and Technology, Tianjin University, Tianjin 300072, China.
| | - Shuwang Gu
- School of Computer Science and Technology, Tianjin University, Tianjin 300072, China.
| | - Yasen Jiao
- School of Computer Science and Technology, Tianjin University, Tianjin 300072, China.
| |
Collapse
|
38
|
iRSpot-TNCPseAAC: identify recombination spots with trinucleotide composition and pseudo amino acid components. Int J Mol Sci 2014; 15:1746-66. [PMID: 24469313 PMCID: PMC3958819 DOI: 10.3390/ijms15021746] [Citation(s) in RCA: 211] [Impact Index Per Article: 19.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2014] [Revised: 01/14/2014] [Accepted: 01/16/2014] [Indexed: 01/22/2023] Open
Abstract
Meiosis and recombination are the two opposite aspects that coexist in a DNA system. As a driving force for evolution by generating natural genetic variations, meiotic recombination plays a very important role in the formation of eggs and sperm. Interestingly, the recombination does not occur randomly across a genome, but with higher probability in some genomic regions called “hotspots”, while with lower probability in so-called “coldspots”. With the ever-increasing amount of genome sequence data in the postgenomic era, computational methods for effectively identifying the hotspots and coldspots have become urgent as they can timely provide us with useful insights into the mechanism of meiotic recombination and the process of genome evolution as well. To meet the need, we developed a new predictor called “iRSpot-TNCPseAAC”, in which a DNA sample was formulated by combining its trinucleotide composition (TNC) and the pseudo amino acid components (PseAAC) of the protein translated from the DNA sample according to its genetic codes. The former was used to incorporate its local or short-rage sequence order information; while the latter, its global and long-range one. Compared with the best existing predictor in this area, iRSpot-TNCPseAAC achieved higher rates in accuracy, Mathew’s correlation coefficient, and sensitivity, indicating that the new predictor may become a useful tool for identifying the recombination hotspots and coldspots, or, at least, become a complementary tool to the existing methods. It has not escaped our notice that the aforementioned novel approach to incorporate the DNA sequence order information into a discrete model may also be used for many other genome analysis problems. The web-server for iRSpot-TNCPseAAC is available at http://www.jci-bioinfo.cn/iRSpot-TNCPseAAC. Furthermore, for the convenience of the vast majority of experimental scientists, a step-by-step guide is provided on how to use the current web server to obtain their desired result without the need to follow the complicated mathematical equations.
Collapse
|