1
|
Liu T, Chen J, Zhang Q, Hippe K, Hunt C, Le T, Cao R, Tang H. The Development of Machine Learning Methods in discriminating Secretory Proteins of Malaria Parasite. Curr Med Chem 2021; 29:807-821. [PMID: 34636289 DOI: 10.2174/0929867328666211005140625] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2021] [Revised: 07/28/2021] [Accepted: 08/15/2021] [Indexed: 11/22/2022]
Abstract
Malaria caused by Plasmodium falciparum is one of the major infectious diseases in the world. It is essential to exploit an effective method to predict secretory proteins of malaria parasites to develop effective cures and treatment. Biochemical assays can provide details for accurate identification of the secretory proteins, but these methods are expensive and time-consuming. In this paper, we summarized the machine learning-based identification algorithms and compared the construction strategies between different computational methods. Also, we discussed the use of machine learning to improve the ability of algorithms to identify proteins secreted by malaria parasites.
Collapse
Affiliation(s)
- Ting Liu
- School of Basic Medical Sciences, Southwest Medical University, Luzhou. China
| | - Jiamao Chen
- School of Basic Medical Sciences, Southwest Medical University, Luzhou. China
| | - Qian Zhang
- School of Basic Medical Sciences, Southwest Medical University, Luzhou. China
| | - Kyle Hippe
- Department of Computer Science, Pacific Lutheran University. United States
| | - Cassandra Hunt
- Department of Computer Science, Pacific Lutheran University. United States
| | - Thu Le
- Department of Computer Science, Pacific Lutheran University. United States
| | - Renzhi Cao
- Department of Computer Science, Pacific Lutheran University. United States
| | - Hua Tang
- School of Basic Medical Sciences, Southwest Medical University, Luzhou. China
| |
Collapse
|
2
|
Abstract
Aims:
The discontinuous pattern of genome size variation in angiosperms is an unsolved
problem related to genome evolution. In this study, we introduced a genome evolution operator
and solved the related eigenvalue equation to deduce the discontinuous pattern.
Background:
Genome is a well-defined system for studying the evolution of species. One of the
basic problems is the genome size evolution. The DNA amounts for angiosperm species are highly
variable, differing over 1000-fold. One big surprise is the discovery of the discontinuous
distribution of nuclear DNA amounts in many angiosperm genera.
Objective:
The discontinuous distribution of nuclear DNA amounts has certain regularity, much
like a group of quantum states in atomic physics. The quantum pattern has not been explained by
all the evolutionary theories so far and we shall interpret it through the quantum simulation of
genome evolution.
Methods:
We introduced a genome evolution operator H to deduce the distribution of DNA
amount. The nuclear DNA amount in angiosperms is studied from the eigenvalue equation of the
genome evolution operator H. The operator H is introduced by physical simulation and it is
defined as a function of the genome size N and the derivative with respect to the size.
Results:
The discontinuity of DNA size distribution and its synergetic occurrence in related
angiosperms species are successfully deduced from the solution of the equation. The results agree
well with the existing experimental data of Aloe, Clarkia, Nicotiana, Lathyrus, Allium and other
genera.
Conclusion:
The success of our approach may infer the existence of a set of genomic evolutionary
equations satisfying classical-quantum duality. The classical phase of evolution means it obeys the
classical deterministic law, while the quantum phase means it obeys the quantum stochastic law.
The discontinuity of DNA size distribution provides novel evidences on the quantum evolution of
angiosperms. It has been realized that the discontinuous pattern is due to the existence of some
unknown evolutionary constraints. However, our study indicates that these constraints on the
angiosperm genome essentially originate from quantum.
Collapse
Affiliation(s)
- Liaofu Luo
- School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China
| | - Lirong Zhang
- School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China
| |
Collapse
|
3
|
Li Y, Zhang Z, Teng Z, Liu X. PredAmyl-MLP: Prediction of Amyloid Proteins Using Multilayer Perceptron. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2020; 2020:8845133. [PMID: 33294004 PMCID: PMC7700051 DOI: 10.1155/2020/8845133] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/14/2020] [Revised: 10/06/2020] [Accepted: 10/31/2020] [Indexed: 01/20/2023]
Abstract
Amyloid is generally an aggregate of insoluble fibrin; its abnormal deposition is the pathogenic mechanism of various diseases, such as Alzheimer's disease and type II diabetes. Therefore, accurately identifying amyloid is necessary to understand its role in pathology. We proposed a machine learning-based prediction model called PredAmyl-MLP, which consists of the following three steps: feature extraction, feature selection, and classification. In the step of feature extraction, seven feature extraction algorithms and different combinations of them are investigated, and the combination of SVMProt-188D and tripeptide composition (TPC) is selected according to the experimental results. In the step of feature selection, maximum relevant maximum distance (MRMD) and binomial distribution (BD) are, respectively, used to remove the redundant or noise features, and the appropriate features are selected according to the experimental results. In the step of classification, we employed multilayer perceptron (MLP) to train the prediction model. The 10-fold cross-validation results show that the overall accuracy of PredAmyl-MLP reached 91.59%, and the performance was better than the existing methods.
Collapse
Affiliation(s)
- Yanjuan Li
- College of Information and Computer Engineering, Northeast Forestry University, Harbin 150040, China
| | - Zitong Zhang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin 150040, China
| | - Zhixia Teng
- College of Information and Computer Engineering, Northeast Forestry University, Harbin 150040, China
| | - Xiaoyan Liu
- College of Computer Science and Technology, Harbin Institute of Technology, Harbin 150040, China
| |
Collapse
|
4
|
Dao FY, Lv H, Wang F, Ding H. Recent Advances on the Machine Learning Methods in Identifying DNA Replication Origins in Eukaryotic Genomics. Front Genet 2018; 9:613. [PMID: 30619452 PMCID: PMC6295579 DOI: 10.3389/fgene.2018.00613] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2018] [Accepted: 11/21/2018] [Indexed: 01/01/2023] Open
Abstract
The initiate site of DNA replication is called origins of replication (ORI) which is regulated by a set of regulatory proteins and plays important roles in the basic biochemical process during cell growth and division in all living organisms. Therefore, the study of ORIs is essential for understanding the cell-division cycle and gene expression regulation so that scholars can develop a new strategy against genetic diseases by using the knowledge of DNA replication. Thus, the accurate identification of ORIs will provide key clues for DNA replication research and clinical medicine. Although, the conventional experiments could provide accurate results, they are time-consuming and cost ineffective. On the contrary, bioinformatics-based methods can overcome these shortcomings. Especially, with the emergence of DNA sequences in the post-genomic era, it is highly expected to develop high throughput tools to identify ORIs based on sequence information. In this review, we will summarize the current progress in computational prediction of eukaryotic ORIs including the collection of benchmark dataset, the application of machine learning-based techniques, the results obtained by these methods, and the construction of web servers. Finally, we gave the future perspectives on ORIs prediction. The review provided readers with a whole background of ORIs prediction based on machine learning methods, which will be helpful for researchers to study DNA replication in-depth and drug therapy of genetic defect.
Collapse
Affiliation(s)
- Fu-Ying Dao
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Hao Lv
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Fang Wang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Hui Ding
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
5
|
Lai HY, Chen XX, Chen W, Tang H, Lin H. Sequence-based predictive modeling to identify cancerlectins. Oncotarget 2018; 8:28169-28175. [PMID: 28423655 PMCID: PMC5438640 DOI: 10.18632/oncotarget.15963] [Citation(s) in RCA: 90] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2017] [Accepted: 02/24/2017] [Indexed: 11/25/2022] Open
Abstract
Lectins are a diverse type of glycoproteins or carbohydrate-binding proteins that have a wide distribution to various species. They can specially identify and exclusively bind to a certain kind of saccharide groups. Cancerlectins are a group of lectins that are closely related to cancer and play a major role in the initiation, survival, growth, metastasis and spread of tumor. Several computational methods have emerged to discriminate cancerlectins from non-cancerlectins, which promote the study on pathogenic mechanisms and clinical treatment of cancer. However, the predictive accuracies of most of these techniques are very limited. In this work, by constructing a benchmark dataset based on the CancerLectinDB database, a new amino acid sequence-based strategy for feature description was developed, and then the binomial distribution was applied to screen the optimal feature set. Ultimately, an SVM-based predictor was performed to distinguish cancerlectins from non-cancerlectins, and achieved an accuracy of 77.48% with AUC of 85.52% in jackknife cross-validation. The results revealed that our prediction model could perform better comparing with published predictive tools.
Collapse
Affiliation(s)
- Hong-Yan Lai
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Xin-Xin Chen
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Wei Chen
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China.,Department of Physics, School of Sciences, and Center for Genomics and Computational Biology, North China University of Science and Technology, Tangshan, Tangshan, China
| | - Hua Tang
- Department of Pathophysiology, Southwest Medical University, Luzhou, China
| | - Hao Lin
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
6
|
Dao FY, Yang H, Su ZD, Yang W, Wu Y, Hui D, Chen W, Tang H, Lin H. Recent Advances in Conotoxin Classification by Using Machine Learning Methods. Molecules 2017; 22:molecules22071057. [PMID: 28672838 PMCID: PMC6152242 DOI: 10.3390/molecules22071057] [Citation(s) in RCA: 43] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2017] [Revised: 06/12/2017] [Accepted: 06/19/2017] [Indexed: 11/16/2022] Open
Abstract
Conotoxins are disulfide-rich small peptides, which are invaluable peptides that target ion channel and neuronal receptors. Conotoxins have been demonstrated as potent pharmaceuticals in the treatment of a series of diseases, such as Alzheimer's disease, Parkinson's disease, and epilepsy. In addition, conotoxins are also ideal molecular templates for the development of new drug lead compounds and play important roles in neurobiological research as well. Thus, the accurate identification of conotoxin types will provide key clues for the biological research and clinical medicine. Generally, conotoxin types are confirmed when their sequence, structure, and function are experimentally validated. However, it is time-consuming and costly to acquire the structure and function information by using biochemical experiments. Therefore, it is important to develop computational tools for efficiently and effectively recognizing conotoxin types based on sequence information. In this work, we reviewed the current progress in computational identification of conotoxins in the following aspects: (i) construction of benchmark dataset; (ii) strategies for extracting sequence features; (iii) feature selection techniques; (iv) machine learning methods for classifying conotoxins; (v) the results obtained by these methods and the published tools; and (vi) future perspectives on conotoxin classification. The paper provides the basis for in-depth study of conotoxins and drug therapy research.
Collapse
Affiliation(s)
- Fu-Ying Dao
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| | - Hui Yang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| | - Zhen-Dong Su
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| | - Wuritu Yang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.
- Development and Planning Department, Inner Mongolia University, Hohhot 010021, China.
| | - Yun Wu
- College of Computer and Information Engineering, Xiamen University of Technology, Xiamen 361024, China.
| | - Ding Hui
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| | - Wei Chen
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.
- Department of Physics, School of Sciences, and Center for Genomics and Computational Biology, North China University of Science and Technology, Tangshan 063000, China.
| | - Hua Tang
- Department of Pathophysiology, Southwest Medical University, Luzhou 646000, China.
| | - Hao Lin
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| |
Collapse
|
7
|
赵 微. Addition of Protein Secondary Structure Information for Prediction of Anticancer Peptide. Biophysics (Nagoya-shi) 2017. [DOI: 10.12677/biphy.2017.52002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023] Open
|
8
|
YongE F, GaoShan K. Identify Beta-Hairpin Motifs with Quadratic Discriminant Algorithm Based on the Chemical Shifts. PLoS One 2015; 10:e0139280. [PMID: 26422468 PMCID: PMC4589334 DOI: 10.1371/journal.pone.0139280] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2015] [Accepted: 09/09/2015] [Indexed: 01/13/2023] Open
Abstract
Successful prediction of the beta-hairpin motif will be helpful for understanding the of the fold recognition. Some algorithms have been proposed for the prediction of beta-hairpin motifs. However, the parameters used by these methods were primarily based on the amino acid sequences. Here, we proposed a novel model for predicting beta-hairpin structure based on the chemical shift. Firstly, we analyzed the statistical distribution of chemical shifts of six nuclei in not beta-hairpin and beta-hairpin motifs. Secondly, we used these chemical shifts as features combined with three algorithms to predict beta-hairpin structure. Finally, we achieved the best prediction, namely sensitivity of 92%, the specificity of 94% with 0.85 of Mathew’s correlation coefficient using quadratic discriminant analysis algorithm, which is clearly superior to the same method for the prediction of beta-hairpin structure from 20 amino acid compositions in the three-fold cross-validation. Our finding showed that the chemical shift is an effective parameter for beta-hairpin prediction, suggesting the quadratic discriminant analysis is a powerful algorithm for the prediction of beta-hairpin.
Collapse
Affiliation(s)
- Feng YongE
- College of Science, Inner Mongolia Agriculture University, Hohhot, PR China
- * E-mail:
| | - Kou GaoShan
- College of Science, Inner Mongolia Agriculture University, Hohhot, PR China
| |
Collapse
|
9
|
Kou G, Feng Y. Identify five kinds of simple super-secondary structures with quadratic discriminant algorithm based on the chemical shifts. J Theor Biol 2015; 380:392-8. [DOI: 10.1016/j.jtbi.2015.06.006] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2015] [Revised: 06/02/2015] [Accepted: 06/04/2015] [Indexed: 10/23/2022]
|
10
|
Feng YE. Identify Secretory Protein of Malaria Parasite with Modified Quadratic Discriminant Algorithm and Amino Acid Composition. Interdiscip Sci 2015; 8:156-161. [DOI: 10.1007/s12539-015-0112-0] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2014] [Revised: 12/15/2014] [Accepted: 03/16/2015] [Indexed: 12/13/2022]
|
11
|
Bioinformatics-Aided Venomics. Toxins (Basel) 2015; 7:2159-87. [PMID: 26110505 PMCID: PMC4488696 DOI: 10.3390/toxins7062159] [Citation(s) in RCA: 32] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2015] [Revised: 06/03/2015] [Accepted: 06/05/2015] [Indexed: 12/12/2022] Open
Abstract
Venomics is a modern approach that combines transcriptomics and proteomics to explore the toxin content of venoms. This review will give an overview of computational approaches that have been created to classify and consolidate venomics data, as well as algorithms that have helped discovery and analysis of toxin nucleic acid and protein sequences, toxin three-dimensional structures and toxin functions. Bioinformatics is used to tackle specific challenges associated with the identification and annotations of toxins. Recognizing toxin transcript sequences among second generation sequencing data cannot rely only on basic sequence similarity because toxins are highly divergent. Mass spectrometry sequencing of mature toxins is challenging because toxins can display a large number of post-translational modifications. Identifying the mature toxin region in toxin precursor sequences requires the prediction of the cleavage sites of proprotein convertases, most of which are unknown or not well characterized. Tracing the evolutionary relationships between toxins should consider specific mechanisms of rapid evolution as well as interactions between predatory animals and prey. Rapidly determining the activity of toxins is the main bottleneck in venomics discovery, but some recent bioinformatics and molecular modeling approaches give hope that accurate predictions of toxin specificity could be made in the near future.
Collapse
|
12
|
Wang X, Zhou Y, Yan R. AAFreqCoil: a new classifier to distinguish parallel dimeric and trimeric coiled coils. MOLECULAR BIOSYSTEMS 2015; 11:1794-801. [DOI: 10.1039/c5mb00119f] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
Coiled coils are characteristic rope-like protein structures, constituted by one or more heptad repeats.
Collapse
Affiliation(s)
- Xiaofeng Wang
- School of Mathematics and Computer Science
- Shanxi Normal University
- Linfen 041004
- China
| | - Yuan Zhou
- College of Biological Sciences
- China Agricultural University
- Beijing 100193
- China
| | - Renxiang Yan
- Institute of Applied Genomics
- School of Biological Sciences and Engineering
- Fuzhou University
- Fuzhou 350108
- China
| |
Collapse
|
13
|
Zhu PP, Li WC, Zhong ZJ, Deng EZ, Ding H, Chen W, Lin H. Predicting the subcellular localization of mycobacterial proteins by incorporating the optimal tripeptides into the general form of pseudo amino acid composition. MOLECULAR BIOSYSTEMS 2015; 11:558-63. [DOI: 10.1039/c4mb00645c] [Citation(s) in RCA: 97] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
Mycobacterium tuberculosis is a bacterium that causes tuberculosis, one of the most prevalent infectious diseases.
Collapse
Affiliation(s)
- Pan-Pan Zhu
- Key Laboratory for Neuro-Information of Ministry of Education
- Center of Bioinformatics
- School of Life Science and Technology
- University of Electronic Science and Technology of China
- Chengdu 610054
| | - Wen-Chao Li
- Key Laboratory for Neuro-Information of Ministry of Education
- Center of Bioinformatics
- School of Life Science and Technology
- University of Electronic Science and Technology of China
- Chengdu 610054
| | - Zhe-Jin Zhong
- Key Laboratory for Neuro-Information of Ministry of Education
- Center of Bioinformatics
- School of Life Science and Technology
- University of Electronic Science and Technology of China
- Chengdu 610054
| | - En-Ze Deng
- Key Laboratory for Neuro-Information of Ministry of Education
- Center of Bioinformatics
- School of Life Science and Technology
- University of Electronic Science and Technology of China
- Chengdu 610054
| | - Hui Ding
- Key Laboratory for Neuro-Information of Ministry of Education
- Center of Bioinformatics
- School of Life Science and Technology
- University of Electronic Science and Technology of China
- Chengdu 610054
| | - Wei Chen
- Department of Physics
- School of Sciences
- and Center for Genomics and Computational Biology
- Hebei United University
- Tangshan 063000
| | - Hao Lin
- Key Laboratory for Neuro-Information of Ministry of Education
- Center of Bioinformatics
- School of Life Science and Technology
- University of Electronic Science and Technology of China
- Chengdu 610054
| |
Collapse
|
14
|
Li Q, Dahl DB, Vannucci M, Hyun Joo, Tsai JW. Bayesian model of protein primary sequence for secondary structure prediction. PLoS One 2014; 9:e109832. [PMID: 25314659 PMCID: PMC4196994 DOI: 10.1371/journal.pone.0109832] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2014] [Accepted: 09/02/2014] [Indexed: 01/26/2023] Open
Abstract
Determining the primary structure (i.e., amino acid sequence) of a protein has become cheaper, faster, and more accurate. Higher order protein structure provides insight into a protein's function in the cell. Understanding a protein's secondary structure is a first step towards this goal. Therefore, a number of computational prediction methods have been developed to predict secondary structure from just the primary amino acid sequence. The most successful methods use machine learning approaches that are quite accurate, but do not directly incorporate structural information. As a step towards improving secondary structure reduction given the primary structure, we propose a Bayesian model based on the knob-socket model of protein packing in secondary structure. The method considers the packing influence of residues on the secondary structure determination, including those packed close in space but distant in sequence. By performing an assessment of our method on 2 test sets we show how incorporation of multiple sequence alignment data, similarly to PSIPRED, provides balance and improves the accuracy of the predictions. Software implementing the methods is provided as a web application and a stand-alone implementation.
Collapse
Affiliation(s)
- Qiwei Li
- Department of Statistics, Rice University, Houston, Texas, United States of America
| | - David B. Dahl
- Department of Statistics, Brigham Young University, Provo, Utah, United States of America
| | - Marina Vannucci
- Department of Statistics, Rice University, Houston, Texas, United States of America
| | - Hyun Joo
- Department of Chemistry, University of the Pacific, Stockton, California, United States of America
| | - Jerry W. Tsai
- Department of Chemistry, University of the Pacific, Stockton, California, United States of America
| |
Collapse
|
15
|
Ding H, Lin H, Chen W, Li ZQ, Guo FB, Huang J, Rao N. Prediction of protein structural classes based on feature selection technique. Interdiscip Sci 2014; 6:235-40. [PMID: 25205501 DOI: 10.1007/s12539-013-0205-6] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2013] [Revised: 10/02/2013] [Accepted: 11/12/2013] [Indexed: 11/24/2022]
Abstract
The prediction of protein structural classes is beneficial to understanding folding patterns, functions and interactions of proteins. In this study, we proposed a feature selection-based method to accurately predict protein structural classes. Three datasets with sequence identity lower than 25% were used to test the prediction performance of the method. Through jackknife cross-validation, we have verified that the overall accuracies of these three datasets are 92.1%, 89.7% and 84.0%, respectively. The proposed method is more efficient and accurate than other existing methods. The present study will offer an excellent alternative to other methods for predicting protein structural classes.
Collapse
Affiliation(s)
- Hui Ding
- Key Laboratory for NeuroInformation of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, 610054, China
| | | | | | | | | | | | | |
Collapse
|
16
|
Feng Y, Luo L. Using long-range contact number information for protein secondary structure prediction. INT J BIOMATH 2014. [DOI: 10.1142/s1793524514500521] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
In this paper, we first combine tetra-peptide structural words with contact number for protein secondary structure prediction. We used the method of increment of diversity combined with quadratic discriminant analysis to predict the structure of central residue for a sequence fragment. The method is used tetra-peptide structural words and long-range contact number as information resources. The accuracy of Q3 is over 83% in 194 proteins. The accuracies of predicted secondary structures for 20 amino acid residues are ranged from 81% to 88%. Moreover, we have introduced the residue long-range contact, which directly indicates the separation of contacting residue in terms of the position in the sequence, and examined the negative influence of long-range residue interactions on predicting secondary structure in a protein. The method is also compared with existing prediction methods. The results show that our method is more effective in protein secondary structures prediction.
Collapse
Affiliation(s)
- Yonge Feng
- College of Science, Inner Mongolia Agriculture University, Hohhot 010018, P. R. China
| | - Liaofu Luo
- School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, P. R. China
| |
Collapse
|
17
|
Prediction of four kinds of simple supersecondary structures in protein by using chemical shifts. ScientificWorldJournal 2014; 2014:978503. [PMID: 25050407 PMCID: PMC4090465 DOI: 10.1155/2014/978503] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2014] [Revised: 06/03/2014] [Accepted: 06/04/2014] [Indexed: 12/23/2022] Open
Abstract
Knowledge of supersecondary structures can provide important information about its spatial structure of protein. Some approaches have been developed for the prediction of protein supersecondary structure. However, the feature used by these approaches is primarily based on amino acid sequences. In this study, a novel model is presented to predict protein supersecondary structure by use of chemical shifts (CSs) information derived from nuclear magnetic resonance (NMR) spectroscopy. Using these CSs as inputs of the method of quadratic discriminant analysis (QD), we achieve the overall prediction accuracy of 77.3%, which is competitive with the same method for predicting supersecondary structures from amino acid compositions in threefold cross-validation. Moreover, our finding suggests that the combined use of different chemical shifts will influence the accuracy of prediction.
Collapse
|
18
|
Feng Y, Lin H, Luo L. Prediction of protein secondary structure using feature selection and analysis approach. Acta Biotheor 2014; 62:1-14. [PMID: 24052343 DOI: 10.1007/s10441-013-9203-7] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2012] [Accepted: 08/24/2013] [Indexed: 01/09/2023]
Abstract
The prediction of the secondary structure of a protein from its amino acid sequence is an important step towards the prediction of its three-dimensional structure. However, the accuracy of ab initio secondary structure prediction from sequence is about 80% currently, which is still far from satisfactory. In this study, we proposed a novel method that uses binomial distribution to optimize tetrapeptide structural words and increment of diversity with quadratic discriminant to perform prediction for protein three-state secondary structure. A benchmark dataset including 2,640 proteins with sequence identity of less than 25% was used to train and test the proposed method. The results indicate that overall accuracy of 87.8% was achieved in secondary structure prediction by using ten-fold cross-validation. Moreover, the accuracy of predicted secondary structures ranges from 84 to 89% at the level of residue. These results suggest that the feature selection technique can detect the optimized tetrapeptide structural words which affect the accuracy of predicted secondary structures.
Collapse
|
19
|
Liu L, Li QZ, Lin H, Zuo YC. The effect of regions flanking target site on siRNA potency. Genomics 2013; 102:215-22. [PMID: 23891614 DOI: 10.1016/j.ygeno.2013.07.009] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2012] [Revised: 07/14/2013] [Accepted: 07/16/2013] [Indexed: 11/28/2022]
Abstract
For a successful RNA interference (RNAi) experiment, selecting the small interference RNA (siRNA) candidates which maximize the knock down effect of the given gene is the critical step. Although various computational approaches have been attempted, the design of efficient siRNA candidates is far from satisfactory yet. In this study, we proposed a novel feature selection algorithm of combined random forest and support vector machine to predict active siRNAs. Using a publically available dataset, we demonstrated that the predictive accuracy would be markedly improved when the context sequence features outside the target site were included. The Pearson correlation coefficient for regression is as high as 0.721, compared to 0.671, 0.668, 0.680, and 0.645, for Biopredsi, i-score, ThermoComposition21 and DSIR, respectively. It revealed that siRNA-target interaction requires appropriate sequence context not only in the target site but also in a broad region flanking the target site.
Collapse
Affiliation(s)
- Li Liu
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China.
| | | | | | | |
Collapse
|
20
|
Using over-represented tetrapeptides to predict protein submitochondria locations. Acta Biotheor 2013; 61:259-68. [PMID: 23475502 DOI: 10.1007/s10441-013-9181-9] [Citation(s) in RCA: 62] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2012] [Accepted: 02/23/2013] [Indexed: 01/25/2023]
Abstract
The mitochondrion is a key organelle of eukaryotic cell that provides the energy for cellular activities. Correctly identifying submitochondria locations of proteins can provide plentiful information for understanding their functions. However, using web-experimental methods to recognize submitochondria locations of proteins are time-consuming and costly. Thus, it is highly desired to develop a bioinformatics method to predict the submitochondria locations of mitochondrion proteins. In this work, a novel method based on support vector machine was developed to predict the submitochondria locations of mitochondrion proteins by using over-represented tetrapeptides selected by using binomial distribution. A reliable and rigorous benchmark dataset including 495 mitochondrion proteins with sequence identity ≤25% was constructed for testing and evaluating the proposed model. Jackknife cross-validated results showed that the 91.1% of the 495 mitochondrion proteins can be correctly predicted. Subsequently, our model was estimated by three existing benchmark datasets. The overall accuracies are 94.0, 94.7 and 93.4%, respectively, suggesting that the proposed model is potentially useful in the realm of mitochondrion proteome research. Based on this model, we built a predictor called TetraMito which is freely available at http://lin.uestc.edu.cn/server/TetraMito.
Collapse
|
21
|
Georgoulia PS, Glykos NM. On the foldability of tryptophan-containing tetra- and pentapeptides: an exhaustive molecular dynamics study. J Phys Chem B 2013; 117:5522-32. [PMID: 23597287 DOI: 10.1021/jp401239v] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
Short peptides serve as minimal model systems to decipher the determinants of foldability due to their simplicity arising from their smaller size, their ability to echo protein-like structural characteristics, and their direct implication in force field validation. Here, we describe an effort to identify small peptides that can still form stable structures in aqueous solutions. We followed the in silico folding of a selected set of 8640 tryptophan-containing tetra- and pentapeptides through 15 210 molecular dynamics simulations amounting to a total of 272.46 μs using explicit representation of the solute and full treatment of the electrostatics. The evaluation and sorting of peptides is achieved through scoring functions, which include terms based on interatomic vector distances, atomic fluctuations, and rmsd matrices between successive frames of a trajectory. Highly scored peptides are studied further via successive simulation rounds of increasing simulation length and using different empirical force fields. Our method suggested only a handful of peptides with strong foldability prognosis. The discrepancies between the predictions of the various force fields for such short sequences are also extensively discussed. We conclude that the vast majority of such short peptides do not adopt significantly stable structures in water solutions, at least based on our computational predictions. The present work can be utilized in the rational design and engineering of bioactive peptides with desired molecular properties.
Collapse
Affiliation(s)
- Panagiota S Georgoulia
- Department of Molecular Biology and Genetics, Democritus University of Thrace, Alexandroupolis, Greece
| | | |
Collapse
|
22
|
LIN HAO, DING CHEN, YUAN LUFENG, CHEN WEI, DING HUI, LI ZIQIANG, GUO FENGBIAO, HUANG JIAN, RAO NINI. PREDICTING SUBCHLOROPLAST LOCATIONS OF PROTEINS BASED ON THE GENERAL FORM OF CHOU'S PSEUDO AMINO ACID COMPOSITION: APPROACHED FROM OPTIMAL TRIPEPTIDE COMPOSITION. INT J BIOMATH 2013. [DOI: 10.1142/s1793524513500034] [Citation(s) in RCA: 43] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Chloroplasts are organelles found in plant cells that conduct photosynthesis. The subchloroplast locations of proteins are correlated with their functions. With the availability of a great number of protein data, it is highly desired to develop a computational method to predict the subchloroplast locations of chloroplast proteins. In this study, we proposed a novel method to predict subchloroplast locations of proteins using tripeptide compositions. It first used the binomial distribution to optimize the feature sets. Then the support vector machine was selected to perform the prediction of subchloroplast locations of proteins. The proposed method was tested on a reliable and rigorous dataset including 259 chloroplast proteins with sequence identity ≤ 25%. In the jack-knife cross-validation, 92.21% envelope proteins, 93.20% thylakoid membrane, 52.63% thylakoid lumen and 85.00% stroma can be correctly identified. The overall accuracy achieves 88.03% which is higher than that of other models. Based on this method, a predictor called ChloPred has been built and can be freely available from http://cobi.uestc.edu.cn/people/hlin/tools/ChloPred/ . The predictor will provide important information for theoretical and experimental research of chloroplast proteins.
Collapse
Affiliation(s)
- HAO LIN
- Key Laboratory for NeuroInformation of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, P. R. China
| | - CHEN DING
- Key Laboratory for NeuroInformation of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, P. R. China
| | - LU-FENG YUAN
- Key Laboratory for NeuroInformation of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, P. R. China
| | - WEI CHEN
- Center for Genomics and Computational Biology, Department of Physics, College of Sciences, Hebei United University, Tangshan 063000, P. R. China
| | - HUI DING
- Key Laboratory for NeuroInformation of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, P. R. China
| | - ZI-QIANG LI
- School of Information and Engineering, Sichuan Agricultural University, Yaan 625014, P. R. China
| | - FENG-BIAO GUO
- Key Laboratory for NeuroInformation of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, P. R. China
| | - JIAN HUANG
- Key Laboratory for NeuroInformation of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, P. R. China
| | - NI-NI RAO
- Key Laboratory for NeuroInformation of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, P. R. China
| |
Collapse
|
23
|
Prediction of the types of ion channel-targeted conotoxins based on radial basis function network. Toxicol In Vitro 2013; 27:852-6. [DOI: 10.1016/j.tiv.2012.12.024] [Citation(s) in RCA: 51] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2012] [Revised: 12/06/2012] [Accepted: 12/22/2012] [Indexed: 11/20/2022]
|
24
|
Ding C, Yuan LF, Guo SH, Lin H, Chen W. Identification of mycobacterial membrane proteins and their types using over-represented tripeptide compositions. J Proteomics 2012; 77:321-8. [DOI: 10.1016/j.jprot.2012.09.006] [Citation(s) in RCA: 82] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2012] [Revised: 08/18/2012] [Accepted: 09/08/2012] [Indexed: 11/25/2022]
|
25
|
Chou KC. Some remarks on protein attribute prediction and pseudo amino acid composition. J Theor Biol 2010; 273:236-47. [PMID: 21168420 PMCID: PMC7125570 DOI: 10.1016/j.jtbi.2010.12.024] [Citation(s) in RCA: 956] [Impact Index Per Article: 68.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2010] [Revised: 12/08/2010] [Accepted: 12/13/2010] [Indexed: 11/29/2022]
Abstract
With the accomplishment of human genome sequencing, the number of sequence-known proteins has increased explosively. In contrast, the pace is much slower in determining their biological attributes. As a consequence, the gap between sequence-known proteins and attribute-known proteins has become increasingly large. The unbalanced situation, which has critically limited our ability to timely utilize the newly discovered proteins for basic research and drug development, has called for developing computational methods or high-throughput automated tools for fast and reliably identifying various attributes of uncharacterized proteins based on their sequence information alone. Actually, during the last two decades or so, many methods in this regard have been established in hope to bridge such a gap. In the course of developing these methods, the following things were often needed to consider: (1) benchmark dataset construction, (2) protein sample formulation, (3) operating algorithm (or engine), (4) anticipated accuracy, and (5) web-server establishment. In this review, we are to discuss each of the five procedures, with a special focus on the introduction of pseudo amino acid composition (PseAAC), its different modes and applications as well as its recent development, particularly in how to use the general formulation of PseAAC to reflect the core and essential features that are deeply hidden in complicated protein sequences.
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Gordon Life Science Institute, 13784 Torrey Del Mar Drive, San Diego, CA 92130, USA.
| |
Collapse
|
26
|
Eukaryotic and prokaryotic promoter prediction using hybrid approach. Theory Biosci 2010; 130:91-100. [DOI: 10.1007/s12064-010-0114-8] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2010] [Accepted: 10/23/2010] [Indexed: 12/27/2022]
|
27
|
Chen W, Luo L. Classification of antimicrobial peptide using diversity measure with quadratic discriminant analysis. J Microbiol Methods 2009; 78:94-6. [PMID: 19348863 DOI: 10.1016/j.mimet.2009.03.013] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2009] [Revised: 03/20/2009] [Accepted: 03/30/2009] [Indexed: 11/27/2022]
Abstract
Accurate classification of antimicrobial peptides according to their biological activities will facilitate the design of novel antimicrobial agents and the discovery of new therapeutic targets. In this work, an excellent algorithm of Increment of Diversity with Quadratic Discriminant analysis (IDQD) was proposed to classify antimicrobial peptides with diverse biological activities.
Collapse
Affiliation(s)
- Wei Chen
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot, 010021, China
| | | |
Collapse
|