1
|
Singh L, Singh S, Singh DD. A Machine Learning Approach to Identify C Type Lectin Domain (CTLD) Containing Proteins. Protein J 2024; 43:718-725. [PMID: 39068630 DOI: 10.1007/s10930-024-10224-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/07/2024] [Indexed: 07/30/2024]
Abstract
Lectins are sugar interacting proteins which bind specific glycans reversibly and have ubiquitous presence in all forms of life. They have diverse biological functions such as cell signaling, molecular recognition, etc. C-type lectins (CTL) are a group of proteins from the lectin family which have been studied extensively in animals and are reported to be involved in immune functions, carcinogenesis, cell signaling, etc. The carbohydrate recognition domain (CRD) in CTL has a highly variable protein sequence and proteins carrying this domain are also referred to as C-type lectin domain containing proteins (CTLD). Because of this low sequence homology, identification of CTLD from hypothetical proteins in the sequenced genomes using homology based programs has limitations. Machine learning (ML) tools use characteristic features to identify homologous sequences and it has been used to develop a tool for identification of CTLD. Initially 500 sequences of well annotated CTLD and 500 sequences of non CTLD were used in developing the machine learning model. The classifier program Linear SVC from sci kit library of python was used and characteristic features in CTLD sequences like dipeptide and tripeptide composition were used as training attributes in various classifiers. A precision, recall and multiple correlation coefficient (MCC) value of 0.92, 0.91 and 0.82 respectively were obtained when tested on external test set. On fine tuning of the parameters like kernel, C value, gamma, degree and increasing number of non CTLD sequences there was improvement in precision, recall and MCC and the corresponding values were 0.99, 0.99 and 0.96. New CTLD have also been identified in the hypothetical segment of human genome using the trained model. The tool is available on our local server for interested users.
Collapse
Affiliation(s)
- Lovepreet Singh
- Department of Biotechnology, Panjab University, Sector-25, Chandigarh, 160014, India
| | - Sukhwinder Singh
- University Institute of Engineering & Technology, Panjab University, Sector-25, Chandigarh, 160014, India
| | - Desh Deepak Singh
- Department of Biotechnology, Panjab University, Sector-25, Chandigarh, 160014, India.
| |
Collapse
|
2
|
Qian Y, Ding Y, Zou Q, Guo F. Multi-View Kernel Sparse Representation for Identification of Membrane Protein Types. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:1234-1245. [PMID: 35857734 DOI: 10.1109/tcbb.2022.3191325] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Membrane proteins are the main undertaker of biomembrane functions and play a vital role in many biological activities of organisms. Prediction of membrane protein types has a great help in determining the function of proteins and understanding the interactions of membrane proteins. However, the biochemical experiment is expensive and not suitable for the large-scale identification of membrane protein types. Therefore, computational methods were used to improve the efficiency of biological experiments. Most existing computational methods only use a single feature of protein, or use multiple features but do not integrate these well. In our study, the protein sequence is described via three different views (features), including amino acid composition, evolutionary information and physicochemical properties of amino acids. To exploit information among all views (features), we introduce a coupling strategy for Kernel Sparse Representation based Classification (KSRC) and construct a new model called Multi-view KSRC (MvKSRC). We implement our method on 4 benchmark data sets of membrane proteins. The comparison results indicate that our method is much superior to all existing methods.
Collapse
|
3
|
Gogoi CR, Rahman A, Saikia B, Baruah A. Protein Dihedral Angle Prediction: The State of the Art. ChemistrySelect 2023. [DOI: 10.1002/slct.202203427] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Affiliation(s)
| | - Aziza Rahman
- Department of Chemistry Dibrugarh University Dibrugarh Assam India
| | - Bondeepa Saikia
- Department of Chemistry Dibrugarh University Dibrugarh Assam India
| | - Anupaul Baruah
- Department of Chemistry Dibrugarh University Dibrugarh Assam India
| |
Collapse
|
4
|
Moosaei H, Ganaie M, Hladík M, Tanveer M. Inverse free reduced universum twin support vector machine for imbalanced data classification. Neural Netw 2023; 157:125-135. [DOI: 10.1016/j.neunet.2022.10.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2022] [Revised: 10/04/2022] [Accepted: 10/04/2022] [Indexed: 11/09/2022]
|
5
|
A Review on Data-Driven Quality Prediction in the Production Process with Machine Learning for Industry 4.0. Processes (Basel) 2022. [DOI: 10.3390/pr10101966] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
The quality-control process in manufacturing must ensure the product is free of defects and performs according to the customer’s expectations. Maintaining the quality of a firm’s products at the highest level is very important for keeping an edge over the competition. To maintain and enhance the quality of their products, manufacturers invest a lot of resources in quality control and quality assurance. During the assembly line, parts will arrive at a constant interval for assembly. The quality criteria must first be met before the parts are sent to the assembly line where the parts and subparts are assembled to get the final product. Once the product has been assembled, it is again inspected and tested before it is delivered to the customer. Because manufacturers are mostly focused on visual quality inspection, there can be bottlenecks before and after assembly. The manufacturer may suffer a loss if the assembly line is slowed down by this bottleneck. To improve quality, state-of-the-art sensors are being used to replace visual inspections and machine learning is used to help determine which part will fail. Using machine learning techniques, a review of quality assessment in various production processes is presented, along with a summary of the four industrial revolutions that have occurred in manufacturing, highlighting the need to detect anomalies in assembly lines, the need to detect the features of the assembly line, the use of machine learning algorithms in manufacturing, the research challenges, the computing paradigms, and the use of state-of-the-art sensors in Industry 4.0.
Collapse
|
6
|
Waqas S, Harun NY, Sambudi NS, Arshad U, Nordin NAHM, Bilad MR, Saeed AAH, Malik AA. SVM and ANN Modelling Approach for the Optimization of Membrane Permeability of a Membrane Rotating Biological Contactor for Wastewater Treatment. MEMBRANES 2022; 12:membranes12090821. [PMID: 36135840 PMCID: PMC9504877 DOI: 10.3390/membranes12090821] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/02/2022] [Revised: 08/15/2022] [Accepted: 08/17/2022] [Indexed: 05/31/2023]
Abstract
Membrane fouling significantly hinders the widespread application of membrane technology. In the current study, a support vector machine (SVM) and artificial neural networks (ANN) modelling approach was adopted to optimize the membrane permeability in a novel membrane rotating biological contactor (MRBC). The MRBC utilizes the disk rotation mechanism to generate a shear rate at the membrane surface to scour off the foulants. The effect of operational parameters (disk rotational speed, hydraulic retention time (HRT), and sludge retention time (SRT)) was studied on the membrane permeability. ANN and SVM are machine learning algorithms that aim to predict the model based on the trained data sets. The implementation and efficacy of machine learning and statistical approaches have been demonstrated through real-time experimental results. Feed-forward ANN with the back-propagation algorithm and SVN regression models for various kernel functions were trained to augment the membrane permeability. An overall comparison of predictive models for the test data sets reveals the model’s significance. ANN modelling with 13 hidden layers gives the highest R2 value of >0.99, and the SVM model with the Bayesian optimizer approach results in R2 values higher than 0.99. The MRBC is a promising substitute for traditional suspended growth processes, which aligns with the stipulations of ecological evolution and environmentally friendly treatment.
Collapse
Affiliation(s)
- Sharjeel Waqas
- Chemical Engineering Department, Universiti Teknologi PETRONAS, Bandar Seri Iskandar 32610, Perak, Malaysia
| | - Noorfidza Yub Harun
- Chemical Engineering Department, Universiti Teknologi PETRONAS, Bandar Seri Iskandar 32610, Perak, Malaysia
| | - Nonni Soraya Sambudi
- Department of Chemical Engineering, Universitas Pertamina, Simprug, Jakarta Selatan 12220, Indonesia
| | - Ushtar Arshad
- Chemical Engineering Department, Universiti Teknologi PETRONAS, Bandar Seri Iskandar 32610, Perak, Malaysia
| | - Nik Abdul Hadi Md Nordin
- Chemical Engineering Department, Universiti Teknologi PETRONAS, Bandar Seri Iskandar 32610, Perak, Malaysia
| | - Muhammad Roil Bilad
- Faculty of Integrated Technologies, Universiti Brunei Darussalam, Gadong BE1410, Brunei
| | - Anwar Ameen Hezam Saeed
- Chemical Engineering Department, Universiti Teknologi PETRONAS, Bandar Seri Iskandar 32610, Perak, Malaysia
| | - Asher Ahmed Malik
- Chemical Engineering Department, Universiti Teknologi PETRONAS, Bandar Seri Iskandar 32610, Perak, Malaysia
| |
Collapse
|
7
|
Lu W, Shen J, Zhang Y, Wu H, Qian Y, Chen X, Fu Q. Identifying Membrane Protein Types Based on Lifelong Learning With Dynamically Scalable Networks. Front Genet 2022; 12:834488. [PMID: 35371189 PMCID: PMC8964460 DOI: 10.3389/fgene.2021.834488] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2021] [Accepted: 12/21/2021] [Indexed: 11/13/2022] Open
Abstract
Membrane proteins are an essential part of the body's ability to maintain normal life activities. Further research into membrane proteins, which are present in all aspects of life science research, will help to advance the development of cells and drugs. The current methods for predicting proteins are usually based on machine learning, but further improvements in prediction effectiveness and accuracy are needed. In this paper, we propose a dynamic deep network architecture based on lifelong learning in order to use computers to classify membrane proteins more effectively. The model extends the application area of lifelong learning and provides new ideas for multiple classification problems in bioinformatics. To demonstrate the performance of our model, we conducted experiments on top of two datasets and compared them with other classification methods. The results show that our model achieves high accuracy (95.3 and 93.5%) on benchmark datasets and is more effective compared to other methods.
Collapse
Affiliation(s)
- Weizhong Lu
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China.,Suzhou Key Laboratory of Virtual Reality Intelligent Interaction and Application Technology, Suzhou University of Science and Technology, Suzhou, China.,Provincial Key Laboratory for Computer Information Processing Technology, Soochow University, Suzhou, China
| | - Jiawei Shen
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
| | - Yu Zhang
- Suzhou Industrial Park Institute of Services Outsourcing, Suzhou, China
| | - Hongjie Wu
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China.,Suzhou Key Laboratory of Virtual Reality Intelligent Interaction and Application Technology, Suzhou University of Science and Technology, Suzhou, China
| | - Yuqing Qian
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
| | - Xiaoyi Chen
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
| | - Qiming Fu
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
| |
Collapse
|
8
|
Zhang Z, Wang L. Using Chou's 5-steps rule to identify N 6-methyladenine sites by ensemble learning combined with multiple feature extraction methods. J Biomol Struct Dyn 2022; 40:796-806. [PMID: 32948102 DOI: 10.1080/07391102.2020.1821778] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Abstract
N6-methyladenine (m6A), a type of modification mostly affecting the downstream biological functions and determining the levels of gene expression, is mediated by the methylation of adenine in nucleic acids. It is also a key factor for influencing biological processes and has attracted attention as a target for treating diseases. Here, an ensemble predictor named as TL-Methy, was developed to identify m6A sites across the genome. TL-Methy is a 2-level machine learning method developed by combining the support vector machine model and multiple features extraction methods, including nucleic acid composition, di-nucleotide composition, tri-nucleotide composition, position-specific trinucleotide propensity, Bi-profile Bayes, binary encoding, and accumulated nucleotide frequency. For Homo sapiens, TL-Methy method reached the accuracy of 91.68% on jackknife test and of 92.23% on 10-fold cross validation test; For Mus musculus, TL-Methy method achieved the accuracy of 93.66% on jackknife test and of 97.07% on 10-fold cross validation test; For Saccharomyces cerevisiae, TL-Methy method obtained the accuracy of 81.57% on jackknife test and of 82.54% on 10-fold cross validation test; For rice genome, TL-Methy method achieved the accuracy of 91.87% on jackknife test and of 93.04% on 10-fold cross validation test. The results via these two test approaches demonstrated the robustness and practicality of our TL-Methy model. The TL-Methy model may be as a potential method for m6A site identification.Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
- Zhongwang Zhang
- College of Science, Dalian Maritime University, Dalian, P.R. China
| | - Lidong Wang
- College of Science, Dalian Maritime University, Dalian, P.R. China
| |
Collapse
|
9
|
The Limitations in Current Studies of Organic Fouling and Future Prospects. MEMBRANES 2021; 11:membranes11120922. [PMID: 34940423 PMCID: PMC8708778 DOI: 10.3390/membranes11120922] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 10/27/2021] [Revised: 11/22/2021] [Accepted: 11/24/2021] [Indexed: 11/16/2022]
Abstract
Microfiltration and ultrafiltration for water/wastewater treatment have gained global attention due to their high separation efficiency, while membrane fouling still remains one of their bottlenecks. In such a situation, many researchers attempt to obtain a deep understanding of fouling mechanisms and to develop effective fouling controls. Therefore, this article intends to trigger discussions on the appropriate choice of foulant surrogates and the application of mathematic models to analyze fouling mechanisms in these filtration processes. It has been found that the commonly used foulant surrogate (sodium alginate) cannot ideally represent the organic foulants in practical feed water to explore the fouling mechanisms. More surrogate foulants or extracellular polymeric substance (EPS) extracted from practical source water may be more suitable for use in the studies of membrane fouling problems. On the other hand, the support vector machine (SVM) which focuses on the general trends of filtration data may work as a more powerful simulation tool than traditional empirical models to predict complex filtration behaviors. Careful selection of foulant surrogate substances and the application of accurate mathematical modeling for fouling mechanisms would provide deep insights into the fouling problems.
Collapse
|
10
|
iMPT-FDNPL: Identification of Membrane Protein Types with Functional Domains and a Natural Language Processing Approach. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2021; 2021:7681497. [PMID: 34671418 PMCID: PMC8523280 DOI: 10.1155/2021/7681497] [Citation(s) in RCA: 29] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/22/2021] [Revised: 09/15/2021] [Accepted: 09/27/2021] [Indexed: 12/20/2022]
Abstract
Membrane protein is an important kind of proteins. It plays essential roles in several cellular processes. Based on the intramolecular arrangements and positions in a cell, membrane proteins can be divided into several types. It is reported that the types of a membrane protein are highly related to its functions. Determination of membrane protein types is a hot topic in recent years. A plenty of computational methods have been proposed so far. Some of them used functional domain information to encode proteins. However, this procedure was still crude. In this study, we designed a novel feature extraction scheme to obtain informative features of proteins from their functional domain information. Such scheme termed domains as words and proteins, represented by its domains, as sentences. The natural language processing approach, word2vector, was applied to access the features of domains, which were further refined to protein features. Based on these features, RAndom k-labELsets with random forest as the base classifier was employed to build the multilabel classifier, namely, iMPT-FDNPL. The tenfold cross-validation results indicated the good performance of such classifier. Furthermore, such classifier was superior to other classifiers based on features derived from functional domains via one-hot scheme or derived from other properties of proteins, suggesting the effectiveness of protein features generated by the proposed scheme.
Collapse
|
11
|
Cao Y, Yu C, Huang S, Wang S, Zuo Y, Yang L. Characterization and Prediction of Presynaptic and Postsynaptic Neurotoxins Based on Reduced Amino Acids and Biological Properties. Curr Bioinform 2021. [DOI: 10.2174/1574893615999200707150512] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
Presynaptic and postsynaptic neurotoxins are two important neurotoxins. Due to the important
role of presynaptic and postsynaptic neurotoxins in pharmacology and neuroscience, identification of them becomes very
important in biology.
Method:
In this study, the statistical test and F-score were used to calculate the difference between amino acids and
biological properties. The support vector machine was used to predict the presynaptic and postsynaptic neurotoxins by
using the reduced amino acid alphabet types.
Results:
By using the reduced amino acid alphabet as the input parameters of support vector machine, the overall accuracy
of our classifier had increased to 91.07%, which was the highest overall accuracy in this study. When compared with the
other published methods, better predictive results were obtained by our classifier.
Conclusion:
In summary, we analyzed the differences between two neurotoxins in amino acids and biological properties,
and constructed a classifier that could predict these two neurotoxins by using the reduced amino acid alphabet.
Collapse
Affiliation(s)
- Yiyin Cao
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, China
| | - Chunlu Yu
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, China
| | - Shenghui Huang
- The State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, 010070, China
| | - Shiyuan Wang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, China
| | - Yongchun Zuo
- The State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot, 010070, China
| | - Lei Yang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, China
| |
Collapse
|
12
|
Prediction of Maize Yield at the City Level in China Using Multi-Source Data. REMOTE SENSING 2021. [DOI: 10.3390/rs13010146] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/13/2023]
Abstract
Maize is a widely grown crop in China, and the relationships between agroclimatic parameters and maize yield are complicated, hence, accurate and timely yield prediction is challenging. Here, climate, satellite data, and meteorological indices were integrated to predict maize yield at the city-level in China from 2000 to 2015 using four machine learning approaches, e.g., cubist, random forest (RF), extreme gradient boosting (Xgboost), and support vector machine (SVM). The climate variables included the diffuse flux of photosynthetic active radiation (PDf), the diffuse flux of shortwave radiation (SDf), the direct flux of shortwave radiation (SDr), minimum temperature (Tmn), potential evapotranspiration (Pet), vapor pressure deficit (Vpd), vapor pressure (Vap), and wet day frequency (Wet). Satellite data, including the enhanced vegetation index (EVI), normalized difference vegetation index (NDVI), and adjusted vegetation index (SAVI) from the Moderate Resolution Imaging Spectroradiometer (MODIS), were used. Meteorological indices, including growing degree day (GDD), extreme degree day (EDD), and the Standardized Precipitation Evapotranspiration Index (SPEI), were used. The results showed that integrating all climate, satellite data, and meteorological indices could achieve the highest accuracy. The highest estimated correlation coefficient (R) values for the cubist, RF, SVM, and Xgboost methods were 0.828, 0.806, 0.742, and 0.758, respectively. The climate, satellite data, or meteorological indices inputs from all growth stages were essential for maize yield prediction, especially in late growth stages. R improved by about 0.126, 0.117, and 0.143 by adding climate data from the early, peak, and late-period to satellite data and meteorological indices from all stages via the four machine learning algorithms, respectively. R increased by 0.016, 0.016, and 0.017 when adding satellite data from the early, peak, and late stages to climate data and meteorological indices from all stages, respectively. R increased by 0.003, 0.032, and 0.042 when adding meteorological indices from the early, peak, and late stages to climate and satellite data from all stages, respectively. The analysis found that the spatial divergences were large and the R value in Northwest region reached 0.942, 0.904, 0.934, and 0.850 for the Cubist, RF, SVM, and Xgboost, respectively. This study highlights the advantages of using climate, satellite data, and meteorological indices for large-scale maize yield estimation with machine learning algorithms.
Collapse
|
13
|
Zhang S, Duan Z, Yang W, Qian C, You Y. iDHS-DASTS: identifying DNase I hypersensitive sites based on LASSO and stacking learning. Mol Omics 2021; 17:130-141. [PMID: 33295914 DOI: 10.1039/d0mo00115e] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
The DNase I hypersensitivity site is an important marker of the DNA regulatory region, and its identification in the DNA sequence is of great significance for biomedical research. However, traditional identification methods are extremely time-consuming and can not obtain an accurate result. In this paper, we proposed a predictor called iDHS-DASTS to identify the DHS based on benchmark datasets. First, we adopt a feature extraction method called PseDNC which can incorporate the original DNA properties and spatial information of the DNA sequence. Then we use a method called LASSO to reduce the dimensions of the original data. Finally, we utilize stacking learning as a classifier, which includes Adaboost, random forest, gradient boosting, extra trees and SVM. Before we train the classifier, we use SMOTE-Tomek to overcome the imbalance of the datasets. In the experiment, our iDHS-DASTS achieves remarkable performance on three benchmark datasets. We achieve state-of-the-art results with over 92.06%, 91.06% and 90.72% accuracy for datasets [Doublestruck S]1, [Doublestruck S]2 and [Doublestruck S]3, respectively. To verify the validation and transferability of our model, we establish another independent dataset [Doublestruck S]4, for which the accuracy can reach 90.31%. Furthermore, we used the proposed model to construct a user friendly web server called iDHS-DASTS, which is available at http://www.xdu-duan.cn/.
Collapse
Affiliation(s)
- Shengli Zhang
- School of Mathematics and Statistics, Xidian University, Xi'an 710071, P. R. China.
| | - Zhengpeng Duan
- School of Electronic Enginnering, Xidian University, Xi'an 710071, P. R. China
| | - Wenhao Yang
- School of Electronic Enginnering, Xidian University, Xi'an 710071, P. R. China
| | - Chenlai Qian
- School of Electronic Enginnering, Xidian University, Xi'an 710071, P. R. China
| | - Yiwei You
- International Business School, Shanghai University of International Business and Economics, Shanghai, 201620, P. R. China
| |
Collapse
|
14
|
Zhang S, Qiao H. KD-KLNMF: Identification of lncRNAs subcellular localization with multiple features and nonnegative matrix factorization. Anal Biochem 2020; 610:113995. [PMID: 33080214 DOI: 10.1016/j.ab.2020.113995] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2020] [Revised: 09/07/2020] [Accepted: 10/12/2020] [Indexed: 12/18/2022]
Abstract
Long non-coding RNAs (lncRNAs) refer to functional RNA molecules with a length more than 200 nucleotides and have minimal or no function to encode proteins. In recent years, more studies show that lncRNAs subcellular localization has valuable clues for their biological functions. So it is count for much to identify lncRNAs subcellular localization. In this paper, a novel statistical model named KD-KLNMF is constructed to predict lncRNAs subcellular localization. Firstly, k-mer and dinucleotide-based spatial autocorrelation are incorporated as the feature vector. Then, Synthetic Minority Over-sampling Technique is used to deal with the imbalance dataset. Next, Kullback-Leibler divergence-based nonnegative matrix factorization is applied to select optimal features. And then we utilize support vector machine as the classifier after comparing with other classifiers. Finally, the jackknife test is performed to evaluate the model. The overall accuracies reach 97.24% and 92.86% on training dataset and independent dataset, respectively. The results are better than the previous methods, which indicate that our model will be a useful and feasible tool to identify lncRNAs subcellular localization. The datasets and source code are freely available at https://github.com/HuijuanQiao/KD-KLNMF.
Collapse
Affiliation(s)
- Shengli Zhang
- School of Mathematics and Statistics, Xidian University, Xi'an, 710071, PR China.
| | - Huijuan Qiao
- School of Mathematics and Statistics, Xidian University, Xi'an, 710071, PR China
| |
Collapse
|
15
|
Alphonse AS, Mary NAB, Starvin MS. Classification of membrane protein using Tetra Peptide Pattern. Anal Biochem 2020; 606:113845. [PMID: 32739352 DOI: 10.1016/j.ab.2020.113845] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2020] [Revised: 06/17/2020] [Accepted: 06/22/2020] [Indexed: 11/29/2022]
Abstract
Membrane proteins play an important role in the life activities of organisms. The mechanism of cell structures and biological activities can be identified only by knowing the functional types of membrane proteins which accelerate the process. Therefore, it is greatly necessary to build up computational approaches for timely and accurate prediction of the functional types of membrane protein. The proposed method analyzes the structure of the membrane proteins using novel Tetra Peptide Pattern (TPP)-based feature extraction technique. A frequency occurrence matrix is created from which a feature vector is formed. This feature vector captures the pattern among amino acids in a membrane protein sequence. The feature vector is reduced in the dimension using General Kernel-based Supervised Principal Component Analysis (GKSPCA). Stacked Restricted Boltzmann Machines (RBM) in Deep Belief Network (DBN) is used for classification. The RBM is the building block of Deep Belief Network. The proposed method achieves good results on two datasets. The performance of the proposed method was analyzed using Accuracy, Specificity, Sensitivity and Mathew's correlation coefficient. The proposed method achieves good results when compared to other state-of-the-art techniques.
Collapse
Affiliation(s)
| | | | - M S Starvin
- University College of Engineering, Nagercoil, 629004, India.
| |
Collapse
|
16
|
Zhang X, Chen L. Prediction of membrane protein types by fusing protein-protein interaction and protein sequence information. BIOCHIMICA ET BIOPHYSICA ACTA-PROTEINS AND PROTEOMICS 2020; 1868:140524. [PMID: 32858174 DOI: 10.1016/j.bbapap.2020.140524] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/26/2020] [Revised: 07/17/2020] [Accepted: 07/30/2020] [Indexed: 11/30/2022]
Abstract
Membrane proteins are gatekeepers to the cell and essential for determination of the function of cells. Identification of the types of membrane proteins is an essential problem in cell biology. It is time-consuming and expensive to identify the type of membrane proteins with traditional experimental methods. The alternative way is to design effective computational methods, which can provide quick and reliable predictions. To date, several computational methods have been proposed in this regard. Several of them used the features extracted from the sequence information of individual proteins. Recently, networks are more and more popular to tackle different protein-related problems, which can organize proteins in a system level and give an overview of all proteins. However, such form weakens the essential properties of proteins, such as their sequence information. In this study, a novel feature fusion scheme was proposed, which integrated the information of protein sequences and protein-protein interaction network. The fused features of a protein were defined as the linear combination of sequence features of all proteins in the network, where the combination coefficients were the probabilities yielded by the random walk with restart algorithm with the protein as the seed node. Several models with such fused features and different classification algorithms were built and evaluated. Their performance for predicting the type of membrane proteins was improved compared with the models only with the sequence features or network information.
Collapse
Affiliation(s)
- Xiaolin Zhang
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, People's Republic of China
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, People's Republic of China.
| |
Collapse
|
17
|
Identification of membrane protein types via multivariate information fusion with Hilbert–Schmidt Independence Criterion. Neurocomputing 2020. [DOI: 10.1016/j.neucom.2019.11.103] [Citation(s) in RCA: 88] [Impact Index Per Article: 22.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
18
|
Xu ZC, Xiao X, Qiu WR, Wang P, Fang XZ. iAI-DSAE: A Computational Method for Adenosine to Inosine Editing Site Prediction. LETT ORG CHEM 2019. [DOI: 10.2174/1570178615666181016112546] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
As an important post-transcriptional modification, adenosine-to-inosine RNA editing generally occurs in both coding and noncoding RNA transcripts in which adenosines are converted to inosines. Accordingly, the diversification of the transcriptome can be resulted in by this modification. It is significant to accurately identify adenosine-to-inosine editing sites for further understanding their biological functions. Currently, the adenosine-to-inosine editing sites would be determined by experimental methods, unfortunately, it may be costly and time consuming. Furthermore, there are only a few existing computational prediction models in this field. Therefore, the work in this study is starting to develop other computational methods to address these problems. Given an uncharacterized RNA sequence that contains many adenosine resides, can we identify which one of them can be converted to inosine, and which one cannot? To deal with this problem, a novel predictor called iAI-DSAE is proposed in the current study. In fact, there are two key issues to address: one is ‘what feature extraction methods should be adopted to formulate the given sample sequence?’ The other is ‘what classification algorithms should be used to construct the classification model?’ For the former, a 540-dimensional feature vector is extracted to formulate the sample sequence by dinucleotide-based auto-cross covariance, pseudo dinucleotide composition, and nucleotide density methods. For the latter, we use the present more popular method i.e. deep spare autoencoder to construct the classification model. Generally, ACC and MCC are considered as the two of the most important performance indicators of a predictor. In this study, in comparison with those of predictor PAI, they are up 2.46% and 4.14%, respectively. The two other indicators, Sn and Sp, rise at certain degree also. This indicates that our predictor can be as an important complementary tool to identify adenosine-toinosine RNA editing sites. For the convenience of most experimental scientists, an easy-to-use web-server for identifying adenosine-to-inosine editing sites has been established at: http://www.jci-bioinfo.cn/iAI-DSAE, by which users can easily obtain their desired results without the need to go through the complicated mathematical equations involved. It is important to identify adenosine-to-inosine editing sites in RNA sequences for the intensive study on RNA function and the development of new medicine. In current study, a novel predictor, called iAI-DSAE, was proposed by using three feature extraction methods including dinucleotidebased auto-cross covariance, pseudo dinucleotide composition and nucleotide density. The jackknife test results of the iAI-DSAE predictor based on deep spare auto-encoder model show that our predictor is more stable and reliable. It has not escaped our notice that the methods proposed in the current paper can be used to solve many other problems in genome analysis.
Collapse
Affiliation(s)
- Zhao-Chun Xu
- Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen 333403, China
| | - Xuan Xiao
- Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen 333403, China
| | - Wang-Ren Qiu
- Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen 333403, China
| | - Peng Wang
- Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen 333403, China
| | - Xin-Zhu Fang
- Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen 333403, China
| |
Collapse
|
19
|
Jayapriya K, Mary NAB. Employing a novel 2-gram subgroup intra pattern (2GSIP) with stacked auto encoder for membrane protein classification. Mol Biol Rep 2019; 46:2259-2272. [PMID: 30778923 DOI: 10.1007/s11033-019-04680-3] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2018] [Accepted: 02/07/2019] [Indexed: 12/01/2022]
Abstract
Cell membrane proteins play an essentially significant function in manipulating the behaviour of cells. Examination of amino acid sequences can put forward useful insights into the tertiary structures of proteins and their biological functions. One of the important problems in amino acid analysis is the uncertainty to establish a digital coding system to better reflect the properties of amino acids and their degeneracy. In order to overcome the demerits, the proposed method is a novel representation of protein sequences that incorporates a new feature named 2-gram subgroup intra pattern. The functional types of membrane protein classification will be supportive to explain the biological functions of membrane proteins. For classification, Stacked Auto Encoder Deep learning method is applied. The performance of the proposed method is evaluated on two benchmark data sets. The results were experimented using the Self-consistency test, Accuracy, Specificity, Sensitivity, Mathew's correlation coefficient, Jackknife test and Independent data set are the tests in which the proposed method outperformed other existing techniques generally used in literatures.
Collapse
Affiliation(s)
- K Jayapriya
- Vin Solutions, Tirunelveli, Tamilnadu, India.
| | | |
Collapse
|
20
|
Prediction of membrane protein types by exploring local discriminative information from evolutionary profiles. Anal Biochem 2019; 564-565:123-132. [DOI: 10.1016/j.ab.2018.10.027] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2018] [Revised: 10/23/2018] [Accepted: 10/25/2018] [Indexed: 11/17/2022]
|
21
|
Sankari ES, Manimegalai D. Predicting membrane protein types by incorporating a novel feature set into Chou's general PseAAC. J Theor Biol 2018; 455:319-328. [DOI: 10.1016/j.jtbi.2018.07.032] [Citation(s) in RCA: 36] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2018] [Revised: 06/27/2018] [Accepted: 07/23/2018] [Indexed: 10/28/2022]
|
22
|
Huang G, Li J, Zhao C. Computational Prediction and Analysis of Associations between Small Molecules and Binding-Associated S-Nitrosylation Sites. Molecules 2018; 23:molecules23040954. [PMID: 29671802 PMCID: PMC6017196 DOI: 10.3390/molecules23040954] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2018] [Revised: 03/30/2018] [Accepted: 04/09/2018] [Indexed: 01/12/2023] Open
Abstract
Interactions between drugs and proteins occupy a central position during the process of drug discovery and development. Numerous methods have recently been developed for identifying drug–target interactions, but few have been devoted to finding interactions between post-translationally modified proteins and drugs. We presented a machine learning-based method for identifying associations between small molecules and binding-associated S-nitrosylated (SNO-) proteins. Namely, small molecules were encoded by molecular fingerprint, SNO-proteins were encoded by the information entropy-based method, and the random forest was used to train a classifier. Ten-fold and leave-one-out cross validations achieved, respectively, 0.7235 and 0.7490 of the area under a receiver operating characteristic curve. Computational analysis of similarity suggested that SNO-proteins associated with the same drug shared statistically significant similarity, and vice versa. This method and finding are useful to identify drug–SNO associations and further facilitate the discovery and development of SNO-associated drugs.
Collapse
Affiliation(s)
- Guohua Huang
- Provincial Key Laboratory of Informational Service for Rural Area of Southwestern Hunan, Shaoyang University, Shaoyang 422000, China.
- College of Information Engineering, Shaoyang University, Shaoyang 422000, China.
| | - Jincheng Li
- Provincial Key Laboratory of Informational Service for Rural Area of Southwestern Hunan, Shaoyang University, Shaoyang 422000, China.
- College of Information Engineering, Shaoyang University, Shaoyang 422000, China.
| | - Chenglin Zhao
- Provincial Key Laboratory of Informational Service for Rural Area of Southwestern Hunan, Shaoyang University, Shaoyang 422000, China.
- College of Information Engineering, Shaoyang University, Shaoyang 422000, China.
| |
Collapse
|
23
|
iMem-2LSAAC: A two-level model for discrimination of membrane proteins and their types by extending the notion of SAAC into chou's pseudo amino acid composition. J Theor Biol 2018; 442:11-21. [DOI: 10.1016/j.jtbi.2018.01.008] [Citation(s) in RCA: 83] [Impact Index Per Article: 13.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2017] [Revised: 12/23/2017] [Accepted: 01/10/2018] [Indexed: 02/08/2023]
|
24
|
Sankari ES, Manimegalai D. Predicting membrane protein types using various decision tree classifiers based on various modes of general PseAAC for imbalanced datasets. J Theor Biol 2017; 435:208-217. [PMID: 28941868 DOI: 10.1016/j.jtbi.2017.09.018] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2017] [Revised: 09/15/2017] [Accepted: 09/18/2017] [Indexed: 12/19/2022]
Abstract
Predicting membrane protein types is an important and challenging research area in bioinformatics and proteomics. Traditional biophysical methods are used to classify membrane protein types. Due to large exploration of uncharacterized protein sequences in databases, traditional methods are very time consuming, expensive and susceptible to errors. Hence, it is highly desirable to develop a robust, reliable, and efficient method to predict membrane protein types. Imbalanced datasets and large datasets are often handled well by decision tree classifiers. Since imbalanced datasets are taken, the performance of various decision tree classifiers such as Decision Tree (DT), Classification And Regression Tree (CART), C4.5, Random tree, REP (Reduced Error Pruning) tree, ensemble methods such as Adaboost, RUS (Random Under Sampling) boost, Rotation forest and Random forest are analysed. Among the various decision tree classifiers Random forest performs well in less time with good accuracy of 96.35%. Another inference is RUS boost decision tree classifier is able to classify one or two samples in the class with very less samples while the other classifiers such as DT, Adaboost, Rotation forest and Random forest are not sensitive for the classes with fewer samples. Also the performance of decision tree classifiers is compared with SVM (Support Vector Machine) and Naive Bayes classifier.
Collapse
Affiliation(s)
- E Siva Sankari
- Department of CSE, Government College of Engineering, Tirunelveli, Tamil Nadu, India.
| | - D Manimegalai
- Department of IT, National Engineering College, Kovilpatti, Tamil Nadu, India.
| |
Collapse
|
25
|
Xu ZC, Wang P, Qiu WR, Xiao X. iSS-PC: Identifying Splicing Sites via Physical-Chemical Properties Using Deep Sparse Auto-Encoder. Sci Rep 2017; 7:8222. [PMID: 28811565 PMCID: PMC5557945 DOI: 10.1038/s41598-017-08523-8] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2017] [Accepted: 07/10/2017] [Indexed: 12/13/2022] Open
Abstract
Gene splicing is one of the most significant biological processes in eukaryotic gene expression, such as RNA splicing, which can cause a pre-mRNA to produce one or more mature messenger RNAs containing the coded information with multiple biological functions. Thus, identifying splicing sites in DNA/RNA sequences is significant for both the bio-medical research and the discovery of new drugs. However, it is expensive and time consuming based only on experimental technique, so new computational methods are needed. To identify the splice donor sites and splice acceptor sites accurately and quickly, a deep sparse auto-encoder model with two hidden layers, called iSS-PC, was constructed based on minimum error law, in which we incorporated twelve physical-chemical properties of the dinucleotides within DNA into PseDNC to formulate given sequence samples via a battery of cross-covariance and auto-covariance transformations. In this paper, five-fold cross-validation test results based on the same benchmark data-sets indicated that the new predictor remarkably outperformed the existing prediction methods in this field. Furthermore, it is expected that many other related problems can be also studied by this approach. To implement classification accurately and quickly, an easy-to-use web-server for identifying slicing sites has been established for free access at: http://www.jci-bioinfo.cn/iSS-PC.
Collapse
Affiliation(s)
- Zhao-Chun Xu
- Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen, 333403, China.
| | - Peng Wang
- Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen, 333403, China
| | - Wang-Ren Qiu
- Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen, 333403, China.
- Department of Computer Science and Bond Life Science Center, University of Missouri, Columbia, MO, USA.
| | - Xuan Xiao
- Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen, 333403, China.
- Gordon Life Science Institute, Boston, Massachusetts, 02478, United States of America.
| |
Collapse
|
26
|
Butt AH, Rasool N, Khan YD. A Treatise to Computational Approaches Towards Prediction of Membrane Protein and Its Subtypes. J Membr Biol 2016; 250:55-76. [DOI: 10.1007/s00232-016-9937-7] [Citation(s) in RCA: 34] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2016] [Accepted: 11/02/2016] [Indexed: 10/20/2022]
|
27
|
Xiao X, Hui M, Liu Z. iAFP-Ense: An Ensemble Classifier for Identifying Antifreeze Protein by Incorporating Grey Model and PSSM into PseAAC. J Membr Biol 2016; 249:845-854. [DOI: 10.1007/s00232-016-9935-9] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2016] [Accepted: 10/24/2016] [Indexed: 12/12/2022]
|
28
|
A Prediction Model for Membrane Proteins Using Moments Based Features. BIOMED RESEARCH INTERNATIONAL 2016; 2016:8370132. [PMID: 26966690 PMCID: PMC4761391 DOI: 10.1155/2016/8370132] [Citation(s) in RCA: 39] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/27/2015] [Accepted: 01/12/2016] [Indexed: 01/29/2023]
Abstract
The most expedient unit of the human body is its cell. Encapsulated within the cell are many infinitesimal entities and molecules which are protected by a cell membrane. The proteins that are associated with this lipid based bilayer cell membrane are known as membrane proteins and are considered to play a significant role. These membrane proteins exhibit their effect in cellular activities inside and outside of the cell. According to the scientists in pharmaceutical organizations, these membrane proteins perform key task in drug interactions. In this study, a technique is presented that is based on various computationally intelligent methods used for the prediction of membrane protein without the experimental use of mass spectrometry. Statistical moments were used to extract features and furthermore a Multilayer Neural Network was trained using backpropagation for the prediction of membrane proteins. Results show that the proposed technique performs better than existing methodologies.
Collapse
|
29
|
Wu CY, Li QZ, Feng ZX. Non-coding RNA identification based on topology secondary structure and reading frame in organelle genome level. Genomics 2015; 107:9-15. [PMID: 26697761 DOI: 10.1016/j.ygeno.2015.12.002] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2015] [Revised: 12/08/2015] [Accepted: 12/12/2015] [Indexed: 10/22/2022]
Abstract
Non-coding RNA (ncRNA) genes make transcripts as same as the encoding genes, and ncRNAs directly function as RNAs rather than serve as blueprints for proteins. As the function of ncRNA is closely related to organelle genomes, it is desirable to explore ncRNA function by confirming its provenance. In this paper, the topology secondary structure, motif and the triplets under three reading frames are considered as parameters of ncRNAs. A method of SVM combining the increment of diversity (ID) algorithm is applied to construct the classifier. When the method is applied to the ncRNA dataset less than 80% sequence identity, the overall accuracies reach 95.57%, 96.40% in the five-fold cross-validation and the jackknife test, respectively. Further, for the independent testing dataset, the average prediction success rate of our method achieved 93.24%. The higher predictive success rates indicate that our method is very helpful for distinguishing ncRNAs from various organelle genomes.
Collapse
Affiliation(s)
- Cheng-Yan Wu
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China
| | - Qian-Zhong Li
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China.
| | - Zhen-Xing Feng
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China
| |
Collapse
|
30
|
Tripathi V, Tripathi P, Gupta D. Statistical approach for lysosomal membrane proteins (LMPs) identification. SYSTEMS AND SYNTHETIC BIOLOGY 2014; 8:313-9. [PMID: 26396655 PMCID: PMC4571724 DOI: 10.1007/s11693-014-9153-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/11/2014] [Revised: 06/11/2014] [Accepted: 07/26/2014] [Indexed: 10/25/2022]
Abstract
Discrimination of Lysosomal membrane proteins (LMP's) from folding types of globular (GPs) and other membrane proteins (OtMPs) is an important task both for identifying LMPs from genomic sequences and for the successful prediction of their secondary and tertiary structures. We have systematically analyzed the amino acid frequencies as well as dipeptide count of GPs, LMPs and OtMPs. Based on the above calculated single amino acid frequency combined with dipeptide count information, we statistically discriminated LMPs from GPs and OtMPs. This approach correctly classified the LMPs with an accuracy of 95 %. On the other hand, the amino acid frequency alone can discriminate LMPs with an accuracy of only 79 %. Similarly dipeptide count alone has an accuracy of 87 % for the discrimination of LMPs. Thus the combined information of both amino acid frequencies and dipeptide composition gives us significant high accurate results.
Collapse
Affiliation(s)
- Vijay Tripathi
- />Center of Bioinformatics, University of Allahabad, Allahabad, India
- />Genome Diversity Center, The Institute of Evolution, University of Haifa, Haifa, Israel
| | - Pooja Tripathi
- />Center of Bioinformatics, University of Allahabad, Allahabad, India
| | - Dwijendra Gupta
- />Center of Bioinformatics, University of Allahabad, Allahabad, India
- />Department of Biochemistry, University of Allahabad, Allahabad, India
| |
Collapse
|
31
|
Chou׳s pseudo amino acid composition improves sequence-based antifreeze protein prediction. J Theor Biol 2014; 356:30-5. [DOI: 10.1016/j.jtbi.2014.04.006] [Citation(s) in RCA: 116] [Impact Index Per Article: 11.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2014] [Revised: 03/28/2014] [Accepted: 04/02/2014] [Indexed: 11/22/2022]
|
32
|
A Multi-label Classifier for Prediction Membrane Protein Functional Types in Animal. J Membr Biol 2014; 247:1141-8. [DOI: 10.1007/s00232-014-9708-2] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2014] [Accepted: 07/14/2014] [Indexed: 11/26/2022]
|
33
|
Piao H, Froula J, Du C, Kim TW, Hawley ER, Bauer S, Wang Z, Ivanova N, Clark DS, Klenk HP, Hess M. Identification of novel biomass-degrading enzymes from genomic dark matter: Populating genomic sequence space with functional annotation. Biotechnol Bioeng 2014; 111:1550-65. [PMID: 24728961 DOI: 10.1002/bit.25250] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2013] [Revised: 02/21/2014] [Accepted: 03/24/2014] [Indexed: 11/06/2022]
Abstract
Although recent nucleotide sequencing technologies have significantly enhanced our understanding of microbial genomes, the function of ∼35% of genes identified in a genome currently remains unknown. To improve the understanding of microbial genomes and consequently of microbial processes it will be crucial to assign a function to this "genomic dark matter." Due to the urgent need for additional carbohydrate-active enzymes for improved production of transportation fuels from lignocellulosic biomass, we screened the genomes of more than 5,500 microorganisms for hypothetical proteins that are located in the proximity of already known cellulases. We identified, synthesized and expressed a total of 17 putative cellulase genes with insufficient sequence similarity to currently known cellulases to be identified as such using traditional sequence annotation techniques that rely on significant sequence similarity. The recombinant proteins of the newly identified putative cellulases were subjected to enzymatic activity assays to verify their hydrolytic activity towards cellulose and lignocellulosic biomass. Eleven (65%) of the tested enzymes had significant activity towards at least one of the substrates. This high success rate highlights that a gene context-based approach can be used to assign function to genes that are otherwise categorized as "genomic dark matter" and to identify biomass-degrading enzymes that have little sequence similarity to already known cellulases. The ability to assign function to genes that have no related sequence representatives with functional annotation will be important to enhance our understanding of microbial processes and to identify microbial proteins for a wide range of applications.
Collapse
Affiliation(s)
- Hailan Piao
- School of Molecular Biosciences, Washington State University, Richland, Washington, 99352; Pacific Northwest National Laboratory, Richland, Washington
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
34
|
Prediction of multi-type membrane proteins in human by an integrated approach. PLoS One 2014; 9:e93553. [PMID: 24676214 PMCID: PMC3968155 DOI: 10.1371/journal.pone.0093553] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2013] [Accepted: 03/05/2014] [Indexed: 11/29/2022] Open
Abstract
Membrane proteins were found to be involved in various cellular processes performing various important functions, which are mainly associated to their types. However, it is very time-consuming and expensive for traditional biophysical methods to identify membrane protein types. Although some computational tools predicting membrane protein types have been developed, most of them can only recognize one kind of type. Therefore, they are not as effective as one membrane protein can have several types at the same time. To our knowledge, few methods handling multiple types of membrane proteins were reported. In this study, we proposed an integrated approach to predict multiple types of membrane proteins by employing sequence homology and protein-protein interaction network. As a result, the prediction accuracies reached 87.65%, 81.39% and 70.79%, respectively, by the leave-one-out test on three datasets. It outperformed the nearest neighbor algorithm adopting pseudo amino acid composition. The method is anticipated to be an alternative tool for identifying membrane protein types. New metrics for evaluating performances of methods dealing with multi-label problems were also presented. The program of the method is available upon request.
Collapse
|
35
|
Han GS, Yu ZG, Anh V. A two-stage SVM method to predict membrane protein types by incorporating amino acid classifications and physicochemical properties into a general form of Chou's PseAAC. J Theor Biol 2013; 344:31-9. [PMID: 24316387 DOI: 10.1016/j.jtbi.2013.11.017] [Citation(s) in RCA: 49] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2013] [Revised: 10/16/2013] [Accepted: 11/24/2013] [Indexed: 01/12/2023]
Abstract
Membrane proteins play important roles in many biochemical processes and are also attractive targets of drug discovery for various diseases. The elucidation of membrane protein types provides clues for understanding the structure and function of proteins. Recently we developed a novel system for predicting protein subnuclear localizations. In this paper, we propose a simplified version of our system for predicting membrane protein types directly from primary protein structures, which incorporates amino acid classifications and physicochemical properties into a general form of pseudo-amino acid composition. In this simplified system, we will design a two-stage multi-class support vector machine combined with a two-step optimal feature selection process, which proves very effective in our experiments. The performance of the present method is evaluated on two benchmark datasets consisting of five types of membrane proteins. The overall accuracies of prediction for five types are 93.25% and 96.61% via the jackknife test and independent dataset test, respectively. These results indicate that our method is effective and valuable for predicting membrane protein types. A web server for the proposed method is available at http://www.juemengt.com/jcc/memty_page.php.
Collapse
Affiliation(s)
- Guo-Sheng Han
- School of Mathematics and Computational Science, Xiangtan University, Hunan 411105, China
| | - Zu-Guo Yu
- School of Mathematics and Computational Science, Xiangtan University, Hunan 411105, China; School of Mathematical Science, Queensland University of Technology, GPO Box 2434, Brisbane Q 4001, Australia.
| | - Vo Anh
- School of Mathematical Science, Queensland University of Technology, GPO Box 2434, Brisbane Q 4001, Australia
| |
Collapse
|
36
|
Fan GL, Li QZ. Discriminating bioluminescent proteins by incorporating average chemical shift and evolutionary information into the general form of Chou's pseudo amino acid composition. J Theor Biol 2013; 334:45-51. [DOI: 10.1016/j.jtbi.2013.06.003] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2013] [Revised: 05/30/2013] [Accepted: 06/03/2013] [Indexed: 01/22/2023]
|
37
|
Xiaohui N, Nana L, Jingbo X, Dingyan C, Yuehua P, Yang X, Weiquan W, Dongming W, Zengzhen W. Using the concept of Chou's pseudo amino acid composition to predict protein solubility: An approach with entropies in information theory. J Theor Biol 2013; 332:211-7. [DOI: 10.1016/j.jtbi.2013.03.010] [Citation(s) in RCA: 34] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2012] [Revised: 03/10/2013] [Accepted: 03/11/2013] [Indexed: 11/15/2022]
|
38
|
Tripathi V, Gupta DK. Discriminating lysosomal membrane protein types using dynamic neural network. J Biomol Struct Dyn 2013; 32:1575-82. [PMID: 23968467 DOI: 10.1080/07391102.2013.827133] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
Abstract
This work presents a dynamic artificial neural network methodology, which classifies the proteins into their classes from their sequences alone: the lysosomal membrane protein classes and the various other membranes protein classes. In this paper, neural networks-based lysosomal-associated membrane protein type prediction system is proposed. Different protein sequence representations are fused to extract the features of a protein sequence, which includes seven feature sets; amino acid (AA) composition, sequence length, hydrophobic group, electronic group, sum of hydrophobicity, R-group, and dipeptide composition. To reduce the dimensionality of the large feature vector, we applied the principal component analysis. The probabilistic neural network, generalized regression neural network, and Elman regression neural network (RNN) are used as classifiers and compared with layer recurrent network (LRN), a dynamic network. The dynamic networks have memory, i.e. its output depends not only on the input but the previous outputs also. Thus, the accuracy of LRN classifier among all other artificial neural networks comes out to be the highest. The overall accuracy of jackknife cross-validation is 93.2% for the data-set. These predicted results suggest that the method can be effectively applied to discriminate lysosomal associated membrane proteins from other membrane proteins (Type-I, Outer membrane proteins, GPI-Anchored) and Globular proteins, and it also indicates that the protein sequence representation can better reflect the core feature of membrane proteins than the classical AA composition.
Collapse
Affiliation(s)
- Vijay Tripathi
- a Genome Diversity Center, Institute of Evolution, University of Haifa , Haifa , Israel
| | | |
Collapse
|
39
|
Liu B, Wang X, Zou Q, Dong Q, Chen Q. Protein Remote Homology Detection by Combining Chou’s Pseudo Amino Acid Composition and Profile-Based Protein Representation. Mol Inform 2013; 32:775-82. [DOI: 10.1002/minf.201300084] [Citation(s) in RCA: 97] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2013] [Accepted: 06/11/2013] [Indexed: 11/12/2022]
|
40
|
Predicting acidic and alkaline enzymes by incorporating the average chemical shift and gene ontology informations into the general form of Chou's PseAAC. Process Biochem 2013. [DOI: 10.1016/j.procbio.2013.05.012] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
41
|
Yu L, Luo J, Guo Y, Li Y, Pu X, Li M. In silico identification of Gram-negative bacterial secreted proteins from primary sequence. Comput Biol Med 2013; 43:1177-81. [PMID: 23930811 DOI: 10.1016/j.compbiomed.2013.06.001] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2013] [Revised: 05/30/2013] [Accepted: 06/04/2013] [Indexed: 11/26/2022]
Abstract
In this study, we focus on different types of Gram-negative bacterial secreted proteins, and try to analyze the relationships and differences among them. Through an extensive literature search, 1612 secreted proteins have been collected as a standard data set from three data sources, including Swiss-Prot, TrEMBL and RefSeq. To explore the relationships among different types of secreted proteins, we model this data set as a sequence similarity network. Finally, a multi-classifier named SecretP is proposed to distinguish different types of secreted proteins, and yields a high total sensitivity of 90.12% for the test set. When performed on another public independent dataset for further evaluation, a promising prediction result is obtained. Predictions can be implemented freely online at http://cic.scu.edu.cn/bioinformatics/secretPv2_1/index.htm.
Collapse
Affiliation(s)
- Lezheng Yu
- College of Chemistry, Sichuan University, Chengdu 610064, PR China
| | | | | | | | | | | |
Collapse
|
42
|
Li T, Li QZ, Liu S, Fan GL, Zuo YC, Peng Y. PreDNA: accurate prediction of DNA-binding sites in proteins by integrating sequence and geometric structure information. ACTA ACUST UNITED AC 2013; 29:678-85. [PMID: 23335013 DOI: 10.1093/bioinformatics/btt029] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023]
Abstract
MOTIVATION Protein-DNA interactions often take part in various crucial processes, which are essential for cellular function. The identification of DNA-binding sites in proteins is important for understanding the molecular mechanisms of protein-DNA interaction. Thus, we have developed an improved method to predict DNA-binding sites by integrating structural alignment algorithm and support vector machine-based methods. RESULTS Evaluated on a new non-redundant protein set with 224 chains, the method has 80.7% sensitivity and 82.9% specificity in the 5-fold cross-validation test. In addition, it predicts DNA-binding sites with 85.1% sensitivity and 85.3% specificity when tested on a dataset with 62 protein-DNA complexes. Compared with a recently published method, BindN+, our method predicts DNA-binding sites with a 7% better area under the receiver operating characteristic curve value when tested on the same dataset. Many important problems in cell biology require the dense non-linear interactions between functional modules be considered. Thus, our prediction method will be useful in detecting such complex interactions.
Collapse
Affiliation(s)
- Tao Li
- Laboratory of Theoretical Biophysics, School of Physical Sciences and Technology, College of Computer Science and The National Research Center for Animal Transgenic Biotechnology, Inner Mongolia University, Hohhot, 010021, China
| | | | | | | | | | | |
Collapse
|
43
|
Lei JB, Yin JB, Shen HB. GFO: A data driven approach for optimizing the Gaussian function based similarity metric in computational biology. Neurocomputing 2013. [DOI: 10.1016/j.neucom.2012.07.003] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
44
|
Ding C, Yuan LF, Guo SH, Lin H, Chen W. Identification of mycobacterial membrane proteins and their types using over-represented tripeptide compositions. J Proteomics 2012; 77:321-8. [DOI: 10.1016/j.jprot.2012.09.006] [Citation(s) in RCA: 82] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2012] [Revised: 08/18/2012] [Accepted: 09/08/2012] [Indexed: 11/25/2022]
|
45
|
Chen YK, Li KB. Predicting membrane protein types by incorporating protein topology, domains, signal peptides, and physicochemical properties into the general form of Chou's pseudo amino acid composition. J Theor Biol 2012; 318:1-12. [PMID: 23137835 DOI: 10.1016/j.jtbi.2012.10.033] [Citation(s) in RCA: 98] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2012] [Revised: 10/25/2012] [Accepted: 10/26/2012] [Indexed: 01/04/2023]
Abstract
The type information of un-annotated membrane proteins provides an important hint for their biological functions. The experimental determination of membrane protein types, despite being more accurate and reliable, is not always feasible due to the costly laboratory procedures, thereby creating a need for the development of bioinformatics methods. This article describes a novel computational classifier for the prediction of membrane protein types using proteins' sequences. The classifier, comprising a collection of one-versus-one support vector machines, makes use of the following sequence attributes: (1) the cationic patch sizes, the orientation, and the topology of transmembrane segments; (2) the amino acid physicochemical properties; (3) the presence of signal peptides or anchors; and (4) the specific protein motifs. A new voting scheme was implemented to cope with the multi-class prediction. Both the training and the testing sequences were collected from SwissProt. Homologous proteins were removed such that there is no pair of sequences left in the datasets with a sequence identity higher than 40%. The performance of the classifier was evaluated by a Jackknife cross-validation and an independent testing experiments. Results show that the proposed classifier outperforms earlier predictors in prediction accuracy in seven of the eight membrane protein types. The overall accuracy was increased from 78.3% to 88.2%. Unlike earlier approaches which largely depend on position-specific substitution matrices and amino acid compositions, most of the sequence attributes implemented in the proposed classifier have supported literature evidences. The classifier has been deployed as a web server and can be accessed at http://bsaltools.ym.edu.tw/predmpt.
Collapse
Affiliation(s)
- Yen-Kuang Chen
- Institute of Biomedical Informatics, National Yang-Ming University, No.155, Sec 2, Lih-Nong Street, Taipei, 112, Taiwan, ROC
| | | |
Collapse
|
46
|
Li T, Li QZ. Annotating the protein-RNA interaction sites in proteins using evolutionary information and protein backbone structure. J Theor Biol 2012; 312:55-64. [PMID: 22874580 DOI: 10.1016/j.jtbi.2012.07.020] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2012] [Revised: 07/19/2012] [Accepted: 07/21/2012] [Indexed: 12/11/2022]
Abstract
RNA-protein interactions play important roles in various biological processes. The precise detection of RNA-protein interaction sites is very important for understanding essential biological processes and annotating the function of the proteins. In this study, based on various features from amino acid sequence and structure, including evolutionary information, solvent accessible surface area and torsion angles (φ, ψ) in the backbone structure of the polypeptide chain, a computational method for predicting RNA-binding sites in proteins is proposed. When the method is applied to predict RNA-binding sites in three datasets: RBP86 containing 86 protein chains, RBP107 containing 107 proteins chains and RBP109 containing 109 proteins chains, better sensitivities and specificities are obtained compared to previously published methods in five-fold cross-validation tests. In order to make further examination for the efficiency of our method, the RBP107 dataset is used as training set, RBP86 and RBP109 datasets are used as the independent test sets. In addition, as examples of our prediction, RNA-binding sites in a few proteins are presented. The annotated results are consistent with the PDB annotation. These results show that our method is useful for annotating RNA binding sites of novel proteins.
Collapse
Affiliation(s)
- Tao Li
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China
| | - Qian-Zhong Li
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China.
| |
Collapse
|
47
|
Predict mycobacterial proteins subcellular locations by incorporating pseudo-average chemical shift into the general form of Chou’s pseudo amino acid composition. J Theor Biol 2012; 304:88-95. [DOI: 10.1016/j.jtbi.2012.03.017] [Citation(s) in RCA: 89] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2011] [Revised: 03/13/2012] [Accepted: 03/14/2012] [Indexed: 11/18/2022]
|
48
|
Hayat M, Khan A. Mem-PHybrid: hybrid features-based prediction system for classifying membrane protein types. Anal Biochem 2012; 424:35-44. [PMID: 22342883 DOI: 10.1016/j.ab.2012.02.007] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2011] [Revised: 02/04/2012] [Accepted: 02/06/2012] [Indexed: 11/29/2022]
Abstract
Membrane proteins are a major class of proteins and encoded by approximately 20% to 30% of genes in most organisms. In this work, a two-layer novel membrane protein prediction system, called Mem-PHybrid, is proposed. It is able to first identify the protein query as a membrane or nonmembrane protein. In the second level, it further identifies the type of membrane protein. The proposed Mem-PHybrid prediction system is based on hybrid features, whereby a fusion of both the physicochemical and split amino acid composition-based features is performed. This enables the proposed Mem-PHybrid to exploit the discrimination capabilities of both types of feature extraction strategy. In addition, minimum redundancy and maximum relevance has also been applied to reduce the dimensionality of a feature vector. We employ random forest, evidence-theoretic K-nearest neighbor, and support vector machine (SVM) as classifiers and analyze their performance on two datasets. SVM using hybrid features yields the highest accuracy of 89.6% and 97.3% on dataset1 and 91.5% and 95.5% on dataset2 for jackknife and independent dataset tests, respectively. The enhanced prediction performance of Mem-PHybrid is largely attributed to the exploitation of the discrimination power of the hybrid features and of the learning capability of SVM. Mem-PHybrid is accessible at http://www.111.68.99.218/Mem-PHybrid.
Collapse
Affiliation(s)
- Maqsood Hayat
- Department of Computer and Information Sciences, Pakistan Institute of Engineering and Applied Sciences, Nilore, Islamabad, Pakistan
| | | |
Collapse
|
49
|
Su CH, Pal NR, Lin KL, Chung IF. Identification of amino acid propensities that are strong determinants of linear B-cell epitope using neural networks. PLoS One 2012; 7:e30617. [PMID: 22347389 PMCID: PMC3275595 DOI: 10.1371/journal.pone.0030617] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2011] [Accepted: 12/22/2011] [Indexed: 11/18/2022] Open
Abstract
BACKGROUND Identification of amino acid propensities that are strong determinants of linear B-cell epitope is very important to enrich our knowledge about epitopes. This can also help to obtain better epitope prediction. Typical linear B-cell epitope prediction methods combine various propensities in different ways to improve prediction accuracies. However, fewer but better features may yield better prediction. Moreover, for a propensity, when the sequence length is k, there will be k values, which should be treated as a single unit for feature selection and hence usual feature selection method will not work. Here we use a novel Group Feature Selecting Multilayered Perceptron, GFSMLP, which treats a group of related information as a single entity and selects useful propensities related to linear B-cell epitopes, and uses them to predict epitopes. METHODOLOGY/ PRINCIPAL FINDINGS We use eight widely known propensities and four data sets. We use GFSMLP to rank propensities by the frequency with which they are selected. We find that Chou's beta-turn and Ponnuswamy's polarity are better features for prediction of linear B-cell epitope. We examine the individual and combined discriminating power of the selected propensities and analyze the correlation between paired propensities. Our results show that the selected propensities are indeed good features, which also cooperate with other propensities to enhance the discriminating power for predicting epitopes. We find that individually polarity is not the best predictor, but it collaborates with others to yield good prediction. Usual feature selection methods cannot provide such information. CONCLUSIONS/ SIGNIFICANCE Our results confirm the effectiveness of active (group) feature selection by GFSMLP over the traditional passive approaches of evaluating various combinations of propensities. The GFSMLP-based feature selection can be extended to more than 500 remaining propensities to enhance our biological knowledge about epitopes and to obtain better prediction. A graphical-user-interface version of GFSMLP is available at: http://bio.classcloud.org/GFSMLP/.
Collapse
Affiliation(s)
- Chun-Hung Su
- Institute of Biomedical Informatics, National Yang-Ming University, Taipei, Taiwan, Republic of China
| | - Nikhil R. Pal
- Electronics and Communication Sciences Unit, Indian Statistical Institute, Calcutta, India
| | - Ken-Li Lin
- Computer Center, Chung Hua University, Hsinchu,Taiwan, Republic of China
| | - I-Fang Chung
- Institute of Biomedical Informatics, National Yang-Ming University, Taipei, Taiwan, Republic of China
- * E-mail:
| |
Collapse
|
50
|
Identification of voltage-gated potassium channel subfamilies from sequence information using support vector machine. Comput Biol Med 2012; 42:504-7. [PMID: 22297432 DOI: 10.1016/j.compbiomed.2012.01.003] [Citation(s) in RCA: 40] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2010] [Revised: 10/16/2011] [Accepted: 01/12/2012] [Indexed: 02/05/2023]
Abstract
Proteins belonging to different subfamilies of Voltage-gated K(+) channels (VKC) are functionally divergent. The traditional method to classify ion channels is more time consuming. Thus, it is highly desirable to develop novel computational methods for VKC subfamily classification. In this study, a support vector machine based method was proposed to predict VKC subfamilies using amino acid and dipeptide compositions. In order to remove redundant information, a novel feature selection technique was employed to single out optimized features. In the jackknife cross-validation, the proposed method (VKCPred) achieved an overall accuracy of 93.09% with 93.22% average sensitivity and 98.34% average specificity, which are superior to that of other two state-of-the-art classifiers. These results indicate that VKCPred can be efficiently used to identify and annotate voltage-gated K(+) channels' subfamilies. The VKCPred software and dataset are freely available at http://cobi.uestc.edu.cn/people/hlin/tools/VKCPred/.
Collapse
|