1
|
Ahmad HI, Ijaz N, Afzal G, Asif AR, ur Rehman A, Rahman A, Ahmed I, Yousaf M, Elokil A, Muhammad SA, Albogami SM, Alotaibi SS. Computational Insights into the Structural and Functional Impacts of nsSNPs of Bone Morphogenetic Proteins. BIOMED RESEARCH INTERNATIONAL 2022; 2022:4013729. [PMID: 35832847 PMCID: PMC9273450 DOI: 10.1155/2022/4013729] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/23/2022] [Accepted: 06/15/2022] [Indexed: 12/12/2022]
Abstract
BMPs (bone morphogenetic proteins) are multipurpose (transforming growth factor)TGF-superfamily released cytokines. These glycoproteins, acting as disulfide-linked homo- or heterodimers, are highly potent regulators of bone and cartilage production and repair, cell proliferation throughout embryonic development, and bone homeostasis in the adults. Due to the fact that genetic variation might influence structural functions, this study is aimed to determine the pathogenic effect of nonsynonymous single-nucleotide polymorphisms (nsSNPs) in BMP genes. The implications of these variations, investigated using computational analysis and molecular models of the mature TGF-β domain, revealed the impact of modifications on the function of BMP protein. The three-dimensional (3D) structure analysis was performed on the nsSNP Y316S, V386G, E387G, C389G, and C391G nsSNP in the TGF-β domain of chicken BMP2 and H344P, S347P, V357A nsSNP in the TGF-β domain of chicken BMP4 protein that was anticipated to be harmful and of high risk. The ability of the proteins to perform variety of tasks interact with other molecules depends on their tertiary structural composition. The current analysis revealed the four most damaging variants (Y316S, V386G, E387G, C389G, and C391G), highly conserved and functional and are located in the TGF-beta domain of BMP2 and BMP4. The amino acid substitutions E387G, C389G, and C391G are discovered in the binding region. It was observed that the mutations in the TGF-beta domain caused significant changes in its structural organization including the substrate binding sites. Current findings will assist future research focused on the role of these variants in BMP function loss and their role in skeletal disorders, and this will possibly help to develop practical strategies for treating bone-related conditions.
Collapse
Affiliation(s)
- Hafiz Ishfaq Ahmad
- Department of Animal Breeding and Genetics, University of Veterinary and Animal Sciences, Lahore, Pakistan
| | - Nabeel Ijaz
- Department of Clinical Science, Faculty of Veterinary Sciences, Bahauddin Zakariya University Multan, Pakistan
| | - Gulnaz Afzal
- Department of Zoology, The Islamia University of Bahawalpur, Bahawalpur, Pakistan
| | - Akhtar Rasool Asif
- Key Laboratory of Animal Genetics, Breeding and Reproduction, Huazhong Agricultural University, Wuhan, China
- University of Veterinary and Animal Sciences, Lahore, Sub-Campus Jhang, Pakistan
| | - Aziz ur Rehman
- Key Laboratory of Animal Genetics, Breeding and Reproduction, Huazhong Agricultural University, Wuhan, China
- University of Veterinary and Animal Sciences, Lahore, Sub-Campus Jhang, Pakistan
| | - Abdur Rahman
- University of Veterinary and Animal Sciences, Lahore, Sub-Campus Jhang, Pakistan
- Department of Animal Nutrition, Afyon Kocatepe University, Turkey
| | - Irfan Ahmed
- Department of Animal Nutrition, Faculty of Veterinary and Animal Sciences, The Islamia University of Bahawalpur, Pakistan
| | - Muhammad Yousaf
- Department of Animal Nutrition, Faculty of Veterinary and Animal Sciences, The Islamia University of Bahawalpur, Pakistan
| | - Abdelmotaleb Elokil
- Department of Animal Production, Faculty of Agriculture, Benha University, Moshtohor 13736, Egypt
| | - Sayyed Aun Muhammad
- University of Veterinary and Animal Sciences, Lahore, Sub-Campus Jhang, Pakistan
| | - Sarah M. Albogami
- Department of Biotechnology, College of Science, Taif University, P.O. Box 11099, Taif 21944, Saudi Arabia
| | - Saqer S. Alotaibi
- Department of Biotechnology, College of Science, Taif University, P.O. Box 11099, Taif 21944, Saudi Arabia
| |
Collapse
|
2
|
Liu B, Li S. ProtDet-CCH: Protein Remote Homology Detection by Combining Long Short-Term Memory and Ranking Methods. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:1203-1210. [PMID: 29993950 DOI: 10.1109/tcbb.2018.2789880] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
As one of the most challenging tasks in sequence analysis, protein remote homology detection has been extensively studied. Methods based on discriminative models and ranking approaches have achieved the state-of-the-art performance, and these two kinds of methods are complementary. In this study, three LSTM models have been applied to construct the predictors for protein remote homology detection, including ULSTM, BLSTM, and CNN-BLSTM. They are able to automatically extract the local and global sequence order information. Combined with PSSMs, the CNN-BLSTM achieved the best performance among the three LSTM-based models. We named this method as CNN-BLSTM-PSSM. Finally, a new method called ProtDet-CCH was proposed by combining CNN-BLSTM-PSSM and a ranking method HHblits. Tested on a widely used SCOP benchmark dataset, ProtDet-CCH achieved an ROC score of 0.998, and an ROC50 score of 0.982, significantly outperforming other existing state-of-the-art methods. Experimental results on two updated SCOPe independent datasets showed that ProtDet-CCH can achieve stable performance. Furthermore, our method can provide useful insights for studying the features and motifs of protein families and superfamilies. It is anticipated that ProtDet-CCH will become a very useful tool for protein remote homology detection.
Collapse
|
3
|
Brick TR, Gray AL, Staples AD. Recurrence Quantification for the Analysis of Coupled Processes in Aging. J Gerontol B Psychol Sci Soc Sci 2017; 73:134-147. [PMID: 28958046 DOI: 10.1093/geronb/gbx018] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2016] [Accepted: 02/11/2017] [Indexed: 11/14/2022] Open
Abstract
Objectives Aging is a complex phenomenon, with numerous simultaneous processes that interact with each other on a moment-to-moment basis. One way to quantify the interactions of these processes is by measuring how much a process is similar to its own past states or the past states of another system through the analysis of recurrence. This paper presents an introduction to recurrence quantification analysis (RQA) and cross-recurrence quantification analysis (CRQA), two dynamical systems analysis techniques that provide ways to characterize the self-similar nature of each process and the properties of their mutual temporal co-occurrence. Method We present RQA and CRQA and demonstrate their effectiveness with an example of conversational movements across age groups. Results RQA and CRQA provide methods of analyzing the repetitive processes that occur in day-to-day life, describing how different processes co-occur, synchronize, or predict each other and comparing the characteristics of those processes between groups. Discussion With intensive longitudinal data becoming increasingly available, it is possible to examine how the processes of aging unfold. RQA and CRQA provide information about how one process may show patterns of internal repetition or echo the patterning of another process and how those characteristics may change across the process of aging.
Collapse
Affiliation(s)
- Timothy R Brick
- Departments of Human Development and Family Studies and Psychology
| | - Allison L Gray
- Department of Human Development and Family Studies, Pennsylvania State University, University Park
| | - Angela D Staples
- Department of Psychology, Eastern Michigan University, Ypsilanti
| |
Collapse
|
4
|
Karain WI. Detecting transitions in protein dynamics using a recurrence quantification analysis based bootstrap method. BMC Bioinformatics 2017; 18:525. [PMID: 29179670 PMCID: PMC5704401 DOI: 10.1186/s12859-017-1943-y] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2017] [Accepted: 11/15/2017] [Indexed: 11/17/2022] Open
Abstract
Background Proteins undergo conformational transitions over different time scales. These transitions are closely intertwined with the protein’s function. Numerous standard techniques such as principal component analysis are used to detect these transitions in molecular dynamics simulations. In this work, we add a new method that has the ability to detect transitions in dynamics based on the recurrences in the dynamical system. It combines bootstrapping and recurrence quantification analysis. We start from the assumption that a protein has a “baseline” recurrence structure over a given period of time. Any statistically significant deviation from this recurrence structure, as inferred from complexity measures provided by recurrence quantification analysis, is considered a transition in the dynamics of the protein. Results We apply this technique to a 132 ns long molecular dynamics simulation of the β-Lactamase Inhibitory Protein BLIP. We are able to detect conformational transitions in the nanosecond range in the recurrence dynamics of the BLIP protein during the simulation. The results compare favorably to those extracted using the principal component analysis technique. Conclusions The recurrence quantification analysis based bootstrap technique is able to detect transitions between different dynamics states for a protein over different time scales. It is not limited to linear dynamics regimes, and can be generalized to any time scale. It also has the potential to be used to cluster frames in molecular dynamics trajectories according to the nature of their recurrence dynamics. One shortcoming for this method is the need to have large enough time windows to insure good statistical quality for the recurrence complexity measures needed to detect the transitions.
Collapse
Affiliation(s)
- Wael I Karain
- Department of Physics, Birzeit University, P.O.Box 14, Birzeit, Palestine.
| |
Collapse
|
5
|
Li S, Chen J, Liu B. Protein remote homology detection based on bidirectional long short-term memory. BMC Bioinformatics 2017; 18:443. [PMID: 29017445 PMCID: PMC5634958 DOI: 10.1186/s12859-017-1842-2] [Citation(s) in RCA: 41] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2017] [Accepted: 09/21/2017] [Indexed: 01/05/2023] Open
Abstract
BACKGROUND Protein remote homology detection plays a vital role in studies of protein structures and functions. Almost all of the traditional machine leaning methods require fixed length features to represent the protein sequences. However, it is never an easy task to extract the discriminative features with limited knowledge of proteins. On the other hand, deep learning technique has demonstrated its advantage in automatically learning representations. It is worthwhile to explore the applications of deep learning techniques to the protein remote homology detection. RESULTS In this study, we employ the Bidirectional Long Short-Term Memory (BLSTM) to learn effective features from pseudo proteins, also propose a predictor called ProDec-BLSTM: it includes input layer, bidirectional LSTM, time distributed dense layer and output layer. This neural network can automatically extract the discriminative features by using bidirectional LSTM and the time distributed dense layer. CONCLUSION Experimental results on a widely-used benchmark dataset show that ProDec-BLSTM outperforms other related methods in terms of both the mean ROC and mean ROC50 scores. This promising result shows that ProDec-BLSTM is a useful tool for protein remote homology detection. Furthermore, the hidden patterns learnt by ProDec-BLSTM can be interpreted and visualized, and therefore, additional useful information can be obtained.
Collapse
Affiliation(s)
- Shumin Li
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, HIT Campus Shenzhen University Town, Xili, Shenzhen, 518055, China
| | - Junjie Chen
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, HIT Campus Shenzhen University Town, Xili, Shenzhen, 518055, China
| | - Bin Liu
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, HIT Campus Shenzhen University Town, Xili, Shenzhen, 518055, China.
| |
Collapse
|
6
|
Konopka BM, Marciniak M, Dyrka W. Quantiprot - a Python package for quantitative analysis of protein sequences. BMC Bioinformatics 2017; 18:339. [PMID: 28716000 PMCID: PMC5512976 DOI: 10.1186/s12859-017-1751-4] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2017] [Accepted: 07/05/2017] [Indexed: 11/17/2022] Open
Abstract
Background The field of protein sequence analysis is dominated by tools rooted in substitution matrices and alignments. A complementary approach is provided by methods of quantitative characterization. A major advantage of the approach is that quantitative properties defines a multidimensional solution space, where sequences can be related to each other and differences can be meaningfully interpreted. Results Quantiprot is a software package in Python, which provides a simple and consistent interface to multiple methods for quantitative characterization of protein sequences. The package can be used to calculate dozens of characteristics directly from sequences or using physico-chemical properties of amino acids. Besides basic measures, Quantiprot performs quantitative analysis of recurrence and determinism in the sequence, calculates distribution of n-grams and computes the Zipf’s law coefficient. Conclusions We propose three main fields of application of the Quantiprot package. First, quantitative characteristics can be used in alignment-free similarity searches, and in clustering of large and/or divergent sequence sets. Second, a feature space defined by quantitative properties can be used in comparative studies of protein families and organisms. Third, the feature space can be used for evaluating generative models, where large number of sequences generated by the model can be compared to actually observed sequences.
Collapse
Affiliation(s)
- Bogumił M Konopka
- Katedra InŻynierii Biomedycznej, Wydział Podstawowych Problemów Techniki, Politechnika Wrocławska, WybrzeŻe Wyspiańskiego 27, Wroclaw, 50-370, Poland
| | - Marta Marciniak
- Katedra InŻynierii Biomedycznej, Wydział Podstawowych Problemów Techniki, Politechnika Wrocławska, WybrzeŻe Wyspiańskiego 27, Wroclaw, 50-370, Poland
| | - Witold Dyrka
- Katedra InŻynierii Biomedycznej, Wydział Podstawowych Problemów Techniki, Politechnika Wrocławska, WybrzeŻe Wyspiańskiego 27, Wroclaw, 50-370, Poland.
| |
Collapse
|
7
|
Chen J, Guo M, Wang X, Liu B. A comprehensive review and comparison of different computational methods for protein remote homology detection. Brief Bioinform 2016; 19:231-244. [DOI: 10.1093/bib/bbw108] [Citation(s) in RCA: 81] [Impact Index Per Article: 10.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2016] [Indexed: 01/02/2023] Open
Affiliation(s)
- Junjie Chen
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
| | - Mingyue Guo
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
| | - Xiaolong Wang
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
- Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
| | - Bin Liu
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
- Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
| |
Collapse
|
8
|
Koyano H, Hayashida M, Akutsu T. Maximum margin classifier working in a set of strings. Proc Math Phys Eng Sci 2016; 472:20150551. [PMID: 27118908 PMCID: PMC4841474 DOI: 10.1098/rspa.2015.0551] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2015] [Accepted: 02/02/2016] [Indexed: 11/12/2022] Open
Abstract
Numbers and numerical vectors account for a large portion of data. However, recently, the amount of string data generated has increased dramatically. Consequently, classifying string data is a common problem in many fields. The most widely used approach to this problem is to convert strings into numerical vectors using string kernels and subsequently apply a support vector machine that works in a numerical vector space. However, this non-one-to-one conversion involves a loss of information and makes it impossible to evaluate, using probability theory, the generalization error of a learning machine, considering that the given data to train and test the machine are strings generated according to probability laws. In this study, we approach this classification problem by constructing a classifier that works in a set of strings. To evaluate the generalization error of such a classifier theoretically, probability theory for strings is required. Therefore, we first extend a limit theorem for a consensus sequence of strings demonstrated by one of the authors and co-workers in a previous study. Using the obtained result, we then demonstrate that our learning machine classifies strings in an asymptotically optimal manner. Furthermore, we demonstrate the usefulness of our machine in practical data analysis by applying it to predicting protein-protein interactions using amino acid sequences and classifying RNAs by the secondary structure using nucleotide sequences.
Collapse
Affiliation(s)
- Hitoshi Koyano
- Laboratory of Biostatistics and Bioinformatics, Graduate School of Medicine, Kyoto University, 54 Kawahara-cho, Shogoin, Sakyo-ku, Kyoto 606-8507, Japan
| | - Morihiro Hayashida
- Laboratory of Mathematical Bioinformatics, Institute for Chemical Research, Kyoto University, Gokasho, Uji, Kyoto 611-0011, Japan
| | - Tatsuya Akutsu
- Laboratory of Mathematical Bioinformatics, Institute for Chemical Research, Kyoto University, Gokasho, Uji, Kyoto 611-0011, Japan
| |
Collapse
|
9
|
Oh Brother, Where Art Thou? Finding Orthologs in the Twilight and Midnight Zones of Sequence Similarity. Evol Biol 2016. [DOI: 10.1007/978-3-319-41324-2_22] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
10
|
Liu B, Chen J, Wang X. Protein remote homology detection by combining Chou’s distance-pair pseudo amino acid composition and principal component analysis. Mol Genet Genomics 2015; 290:1919-31. [DOI: 10.1007/s00438-015-1044-4] [Citation(s) in RCA: 61] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2015] [Accepted: 04/06/2015] [Indexed: 02/07/2023]
|
11
|
Bedoya O, Tischer I. Reducing dimensionality in remote homology detection using predicted contact maps. Comput Biol Med 2015; 59:64-72. [DOI: 10.1016/j.compbiomed.2015.01.020] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2014] [Revised: 01/05/2015] [Accepted: 01/22/2015] [Indexed: 11/28/2022]
|
12
|
Detecting protein atom correlations using correlation of probability of recurrence. Proteins 2014; 82:2180-9. [DOI: 10.1002/prot.24574] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2014] [Accepted: 03/29/2014] [Indexed: 11/07/2022]
|
13
|
Remote homology detection incorporating the context of physicochemical properties. Comput Biol Med 2014; 45:43-50. [PMID: 24480162 DOI: 10.1016/j.compbiomed.2013.11.012] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2013] [Revised: 11/10/2013] [Accepted: 11/18/2013] [Indexed: 11/22/2022]
Abstract
A new method for remote protein homology detection, called support vector machine incorporating the context of physicochemical properties (SVM-CP), is presented. Recent discriminative methods are based on concatenating information extracted from each protein by considering several physicochemical properties. We show that there are physicochemical properties that reflect the functional or structural characteristics of each specific protein family, but there are also some physicochemical properties that affect the accuracy of the classification techniques. The research highlights the importance of the selection of physicochemical properties in remote homology detection. Most of the methods slide a window over every protein sequence to extract physicochemical information. This extraction is usually performed by giving the same importance to every value in the window, i.e., averaging the physicochemical values in the observation window. SVM-CP takes into account that every residue in a sliding window has a different weight, which reflects the importance or contribution to the representative value of the window. The SVM-CP method reaches a receiver operating characteristic (ROC) score of 0.93462, which is the highest value for a remote homology detection method based on the sequence composition information.
Collapse
|
14
|
Liu B, Xu J, Zou Q, Xu R, Wang X, Chen Q. Using distances between Top-n-gram and residue pairs for protein remote homology detection. BMC Bioinformatics 2014; 15 Suppl 2:S3. [PMID: 24564580 PMCID: PMC4015815 DOI: 10.1186/1471-2105-15-s2-s3] [Citation(s) in RCA: 42] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open
Abstract
Background Protein remote homology detection is one of the central problems in bioinformatics, which is important for both basic research and practical application. Currently, discriminative methods based on Support Vector Machines (SVMs) achieve the state-of-the-art performance. Exploring feature vectors incorporating the position information of amino acids or other protein building blocks is a key step to improve the performance of the SVM-based methods. Results Two new methods for protein remote homology detection were proposed, called SVM-DR and SVM-DT. SVM-DR is a sequence-based method, in which the feature vector representation for protein is based on the distances between residue pairs. SVM-DT is a profile-based method, which considers the distances between Top-n-gram pairs. Top-n-gram can be viewed as a profile-based building block of proteins, which is calculated from the frequency profiles. These two methods are position dependent approaches incorporating the sequence-order information of protein sequences. Various experiments were conducted on a benchmark dataset containing 54 families and 23 superfamilies. Experimental results showed that these two new methods are very promising. Compared with the position independent methods, the performance improvement is obvious. Furthermore, the proposed methods can also provide useful insights for studying the features of protein families. Conclusion The better performance of the proposed methods demonstrates that the position dependant approaches are efficient for protein remote homology detection. Another advantage of our methods arises from the explicit feature space representation, which can be used to analyze the characteristic features of protein families. The source code of SVM-DT and SVM-DR is available at http://bioinformatics.hitsz.edu.cn/DistanceSVM/index.jsp
Collapse
|
15
|
Kuksa PP. Biological sequence classification with multivariate string kernels. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2013; 10:1201-1210. [PMID: 24384708 DOI: 10.1109/tcbb.2013.15] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
String kernel-based machine learning methods have yielded great success in practical tasks of structured/sequential data analysis. They often exhibit state-of-the-art performance on many practical tasks of sequence analysis such as biological sequence classification, remote homology detection, or protein superfamily and fold prediction. However, typical string kernel methods rely on the analysis of discrete 1D string data (e.g., DNA or amino acid sequences). In this paper, we address the multiclass biological sequence classification problems using multivariate representations in the form of sequences of features vectors (as in biological sequence profiles, or sequences of individual amino acid physicochemical descriptors) and a class of multivariate string kernels that exploit these representations. On three protein sequence classification tasks, the proposed multivariate representations and kernels show significant 15-20 percent improvements compared to existing state-of-the-art sequence classification methods.
Collapse
|
16
|
Huang CH, Chou SY, Ng KL. Improving protein complex classification accuracy using amino acid composition profile. Comput Biol Med 2013; 43:1196-204. [PMID: 23930814 DOI: 10.1016/j.compbiomed.2013.05.026] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2012] [Revised: 05/29/2013] [Accepted: 05/30/2013] [Indexed: 11/18/2022]
Abstract
Protein complex prediction approaches are based on the assumptions that complexes have dense protein-protein interactions and high functional similarity between their subunits. We investigated those assumptions by studying the subunits' interaction topology, sequence similarity and molecular function for human and yeast protein complexes. Inclusion of amino acids' physicochemical properties can provide better understanding of protein complex properties. Principal component analysis is carried out to determine the major features. Adopting amino acid composition profile information with the SVM classifier serves as an effective post-processing step for complexes classification. Improvement is based on primary sequence information only, which is easy to obtain.
Collapse
Affiliation(s)
- Chien-Hung Huang
- Department of Computer Science and Information Engineering, National Formosa University, 64, Wen-Hwa Road, Hu-wei, Yun-Lin 632, Taiwan
| | | | | |
Collapse
|
17
|
Han GS, Yu ZG, Anh V, Krishnajith APD, Tian YC. An ensemble method for predicting subnuclear localizations from primary protein structures. PLoS One 2013; 8:e57225. [PMID: 23460833 PMCID: PMC3584121 DOI: 10.1371/journal.pone.0057225] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2012] [Accepted: 01/18/2013] [Indexed: 12/04/2022] Open
Abstract
Background Predicting protein subnuclear localization is a challenging problem. Some previous works based on non-sequence information including Gene Ontology annotations and kernel fusion have respective limitations. The aim of this work is twofold: one is to propose a novel individual feature extraction method; another is to develop an ensemble method to improve prediction performance using comprehensive information represented in the form of high dimensional feature vector obtained by 11 feature extraction methods. Methodology/Principal Findings A novel two-stage multiclass support vector machine is proposed to predict protein subnuclear localizations. It only considers those feature extraction methods based on amino acid classifications and physicochemical properties. In order to speed up our system, an automatic search method for the kernel parameter is used. The prediction performance of our method is evaluated on four datasets: Lei dataset, multi-localization dataset, SNL9 dataset and a new independent dataset. The overall accuracy of prediction for 6 localizations on Lei dataset is 75.2% and that for 9 localizations on SNL9 dataset is 72.1% in the leave-one-out cross validation, 71.7% for the multi-localization dataset and 69.8% for the new independent dataset, respectively. Comparisons with those existing methods show that our method performs better for both single-localization and multi-localization proteins and achieves more balanced sensitivities and specificities on large-size and small-size subcellular localizations. The overall accuracy improvements are 4.0% and 4.7% for single-localization proteins and 6.5% for multi-localization proteins. The reliability and stability of our classification model are further confirmed by permutation analysis. Conclusions It can be concluded that our method is effective and valuable for predicting protein subnuclear localizations. A web server has been designed to implement the proposed method. It is freely available at http://bioinformatics.awowshop.com/snlpred_page.php.
Collapse
Affiliation(s)
- Guo Sheng Han
- School of Mathematics and Computational Science, Xiangtan University, Xiangtan City, Hunan, China
| | - Zu Guo Yu
- School of Mathematics and Computational Science, Xiangtan University, Xiangtan City, Hunan, China
- School of Mathematical Sciences, Queensland University of Technology, Brisbane, Queensland, Australia
- * E-mail:
| | - Vo Anh
- School of Mathematical Sciences, Queensland University of Technology, Brisbane, Queensland, Australia
| | - Anaththa P. D. Krishnajith
- School of Electrical Engineering and Computer Science, Queensland University of Technology, Brisbane, Queensland, Australia
| | - Yu-Chu Tian
- School of Electrical Engineering and Computer Science, Queensland University of Technology, Brisbane, Queensland, Australia
| |
Collapse
|
18
|
Liu B, Wang X, Chen Q, Dong Q, Lan X. Using amino acid physicochemical distance transformation for fast protein remote homology detection. PLoS One 2012; 7:e46633. [PMID: 23029559 PMCID: PMC3460876 DOI: 10.1371/journal.pone.0046633] [Citation(s) in RCA: 81] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2012] [Accepted: 09/03/2012] [Indexed: 11/18/2022] Open
Abstract
Protein remote homology detection is one of the most important problems in bioinformatics. Discriminative methods such as support vector machines (SVM) have shown superior performance. However, the performance of SVM-based methods depends on the vector representations of the protein sequences. Prior works have demonstrated that sequence-order effects are relevant for discrimination, but little work has explored how to incorporate the sequence-order information along with the amino acid physicochemical properties into the prediction. In order to incorporate the sequence-order effects into the protein remote homology detection, the physicochemical distance transformation (PDT) method is proposed. Each protein sequence is converted into a series of numbers by using the physicochemical property scores in the amino acid index (AAIndex), and then the sequence is converted into a fixed length vector by PDT. The sequence-order information can be efficiently included into the feature vector with little computational cost by this approach. Finally, the feature vectors are input into a support vector machine classifier to detect the protein remote homologies. Our experiments on a well-known benchmark show the proposed method SVM-PDT achieves superior or comparable performance with current state-of-the-art methods and its computational cost is considerably superior to those of other methods. When the evolutionary information extracted from the frequency profiles is combined with the PDT method, the profile-based PDT approach can improve the performance by 3.4% and 11.4% in terms of ROC score and ROC50 score respectively. The local sequence-order information of the protein can be efficiently captured by the proposed PDT and the physicochemical properties extracted from the amino acid index are incorporated into the prediction. The physicochemical distance transformation provides a general framework, which would be a valuable tool for protein-level study.
Collapse
Affiliation(s)
- Bin Liu
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, People's Republic of China.
| | | | | | | | | |
Collapse
|
19
|
Liu X, Zhao L, Dong Q. Protein remote homology detection based on auto-cross covariance transformation. Comput Biol Med 2011; 41:640-7. [DOI: 10.1016/j.compbiomed.2011.05.015] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2010] [Revised: 05/03/2011] [Accepted: 05/24/2011] [Indexed: 11/26/2022]
|
20
|
Shah AR, Agarwal K, Baker ES, Singhal M, Mayampurath AM, Ibrahim YM, Kangas LJ, Monroe ME, Zhao R, Belov ME, Anderson GA, Smith RD. Machine learning based prediction for peptide drift times in ion mobility spectrometry. Bioinformatics 2010; 26:1601-7. [PMID: 20495001 PMCID: PMC2913656 DOI: 10.1093/bioinformatics/btq245] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2010] [Revised: 04/18/2010] [Accepted: 05/02/2010] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Ion mobility spectrometry (IMS) has gained significant traction over the past few years for rapid, high-resolution separations of analytes based upon gas-phase ion structure, with significant potential impacts in the field of proteomic analysis. IMS coupled with mass spectrometry (MS) affords multiple improvements over traditional proteomics techniques, such as in the elucidation of secondary structure information, identification of post-translational modifications, as well as higher identification rates with reduced experiment times. The high throughput nature of this technique benefits from accurate calculation of cross sections, mobilities and associated drift times of peptides, thereby enhancing downstream data analysis. Here, we present a model that uses physicochemical properties of peptides to accurately predict a peptide's drift time directly from its amino acid sequence. This model is used in conjunction with two mathematical techniques, a partial least squares regression and a support vector regression setting. RESULTS When tested on an experimentally created high confidence database of 8675 peptide sequences with measured drift times, both techniques statistically significantly outperform the intrinsic size parameters-based calculations, the currently held practice in the field, on all charge states (+2, +3 and +4). AVAILABILITY The software executable, imPredict, is available for download from http:/omics.pnl.gov/software/imPredict.php CONTACT rds@pnl.gov SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Anuj R Shah
- Fundamental and Computational Sciences Directorate, Pacific Northwest National Laboratory, 999 Battelle Boulevard, Richland, WA 99352, USA
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
21
|
Webb-Robertson BJM, Ratuiste KG, Oehmen CS. Physicochemical property distributions for accurate and rapid pairwise protein homology detection. BMC Bioinformatics 2010; 11:145. [PMID: 20302613 PMCID: PMC2851606 DOI: 10.1186/1471-2105-11-145] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2009] [Accepted: 03/19/2010] [Indexed: 11/10/2022] Open
Abstract
Background The challenge of remote homology detection is that many evolutionarily related sequences have very little similarity at the amino acid level. Kernel-based discriminative methods, such as support vector machines (SVMs), that use vector representations of sequences derived from sequence properties have been shown to have superior accuracy when compared to traditional approaches for the task of remote homology detection. Results We introduce a new method for feature vector representation based on the physicochemical properties of the primary protein sequence. A distribution of physicochemical property scores are assembled from 4-mers of the sequence and normalized based on the null distribution of the property over all possible 4-mers. With this approach there is little computational cost associated with the transformation of the protein into feature space, and overall performance in terms of remote homology detection is comparable with current state-of-the-art methods. We demonstrate that the features can be used for the task of pairwise remote homology detection with improved accuracy versus sequence-based methods such as BLAST and other feature-based methods of similar computational cost. Conclusions A protein feature method based on physicochemical properties is a viable approach for extracting features in a computationally inexpensive manner while retaining the sensitivity of SVM protein homology detection. Furthermore, identifying features that can be used for generic pairwise homology detection in lieu of family-based homology detection is important for applications such as large database searches and comparative genomics.
Collapse
Affiliation(s)
- Bobbie-Jo M Webb-Robertson
- Computational Biology and Bioinformatics, Pacific Northwest National Laboratory, Richland, WA 99352, USA.
| | | | | |
Collapse
|
22
|
Bruni R, Costantino A, Tritarelli E, Marcantonio C, Ciccozzi M, Rapicetta M, El Sawaf G, Giuliani A, Ciccaglione AR. A computational approach identifies two regions of Hepatitis C Virus E1 protein as interacting domains involved in viral fusion process. BMC STRUCTURAL BIOLOGY 2009; 9:48. [PMID: 19640267 PMCID: PMC2732612 DOI: 10.1186/1472-6807-9-48] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/22/2008] [Accepted: 07/29/2009] [Indexed: 01/01/2023]
Abstract
Background The E1 protein of Hepatitis C Virus (HCV) can be dissected into two distinct hydrophobic regions: a central domain containing an hypothetical fusion peptide (FP), and a C-terminal domain (CT) comprising two segments, a pre-anchor and a trans-membrane (TM) region. In the currently accepted model of the viral fusion process, the FP and the TM regions are considered to be closely juxtaposed in the post-fusion structure and their physical interaction cannot be excluded. In the present study, we took advantage of the natural sequence variability present among HCV strains to test, by purely sequence-based computational tools, the hypothesis that in this virus the fusion process involves the physical interaction of the FP and CT regions of E1. Results Two computational approaches were applied. The first one is based on the co-evolution paradigm of interacting peptides and consequently on the correlation between the distance matrices generated by the sequence alignment method applied to FP and CT primary structures, respectively. In spite of the relatively low random genetic drift between genotypes, co-evolution analysis of sequences from five HCV genotypes revealed a greater correlation between the FP and CT domains than respect to a control HCV sequence from Core protein, so giving a clear, albeit still inconclusive, support to the physical interaction hypothesis. The second approach relies upon a non-linear signal analysis method widely used in protein science called Recurrence Quantification Analysis (RQA). This method allows for a direct comparison of domains for the presence of common hydrophobicity patterns, on which the physical interaction is based upon. RQA greatly strengthened the reliability of the hypothesis by the scoring of a lot of cross-recurrences between FP and CT peptides hydrophobicity patterning largely outnumbering chance expectations and pointing to putative interaction sites. Intriguingly, mutations in the CT region of E1, reducing the fusion process in vitro, strongly reduced the amount of cross-recurrence further supporting interaction between this region and FP. Conclusion Our results support a fusion model for HCV in which the FP and the C-terminal region of E1 are juxtaposed and interact in the post-fusion structure. These findings have general implications for viruses, as any visualization of the post-fusion FP-TM complex has been precluded by the impossibility to obtain crystallised viral fusion proteins containing the trans-membrane region. This limitation gives to sequence based modelling efforts a crucial role in the sketching of a molecular interpretation of the fusion process. Moreover, our data also have a more general relevance for cell biology as the mechanism of intracellular fusion showed remarkable similarities with viral fusion
Collapse
Affiliation(s)
- Roberto Bruni
- Department of Infectious, Parasitic and Immune-mediated Diseases, Istituto Superiore di Sanità, Rome, Italy.
| | | | | | | | | | | | | | | | | |
Collapse
|