1
|
Tang T, Zhang X, Li W, Wang Q, Liu Y, Cao X. Co-training based prediction of multi-label protein-protein interactions. Comput Biol Med 2024; 177:108623. [PMID: 38788374 DOI: 10.1016/j.compbiomed.2024.108623] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2024] [Revised: 05/01/2024] [Accepted: 05/16/2024] [Indexed: 05/26/2024]
Abstract
Prediction of protein-protein interaction (PPI) types enhances the comprehension of the underlying structural characteristics and functions of proteins, which gives rise to a multi-label classification problem. The nominal features describe the physicochemical characteristics of proteins directly, establishing a more robust correlation with the interaction types between proteins than ordered features. Motivated by this, we propose a multi-label PPI prediction model referred to as CoMPPI (Co-training based Multi-Label prediction of Protein-Protein Interaction). This approach aims to maximize the utility of both ordered and nominal features extracted from protein sequences. Specifically, CoMPPI incorporates graph convolutional network (GCN) and 1D convolution operation to process the complementary subsets of features individually, leveraging both local and contextualized information in a more efficient way. In addition, two multi-type PPI datasets were constructed to eliminate the duplication in previous datasets. We compare the performance of CoMPPI with three state-of-the-art methods on three datasets partitioned using distinct schemes (Breadth-first search, Depth-first search, and Random), CoMPPI consistently outperforms the other methods across all cases, demonstrating improvements ranging from 3.81% to 32.40% in Micro-F1. The subsequent ablation experiment confirms the efficacy of employing the co-training framework for multi-label PPI prediction, indicating promising avenues for future advancements in this domain.
Collapse
Affiliation(s)
- Tao Tang
- School of Modern Posts, Nanjing University of Posts and Telecommunications, 9 Wenyuan Rd, Nanjing, 210023, Jiangsu, China
| | - Xiaocai Zhang
- Institute of High Performance Computing, Agency for Science, Technology and Research (A*STAR), 1 Fusionopolis Way, Singapore, 138632, Singapore
| | - Weizhuo Li
- School of Modern Posts, Nanjing University of Posts and Telecommunications, 9 Wenyuan Rd, Nanjing, 210023, Jiangsu, China
| | - Qing Wang
- School of Management, Nanjing University of Posts and Telecommunications, 9 Wenyuan Rd, Nanjing, 210023, Jiangsu, China
| | - Yuansheng Liu
- College of Computer Science and Electronic Engineering, Hunan University, 2 Lushan Rd, Changsha, 410086, Hunan, China; Key Laboratory of Intelligent Computing & Signal Processing of Ministry of Education, Anhui University, 111 Jiulong Road, Hefei, 230601, Anhui, China.
| | - Xiaofeng Cao
- School of Artificial Intelligence, Jilin University, 2699 Qianjin St, Jilin, 130012, Changchun, China
| |
Collapse
|
2
|
Jia P, Zhang F, Wu C, Li M. A comprehensive review of protein-centric predictors for biomolecular interactions: from proteins to nucleic acids and beyond. Brief Bioinform 2024; 25:bbae162. [PMID: 38739759 PMCID: PMC11089422 DOI: 10.1093/bib/bbae162] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2024] [Revised: 02/17/2024] [Accepted: 03/31/2024] [Indexed: 05/16/2024] Open
Abstract
Proteins interact with diverse ligands to perform a large number of biological functions, such as gene expression and signal transduction. Accurate identification of these protein-ligand interactions is crucial to the understanding of molecular mechanisms and the development of new drugs. However, traditional biological experiments are time-consuming and expensive. With the development of high-throughput technologies, an increasing amount of protein data is available. In the past decades, many computational methods have been developed to predict protein-ligand interactions. Here, we review a comprehensive set of over 160 protein-ligand interaction predictors, which cover protein-protein, protein-nucleic acid, protein-peptide and protein-other ligands (nucleotide, heme, ion) interactions. We have carried out a comprehensive analysis of the above four types of predictors from several significant perspectives, including their inputs, feature profiles, models, availability, etc. The current methods primarily rely on protein sequences, especially utilizing evolutionary information. The significant improvement in predictions is attributed to deep learning methods. Additionally, sequence-based pretrained models and structure-based approaches are emerging as new trends.
Collapse
Affiliation(s)
- Pengzhen Jia
- School of Computer Science and Engineering, Central South University, 932 Lushan Road(S), Changsha 410083, China
| | - Fuhao Zhang
- School of Computer Science and Engineering, Central South University, 932 Lushan Road(S), Changsha 410083, China
- College of Information Engineering, Northwest A&F University, No. 3 Taicheng Road, Yangling, Shaanxi 712100, China
| | - Chaojin Wu
- School of Computer Science and Engineering, Central South University, 932 Lushan Road(S), Changsha 410083, China
| | - Min Li
- School of Computer Science and Engineering, Central South University, 932 Lushan Road(S), Changsha 410083, China
| |
Collapse
|
3
|
Liu B, Yang Z, Liu Q, Zhang Y, Ding H, Lai H, Li Q. Computational prediction of allergenic proteins based on multi-feature fusion. Front Genet 2023; 14:1294159. [PMID: 37928245 PMCID: PMC10622758 DOI: 10.3389/fgene.2023.1294159] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2023] [Accepted: 10/11/2023] [Indexed: 11/07/2023] Open
Abstract
Allergy is an autoimmune disorder described as an undesirable response of the immune system to typically innocuous substance in the environment. Studies have shown that the ability of proteins to trigger allergic reactions in susceptible individuals can be evaluated by bioinformatics tools. However, developing computational methods to accurately identify new allergenic proteins remains a vital challenge. This work aims to propose a machine learning model based on multi-feature fusion for predicting allergenic proteins efficiently. Firstly, we prepared a benchmark dataset of allergenic and non-allergenic protein sequences and pretested on it with a machine-learning platform. Then, three preferable feature extraction methods, including amino acid composition (AAC), dipeptide composition (DPC) and composition of k-spaced amino acid pairs (CKSAAP) were chosen to extract protein sequence features. Subsequently, these features were fused and optimized by Pearson correlation coefficient (PCC) and principal component analysis (PCA). Finally, the most representative features were picked out to build the optimal predictor based on random forest (RF) algorithm. Performance evaluation results via 5-fold cross-validation showed that the final model, called iAller (https://github.com/laihongyan/iAller), could precisely distinguish allergenic proteins from non-allergenic proteins. The prediction accuracy and AUC value for validation dataset achieved 91.4% and 0.97%, respectively. This model will provide guide for users to identify more allergenic proteins.
Collapse
Affiliation(s)
- Bin Liu
- Department of Anesthesiology, The Fourth People’s Hospital of Sichuan Province, Chengdu, Sichuan, China
| | - Ziman Yang
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Qing Liu
- Department of Pain, The Affiliated Traditional Chinese Medicine Hospital of Southwest Medical University, Luzhou, Sichuan, China
| | - Ying Zhang
- Department of Anesthesiology, The Affiliated Traditional Chinese Medicine Hospital of Southwest Medical University, Luzhou, Sichuan, China
| | - Hui Ding
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Hongyan Lai
- Chongqing Key Laboratory of Big Data for Bio Intelligence, Chongqing University of Posts and Telecommunications, Chongqing, China
| | - Qun Li
- Department of Pain, The Affiliated Traditional Chinese Medicine Hospital of Southwest Medical University, Luzhou, Sichuan, China
- Research Center of Integrated Traditional Chinese and Western Medicine, The Affiliated Traditional Chinese Medicine Hospital of Southwest Medical University, Luzhou, Sichuan, China
| |
Collapse
|
4
|
Liu Y, Wang S, Li X, Liu Y, Zhu X. NeuroPpred-SVM: A New Model for Predicting Neuropeptides Based on Embeddings of BERT. J Proteome Res 2023; 22:718-728. [PMID: 36749151 DOI: 10.1021/acs.jproteome.2c00363] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
Abstract
Neuropeptides play pivotal roles in different physiological processes and are related to different kinds of diseases. Identification of neuropeptides is of great benefit for studying the mechanism of these physiological processes and the treatment of neurological disorders. Several state-of-the-art neuropeptide predictors have been developed by using a two-layer stacking ensemble algorithm. Although the two-layer stacking ensemble algorithm can improve the feature representability, these models are complex, which are not as efficient as the models based on one classifier. In this study, we proposed a new model, NeuroPpred-SVM, to predict neuropeptides based on the embeddings of Bidirectional Encoder Representations from Transformers and other sequential features by using a support vector machine (SVM). The experimental results indicate that our model achieved a cross-validation area under the receiver operating characteristic (AUROC) curve of 0.969 on the training data set and an AUROC of 0.966 on the independent test set. By comparing our model with the other four state-of-the-art models including NeuroPIpred, PredNeuroP, NeuroPpred-Fuse, and NeuroPpred-FRL on the independent test set, our model achieved the highest AUROC, Matthews correlation coefficient, accuracy, and specificity, which indicate that our model outperforms the existing models. We believed that NeuroPpred-SVM could be a useful tool for identifying neuropeptides with high accuracy and low cost. The data sets and Python code are available at https://github.com/liuyf-a/NeuroPpred-SVM.
Collapse
Affiliation(s)
- Yufeng Liu
- School of Sciences, Anhui Agricultural University, Hefei, Anhui 230036, China
| | - Shuyu Wang
- School of Sciences, Anhui Agricultural University, Hefei, Anhui 230036, China
| | - Xiang Li
- School of Sciences, Anhui Agricultural University, Hefei, Anhui 230036, China
| | - Yinbo Liu
- School of Sciences, Anhui Agricultural University, Hefei, Anhui 230036, China
| | - Xiaolei Zhu
- School of Sciences, Anhui Agricultural University, Hefei, Anhui 230036, China
| |
Collapse
|
5
|
Garcia-Moreno FM, Gutiérrez-Naranjo MA. ALLERDET: A novel web app for prediction of protein allergenicity. J Biomed Inform 2022; 135:104217. [DOI: 10.1016/j.jbi.2022.104217] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2022] [Revised: 09/21/2022] [Accepted: 09/30/2022] [Indexed: 10/31/2022]
|
6
|
Benedé S, Lozano-Ojalvo D, Cristobal S, Costa J, D'Auria E, Velickovic TC, Garrido-Arandia M, Karakaya S, Mafra I, Mazzucchelli G, Picariello G, Romero-Sahagun A, Villa C, Roncada P, Molina E. New applications of advanced instrumental techniques for the characterization of food allergenic proteins. Crit Rev Food Sci Nutr 2021; 62:8686-8702. [PMID: 34060381 DOI: 10.1080/10408398.2021.1931806] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Abstract
Current approaches based on electrophoretic, chromatographic or immunochemical principles have allowed characterizing multiple allergens, mapping their epitopes, studying their mechanisms of action, developing detection and diagnostic methods and therapeutic strategies for the food and pharmaceutical industry. However, some of the common structural features related to the allergenic potential of food proteins remain unknown, or the pathological mechanism of food allergy is not yet fully understood. In addition, it is also necessary to evaluate new allergens from novel protein sources that may pose a new risk for consumers. Technological development has allowed the expansion of advanced technologies for which their whole potential has not been entirely exploited and could provide novel contributions to still unexplored molecular traits underlying both the structure of food allergens and the mechanisms through which they sensitize or elicit adverse responses in human subjects, as well as improving analytical techniques for their detection. This review presents cutting-edge instrumental techniques recently applied when studying structural and functional aspects of proteins, mechanism of action and interaction between biomolecules. We also exemplify their role in the food allergy research and discuss their new possible applications in several areas of the food allergy field.
Collapse
Affiliation(s)
- Sara Benedé
- Instituto de Investigación en Ciencias de la Alimentación (CIAL, CSIC-UAM), Madrid, Spain
| | - Daniel Lozano-Ojalvo
- Precision Immunology Institute, Icahn School of Medicine at Mount Sinai, Jaffe Food Allergy Institute, New York, NY, USA
| | - Susana Cristobal
- Department of Biomedical and Clinical Sciences, Cell Biology, Faculty of Medicine, Linköping University, Linköping, Sweden.,IKERBASQUE, Basque Foundation for Science, Department of Physiology, Faculty of Medicine and Nursing, University of the Basque Country UPV/EHU, Leioa, Spain
| | - Joana Costa
- REQUIMTE-LAQV, Faculdade de Farmácia, Universidade do Porto, Porto, Portugal
| | - Enza D'Auria
- Clinica Pediatrica, Ospedale dei Bambini Vittore Buzzi, Università degli Studi, Milano, Italy
| | - Tanja Cirkovic Velickovic
- Faculty of Chemistry, University of Belgrade, Belgrade, Serbia.,Ghent University Global Campus, Incheon, South Korea.,Faculty of Bioscience Engineering, Ghent University, Ghent, Belgium.,Serbian Academy of Sciences and Arts, Belgrade, Serbia
| | - María Garrido-Arandia
- Centro de Biotecnología y Genómica de Plantas (UPM-INIA), Universidad Politécnica de Madrid, Pozuelo de Alarcón, Madrid, Spain
| | - Sibel Karakaya
- Department of Food Engineering, Ege University, Izmir, Turkey
| | - Isabel Mafra
- REQUIMTE-LAQV, Faculdade de Farmácia, Universidade do Porto, Porto, Portugal
| | - Gabriel Mazzucchelli
- Mass Spectrometry Laboratory, MolSys Research Unit, University of Liege, Liege, Belgium
| | - Gianluca Picariello
- Institute of Food Sciences, National Research Council (CNR), Avellino, Italy
| | - Alejandro Romero-Sahagun
- Centro de Biotecnología y Genómica de Plantas (UPM-INIA), Universidad Politécnica de Madrid, Pozuelo de Alarcón, Madrid, Spain
| | - Caterina Villa
- REQUIMTE-LAQV, Faculdade de Farmácia, Universidade do Porto, Porto, Portugal
| | - Paola Roncada
- Department of Health Sciences, University Magna Graecia, Catanzaro, Italy
| | - Elena Molina
- Instituto de Investigación en Ciencias de la Alimentación (CIAL, CSIC-UAM), Madrid, Spain
| |
Collapse
|
7
|
Homologies between SARS-CoV-2 and allergen proteins may direct T cell-mediated heterologous immune responses. Sci Rep 2021; 11:4792. [PMID: 33637823 PMCID: PMC7910599 DOI: 10.1038/s41598-021-84320-8] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2020] [Accepted: 02/15/2021] [Indexed: 01/30/2023] Open
Abstract
The outbreak of the new severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) is a public health emergency. Asthma does not represent a risk factor for COVID-19 in several published cohorts. We hypothesized that the SARS-CoV-2 proteome contains T cell epitopes, which are potentially cross-reactive to allergen epitopes. We aimed at identifying homologous peptide sequences by means of two distinct complementary bioinformatics approaches. Pipeline 1 included prediction of MHC Class I and Class II epitopes contained in the SARS-CoV-2 proteome and allergens along with alignment and elaborate ranking approaches. Pipeline 2 involved alignment of SARS-CoV-2 overlapping peptides with known allergen-derived T cell epitopes. Our results indicate a large number of MHC Class I epitope pairs including known as well as de novo predicted allergen T cell epitopes with high probability for cross-reactivity. Allergen sources, such as Aspergillus fumigatus, Phleum pratense and Dermatophagoides species are of particular interest due to their association with multiple cross-reactive candidate peptides, independently of the applied bioinformatic approach. In contrast, peptides derived from food allergens, as well as MHC class II epitopes did not achieve high in silico ranking and were therefore not further investigated. Our findings warrant further experimental confirmation along with examination of the functional importance of such cross-reactive responses.
Collapse
|
8
|
Khan YD, Alzahrani E, Alghamdi W, Ullah MZ. Sequence-based Identification of Allergen Proteins Developed by Integration of PseAAC and Statistical Moments via 5-Step Rule. Curr Bioinform 2021. [DOI: 10.2174/1574893615999200424085947] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/24/2023]
Abstract
Background:
Allergens are antigens that can stimulate an atopic type I human
hypersensitivity reaction by an immunoglobulin E (IgE) reaction. Some proteins are naturally
allergenic than others. The challenge for toxicologists is to identify properties that allow proteins
to cause allergic sensitization and allergic diseases. The identification of allergen proteins is a very
critical and pivotal task. The experimental identification of protein functions is a hectic, laborious
and costly task; therefore, computer scientists have proposed various methods in the field of
computational biology and bioinformatics using various data science approaches. Objectives:
Herein, we report a novel predictor for the identification of allergen proteins.
Methods:
For feature extraction, statistical moments and various position-based features have been
incorporated into Chou’s pseudo amino acid composition (PseAAC), and are used for training of a
neural network.
Results:
The predictor is validated through 10-fold cross-validation and Jackknife testing, which
gave 99.43% and 99.87% accurate results.
Conclusions:
Thus, the proposed predictor can help in predicting the Allergen proteins in an
efficient and accurate way and can provide baseline data for the discovery of new drugs and
biomarkers.
Collapse
Affiliation(s)
- Yaser Daanial Khan
- Department of Computer Science, School of Systems and Technology, University of Management and Technology, C II Johar Town, Lahore 54770, Pakistan
| | - Ebraheem Alzahrani
- Department of Mathematics, Faculty of Science, King Abdulaziz University, P.O. Box 80203, Jeddah 21589, Saudi Arabia
| | - Wajdi Alghamdi
- Department of Information Technology, Faculty of Computing and Information Technology, King Abdulaziz University, P.O. Box 80221, Jeddah, Saudi Arabia
| | - Malik Zaka Ullah
- Department of Mathematics, Faculty of Science, King Abdulaziz University, P.O. Box 80203, Jeddah 21589, Saudi Arabia
| |
Collapse
|
9
|
Behbahani M, Rabiei P, Mohabatkar H. A Comparative Analysis of Allergen Proteins between Plants and Animals Using Several Computational Tools and Chou's PseAAC Concept. Int Arch Allergy Immunol 2020; 181:813-821. [PMID: 32906141 DOI: 10.1159/000509084] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2020] [Accepted: 05/29/2020] [Indexed: 11/19/2022] Open
Abstract
BACKGROUND A large number of allergens are derived from plant and animal proteins. A major challenge for researchers is to study the possible allergenic properties of proteins. The aim of this study was in silico analysis and comparison of several physiochemical and structural features of plant- and animal-derived allergen proteins, as well as classifying these proteins based on Chou's pseudo-amino acid composition (PseAAC) concept combined with bioinformatics algorithms. METHODS The physiochemical properties and secondary structure of plant and animal allergens were studied. The classification of the sequences was done using the PseAAC concept incorporated with the deep learning algorithm. Conserved motifs of plant and animal proteins were discovered using the MEME tool. B-cell and T-cell epitopes of the proteins were predicted in conserved motifs. Allergenicity and amino acid composition of epitopes were also analyzed via bioinformatics servers. RESULTS In comparison of physiochemical features of animal and plant allergens, extinction coefficient was different significantly. Secondary structure prediction showed more random coiled structure in plant allergen proteins compared with animal proteins. Classification of proteins based on PseAAC achieved 88.24% accuracy. The amino acid composition study of predicted B- and T-cell epitopes revealed more aliphatic index in plant-derived epitopes. CONCLUSIONS The results indicated that bioinformatics-based studies could be useful in comparing plant and animal allergens.
Collapse
Affiliation(s)
- Mandana Behbahani
- Department of Biotechnology, Faculty of Biological Science and Technology, University of Isfahan, Isfahan, Iran
| | - Parisa Rabiei
- Department of Biotechnology, Faculty of Biological Science and Technology, University of Isfahan, Isfahan, Iran
| | - Hassan Mohabatkar
- Department of Biotechnology, Faculty of Biological Science and Technology, University of Isfahan, Isfahan, Iran,
| |
Collapse
|
10
|
Bekhouche S, Mohamed Ben Ali Y. Feature Selection in GPCR Classification Using BAT Algorithm. INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE AND APPLICATIONS 2020. [DOI: 10.1142/s1469026820500066] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
G-Protein-Coupled Receptors (GPCR) are the large family of protein membrane; and until now some of them still remain orphans. Predicting GPCR functions is a challenging task, it depends closely to their classification, which requires a digital representation of each protein chain as an attribute vector. A major problem of GPCR databases is their great number of features which can produce combinatorial explosion and increase the complexity of classification algorithms. Feature selection techniques are used to deal with this problem by minimizing features space dimension, and keeping the most relevant ones. In this paper, we propose to use the BAT algorithm for extracting the pertinent features and to improve the classification results. We compared the results obtained by our system with two other bio-inspired algorithms, Evolutionary Algorithm and PSO search. Metrics quality measures used for comparison are Error Rate, Accuracy, MCC and [Formula: see text]-measure. Experimental results indicate that our system is more efficient.
Collapse
Affiliation(s)
- Safia Bekhouche
- Department of Computer Science, Badji Mokhtar University, Annaba 23000, Algeria
| | - Yamina Mohamed Ben Ali
- Lboratory of Research in Informatics (LRI), Badji Mokhtar University, Annaba 23000, Algeria
| |
Collapse
|
11
|
Kong M, Zhang Y, Xu D, Chen W, Dehmer M. FCTP-WSRC: Protein-Protein Interactions Prediction via Weighted Sparse Representation Based Classification. Front Genet 2020; 11:18. [PMID: 32117437 PMCID: PMC7010952 DOI: 10.3389/fgene.2020.00018] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2019] [Accepted: 01/07/2020] [Indexed: 12/21/2022] Open
Abstract
The task of predicting protein–protein interactions (PPIs) has been essential in the context of understanding biological processes. This paper proposes a novel computational model namely FCTP-WSRC to predict PPIs effectively. Initially, combinations of the F-vector, composition (C) and transition (T) are used to map each protein sequence onto numeric feature vectors. Afterwards, an effective feature extraction method PCA (principal component analysis) is employed to reconstruct the most discriminative feature subspaces, which is subsequently used as input in weighted sparse representation based classification (WSRC) for prediction. The FCTP-WSRC model achieves accuracies of 96.67%, 99.82%, and 98.09% for H. pylori, Human and Yeast datasets respectively. Furthermore, the FCTP-WSRC model performs well when predicting three significant PPIs networks: the single-core network (CD9), the multiple-core network (Ras-Raf-Mek-Erk-Elk-Srf pathway), and the cross-connection network (Wnt-related Network). Consequently, the promising results show that the proposed method can be a powerful tool for PPIs prediction with excellent performance and less time.
Collapse
Affiliation(s)
- Meng Kong
- School of Mathematics and Statistics, Shandong University at Weihai, Weihai, China
| | - Yusen Zhang
- School of Mathematics and Statistics, Shandong University at Weihai, Weihai, China
| | - Da Xu
- School of Mathematics and Statistics, Shandong University at Weihai, Weihai, China
| | - Wei Chen
- School of Mathematics and Statistics, Shandong University at Weihai, Weihai, China
| | - Matthias Dehmer
- University of Applied Sciences Upper Austria, School of Management, Steyr, Austria.,College of Artificial Intellegience, Nankai University, Tianjin, China.,Department of Biomedical Computer Science and Mechantronics, UMIT Hall, Tyrol, Austria
| |
Collapse
|
12
|
Yang X, Yang S, Li Q, Wuchty S, Zhang Z. Prediction of human-virus protein-protein interactions through a sequence embedding-based machine learning method. Comput Struct Biotechnol J 2019; 18:153-161. [PMID: 31969974 PMCID: PMC6961065 DOI: 10.1016/j.csbj.2019.12.005] [Citation(s) in RCA: 69] [Impact Index Per Article: 13.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2019] [Revised: 11/29/2019] [Accepted: 12/10/2019] [Indexed: 12/11/2022] Open
Abstract
The identification of human-virus protein-protein interactions (PPIs) is an essential and challenging research topic, potentially providing a mechanistic understanding of viral infection. Given that the experimental determination of human-virus PPIs is time-consuming and labor-intensive, computational methods are playing an important role in providing testable hypotheses, complementing the determination of large-scale interactome between species. In this work, we applied an unsupervised sequence embedding technique (doc2vec) to represent protein sequences as rich feature vectors of low dimensionality. Training a Random Forest (RF) classifier through a training dataset that covers known PPIs between human and all viruses, we obtained excellent predictive accuracy outperforming various combinations of machine learning algorithms and commonly-used sequence encoding schemes. Rigorous comparison with three existing human-virus PPI prediction methods, our proposed computational framework further provided very competitive and promising performance, suggesting that the doc2vec encoding scheme effectively captures context information of protein sequences, pertaining to corresponding protein-protein interactions. Our approach is freely accessible through our web server as part of our host-pathogen PPI prediction platform (http://zzdlab.com/InterSPPI/). Taken together, we hope the current work not only contributes a useful predictor to accelerate the exploration of human-virus PPIs, but also provides some meaningful insights into human-virus relationships.
Collapse
Key Words
- AC, Auto Covariance
- ACC, Accuracy
- AUC, area under the ROC curve
- AUPRC, area under the PR curve
- Adaboost, Adaptive Boosting
- CT, Conjoint Triad
- Doc2vec
- Embedding
- Human-virus interaction
- LD, Local Descriptor
- MCC, Matthews correlation coefficient
- ML, machine learning
- MLP, Multiple Layer Perceptron
- MS, mass spectroscopy
- Machine learning
- PPIs, protein-protein interactions
- PR, Precision-Recall
- Prediction
- Protein-protein interaction
- RBF, radial basis function
- RF, Random Forest
- ROC, Receiver Operating Characteristic
- SGD, stochastic gradient descent
- SVM, Support Vector Machine
- Y2H, yeast two-hybrid
Collapse
Affiliation(s)
- Xiaodi Yang
- State Key Laboratory of Agrobiotechnology, College of Biological Sciences, China Agricultural University, Beijing 100193, China
| | - Shiping Yang
- State Key Laboratory of Plant Physiology and Biochemistry, College of Biological Sciences, China Agricultural University, Beijing 100193, China
| | - Qinmengge Li
- National Demonstration Center for Experimental Biological Sciences Education, College of Biological Sciences, China Agricultural University, Beijing 100193, China
| | - Stefan Wuchty
- Dept. of Computer Science, University of Miami, Miami, FL 33146, USA
- Dept. of Biology, University of Miami, Miami, FL 33146, USA
- Center of Computational Science, University of Miami, Miami, FL 33146, USA
- Sylvester Comprehensive Cancer Center, University of Miami, Miami, FL 33136, USA
| | - Ziding Zhang
- State Key Laboratory of Agrobiotechnology, College of Biological Sciences, China Agricultural University, Beijing 100193, China
| |
Collapse
|
13
|
Bahrami AA, Payandeh Z, Khalili S, Zakeri A, Bandehpour M. Immunoinformatics: In Silico Approaches and Computational Design of a Multi-epitope, Immunogenic Protein. Int Rev Immunol 2019; 38:307-322. [PMID: 31478759 DOI: 10.1080/08830185.2019.1657426] [Citation(s) in RCA: 60] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
Abstract
Immunoinformatics is a new critical field with several tools and databases that conduct the eyesight of experimental selection and facilitate analysis of the great amount of immunologic data obtained from experimental researches and helps to design and introducing new hypothesis. Given these visages, immunoinformatics seems to be the way that develop and progress the immunological research. Bioinformatics methods and applications are successfully employed in vaccine informatics to assist different sites of the preclinical, clinical, and post-licensure vaccine enterprises. On the other hand, the progression of molecular biology and immunology caused epitope vaccines have become the focus of research on molecular vaccines. Moreover, reverse vaccinology could improve vaccine production and vaccination protocols by in silico prediction of protein-vaccine candidates from genome sequences. B- and T-cell immune epitopes could be predicted by immunoinformatics algorithms and computational methods to improve the vaccine design, protective immunity analysis, assessment of vaccine safety and efficacy, and immunization modeling. This review aims to discuss the power of computational approaches in vaccine design and their relevance to the development of effective vaccines. Furthermore, the various divisions of this field and available tools in each item are introduced and reviewed.
Collapse
Affiliation(s)
- Armina Alagheband Bahrami
- Department of Biotechnology, School of Advanced Technologies in Medicine, Shahid Beheshti University of Medical Sciences, Tehran, Iran
| | - Zahra Payandeh
- Immunology Research Center, Tabriz University of Medical Sciences, Tabriz, Iran
| | - Saeed Khalili
- Department of Biology Sciences, Shahid Rajaee Teacher Training University, Tehran, Iran
| | - Alireza Zakeri
- Department of Biology Sciences, Shahid Rajaee Teacher Training University, Tehran, Iran
| | - Mojgan Bandehpour
- Department of Biotechnology, School of Advanced Technologies in Medicine, Shahid Beheshti University of Medical Sciences, Tehran, Iran
| |
Collapse
|
14
|
Xi B, Tao J, Liu X, Xu X, He P, Dai Q. RaaMLab: A MATLAB toolbox that generates amino acid groups and reduced amino acid modes. Biosystems 2019; 180:38-45. [PMID: 30904554 DOI: 10.1016/j.biosystems.2019.03.002] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2018] [Revised: 12/25/2018] [Accepted: 03/06/2019] [Indexed: 01/31/2023]
Abstract
Amino acid (AA) classification and its different biophysical and chemical characteristics have been widely applied to analyze and predict the structural, functional, expression and interaction profiles of proteins and peptides. We present RaaMLab, a free and open-source MATLAB toolbox, to facilitate studies on proteins and peptides, to generate AA groups and to extract the structural and physicochemical features of reduced AAs (RedAA). This toolbox offers 4 kinds of databases, including the physicochemical properties of AAs and their groupings, 49 AA classification methods and 5 types of biophysicochemical features of RedAAs. These factors can be easily computed based on user-defined alphabet size and AA properties of AA groupings. RaaMLab is an open source freely available at https://github.com/bioinfo0706/RaaMLab. This website also contains a tutorial, extensive documentation and examples.
Collapse
Affiliation(s)
- Baohang Xi
- College of Life Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, People's Republic of China
| | - Jin Tao
- College of Life Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, People's Republic of China
| | - Xiaoqing Liu
- College of Sciences, Hangzhou Dianzi University, Hangzhou 310018, People's Republic of China
| | - Xinnan Xu
- College of Life Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, People's Republic of China
| | - Pingan He
- College of Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, People's Republic of China
| | - Qi Dai
- College of Life Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, People's Republic of China.
| |
Collapse
|
15
|
Zhang L, Yu G, Xia D, Wang J. Protein–protein interactions prediction based on ensemble deep neural networks. Neurocomputing 2019. [DOI: 10.1016/j.neucom.2018.02.097] [Citation(s) in RCA: 74] [Impact Index Per Article: 14.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
|
16
|
Chrysostomou C, Seker H. Prediction of protein allergenicity based on signal-processing bioinformatics approach. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2016; 2014:808-11. [PMID: 25570082 DOI: 10.1109/embc.2014.6943714] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Current bioinformatics tools accomplish high accuracies in classifying allergenic protein sequences with high homology and generally perform poorly with low homology protein sequences. Although some homologous regions explained Immunoglobulin E (IgE) cross-reactivity in groups of allergens, no universal molecular structure could be associated with allergenicity. In addition, studies have showed that cross-reactivity is not directly linked to the homology between protein sequences. Therefore, a new homology independent method needs to be developed to determine if a protein is an allergen or not. The aim of this study is therefore to differentiate sets of allergenic and non-allergenic proteins using a signal-processing based bioinformatics approach. In this paper, a new method was proposed for characterisation and classification of allergenic protein sequences. For this method hydrophobicity amino acid index was used to encode proteins to numerical sequences and Discrete Fourier Transform to extract features for each protein. Finally, a classifier was constructed based on Support Vector Machines. In order to demonstrate the applicability of the proposed method 857 allergen and 1000 non-allergen proteins were collected from UniProt online database. The results obtained from the proposed method yielded: MCC: 0.752 ± 0.007, Specificity: 0.912 ± 0.005, Sensitivity: 0.835 ± 0.008 and Total Accuracy: 87.65% ± 0.004.
Collapse
|
17
|
Li YH, Xu JY, Tao L, Li XF, Li S, Zeng X, Chen SY, Zhang P, Qin C, Zhang C, Chen Z, Zhu F, Chen YZ. SVM-Prot 2016: A Web-Server for Machine Learning Prediction of Protein Functional Families from Sequence Irrespective of Similarity. PLoS One 2016; 11:e0155290. [PMID: 27525735 PMCID: PMC4985167 DOI: 10.1371/journal.pone.0155290] [Citation(s) in RCA: 85] [Impact Index Per Article: 10.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2016] [Accepted: 04/27/2016] [Indexed: 12/20/2022] Open
Abstract
Knowledge of protein function is important for biological, medical and therapeutic studies, but many proteins are still unknown in function. There is a need for more improved functional prediction methods. Our SVM-Prot web-server employed a machine learning method for predicting protein functional families from protein sequences irrespective of similarity, which complemented those similarity-based and other methods in predicting diverse classes of proteins including the distantly-related proteins and homologous proteins of different functions. Since its publication in 2003, we made major improvements to SVM-Prot with (1) expanded coverage from 54 to 192 functional families, (2) more diverse protein descriptors protein representation, (3) improved predictive performances due to the use of more enriched training datasets and more variety of protein descriptors, (4) newly integrated BLAST analysis option for assessing proteins in the SVM-Prot predicted functional families that were similar in sequence to a query protein, and (5) newly added batch submission option for supporting the classification of multiple proteins. Moreover, 2 more machine learning approaches, K nearest neighbor and probabilistic neural networks, were added for facilitating collective assessment of protein functions by multiple methods. SVM-Prot can be accessed at http://bidd2.nus.edu.sg/cgi-bin/svmprot/svmprot.cgi.
Collapse
Affiliation(s)
- Ying Hong Li
- Innovative Drug Research and Bioinformatics Group, Innovative Drug Research Centre and School of Pharmaceutical Sciences, Chongqing University, Chongqing, 401331, China
| | - Jing Yu Xu
- Innovative Drug Research and Bioinformatics Group, Innovative Drug Research Centre and School of Pharmaceutical Sciences, Chongqing University, Chongqing, 401331, China
- School of Mathematics and Statistics, Beijing Institute of Technology, Beijing, China
| | - Lin Tao
- Innovative Drug Research and Bioinformatics Group, Innovative Drug Research Centre and School of Pharmaceutical Sciences, Chongqing University, Chongqing, 401331, China
- Bioinformatics and Drug Discovery group, Department of Pharmacy, National University of Singapore, Singapore, 117543, Singapore
| | - Xiao Feng Li
- Innovative Drug Research and Bioinformatics Group, Innovative Drug Research Centre and School of Pharmaceutical Sciences, Chongqing University, Chongqing, 401331, China
| | - Shuang Li
- Innovative Drug Research and Bioinformatics Group, Innovative Drug Research Centre and School of Pharmaceutical Sciences, Chongqing University, Chongqing, 401331, China
| | - Xian Zeng
- Bioinformatics and Drug Discovery group, Department of Pharmacy, National University of Singapore, Singapore, 117543, Singapore
| | - Shang Ying Chen
- Bioinformatics and Drug Discovery group, Department of Pharmacy, National University of Singapore, Singapore, 117543, Singapore
| | - Peng Zhang
- Bioinformatics and Drug Discovery group, Department of Pharmacy, National University of Singapore, Singapore, 117543, Singapore
| | - Chu Qin
- Bioinformatics and Drug Discovery group, Department of Pharmacy, National University of Singapore, Singapore, 117543, Singapore
| | - Cheng Zhang
- Bioinformatics and Drug Discovery group, Department of Pharmacy, National University of Singapore, Singapore, 117543, Singapore
| | - Zhe Chen
- Zhejiang Key Laboratory of Gastro-intestinal Pathophysiology, Zhejiang Hospital of Traditional Chinese Medicine, Zhejiang Chinese Medical University, Hangzhou, P. R. China
| | - Feng Zhu
- Innovative Drug Research and Bioinformatics Group, Innovative Drug Research Centre and School of Pharmaceutical Sciences, Chongqing University, Chongqing, 401331, China
| | - Yu Zong Chen
- Bioinformatics and Drug Discovery group, Department of Pharmacy, National University of Singapore, Singapore, 117543, Singapore
| |
Collapse
|
18
|
Saravanan V, Lakshmi PTV. Fuzzy Logic for Personalized Healthcare and Diagnostics: FuzzyApp—A Fuzzy Logic Based Allergen-Protein Predictor. OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY 2014; 18:570-81. [DOI: 10.1089/omi.2014.0021] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Affiliation(s)
- Vijayakumar Saravanan
- Centre for Bioinformatics, School of Life Sciences, Pondicherry University, Pondicherry, India
| | - PTV Lakshmi
- Centre for Bioinformatics, School of Life Sciences, Pondicherry University, Pondicherry, India
| |
Collapse
|
19
|
Dimitrov I, Bangov I, Flower DR, Doytchinova I. AllerTOP v.2--a server for in silico prediction of allergens. J Mol Model 2014; 20:2278. [PMID: 24878803 DOI: 10.1007/s00894-014-2278-5] [Citation(s) in RCA: 628] [Impact Index Per Article: 62.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2013] [Accepted: 04/25/2014] [Indexed: 11/29/2022]
Abstract
Allergy is an overreaction by the immune system to a previously encountered, ordinarily harmless substance--typically proteins--resulting in skin rash, swelling of mucous membranes, sneezing or wheezing, or other abnormal conditions. The use of modified proteins is increasingly widespread: their presence in food, commercial products, such as washing powder, and medical therapeutics and diagnostics, makes predicting and identifying potential allergens a crucial societal issue. The prediction of allergens has been explored widely using bioinformatics, with many tools being developed in the last decade; many of these are freely available online. Here, we report a set of novel models for allergen prediction utilizing amino acid E-descriptors, auto- and cross-covariance transformation, and several machine learning methods for classification, including logistic regression (LR), decision tree (DT), naïve Bayes (NB), random forest (RF), multilayer perceptron (MLP) and k nearest neighbours (kNN). The best performing method was kNN with 85.3% accuracy at 5-fold cross-validation. The resulting model has been implemented in a revised version of the AllerTOP server (http://www.ddg-pharmfac.net/AllerTOP).
Collapse
Affiliation(s)
- Ivan Dimitrov
- Faculty of Pharmacy, Medical University of Sofia, 2 Dunav st., 1000, Sofia, Bulgaria
| | | | | | | |
Collapse
|
20
|
Dang HX, Lawrence CB. Allerdictor: fast allergen prediction using text classification techniques. ACTA ACUST UNITED AC 2014; 30:1120-1128. [PMID: 24403538 DOI: 10.1093/bioinformatics/btu004] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2013] [Accepted: 12/30/2013] [Indexed: 11/14/2022]
Abstract
MOTIVATION Accurately identifying and eliminating allergens from biotechnology-derived products are important for human health. From a biomedical research perspective, it is also important to identify allergens in sequenced genomes. Many allergen prediction tools have been developed during the past years. Although these tools have achieved certain levels of specificity, when applied to large-scale allergen discovery (e.g. at a whole-genome scale), they still yield many false positives and thus low precision (even at low recall) due to the extreme skewness of the data (allergens are rare). Moreover, the most accurate tools are relatively slow because they use protein sequence alignment to build feature vectors for allergen classifiers. Additionally, only web server implementations of the current allergen prediction tools are publicly available and are without the capability of large batch submission. These weaknesses make large-scale allergen discovery ineffective and inefficient in the public domain. RESULTS We developed Allerdictor, a fast and accurate sequence-based allergen prediction tool that models protein sequences as text documents and uses support vector machine in text classification for allergen prediction. Test results on multiple highly skewed datasets demonstrated that Allerdictor predicted allergens with high precision over high recall at fast speed. For example, Allerdictor only took ∼6 min on a single core PC to scan a whole Swiss-Prot database of ∼540 000 sequences and identified <1% of them as allergens. AVAILABILITY AND IMPLEMENTATION Allerdictor is implemented in Python and available as standalone and web server versions at http://allerdictor.vbi.vt.edu CONTACT: lawrence@vbi.vt.edu Supplementary information: Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ha X Dang
- Virginia Bioinformatics Institute and Department of Biological Sciences, Virginia Tech, Blacksburg, VA 24061, USA
| | - Christopher B Lawrence
- Virginia Bioinformatics Institute and Department of Biological Sciences, Virginia Tech, Blacksburg, VA 24061, USA Virginia Bioinformatics Institute and Department of Biological Sciences, Virginia Tech, Blacksburg, VA 24061, USA
| |
Collapse
|
21
|
Abstract
A large volume of data relevant to immunology research has accumulated due to sequencing of genomes of the human and other model organisms. At the same time, huge amounts of clinical and epidemiologic data are being deposited in various scientific literature and clinical records. This accumulation of the information is like a goldmine for researchers looking for mechanisms of immune function and disease pathogenesis. Thus the need to handle this rapidly growing immunological resource has given rise to the field known as immunoinformatics. Immunoinformatics, otherwise known as computational immunology, is the interface between computer science and experimental immunology. It represents the use of computational methods and resources for the understanding of immunological information. It not only helps in dealing with huge amount of data but also plays a great role in defining new hypotheses related to immune responses. This chapter reviews classical immunology, different databases, and prediction tool. Further, it briefly describes applications of immunoinformatics in reverse vaccinology, immune system modeling, and cancer diagnosis and therapy. It also explores the idea of integrating immunoinformatics with systems biology for the development of personalized medicine. All these efforts save time and cost to a great extent.
Collapse
Affiliation(s)
- Namrata Tomar
- Machine Intelligence Unit, Indian Statistical Institute, 203 B.T. Road, Kolkata, 700108, India,
| | | |
Collapse
|
22
|
PREAL: prediction of allergenic protein by maximum Relevance Minimum Redundancy (mRMR) feature selection. BMC SYSTEMS BIOLOGY 2013; 7 Suppl 5:S9. [PMID: 24565053 PMCID: PMC4029432 DOI: 10.1186/1752-0509-7-s5-s9] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
Abstract
BACKGROUND Assessment of potential allergenicity of protein is necessary whenever transgenic proteins are introduced into the food chain. Bioinformatics approaches in allergen prediction have evolved appreciably in recent years to increase sophistication and performance. However, what are the critical features for protein's allergenicity have been not fully investigated yet. RESULTS We presented a more comprehensive model in 128 features space for allergenic proteins prediction by integrating various properties of proteins, such as biochemical and physicochemical properties, sequential features and subcellular locations. The overall accuracy in the cross-validation reached 93.42% to 100% with our new method. Maximum Relevance Minimum Redundancy (mRMR) method and Incremental Feature Selection (IFS) procedure were applied to obtain which features are essential for allergenicity. Results of the performance comparisons showed the superior of our method to the existing methods used widely. More importantly, it was observed that the features of subcellular locations and amino acid composition played major roles in determining the allergenicity of proteins, particularly extracellular/cell surface and vacuole of the subcellular locations for wheat and soybean. To facilitate the allergen prediction, we implemented our computational method in a web application, which can be available at http://gmobl.sjtu.edu.cn/PREAL/index.php. CONCLUSIONS Our new approach could improve the accuracy of allergen prediction. And the findings may provide novel insights for the mechanism of allergies.
Collapse
|
23
|
Dimitrov I, Naneva L, Doytchinova I, Bangov I. AllergenFP: allergenicity prediction by descriptor fingerprints. Bioinformatics 2013; 30:846-51. [PMID: 24167156 DOI: 10.1093/bioinformatics/btt619] [Citation(s) in RCA: 420] [Impact Index Per Article: 38.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Allergenicity, like antigenicity and immunogenicity, is a property encoded linearly and non-linearly, and therefore the alignment-based approaches are not able to identify this property unambiguously. A novel alignment-free descriptor-based fingerprint approach is presented here and applied to identify allergens and non-allergens. The approach was implemented into a four step algorithm. Initially, the protein sequences are described by amino acid principal properties as hydrophobicity, size, relative abundance, helix and β-strand forming propensities. Then, the generated strings of different length are converted into vectors with equal length by auto- and cross-covariance (ACC). The vectors were transformed into binary fingerprints and compared in terms of Tanimoto coefficient. RESULTS The approach was applied to a set of 2427 known allergens and 2427 non-allergens and identified correctly 88% of them with Matthews correlation coefficient of 0.759. The descriptor fingerprint approach presented here is universal. It could be applied for any classification problem in computational biology. The set of E-descriptors is able to capture the main structural and physicochemical properties of amino acids building the proteins. The ACC transformation overcomes the main problem in the alignment-based comparative studies arising from the different length of the aligned protein sequences. The conversion of protein ACC values into binary descriptor fingerprints allows similarity search and classification. AVAILABILITY AND IMPLEMENTATION The algorithm described in the present study was implemented in a specially designed Web site, named AllergenFP (FP stands for FingerPrint). AllergenFP is written in Python, with GIU in HTML. It is freely accessible at http://ddg-pharmfac.net/Allergen FP. CONTACT idoytchinova@pharmfac.net or ivanbangov@shu-bg.net.
Collapse
Affiliation(s)
- Ivan Dimitrov
- Medical University of Sofia, Faculty of Pharmacy, 2 Dunav st., 1000 Sofia and Konstantin Preslavski Shumen University, Faculty of Natural Sciences, 115 Universitetska st., 9712 Shumen, Bulgaria
| | | | | | | |
Collapse
|
24
|
Chen X, Xu H. JFeature. Bioinformatics 2013. [DOI: 10.4018/978-1-4666-3604-0.ch060] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022] Open
Abstract
Prediction of various functional properties of proteins has long been a central theme of bioinformatics in the post-genomic era. Statistical learning, in addition to analysis based on sequence similarity, was proven successful to detect complex sequence-function associations in many applications. JFeature is an integrated Java tool to facilitate extraction of global sequence features and preparation of example sets, in statistical learning studies of sequence-function relationships. With a user-friendly graphical interface, it computes the composition, distribution, transition and auto-correlation features from sequence. It also helps to assemble a negative example set based on the most-dissimilar principle. The Java package and supplementary documentations are available at http://www.cls.zju.edu.cn/rlibs/software/jfeature.html.
Collapse
|
25
|
Wang J, Yu Y, Zhao Y, Zhang D, Li J. Evaluation and integration of existing methods for computational prediction of allergens. BMC Bioinformatics 2013; 14 Suppl 4:S1. [PMID: 23514097 PMCID: PMC3599076 DOI: 10.1186/1471-2105-14-s4-s1] [Citation(s) in RCA: 34] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022] Open
Abstract
Background Allergy involves a series of complex reactions and factors that contribute to the development of the disease and triggering of the symptoms, including rhinitis, asthma, atopic eczema, skin sensitivity, even acute and fatal anaphylactic shock. Prediction and evaluation of the potential allergenicity is of importance for safety evaluation of foods and other environment factors. Although several computational approaches for assessing the potential allergenicity of proteins have been developed, their performance and relative merits and shortcomings have not been compared systematically. Results To evaluate and improve the existing methods for allergen prediction, we collected an up-to-date definitive dataset consisting of 989 known allergens and massive putative non-allergens. The three most widely used allergen computational prediction approaches including sequence-, motif- and SVM-based (Support Vector Machine) methods were systematically compared using the defined parameters and we found that SVM-based method outperformed the other two methods with higher accuracy and specificity. The sequence-based method with the criteria defined by FAO/WHO (FAO: Food and Agriculture Organization of the United Nations; WHO: World Health Organization) has higher sensitivity of over 98%, but having a low specificity. The advantage of motif-based method is the ability to visualize the key motif within the allergen. Notably, the performances of the sequence-based method defined by FAO/WHO and motif eliciting strategy could be improved by the optimization of parameters. To facilitate the allergen prediction, we integrated these three methods in a web-based application proAP, which provides the global search of the known allergens and a powerful tool for allergen predication. Flexible parameter setting and batch prediction were also implemented. The proAP can be accessed at http://gmobl.sjtu.edu.cn/proAP/main.html. Conclusions This study comprehensively evaluated sequence-, motif- and SVM-based computational prediction approaches for allergens and optimized their parameters to obtain better performance. These findings may provide helpful guidance for the researchers in allergen-prediction. Furthermore, we integrated these methods into a web application proAP, greatly facilitating users to do customizable allergen search and prediction.
Collapse
Affiliation(s)
- Jing Wang
- Bor Luh Food Safety Center, National Center for Molecular Characterization of Genetically Modified Organisms, State Key Laboratory of Hybrid Rice, School of Life Science and Biotechnology, Shanghai Jiao Tong University, China
| | | | | | | | | |
Collapse
|
26
|
Zhang L, Huang Y, Zou Z, He Y, Chen X, Tao A. SORTALLER: predicting allergens using substantially optimized algorithm on allergen family featured peptides. ACTA ACUST UNITED AC 2012; 28:2178-9. [PMID: 22692221 DOI: 10.1093/bioinformatics/bts326] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
UNLABELLED SORTALLER is an online allergen classifier based on allergen family featured peptide (AFFP) dataset and normalized BLAST E-values, which establish the featured vectors for support vector machine (SVM). AFFPs are allergen-specific peptides panned from irredundant allergens and harbor perfect information with noise fragments eliminated because of their similarity to non-allergens. SORTALLER performed significantly better than other existing software and reached a perfect balance with high specificity (98.4%) and sensitivity (98.6%) for discriminating allergenic proteins from several independent datasets of protein sequences of diverse sources, also highlighting with the Matthews correlation coefficient (MCC) as high as 0.970, fast running speed and rapidly predicting a batch of amino acid sequences with a single click. AVAILABILITY AND IMPLEMENTATION http://sortaller.gzhmc.edu.cn/.
Collapse
Affiliation(s)
- Lida Zhang
- Plant Biotechnology Research Center, School of Agriculture and Biology, Shanghai Jiao Tong University, Shanghai 200030, China
| | | | | | | | | | | |
Collapse
|
27
|
Pfiffner P, Stadler BM, Rasi C, Scala E, Mari A. Cross-reactions vs co-sensitization evaluated by in silico motifs and in vitro IgE microarray testing. Allergy 2012; 67:210-6. [PMID: 22054025 DOI: 10.1111/j.1398-9995.2011.02743.x] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
BACKGROUND AND OBJECTIVE Using an in silico allergen clustering method, we have recently shown that allergen extracts are highly cross-reactive. Here we used serological data from a multi-array IgE test based on recombinant or highly purified natural allergens to evaluate whether co-reactions are true cross-reactions or co-sensitizations by allergens with the same motifs. METHODS The serum database consisted of 3142 samples, each tested against 103 highly purified natural or recombinant allergens. Cross-reactivity was predicted by an iterative motif-finding algorithm through sequence motifs identified in 2708 known allergens. RESULTS Allergen proteins containing the same motifs cross-reacted as predicted. However, proteins with identical motifs revealed a hierarchy in the degree of cross-reaction: The more frequent an allergen was positive in the allergic population, the less frequently it was cross-reacting and vice versa. Co-sensitization was analyzed by splitting the dataset into patient groups that were most likely sensitized through geographical occurrence of allergens. Interestingly, most co-reactions are cross-reactions but not co-sensitizations. CONCLUSIONS The observed hierarchy of cross-reactivity may play an important role for the future management of allergic diseases.
Collapse
Affiliation(s)
- P Pfiffner
- University Institute of Immunology, University of Bern, Switzerland
| | | | | | | | | |
Collapse
|
28
|
SProtP: a web server to recognize those short-lived proteins based on sequence-derived features in human cells. PLoS One 2011; 6:e27836. [PMID: 22114707 PMCID: PMC3218052 DOI: 10.1371/journal.pone.0027836] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2011] [Accepted: 10/26/2011] [Indexed: 11/19/2022] Open
Abstract
Protein turnover metabolism plays important roles in cell cycle progression, signal transduction, and differentiation. Those proteins with short half-lives are involved in various regulatory processes. To better understand the regulation of cell process, it is important to study the key sequence-derived factors affecting short-lived protein degradation. Until now, most of protein half-lives are still unknown due to the difficulties of traditional experimental methods in measuring protein half-lives in human cells. To investigate the molecular determinants that affect short-lived proteins, a computational method was proposed in this work to recognize short-lived proteins based on sequence-derived features in human cells. In this study, we have systematically analyzed many features that perhaps correlated with short-lived protein degradation. It is found that a large fraction of proteins with signal peptides and transmembrane regions in human cells are of short half-lives. We have constructed an SVM-based classifier to recognize short-lived proteins, due to the fact that short-lived proteins play pivotal roles in the control of various cellular processes. By employing the SVM model on human dataset, we achieved 80.8% average sensitivity and 79.8% average specificity, respectively, on ten testing dataset (TE1-TE10). We also obtained 89.9%, 99% and 83.9% of average accuracy on an independent validation datasets iTE1, iTE2 and iTE3 respectively. The approach proposed in this paper provides a valuable alternative for recognizing the short-lived proteins in human cells, and is more accurate than the traditional N-end rule. Furthermore, the web server SProtP (http://reprod.njmu.edu.cn/sprotp) has been developed and is freely available for users.
Collapse
|
29
|
Cobanoglu MC, Saygin Y, Sezerman U. Classification of GPCRs using family specific motifs. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011; 8:1495-1508. [PMID: 20876934 DOI: 10.1109/tcbb.2010.101] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2023]
Abstract
The classification of G-Protein Coupled Receptor (GPCR) sequences is an important problem that arises from the need to close the gap between the large number of orphan receptors and the relatively small number of annotated receptors. Equally important is the characterization of GPCR Class A subfamilies and gaining insight into the ligand interaction since GPCR Class A encompasses a very large number of drug-targeted receptors. In this work, we propose a method for Class A subfamily classification using sequence-derived motifs which characterizes the subfamilies by discovering receptor-ligand interaction sites. The motifs that best characterize a subfamily are selected by the Distinguishing Power Evaluation (DPE) technique we propose. The experiments performed on GPCR sequence databases show that our method outperforms state-of-the-art classification techniques for GPCR Class A subfamily prediction. An important contribution of our work is to discover key receptor-ligand interaction sites which is very important for drug design.
Collapse
|
30
|
|
31
|
Tomar N, De RK. Immunoinformatics: an integrated scenario. Immunology 2010; 131:153-68. [PMID: 20722763 PMCID: PMC2967261 DOI: 10.1111/j.1365-2567.2010.03330.x] [Citation(s) in RCA: 98] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2009] [Revised: 06/12/2010] [Accepted: 06/21/2010] [Indexed: 12/11/2022] Open
Abstract
Genome sequencing of humans and other organisms has led to the accumulation of huge amounts of data, which include immunologically relevant data. A large volume of clinical data has been deposited in several immunological databases and as a result immunoinformatics has emerged as an important field which acts as an intersection between experimental immunology and computational approaches. It not only helps in dealing with the huge amount of data but also plays a role in defining new hypotheses related to immune responses. This article reviews classical immunology, different databases and prediction tools. It also describes applications of immunoinformatics in designing in silico vaccination and immune system modelling. All these efforts save time and reduce cost.
Collapse
Affiliation(s)
- Namrata Tomar
- Machine Intelligence Unit, Indian Statistical Institute, Kolkata, India
| | | |
Collapse
|
32
|
Scientific Opinion on the assessment of allergenicity of GM plants and microorganisms and derived food and feed. EFSA J 2010. [DOI: 10.2903/j.efsa.2010.1700] [Citation(s) in RCA: 243] [Impact Index Per Article: 17.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022] Open
|
33
|
Strömbergsson H, Lapins M, Kleywegt GJ, Wikberg JES. Towards Proteome-Wide Interaction Models Using the Proteochemometrics Approach. Mol Inform 2010; 29:499-508. [PMID: 27463328 DOI: 10.1002/minf.201000052] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2010] [Accepted: 05/25/2010] [Indexed: 02/02/2023]
Abstract
A proteochemometrics model was induced from all interaction data in the BindingDB database, comprizing in all 7078 protein-ligand complexes with representatives from all major drug target categories. Proteins were represented by alignment-independent sequence descriptors holding information on properties such as hydrophobicity, charge, and secondary structure. Ligands were represented by commonly used QSAR descriptors. The inhibition constant (pKi ) values of protein-ligand complexes were discretized into "high" and "low" interaction activity. Different machine-learning techniques were used to induce models relating protein and ligand properties to the interaction activity. The best was decision trees, which gave an accuracy of 80 % and an area under the ROC curve of 0.81. The tree pointed to the protein and ligand properties, which are relevant for the interaction. As the approach does neither require alignments nor knowledge of protein 3D structures virtually all available protein-ligand interaction data could be utilized, thus opening a way to completely general interaction models that may span entire proteomes.
Collapse
Affiliation(s)
- Helena Strömbergsson
- The Linnaeus Centre for Bioinformatics, Department of Cell and Molecular Biology, Biomedical Centre, Box 598, SE-751 24, Uppsala, Sweden.
| | - Maris Lapins
- Department of Pharmaceutical Pharmacology, Biomedical Centre, Box 591, SE-751 24 Uppsala, Sweden
| | - Gerard J Kleywegt
- Department of Cell and Molecular Biology, Biomedical Centre, Box 596, SE-751 24, Uppsala, Sweden
| | - Jarl E S Wikberg
- Department of Pharmaceutical Pharmacology, Biomedical Centre, Box 591, SE-751 24 Uppsala, Sweden
| |
Collapse
|
34
|
Tang ZQ, Lin HH, Zhang HL, Han LY, Chen X, Chen YZ. Prediction of functional class of proteins and peptides irrespective of sequence homology by support vector machines. Bioinform Biol Insights 2009; 1:19-47. [PMID: 20066123 PMCID: PMC2789692 DOI: 10.4137/bbi.s315] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
Abstract
Various computational methods have been used for the prediction of protein and peptide function based on their sequences. A particular challenge is to derive functional properties from sequences that show low or no homology to proteins of known function. Recently, a machine learning method, support vector machines (SVM), have been explored for predicting functional class of proteins and peptides from amino acid sequence derived properties independent of sequence similarity, which have shown promising potential for a wide spectrum of protein and peptide classes including some of the low- and non-homologous proteins. This method can thus be explored as a potential tool to complement alignment-based, clustering-based, and structure-based methods for predicting protein function. This article reviews the strategies, current progresses, and underlying difficulties in using SVM for predicting the functional class of proteins. The relevant software and web-servers are described. The reported prediction performances in the application of these methods are also presented.
Collapse
Affiliation(s)
- Zhi Qun Tang
- Department of Pharmacy and Department of Computational Science, National University of Singapore, Republic of Singapore, 117543
| | - Hong Huang Lin
- Department of Pharmacy and Department of Computational Science, National University of Singapore, Republic of Singapore, 117543
| | - Hai Lei Zhang
- Department of Pharmacy and Department of Computational Science, National University of Singapore, Republic of Singapore, 117543
| | - Lian Yi Han
- Department of Pharmacy and Department of Computational Science, National University of Singapore, Republic of Singapore, 117543
| | - Xin Chen
- Department of Biotechnology, Zhejiang University, Hang Zhou, Zhejiang Province, P. R. China, 310029
| | - Yu Zong Chen
- Department of Pharmacy and Department of Computational Science, National University of Singapore, Republic of Singapore, 117543
- Shanghai Center for Bioinformatics Technology, Shanghai, P. R. China, 201203
| |
Collapse
|
35
|
Muh HC, Tong JC, Tammi MT. AllerHunter: a SVM-pairwise system for assessment of allergenicity and allergic cross-reactivity in proteins. PLoS One 2009; 4:e5861. [PMID: 19516900 PMCID: PMC2689655 DOI: 10.1371/journal.pone.0005861] [Citation(s) in RCA: 81] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2008] [Accepted: 05/06/2009] [Indexed: 11/19/2022] Open
Abstract
Allergy is a major health problem in industrialized countries. The number of transgenic food crops is growing rapidly creating the need for allergenicity assessment before they are introduced into human food chain. While existing bioinformatic methods have achieved good accuracies for highly conserved sequences, the discrimination of allergens and non-allergens from allergen-like non-allergen sequences remains difficult. We describe AllerHunter, a web-based computational system for the assessment of potential allergenicity and allergic cross-reactivity in proteins. It combines an iterative pairwise sequence similarity encoding scheme with SVM as the discriminating engine. The pairwise vectorization framework allows the system to model essential features in allergens that are involved in cross-reactivity, but not limited to distinct sets of physicochemical properties. The system was rigorously trained and tested using 1,356 known allergen and 13,449 putative non-allergen sequences. Extensive testing was performed for validation of the prediction models. The system is effective for distinguishing allergens and non-allergens from allergen-like non-allergen sequences. Testing results showed that AllerHunter, with a sensitivity of 83.4% and specificity of 96.4% (accuracy = 95.3%, area under the receiver operating characteristic curve AROC = 0.928+/-0.004 and Matthew's correlation coefficient MCC = 0.738), performs significantly better than a number of existing methods using an independent dataset of 1443 protein sequences. AllerHunter is available at (http://tiger.dbs.nus.edu.sg/AllerHunter).
Collapse
Affiliation(s)
- Hon Cheng Muh
- Department of Biological Sciences, National University of Singapore, Singapore, Singapore
| | - Joo Chuan Tong
- Data Mining Department, Institute for Infocomm Research, Singapore, Singapore
- Department of Biochemistry, National University of Singapore, Singapore, Singapore
| | - Martti T. Tammi
- Department of Biological Sciences, National University of Singapore, Singapore, Singapore
- Department of Biochemistry, National University of Singapore, Singapore, Singapore
- Department of Microbiology, Tumor and Cell Biology, Karolinska Institutet, Stockholm, Sweden
| |
Collapse
|
36
|
Lim SJ, Tong JC, Chew FT, Tammi MT. The value of position-specific scoring matrices for assessment of protein allegenicity. BMC Bioinformatics 2008; 9 Suppl 12:S21. [PMID: 19091021 PMCID: PMC2638161 DOI: 10.1186/1471-2105-9-s12-s21] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Bioinformatics tools are commonly used for assessing potential protein allergenicity. While these methods have achieved good accuracies for highly conserved sequences, they are less effective when the overall similarity is low. In this study, we assessed the feasibility of using position-specific scoring matrices as a basis for predicting potential allergenicity in proteins. RESULTS Two simple methods for predicting potential allergenicity in proteins, based on general and group-specific allergen profiles, are presented. Testing results indicate that the performances of both methods are comparable to the best results of other methods. The group-specific profile approach, with a sensitivity of 84.04% and specificity of 96.52%, gives similar results as those obtained using the general profile approach (sensitivity = 82.45%, specificity = 96.92%). CONCLUSION We show that position-specific scoring matrices are highly promising for constructing computational models suitable for allergenicity assessment. These data suggest it may be possible to apply a targeted approach for allergenicity assessment based on the profiles of allergens of interest.
Collapse
Affiliation(s)
- Shen Jean Lim
- Department of Biochemistry, Yong Loo Lin School of Medicine, National University of Singapore, 8 Medical Drive, Singapore 117597.
| | | | | | | |
Collapse
|
37
|
Cui J, Liu Q, Puett D, Xu Y. Computational prediction of human proteins that can be secreted into the bloodstream. ACTA ACUST UNITED AC 2008; 24:2370-5. [PMID: 18697770 DOI: 10.1093/bioinformatics/btn418] [Citation(s) in RCA: 48] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
We present a novel computational method for predicting which proteins from highly and abnormally expressed genes in diseased human tissues, such as cancers, can be secreted into the bloodstream, suggesting possible marker proteins for follow-up serum proteomic studies. A main challenging issue in tackling this problem is that our understanding about the downstream localization after proteins are secreted outside the cells is very limited and not sufficient to provide useful hints about secretion to the bloodstream. To bypass this difficulty, we have taken a data mining approach by first collecting, through extensive literature searches, human proteins that are known to be secreted into the bloodstream due to various pathological conditions as detected by previous proteomic studies, and then asking the question: 'what do these secreted proteins have in common in terms of their physical and chemical properties, amino acid sequence and structural features that can be used to predict them?' We have identified a list of features, such as signal peptides, transmembrane domains, glycosylation sites, disordered regions, secondary structural content, hydrophobicity and polarity measures that show relevance to protein secretion. Using these features, we have trained a support vector machine-based classifier to predict protein secretion to the bloodstream. On a large test set containing 98 secretory proteins and 6601 non-secretory proteins of human, our classifier achieved approximately 90% prediction sensitivity and approximately 98% prediction specificity. Several additional datasets are used to further assess the performance of our classifier. On a set of 122 proteins that were found to be of abnormally high abundance in human blood due to various cancers, our program predicted 62 as blood-secreted proteins. By applying our program to abnormally highly expressed genes in gastric cancer and lung cancer tissues detected through microarray gene expression studies, we predicted 13 and 31 as blood secreted, respectively, suggesting that they could serve as potential biomarkers for these two cancers, respectively. Our study demonstrated that our method can provide highly useful information to link genomic and proteomic studies for disease biomarker discovery. Our software can be accessed at http://csbl1.bmb.uga.edu/cgi-bin/Secretion/secretion.cgi.
Collapse
Affiliation(s)
- Juan Cui
- Department of Biochemistry and Molecular Biology, University of Georgia, Athens, GA 30602, USA
| | | | | | | |
Collapse
|
38
|
|
39
|
Zhang HL, Lin HH, Tao L, Ma XH, Dai JL, Jia J, Cao ZW. Prediction of antibiotic resistance proteins from sequence-derived properties irrespective of sequence similarity. Int J Antimicrob Agents 2008; 32:221-6. [PMID: 18583101 DOI: 10.1016/j.ijantimicag.2008.03.006] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2008] [Revised: 03/13/2008] [Accepted: 03/15/2008] [Indexed: 11/29/2022]
Abstract
Increasing antibiotic resistance has become a worldwide challenge to the clinical treatment of infectious diseases. The identification of antibiotic resistance proteins (ARPs) would be helpful in the discovery of new therapeutic targets and the design of novel drugs to control the potential spread of antibiotic resistance. In this work, a support vector machine (SVM)-based ARP prediction system was developed using 1308 ARPs and 15587 non-ARPs. Its performance was evaluated using 313 ARPs and 7156 non-ARPs. The computed prediction accuracy was 88.5% for ARPs and 99.2% for non-ARPs. A potential application of this method is the identification of ARPs non-homologous to proteins of known function. Further genome screening found that ca. 3.5% and 3.2% of proteins in Escherichia coli and Staphylococcus aureus, respectively, are potential ARPs. These results suggest the usefulness of SVMs for facilitating the identification of ARPs. The software can be accessed at SARPI (Server for Antibiotic Resistance Protein Identification).
Collapse
Affiliation(s)
- H L Zhang
- Department of Pharmacy, 18 Science Drive 4, National University of Singapore, Singapore 117543, Singapore
| | | | | | | | | | | | | |
Collapse
|
40
|
Soeria-Atmadja D, Onell A, Kober A, Matsson P, Gustafsson MG, Hammerling U. Multivariate statistical analysis of large-scale IgE antibody measurements reveals allergen extract relationships in sensitized individuals. J Allergy Clin Immunol 2007; 120:1433-40. [PMID: 17825892 DOI: 10.1016/j.jaci.2007.07.021] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2007] [Revised: 06/28/2007] [Accepted: 07/16/2007] [Indexed: 10/22/2022]
Abstract
BACKGROUND Many allergenic sources are reportedly cross-reactive because of protein structural similarities. Although several aggregations are well characterized, no holistic mapping of IgE reactivity has hitherto been reported. OBJECTIVE The aim of this study was to disclose relevant associations within a large set of allergen preparations, as revealed by specific IgE antibody levels in blood sera of multireactive human donors. METHODS A dataset of recorded IgE antibody serum concentrations of 1011 nonidentifiable multireactive individuals (devoid of clinical records) to 89 allergen extracts was compiled for in silico analysis. Various algorithms were used to identify specific multivariate dependencies between the IgE antibody levels. RESULTS Exhaustive cluster analysis demonstrates that IgE antibody responses to the 89 extracts can be aggregated into 12 stable formations. These clusters hold both well-known relationships, unexpected patterns, and unknown patterns, the latter categories being exemplified by the coclustering of wasp and certain seafood and a clear differentiation among pollen allergens. CONCLUSION Identified relationships within several well-known groups of cross-reactive allergen extracts confirm the applicability of dedicated multivariate data analysis within the allergology field. Moreover, some of the unexpected IgE reactivity associations in sensitized human subjects might help in identifying new relationships with potential importance to allergy. CLINICAL IMPLICATIONS Although clinical implications from this study should be validated in subsequent investigations with documentation on symptoms included, we believe this seminal approach is a key step toward the development of new analysis tools for interpretation of allergy data generated by using high-throughput recording systems.
Collapse
|
41
|
Martinez Barrio A, Soeria-Atmadja D, Nistér A, Gustafsson MG, Hammerling U, Bongcam-Rudloff E. EVALLER: a web server for in silico assessment of potential protein allergenicity. Nucleic Acids Res 2007; 35:W694-700. [PMID: 17537818 PMCID: PMC1933222 DOI: 10.1093/nar/gkm370] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Bioinformatics testing approaches for protein allergenicity, involving amino acid sequence comparisons, have evolved appreciably over the last several years to increased sophistication and performance. EVALLER, the web server presented in this article is based on our recently published 'Detection based on Filtered Length-adjusted Allergen Peptides' (DFLAP) algorithm, which affords in silico determination of potential protein allergenicity of high sensitivity and excellent specificity. To strengthen bioinformatics risk assessment in allergology EVALLER provides a comprehensive outline of its judgment on a query protein's potential allergenicity. Each such textual output incorporates a scoring figure, a confidence numeral of the assignment and information on high- or low-scoring matches to identified allergen-related motifs, including their respective location in accordingly derived allergens. The interface, built on a modified Perl Open Source package, enables dynamic and color-coded graphic representation of key parts of the output. Moreover, pertinent details can be examined in great detail through zoomed views. The server can be accessed at http://bioinformatics.bmc.uu.se/evaller.html.
Collapse
Affiliation(s)
- Alvaro Martinez Barrio
- Linnaeus Centre for Bioinformatics, Uppsala Biomedical Centre (BMC), Uppsala University, P.O. Box 598, SE-751 24 Uppsala, Sweden
| | | | | | | | | | | |
Collapse
|
42
|
Schein CH, Ivanciuc O, Braun W. Bioinformatics approaches to classifying allergens and predicting cross-reactivity. Immunol Allergy Clin North Am 2007; 27:1-27. [PMID: 17276876 PMCID: PMC1941676 DOI: 10.1016/j.iac.2006.11.005] [Citation(s) in RCA: 71] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
Allergenic proteins from very different environmental sources have similar sequences and structures. This fact may account for multiple allergen syndromes, whereby a myriad of diverse plants and foods may induce a similar IgE-based reaction in certain patients. Identifying the common triggering protein in these sources, in silico, can aid designing individualized therapy for allergen sufferers. This article provides an overview of databases on allergenic proteins, and ways to identify common proteins that may be the cause of multiple allergy syndromes. The major emphasis is on the relational Structural Database of Allergenic Proteins (SDAP []), which includes cross-referenced data on the sequence, structure, and IgE epitopes of over 800 allergenic proteins, coupled with specially developed bioinformatics tools to group all allergens and identify discrete areas that may account for cross-reactivity. SDAP is freely available on the Web to clinicians and patients.
Collapse
Affiliation(s)
- Catherine H. Schein
- Sealy Center for Structural Biology and Molecular Biophysics, Departments of Biochemistry and Molecular Biology, University of Texas Medical Branch, 301 University Blvd., Galveston TX 77555-0857
- Sealy Center for Structural Biology and Molecular Biophysics, Departments of Microbiology and Immunology, University of Texas Medical Branch, 301 University Blvd., Galveston TX 77555-0857
| | - Ovidiu Ivanciuc
- Sealy Center for Structural Biology and Molecular Biophysics, Departments of Biochemistry and Molecular Biology, University of Texas Medical Branch, 301 University Blvd., Galveston TX 77555-0857
| | - Werner Braun
- Sealy Center for Structural Biology and Molecular Biophysics, Departments of Biochemistry and Molecular Biology, University of Texas Medical Branch, 301 University Blvd., Galveston TX 77555-0857
| |
Collapse
|