1
|
Prediction of High-Risk Types of Human Papillomaviruses Using Reduced Amino Acid Modes. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2021; 2020:5325304. [PMID: 32655680 PMCID: PMC7320279 DOI: 10.1155/2020/5325304] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/01/2020] [Accepted: 04/22/2020] [Indexed: 01/04/2023]
Abstract
A human papillomavirus type plays an important role in the early diagnosis of cervical cancer. Most of the prediction methods use protein sequence and structure information, but the reduced amino acid modes have not been used until now. In this paper, we introduced the modes of reduced amino acids to predict high-risk HPV. We first reduced 20 amino acids into several nonoverlapping groups and calculated their structure and physicochemical modes for high-risk HPV prediction, which was tested and compared with the existing methods on 68 samples of known HPV types. The experiment result indicates that the proposed method achieved better performance with an accuracy of 96.49%, indicating that the reduced amino acid modes might be used to improve the prediction of high-risk HPV types.
Collapse
|
2
|
Abstract
Immunoinformatics is a discipline that applies methods of computer science to study and model the immune system. A fundamental question addressed by immunoinformatics is how to understand the rules of antigen presentation by MHC molecules to T cells, a process that is central to adaptive immune responses to infections and cancer. In the modern era of personalized medicine, the ability to model and predict which antigens can be presented by MHC is key to manipulating the immune system and designing strategies for therapeutic intervention. Since the MHC is both polygenic and extremely polymorphic, each individual possesses a personalized set of MHC molecules with different peptide-binding specificities, and collectively they present a unique individualized peptide imprint of the ongoing protein metabolism. Mapping all MHC allotypes is an enormous undertaking that cannot be achieved without a strong bioinformatics component. Computational tools for the prediction of peptide-MHC binding have thus become essential in most pipelines for T cell epitope discovery and an inescapable component of vaccine and cancer research. Here, we describe the development of several such tools, from pioneering efforts to the current state-of-the-art methods, that have allowed for accurate predictions of peptide binding of all MHC molecules, even including those that have not yet been characterized experimentally.
Collapse
Affiliation(s)
- Morten Nielsen
- Department of Health Technology, Technical University of Denmark, DK-2800 Kongens Lyngby, Denmark
- Instituto de Investigaciones Biotecnológicas, Universidad Nacional de San Martín, CP 1650 San Martin, Buenos Aires, Argentina
| | - Massimo Andreatta
- Instituto de Investigaciones Biotecnológicas, Universidad Nacional de San Martín, CP 1650 San Martin, Buenos Aires, Argentina
| | - Bjoern Peters
- Division of Vaccine Discovery, La Jolla Institute for Immunology, La Jolla, California 92037, USA
- Department of Medicine, University of California, San Diego, La Jolla, California 92093, USA
| | - Søren Buus
- Department of Immunology and Microbiology, Faculty of Health Sciences, University of Copenhagen, DK-2200 Copenhagen, Denmark
| |
Collapse
|
3
|
Abstract
Throughout the body, T cells monitor MHC-bound ligands expressed on the surface of essentially all cell types. MHC ligands that trigger a T cell immune response are referred to as T cell epitopes. Identifying such epitopes enables tracking, phenotyping, and stimulating T cells involved in immune responses in infectious disease, allergy, autoimmunity, transplantation, and cancer. The specific T cell epitopes recognized in an individual are determined by genetic factors such as the MHC molecules the individual expresses, in parallel to the individual's environmental exposure history. The complexity and importance of T cell epitope mapping have motivated the development of computational approaches that predict what T cell epitopes are likely to be recognized in a given individual or in a broader population. Such predictions guide experimental epitope mapping studies and enable computational analysis of the immunogenic potential of a given protein sequence region.
Collapse
Affiliation(s)
- Bjoern Peters
- Division of Vaccine Discovery, La Jolla Institute for Immunology, La Jolla, California 92037, USA; ,
- Department of Medicine, University of California San Diego, La Jolla, California 92093, USA
| | - Morten Nielsen
- Department of Health Technology, Technical University of Denmark, DK-2800 Kgs. Lyngby, Denmark;
- Instituto de Investigaciones Biotecnológicas, Universidad Nacional de San Martín, B1650 Buenos Aires, Argentina
| | - Alessandro Sette
- Division of Vaccine Discovery, La Jolla Institute for Immunology, La Jolla, California 92037, USA; ,
- Department of Medicine, University of California San Diego, La Jolla, California 92093, USA
| |
Collapse
|
4
|
Xi B, Tao J, Liu X, Xu X, He P, Dai Q. RaaMLab: A MATLAB toolbox that generates amino acid groups and reduced amino acid modes. Biosystems 2019; 180:38-45. [PMID: 30904554 DOI: 10.1016/j.biosystems.2019.03.002] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2018] [Revised: 12/25/2018] [Accepted: 03/06/2019] [Indexed: 01/31/2023]
Abstract
Amino acid (AA) classification and its different biophysical and chemical characteristics have been widely applied to analyze and predict the structural, functional, expression and interaction profiles of proteins and peptides. We present RaaMLab, a free and open-source MATLAB toolbox, to facilitate studies on proteins and peptides, to generate AA groups and to extract the structural and physicochemical features of reduced AAs (RedAA). This toolbox offers 4 kinds of databases, including the physicochemical properties of AAs and their groupings, 49 AA classification methods and 5 types of biophysicochemical features of RedAAs. These factors can be easily computed based on user-defined alphabet size and AA properties of AA groupings. RaaMLab is an open source freely available at https://github.com/bioinfo0706/RaaMLab. This website also contains a tutorial, extensive documentation and examples.
Collapse
Affiliation(s)
- Baohang Xi
- College of Life Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, People's Republic of China
| | - Jin Tao
- College of Life Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, People's Republic of China
| | - Xiaoqing Liu
- College of Sciences, Hangzhou Dianzi University, Hangzhou 310018, People's Republic of China
| | - Xinnan Xu
- College of Life Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, People's Republic of China
| | - Pingan He
- College of Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, People's Republic of China
| | - Qi Dai
- College of Life Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, People's Republic of China.
| |
Collapse
|
5
|
Abstract
The rapidly increasing number of characterized allergens has created huge demands for advanced information storage, retrieval, and analysis. Bioinformatics and machine learning approaches provide useful tools for the study of allergens and epitopes prediction, which greatly complement traditional laboratory techniques. The specific applications mainly include identification of B- and T-cell epitopes, and assessment of allergenicity and cross-reactivity. In order to facilitate the work of clinical and basic researchers who are not familiar with bioinformatics, we review in this chapter the most important databases, bioinformatic tools, and methods with relevance to the study of allergens.
Collapse
|
6
|
Zhang L, Zhang C, Gao R, Yang R. An Ensemble Method to Distinguish Bacteriophage Virion from Non-Virion Proteins Based on Protein Sequence Characteristics. Int J Mol Sci 2015; 16:21734-58. [PMID: 26370987 PMCID: PMC4613277 DOI: 10.3390/ijms160921734] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2015] [Revised: 08/16/2015] [Accepted: 08/25/2015] [Indexed: 11/16/2022] Open
Abstract
Bacteriophage virion proteins and non-virion proteins have distinct functions in biological processes, such as specificity determination for host bacteria, bacteriophage replication and transcription. Accurate identification of bacteriophage virion proteins from bacteriophage protein sequences is significant to understand the complex virulence mechanism in host bacteria and the influence of bacteriophages on the development of antibacterial drugs. In this study, an ensemble method for bacteriophage virion protein prediction from bacteriophage protein sequences is put forward with hybrid feature spaces incorporating CTD (composition, transition and distribution), bi-profile Bayes, PseAAC (pseudo-amino acid composition) and PSSM (position-specific scoring matrix). When performing on the training dataset 10-fold cross-validation, the presented method achieves a satisfactory prediction result with a sensitivity of 0.870, a specificity of 0.830, an accuracy of 0.850 and Matthew's correlation coefficient (MCC) of 0.701, respectively. To evaluate the prediction performance objectively, an independent testing dataset is used to evaluate the proposed method. Encouragingly, our proposed method performs better than previous studies with a sensitivity of 0.853, a specificity of 0.815, an accuracy of 0.831 and MCC of 0.662 on the independent testing dataset. These results suggest that the proposed method can be a potential candidate for bacteriophage virion protein prediction, which may provide a useful tool to find novel antibacterial drugs and to understand the relationship between bacteriophage and host bacteria. For the convenience of the vast majority of experimental Int. J. Mol. Sci. 2015, 16,21735 scientists, a user-friendly and publicly-accessible web-server for the proposed ensemble method is established.
Collapse
Affiliation(s)
- Lina Zhang
- School of Control Science and Engineering, Shandong University, Jinan 250061, China.
| | - Chengjin Zhang
- School of Control Science and Engineering, Shandong University, Jinan 250061, China.
- School of Mechanical, Electrical and Information Engineering, Shandong University, Weihai 264209, China.
| | - Rui Gao
- School of Control Science and Engineering, Shandong University, Jinan 250061, China.
| | - Runtao Yang
- School of Control Science and Engineering, Shandong University, Jinan 250061, China.
| |
Collapse
|
7
|
Eng LP, Tan TW, Tong JC. Building MHC class II epitope predictor using machine learning approaches. Methods Mol Biol 2015; 1268:67-73. [PMID: 25555721 DOI: 10.1007/978-1-4939-2285-7_4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
Identification of T-cell epitopes binding to MHC class II molecules is an important step in epitope-based vaccine development. This process has since been accelerated with the use of bioinformatics tools to aid in the prediction of peptide binding to MHC class II molecules and also to systematically scan for candidate peptides in antigenic proteins. There have been many prediction software developed over the years using various methods and algorithms and they are becoming increasingly sophisticated. Here, we illustrate the use of machine learning algorithms to train on MHC class II peptide data represented by feature vectors describing their amino acid physicochemical properties. The developed prediction model can then be used to predict new peptide data.
Collapse
Affiliation(s)
- Loan Ping Eng
- Department of Biochemistry, National University of Singapore, 14 Medical Drive #14-01T, Singapore, Singapore, 117599
| | | | | |
Collapse
|
8
|
Eng CLP, Tong JC, Tan TW. Predicting host tropism of influenza A virus proteins using random forest. BMC Med Genomics 2014; 7 Suppl 3:S1. [PMID: 25521718 PMCID: PMC4290784 DOI: 10.1186/1755-8794-7-s3-s1] [Citation(s) in RCA: 44] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open
Abstract
Background Majority of influenza A viruses reside and circulate among animal populations, seldom infecting humans due to host range restriction. Yet when some avian strains do acquire the ability to overcome species barrier, they might become adapted to humans, replicating efficiently and causing diseases, leading to potential pandemic. With the huge influenza A virus reservoir in wild birds, it is a cause for concern when a new influenza strain emerges with the ability to cross host species barrier, as shown in light of the recent H7N9 outbreak in China. Several influenza proteins have been shown to be major determinants in host tropism. Further understanding and determining host tropism would be important in identifying zoonotic influenza virus strains capable of crossing species barrier and infecting humans. Results In this study, computational models for 11 influenza proteins have been constructed using the machine learning algorithm random forest for prediction of host tropism. The prediction models were trained on influenza protein sequences isolated from both avian and human samples, which were transformed into amino acid physicochemical properties feature vectors. The results were highly accurate prediction models (ACC>96.57; AUC>0.980; MCC>0.916) capable of determining host tropism of individual influenza proteins. In addition, features from all 11 proteins were used to construct a combined model to predict host tropism of influenza virus strains. This would help assess a novel influenza strain's host range capability. Conclusions From the prediction models constructed, all achieved high prediction performance, indicating clear distinctions in both avian and human proteins. When used together as a host tropism prediction system, zoonotic strains could potentially be identified based on different protein prediction results. Understanding and predicting host tropism of influenza proteins lay an important foundation for future work in constructing computation models capable of directly predicting interspecies transmission of influenza viruses. The models are available for prediction at http://fluleap.bic.nus.edu.sg.
Collapse
|
9
|
Hwang I, Park S. Computational design of protein therapeutics. DRUG DISCOVERY TODAY. TECHNOLOGIES 2014; 5:e43-8. [PMID: 24981090 DOI: 10.1016/j.ddtec.2008.11.004] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Abstract
Computation is increasingly used to guide protein therapeutic designs. Some of the potential applications for computational, structure-based protein design include antibody affinity maturation, modulation of protein-protein interaction, stability improvement and minimization of protein aggregation. The versatility of a computational approach is that different biophysical properties can be analyzed on a common framework. Developing a coherent strategy to address various protein engineering objectives will promote synergy and exploration. Advances in computational structural analysis will thus have a transformative impact on how protein therapeutics are engineered in the future.:
Collapse
Affiliation(s)
- Inseong Hwang
- Department of Chemical and Biological Engineering, University at Buffalo, SUNY, Buffalo, NY, 14260, USA
| | - Sheldon Park
- Department of Chemical and Biological Engineering, University at Buffalo, SUNY, Buffalo, NY, 14260, USA.
| |
Collapse
|
10
|
Hosseinzadeh F, Kayvanjoo AH, Ebrahimi M, Goliaei B. Prediction of lung tumor types based on protein attributes by machine learning algorithms. SPRINGERPLUS 2013; 2:238. [PMID: 23888262 PMCID: PMC3710575 DOI: 10.1186/2193-1801-2-238] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/16/2013] [Accepted: 03/21/2013] [Indexed: 01/15/2023]
Abstract
Early diagnosis of lung cancers and distinction between the tumor types (Small Cell Lung Cancer (SCLC) and Non-Small Cell Lung Cancer (NSCLC) are very important to increase the survival rate of patients. Herein, we propose a diagnostic system based on sequence-derived structural and physicochemical attributes of proteins that involved in both types of tumors via feature extraction, feature selection and prediction models. 1497 proteins attributes computed and important features selected by 12 attribute weighting models and finally machine learning models consist of seven SVM models, three ANN models and two NB models applied on original database and newly created ones from attribute weighting models; models accuracies calculated through 10-fold cross and wrapper validation (just for SVM algorithms). In line with our previous findings, dipeptide composition, autocorrelation and distribution descriptor were the most important protein features selected by bioinformatics tools. The algorithms performances in lung cancer tumor type prediction increased when they applied on datasets created by attribute weighting models rather than original dataset. Wrapper-Validation performed better than X-Validation; the best cancer type prediction resulted from SVM and SVM Linear models (82%). The best accuracy of ANN gained when Neural Net model applied on SVM dataset (88%). This is the first report suggesting that the combination of protein features and attribute weighting models with machine learning algorithms can be effectively used to predict the type of lung cancer tumors (SCLC and NSCLC).
Collapse
Affiliation(s)
- Faezeh Hosseinzadeh
- Laboratory of biophysics and molecular biology, Institute of Biophysics and Biochemistry (IBB), University of Tehran, Tehran, Iran
| | | | | | | |
Collapse
|
11
|
Koch CP, Pillong M, Hiss JA, Schneider G. Computational Resources for MHC Ligand Identification. Mol Inform 2013; 32:326-36. [PMID: 27481589 DOI: 10.1002/minf.201300042] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2012] [Accepted: 04/04/2013] [Indexed: 01/16/2023]
Abstract
Advances in the high-throughput determination of functional modulators of major histocompatibility complex (MHC) and improved computational predictions of MHC ligands have rendered the rational design of immunomodulatory peptides feasible. Proteome-derived peptides and 'reverse vaccinology' by computational means will play a driving role in future vaccine design. Here we review the molecular mechanisms of the MHC mediated immune response, present the computational approaches that have emerged in this area of biotechnology, and provide an overview of publicly available computational resources for predicting and designing new peptidic MHC ligands.
Collapse
Affiliation(s)
- Christian P Koch
- ETH Zürich, Department of Chemistry and Applied Biosciences, Institute of Pharmaceutical Sciences, Wolfgang-Pauli-Str. 10, 8093 Zürich, Switzerland
| | - Max Pillong
- ETH Zürich, Department of Chemistry and Applied Biosciences, Institute of Pharmaceutical Sciences, Wolfgang-Pauli-Str. 10, 8093 Zürich, Switzerland
| | - Jan A Hiss
- ETH Zürich, Department of Chemistry and Applied Biosciences, Institute of Pharmaceutical Sciences, Wolfgang-Pauli-Str. 10, 8093 Zürich, Switzerland
| | - Gisbert Schneider
- ETH Zürich, Department of Chemistry and Applied Biosciences, Institute of Pharmaceutical Sciences, Wolfgang-Pauli-Str. 10, 8093 Zürich, Switzerland.
| |
Collapse
|
12
|
Han B, Ma X, Zhao R, Zhang J, Wei X, Liu X, Liu X, Zhang C, Tan C, Jiang Y, Chen Y. Development and experimental test of support vector machines virtual screening method for searching Src inhibitors from large compound libraries. Chem Cent J 2012; 6:139. [PMID: 23173901 PMCID: PMC3538513 DOI: 10.1186/1752-153x-6-139] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2012] [Accepted: 11/07/2012] [Indexed: 01/04/2023] Open
Abstract
UNLABELLED BACKGROUND Src plays various roles in tumour progression, invasion, metastasis, angiogenesis and survival. It is one of the multiple targets of multi-target kinase inhibitors in clinical uses and trials for the treatment of leukemia and other cancers. These successes and appearances of drug resistance in some patients have raised significant interest and efforts in discovering new Src inhibitors. Various in-silico methods have been used in some of these efforts. It is desirable to explore additional in-silico methods, particularly those capable of searching large compound libraries at high yields and reduced false-hit rates. RESULTS We evaluated support vector machines (SVM) as virtual screening tools for searching Src inhibitors from large compound libraries. SVM trained and tested by 1,703 inhibitors and 63,318 putative non-inhibitors correctly identified 93.53%~ 95.01% inhibitors and 99.81%~ 99.90% non-inhibitors in 5-fold cross validation studies. SVM trained by 1,703 inhibitors reported before 2011 and 63,318 putative non-inhibitors correctly identified 70.45% of the 44 inhibitors reported since 2011, and predicted as inhibitors 44,843 (0.33%) of 13.56M PubChem, 1,496 (0.89%) of 168 K MDDR, and 719 (7.73%) of 9,305 MDDR compounds similar to the known inhibitors. CONCLUSIONS SVM showed comparable yield and reduced false hit rates in searching large compound libraries compared to the similarity-based and other machine-learning VS methods developed from the same set of training compounds and molecular descriptors. We tested three virtual hits of the same novel scaffold from in-house chemical libraries not reported as Src inhibitor, one of which showed moderate activity. SVM may be potentially explored for searching Src inhibitors from large compound libraries at low false-hit rates.
Collapse
Affiliation(s)
- Bucong Han
- The Key Laboratory of Chemical Biology, Guangdong Province, The Graduate School at Shenzhen, Tsinghua University, Shenzhen, Guangdong, 518055, People’s Republic of China
- Computation and Systems Biology, Singapore-MIT Alliance, National University of Singapore, E4-04-10, 4 Engineering Drive 3, Singapore, 117576, Singapore
- Bioinformatics and Drug Design Group, Department of Pharmacy, Centre for Computational Science and Engineering, National University of Singapore, Blk S16, Level 8, 3 Science Drive 2, Singapore, 117543, Singapore
| | - Xiaohua Ma
- Bioinformatics and Drug Design Group, Department of Pharmacy, Centre for Computational Science and Engineering, National University of Singapore, Blk S16, Level 8, 3 Science Drive 2, Singapore, 117543, Singapore
| | - Ruiying Zhao
- Central Research Institute of China Chemical Science and Technology, 20 Xueyuan Road, Haidian District, Beijing, 100083, People’s Republic of China
| | - Jingxian Zhang
- Bioinformatics and Drug Design Group, Department of Pharmacy, Centre for Computational Science and Engineering, National University of Singapore, Blk S16, Level 8, 3 Science Drive 2, Singapore, 117543, Singapore
| | - Xiaona Wei
- Computation and Systems Biology, Singapore-MIT Alliance, National University of Singapore, E4-04-10, 4 Engineering Drive 3, Singapore, 117576, Singapore
- Bioinformatics and Drug Design Group, Department of Pharmacy, Centre for Computational Science and Engineering, National University of Singapore, Blk S16, Level 8, 3 Science Drive 2, Singapore, 117543, Singapore
| | - Xianghui Liu
- Bioinformatics and Drug Design Group, Department of Pharmacy, Centre for Computational Science and Engineering, National University of Singapore, Blk S16, Level 8, 3 Science Drive 2, Singapore, 117543, Singapore
| | - Xin Liu
- Bioinformatics and Drug Design Group, Department of Pharmacy, Centre for Computational Science and Engineering, National University of Singapore, Blk S16, Level 8, 3 Science Drive 2, Singapore, 117543, Singapore
| | - Cunlong Zhang
- The Key Laboratory of Chemical Biology, Guangdong Province, The Graduate School at Shenzhen, Tsinghua University, Shenzhen, Guangdong, 518055, People’s Republic of China
| | - Chunyan Tan
- The Key Laboratory of Chemical Biology, Guangdong Province, The Graduate School at Shenzhen, Tsinghua University, Shenzhen, Guangdong, 518055, People’s Republic of China
| | - Yuyang Jiang
- The Key Laboratory of Chemical Biology, Guangdong Province, The Graduate School at Shenzhen, Tsinghua University, Shenzhen, Guangdong, 518055, People’s Republic of China
| | - Yuzong Chen
- The Key Laboratory of Chemical Biology, Guangdong Province, The Graduate School at Shenzhen, Tsinghua University, Shenzhen, Guangdong, 518055, People’s Republic of China
- Computation and Systems Biology, Singapore-MIT Alliance, National University of Singapore, E4-04-10, 4 Engineering Drive 3, Singapore, 117576, Singapore
- Bioinformatics and Drug Design Group, Department of Pharmacy, Centre for Computational Science and Engineering, National University of Singapore, Blk S16, Level 8, 3 Science Drive 2, Singapore, 117543, Singapore
| |
Collapse
|
13
|
Hosseinzadeh F, Ebrahimi M, Goliaei B, Shamabadi N. Classification of lung cancer tumors based on structural and physicochemical properties of proteins by bioinformatics models. PLoS One 2012; 7:e40017. [PMID: 22829872 PMCID: PMC3400626 DOI: 10.1371/journal.pone.0040017] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2012] [Accepted: 05/30/2012] [Indexed: 12/03/2022] Open
Abstract
Rapid distinction between small cell lung cancer (SCLC) and non-small cell lung cancer (NSCLC) tumors is very important in diagnosis of this disease. Furthermore sequence-derived structural and physicochemical descriptors are very useful for machine learning prediction of protein structural and functional classes, classifying proteins and the prediction performance. Herein, in this study is the classification of lung tumors based on 1497 attributes derived from structural and physicochemical properties of protein sequences (based on genes defined by microarray analysis) investigated through a combination of attribute weighting, supervised and unsupervised clustering algorithms. Eighty percent of the weighting methods selected features such as autocorrelation, dipeptide composition and distribution of hydrophobicity as the most important protein attributes in classification of SCLC, NSCLC and COMMON classes of lung tumors. The same results were observed by most tree induction algorithms while descriptors of hydrophobicity distribution were high in protein sequences COMMON in both groups and distribution of charge in these proteins was very low; showing COMMON proteins were very hydrophobic. Furthermore, compositions of polar dipeptide in SCLC proteins were higher than NSCLC proteins. Some clustering models (alone or in combination with attribute weighting algorithms) were able to nearly classify SCLC and NSCLC proteins. Random Forest tree induction algorithm, calculated on leaves one-out and 10-fold cross validation) shows more than 86% accuracy in clustering and predicting three different lung cancer tumors. Here for the first time the application of data mining tools to effectively classify three classes of lung cancer tumors regarding the importance of dipeptide composition, autocorrelation and distribution descriptor has been reported.
Collapse
Affiliation(s)
- Faezeh Hosseinzadeh
- Student at Laboratory of Biophysics and Molecular Biology, Institute of Biophysics and Biochemistry, University of Tehran, Tehran, Iran
| | - Mansour Ebrahimi
- Department of Biology at Basic science School & Bioinformatics Research Group, Green Research Center, University of Qom, Qom, Iran
| | - Bahram Goliaei
- Department of Medical Physics, Iran University of Medical Science, Tehran, Iran
| | - Narges Shamabadi
- Bioinformatics Research Group, Green Research Center, University of Qom, Qom, Iran
| |
Collapse
|
14
|
He J, Yang G, Rao H, Li Z, Ding X, Chen Y. Prediction of human major histocompatibility complex class II binding peptides by continuous kernel discrimination method. Artif Intell Med 2011; 55:107-15. [PMID: 22134095 DOI: 10.1016/j.artmed.2011.10.005] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2011] [Revised: 10/12/2011] [Accepted: 10/21/2011] [Indexed: 11/25/2022]
Abstract
OBJECTIVE Accurate prediction of major histocompatibility complex (MHC) class II binding peptides helps reducing the experimental cost for identifying helper T cell epitopes, which has been a challenging problem partly because of the variable length of the binding peptides. This work is to develop an accurate model for predicting MHC-binding peptides using machine learning methods. METHODS In this work, a machine learning method, continuous kernel discrimination (CKD), was used for predicting MHC class II binders of variable lengths. The composition transition and distribution features were used for encoding peptide sequence and the Metropolis Monte Carlo simulated annealing approach was used for feature selection. RESULTS Feature selection was found to significantly improve the performance of the model. For benchmark dataset Dataset-1, the number of features is reduced from 147 to 24 and the area under the receiver operating characteristic curve (AUC) is improved from 0.8088 to 0.9034, while for benchmark dataset Dataset-2, the number of features is reduced from 147 to 44 and the AUC is improved from 0.7349 to 0.8499. An optimal CKD model was derived from the feature selection and bandwidth optimization using 10-fold cross-validation. Its AUC values are between 0.831 and 0.980 evaluated on benchmark datasets BM-Set1 and are between 0.806 and 0.949 on benchmark datasets BM-Set2 for MHC class II alleles. These results indicate a significantly better performance for our CKD model over other earlier models based on the training and testing of the same datasets. CONCLUSIONS Our study suggested that the CKD method outperforms other machine learning methods proposed earlier in the prediction of MHC class II biding peptides. Moreover, the choice of the cut-off for CKD classifier is crucial for its performance.
Collapse
Affiliation(s)
- Ju He
- College of Chemistry, Sichuan University, Chengdu 610064, People's Republic of China
| | | | | | | | | | | |
Collapse
|
15
|
EL-Manzalawy Y, Dobbs D, Honavar V. Predicting MHC-II binding affinity using multiple instance regression. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011; 8:1067-1079. [PMID: 20855923 PMCID: PMC3400677 DOI: 10.1109/tcbb.2010.94] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2023]
Abstract
Reliably predicting the ability of antigen peptides to bind to major histocompatibility complex class II (MHC-II) molecules is an essential step in developing new vaccines. Uncovering the amino acid sequence correlates of the binding affinity of MHC-II binding peptides is important for understanding pathogenesis and immune response. The task of predicting MHC-II binding peptides is complicated by the significant variability in their length. Most existing computational methods for predicting MHC-II binding peptides focus on identifying a nine amino acids core region in each binding peptide. We formulate the problems of qualitatively and quantitatively predicting flexible length MHC-II peptides as multiple instance learning and multiple instance regression problems, respectively. Based on this formulation, we introduce MHCMIR, a novel method for predicting MHC-II binding affinity using multiple instance regression. We present results of experiments using several benchmark data sets that show that MHCMIR is competitive with the state-of-the-art methods for predicting MHC-II binding peptides. An online web server that implements the MHCMIR method for MHC-II binding affinity prediction is freely accessible at http://ailab.cs.iastate.edu/mhcmir.
Collapse
Affiliation(s)
- Yasser EL-Manzalawy
- Department of Systems and Computers Engineering, Al-Azhar University, Cairo, Egypt.
| | | | | |
Collapse
|
16
|
Rao HB, Zhu F, Yang GB, Li ZR, Chen YZ. Update of PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence. Nucleic Acids Res 2011; 39:W385-90. [PMID: 21609959 PMCID: PMC3125735 DOI: 10.1093/nar/gkr284] [Citation(s) in RCA: 105] [Impact Index Per Article: 8.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/04/2022] Open
Abstract
Sequence-derived structural and physicochemical features have been extensively used for analyzing and predicting structural, functional, expression and interaction profiles of proteins and peptides. PROFEAT has been developed as a web server for computing commonly used features of proteins and peptides from amino acid sequence. To facilitate more extensive studies of protein and peptides, numerous improvements and updates have been made to PROFEAT. We added new functions for computing descriptors of protein–protein and protein–small molecule interactions, segment descriptors for local properties of protein sequences, topological descriptors for peptide sequences and small molecule structures. We also added new feature groups for proteins and peptides (pseudo-amino acid composition, amphiphilic pseudo-amino acid composition, total amino acid properties and atomic-level topological descriptors) as well as for small molecules (atomic-level topological descriptors). Overall, PROFEAT computes 11 feature groups of descriptors for proteins and peptides, and a feature group of more than 400 descriptors for small molecules plus the derived features for protein–protein and protein–small molecule interactions. Our computational algorithms have been extensively tested and used in a number of published works for predicting proteins of specific structural or functional classes, protein–protein interactions, peptides of specific functions and quantitative structure activity relationships of small molecules. PROFEAT is accessible free of charge at http://bidd.cz3.nus.edu.sg/cgi-bin/prof/protein/profnew.cgi.
Collapse
Affiliation(s)
- H B Rao
- College of Chemistry, Sichuan University, Chengdu, 610064, PR China
| | | | | | | | | |
Collapse
|
17
|
Nielsen M, Justesen S, Lund O, Lundegaard C, Buus S. NetMHCIIpan-2.0 - Improved pan-specific HLA-DR predictions using a novel concurrent alignment and weight optimization training procedure. Immunome Res 2010; 6:9. [PMID: 21073747 PMCID: PMC2994798 DOI: 10.1186/1745-7580-6-9] [Citation(s) in RCA: 116] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2010] [Accepted: 11/13/2010] [Indexed: 01/16/2023] Open
Abstract
Background Binding of peptides to Major Histocompatibility class II (MHC-II) molecules play a central role in governing responses of the adaptive immune system. MHC-II molecules sample peptides from the extracellular space allowing the immune system to detect the presence of foreign microbes from this compartment. Predicting which peptides bind to an MHC-II molecule is therefore of pivotal importance for understanding the immune response and its effect on host-pathogen interactions. The experimental cost associated with characterizing the binding motif of an MHC-II molecule is significant and large efforts have therefore been placed in developing accurate computer methods capable of predicting this binding event. Prediction of peptide binding to MHC-II is complicated by the open binding cleft of the MHC-II molecule, allowing binding of peptides extending out of the binding groove. Moreover, the genes encoding the MHC molecules are immensely diverse leading to a large set of different MHC molecules each potentially binding a unique set of peptides. Characterizing each MHC-II molecule using peptide-screening binding assays is hence not a viable option. Results Here, we present an MHC-II binding prediction algorithm aiming at dealing with these challenges. The method is a pan-specific version of the earlier published allele-specific NN-align algorithm and does not require any pre-alignment of the input data. This allows the method to benefit also from information from alleles covered by limited binding data. The method is evaluated on a large and diverse set of benchmark data, and is shown to significantly out-perform state-of-the-art MHC-II prediction methods. In particular, the method is found to boost the performance for alleles characterized by limited binding data where conventional allele-specific methods tend to achieve poor prediction accuracy. Conclusions The method thus shows great potential for efficient boosting the accuracy of MHC-II binding prediction, as accurate predictions can be obtained for novel alleles at highly reduced experimental costs. Pan-specific binding predictions can be obtained for all alleles with know protein sequence and the method can benefit by including data in the training from alleles even where only few binders are known. The method and benchmark data are available at http://www.cbs.dtu.dk/services/NetMHCIIpan-2.0
Collapse
Affiliation(s)
- Morten Nielsen
- Center A for Biological Sequence Analysis, BioCentrum-DTU, Building 208, Technical University of Denmark, DK-2800 Lyngby, Denmark.
| | | | | | | | | |
Collapse
|
18
|
Zhang W, Liu J, Niu Y. Quantitative prediction of MHC-II binding affinity using particle swarm optimization. Artif Intell Med 2010; 50:127-32. [PMID: 20541921 DOI: 10.1016/j.artmed.2010.05.003] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2009] [Revised: 03/31/2010] [Accepted: 05/12/2010] [Indexed: 01/13/2023]
Abstract
OBJECTIVE Helper T-cell epitopes (Th epitopes) are the basic units which activate helper T-cell's immune response, and they are helpful for understanding the immune mechanism and developing vaccines. Peptide and major histocompatibility complex class II (MHC-II) binding is an important prerequisite event for helper T-cell immune response, and the binding peptides are usually recognized as Th epitopes, therefore we can identify Th epitopes by predicting MHC-II binding peptides. Recently, instead of differentiating the peptides as binder or non-binder, researchers are more interested in predicting binding affinities between MHC-II molecules and peptides. METHODOLOGY Motivated by the collective search strategy of the particle swarm optimization algorithm (PSO), a method was developed to make the direct prediction of peptide binding affinity. In our paper, PSO was utilized to search for the optimal position-specific scoring matrices (PSSM) from the experimentally derived allele-related peptides, and then the prediction models were constructed based on the matrices. Moreover, we evaluated several factors influencing the binding affinity, including peptide length and flanking residue length, and incorporated them into our models. RESULTS The performance of our models was evaluated on three MHC-II alleles from AntiJen database and 14 MHC-II alleles from IEDB database. When compared to the existing popular quantitative methods such as MHCPred, SVRMHC, ARB and SMM-align, our method can give out better performance in terms of correlation coefficient (r) and area under ROC curve (AUC). In addition, the results demonstrated that the performance of models was further improved by incorporating the global length information, achieving average AUC value of 0.7534 and average r value of 0.4707. CONCLUSIONS Quantitative prediction of MHC-II binding affinity can be modeled as an optimization problem. Our PSO based method can find the optimal PSSM, which will then be used for identifying the binding cores and scoring the binding affinities of the peptides. The experiment results show that our method is promising for the prediction of MHC-II binding affinity.
Collapse
Affiliation(s)
- Wen Zhang
- School of Computer Science, Wuhan University, Wuhan 430072, People's Republic of China.
| | | | | |
Collapse
|
19
|
Rao H, Li Z, Li X, Ma X, Ung C, Li H, Liu X, Chen Y. Identification of small molecule aggregators from large compound libraries by support vector machines. J Comput Chem 2010; 31:752-63. [PMID: 19569201 DOI: 10.1002/jcc.21347] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
Abstract
Small molecule aggregators non-specifically inhibit multiple unrelated proteins, rendering them therapeutically useless. They frequently appear as false hits and thus need to be eliminated in high-throughput screening campaigns. Computational methods have been explored for identifying aggregators, which have not been tested in screening large compound libraries. We used 1319 aggregators and 128,325 non-aggregators to develop a support vector machines (SVM) aggregator identification model, which was tested by four methods. The first is five fold cross-validation, which showed comparable aggregator and significantly improved non-aggregator identification rates against earlier studies. The second is the independent test of 17 aggregators discovered independently from the training aggregators, 71% of which were correctly identified. The third is retrospective screening of 13M PUBCHEM and 168K MDDR compounds, which predicted 97.9% and 98.7% of the PUBCHEM and MDDR compounds as non-aggregators. The fourth is retrospective screening of 5527 MDDR compounds similar to the known aggregators, 1.14% of which were predicted as aggregators. SVM showed slightly better overall performance against two other machine learning methods based on five fold cross-validation studies of the same settings. Molecular features of aggregation, extracted by a feature selection method, are consistent with published profiles. SVM showed substantial capability in identifying aggregators from large libraries at low false-hit rates.
Collapse
Affiliation(s)
- Hanbing Rao
- College of Chemistry, Sichuan University, Chengdu 610064, People's Republic of China
| | | | | | | | | | | | | | | |
Collapse
|
20
|
Abstract
SUMMARY Major histocompatibility complex class II (MHC-II) molecules sample peptides from the extracellular space, allowing the immune system to detect the presence of foreign microbes from this compartment. To be able to predict the immune response to given pathogens, a number of methods have been developed to predict peptide-MHC binding. However, few methods other than the pioneering TEPITOPE/ProPred method have been developed for MHC-II. Despite recent progress in method development, the predictive performance for MHC-II remains significantly lower than what can be obtained for MHC-I. One reason for this is that the MHC-II molecule is open at both ends allowing binding of peptides extending out of the groove. The binding core of MHC-II-bound peptides is therefore not known a priori and the binding motif is hence not readily discernible. Recent progress has been obtained by including the flanking residues in the predictions. All attempts to make ab initio predictions based on protein structure have failed to reach predictive performances similar to those that can be obtained by data-driven methods. Thousands of different MHC-II alleles exist in humans. Recently developed pan-specific methods have been able to make reasonably accurate predictions for alleles that were not included in the training data. These methods can be used to define supertypes (clusters) of MHC-II alleles where alleles within each supertype have similar binding specificities. Furthermore, the pan-specific methods have been used to make a graphical atlas such as the MHCMotifviewer, which allows for visual comparison of specificities of different alleles.
Collapse
Affiliation(s)
- Morten Nielsen
- Department of Systems Biology, Technical University of Denmark, Centre for Biological Sequence Analysis, Lyngby, Denmark.
| | | | | | | |
Collapse
|
21
|
Liu XH, Ma XH, Tan CY, Jiang YY, Go ML, Low BC, Chen YZ. Virtual screening of Abl inhibitors from large compound libraries by support vector machines. J Chem Inf Model 2009; 49:2101-10. [PMID: 19689138 DOI: 10.1021/ci900135u] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022]
Abstract
Abl promotes cancers by regulating cell morphogenesis, motility, growth, and survival. Successes of several marketed and clinical trial Abl inhibitors against leukemia and other cancers and appearances of reduced efficacies and drug resistances have led to significant interest in and efforts for developing new Abl inhibitors. In silico methods of pharmacophore, fragment, and molecular docking have been used in some of these efforts. It is desirable to explore other in silico methods capable of searching large compound libraries at high yields and reduced false-hit rates. We evaluated support vector machines (SVM) as a virtual screening tool for searching Abl inhibitors from large compound libraries. SVM trained and tested by 708 inhibitors and 65,494 putative noninhibitors correctly identified 84.4 to 92.3% inhibitors and 99.96 to 99.99% noninhibitors in 5-fold cross validation studies. SVM trained by 708 pre-2008 inhibitors and 65 494 putative noninhibitors correctly identified 50.5% of the 91 inhibitors reported since 2008 and predicted as inhibitors 29,072 (0.21%) of 13.56M PubChem, 659 (0.39%) of 168K MDDR, and 330 (5.0%) of 6638 MDDR compounds similar to the known inhibitors. SVM showed comparable yields and substantially reduced false-hit rates against two similarity based and another machine learning VS methods based on the same training and testing data sets and molecular descriptors. These suggest that SVM is capable of searching Abl inhibitors from large compound libraries at low false-hit rates.
Collapse
Affiliation(s)
- X H Liu
- Bioinformatics and Drug Design Group, Department of Pharmacy, Centre for Computational Science and Engineering, National University of Singapore, Blk S16, Level 8, 3 Science Drive 2, Singapore 117543
| | | | | | | | | | | | | |
Collapse
|
22
|
Rao H, Yang G, Tan N, Li P, Li Z, Li X. Prediction of HIV-1 Protease Inhibitors Using Machine Learning Approaches. ACTA ACUST UNITED AC 2009. [DOI: 10.1002/qsar.200960021] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
|
23
|
Nielsen M, Lund O. NN-align. An artificial neural network-based alignment algorithm for MHC class II peptide binding prediction. BMC Bioinformatics 2009; 10:296. [PMID: 19765293 PMCID: PMC2753847 DOI: 10.1186/1471-2105-10-296] [Citation(s) in RCA: 380] [Impact Index Per Article: 25.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2009] [Accepted: 09/18/2009] [Indexed: 01/03/2023] Open
Abstract
BACKGROUND The major histocompatibility complex (MHC) molecule plays a central role in controlling the adaptive immune response to infections. MHC class I molecules present peptides derived from intracellular proteins to cytotoxic T cells, whereas MHC class II molecules stimulate cellular and humoral immunity through presentation of extracellularly derived peptides to helper T cells. Identification of which peptides will bind a given MHC molecule is thus of great importance for the understanding of host-pathogen interactions, and large efforts have been placed in developing algorithms capable of predicting this binding event. RESULTS Here, we present a novel artificial neural network-based method, NN-align that allows for simultaneous identification of the MHC class II binding core and binding affinity. NN-align is trained using a novel training algorithm that allows for correction of bias in the training data due to redundant binding core representation. Incorporation of information about the residues flanking the peptide-binding core is shown to significantly improve the prediction accuracy. The method is evaluated on a large-scale benchmark consisting of six independent data sets covering 14 human MHC class II alleles, and is demonstrated to outperform other state-of-the-art MHC class II prediction methods. CONCLUSION The NN-align method is competitive with the state-of-the-art MHC class II peptide binding prediction algorithms. The method is publicly available at http://www.cbs.dtu.dk/services/NetMHCII-2.0.
Collapse
Affiliation(s)
- Morten Nielsen
- Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, DK-2800 Lyngby, Denmark.
| | | |
Collapse
|
24
|
Toussaint NC, Kohlbacher O. Towards in silico design of epitope-based vaccines. Expert Opin Drug Discov 2009; 4:1047-60. [PMID: 23480396 DOI: 10.1517/17460440903242283] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
Abstract
BACKGROUND Epitope-based vaccines (EVs) make use of immunogenic peptides (epitopes) to trigger an immune response. Due to their manifold advantages, EVs have recently been attracting growing interest. The success of an EV is determined by the choice of epitopes used as a basis. However, the experimental discovery of candidate epitopes is expensive in terms of time and money. Furthermore, for the final choice of epitopes various immunological requirements have to be considered. METHODS Numerous in silico approaches exist that can guide the design of EVs. In particular, computational methods for MHC binding prediction have already become standard tools in immunology. Apart from binding prediction and prediction of antigen processing, methods for epitope design and selection have been suggested. We review these in silico approaches for epitope discovery and selection along with their strengths and weaknesses. Finally, we discuss some of the obvious problems in the design of EVs. CONCLUSION State-of-the-art in silico approaches to MHC binding prediction yield high accuracies. However, a more thorough understanding of the underlying biological processes and significant amounts of experimental data will be required for the validation and improvement of in silico approaches to the remaining aspects of EV design.
Collapse
Affiliation(s)
- Nora C Toussaint
- Eberhard Karls University, Center for Bioinformatics Tübingen, Division for Simulation of Biological Systems, 72076 Tübingen, Germany +49 7071 2970458 ; +49 7071 295152 ;
| | | |
Collapse
|
25
|
Yang X, Yu X. An introduction to epitope prediction methods and software. Rev Med Virol 2009; 19:77-96. [PMID: 19101924 DOI: 10.1002/rmv.602] [Citation(s) in RCA: 127] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
In this paper, current prediction methods and algorithms for both T- and B cell epitopes are reviewed, and a comprehensive summary of epitope prediction software and databases currently available online is also provided. This review can offer researchers in this field a sense of direction and insights for future work. However, our main purpose is to introduce clinical and basic biomedical researchers who are not familiar with these biological analysis tools and databases to these online resources and to provide guidance on how to use them effectively.
Collapse
Affiliation(s)
- Xingdong Yang
- Department of Veterinary Medicine, Hunan Agricultural University, Changsha, Hunan, P. R. China
| | | |
Collapse
|
26
|
Jia J, Cui J, Liu X, Han J, Yang S, Wei Y, Chen Y. Genome-scale search of tumor-specific antigens by collective analysis of mutations, expressions and T-cell recognition. Mol Immunol 2009; 46:1824-9. [PMID: 19243822 DOI: 10.1016/j.molimm.2009.01.019] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2008] [Revised: 01/05/2009] [Accepted: 01/12/2009] [Indexed: 11/17/2022]
Abstract
BACKGROUND Tumor-specific antigens (TSAs) are potential sources of cancer vaccines, some of which are derived from T-cell epitopes of over-expressed mutant proteins to elicit immunogenicity and overcome tolerance and evasion. The lack of effective vaccines for many cancers has prompted strong interest in improved TSA search methods. Recent progresses in profiling somatic mutations and expressions of human cancer genomes, and in predicting T-cell epitopes enable genome-scale TSA search by collectively analyzing these profiles. Such a collective approach has not been explored in spite of the availability and usage of individual methods. METHODOLOGY Genome-scale TSA search was conducted by genome-scale search of tumor-specific mutations in differentially over-expressed genes of specific cancers based on tumor-specific somatic mutation and microarray gene expression data, followed by T-cell recognition analysis of the identified mutant and over-expressed peptides to determine if they are substrates of proteasomal cleavage, TAP mediated transport and MHC-I alleles capable of eliciting immune response. The performance of our method was tested against 12 and 4 known T-cell defined melanoma and lung cancer TSAs in the Cancer Immunity database. CONCLUSIONS Our approach identified 50% and 75% of the 12 and 4 known TSAs and predicted from the human cancer genomes additional 8-250 and 14-359 putative TSAs of 5 and 3 HLA alleles respectively. The known TSA hit rates (1.9% and 0.8%) are enriched by 29-fold and 35-fold over those of mutation analysis. The numbers of predicted TSAs are within the testing range of typical screening campaigns. Noises in expression data of small sample sizes appear to be a major factor for misidentification of known TSAs. With improved data quality and analysis methods, the collective approach is potentially useful for facilitating genome-scale TSA search.
Collapse
Affiliation(s)
- Jia Jia
- Bioinformatics and Drug Design Group, Department of Pharmacy, and Centre for Computational Science and Engineering, National University of Singapore, Singapore 117543, Singapore
| | | | | | | | | | | | | |
Collapse
|
27
|
Zhang W, Liu J, Niu YQ, Wang L, Hu X. A Bayesian regression approach to the prediction of MHC-II binding affinity. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2008; 92:1-7. [PMID: 18562039 DOI: 10.1016/j.cmpb.2008.05.002] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/05/2008] [Revised: 05/06/2008] [Accepted: 05/06/2008] [Indexed: 05/26/2023]
Abstract
Peptide-major histocompatibility complex (MHC) binding is an important prerequisite event and has immediate consequences to immune response. Those peptides binding to MHC molecules can activate the T-cell immunity, and they are useful for understanding the immune mechanism and developing vaccines for diseases. Recently, researchers are interested in making prediction about binding affinity instead of differentiating the peptides as binder or non-binder. In this paper, we use sparse Bayesian regression algorithm proposed by Tipping [M.E. Tipping, Sparse Bayesian learning and the relevance vector machine. J. Mach. Learn. Res. (2001)] to derive position-specific scoring matrices from allele-related peptides, and develop the models allowing for the prediction of MHC-II binding affinity. We explore the peptide length and peptide flanking residue length's impact on binding affinity, and incorporate these factors into our models to enhance prediction performance. When applied to the datasets from AntiJen database and IEDB database, our method produces better performances than several popular quantitative methods.
Collapse
Affiliation(s)
- Wen Zhang
- School of Computer Science, Wuhan University, Wuhan 430079, China.
| | | | | | | | | |
Collapse
|
28
|
El-Manzalawy Y, Dobbs D, Honavar V. On evaluating MHC-II binding peptide prediction methods. PLoS One 2008; 3:e3268. [PMID: 18813344 PMCID: PMC2533399 DOI: 10.1371/journal.pone.0003268] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2008] [Accepted: 08/20/2008] [Indexed: 12/26/2022] Open
Abstract
Choice of one method over another for MHC-II binding peptide prediction is typically based on published reports of their estimated performance on standard benchmark datasets. We show that several standard benchmark datasets of unique peptides used in such studies contain a substantial number of peptides that share a high degree of sequence identity with one or more other peptide sequences in the same dataset. Thus, in a standard cross-validation setup, the test set and the training set are likely to contain sequences that share a high degree of sequence identity with each other, leading to overly optimistic estimates of performance. Hence, to more rigorously assess the relative performance of different prediction methods, we explore the use of similarity-reduced datasets. We introduce three similarity-reduced MHC-II benchmark datasets derived from MHCPEP, MHCBN, and IEDB databases. The results of our comparison of the performance of three MHC-II binding peptide prediction methods estimated using datasets of unique peptides with that obtained using their similarity-reduced counterparts shows that the former can be rather optimistic relative to the performance of the same methods on similarity-reduced counterparts of the same datasets. Furthermore, our results demonstrate that conclusions regarding the superiority of one method over another drawn on the basis of performance estimates obtained using commonly used datasets of unique peptides are often contradicted by the observed performance of the methods on the similarity-reduced versions of the same datasets. These results underscore the importance of using similarity-reduced datasets in rigorously comparing the performance of alternative MHC-II peptide prediction methods.
Collapse
Affiliation(s)
- Yasser El-Manzalawy
- Department of Computer Science, Center for Computational Intelligence, Learning, and Discovery, Iowa State University, Ames, Iowa, USA.
| | | | | |
Collapse
|
29
|
Lara J, Wohlhueter RM, Dimitrova Z, Khudyakov YE. Artificial neural network for prediction of antigenic activity for a major conformational epitope in the hepatitis C virus NS3 protein. Bioinformatics 2008; 24:1858-64. [DOI: 10.1093/bioinformatics/btn339] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023] Open
|
30
|
Nielsen M, Lundegaard C, Blicher T, Peters B, Sette A, Justesen S, Buus S, Lund O. Quantitative predictions of peptide binding to any HLA-DR molecule of known sequence: NetMHCIIpan. PLoS Comput Biol 2008; 4:e1000107. [PMID: 18604266 PMCID: PMC2430535 DOI: 10.1371/journal.pcbi.1000107] [Citation(s) in RCA: 215] [Impact Index Per Article: 13.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2008] [Accepted: 05/28/2008] [Indexed: 01/05/2023] Open
Abstract
CD4 positive T helper cells control many aspects of specific immunity. These cells are specific for peptides derived from protein antigens and presented by molecules of the extremely polymorphic major histocompatibility complex (MHC) class II system. The identification of peptides that bind to MHC class II molecules is therefore of pivotal importance for rational discovery of immune epitopes. HLA-DR is a prominent example of a human MHC class II. Here, we present a method, NetMHCIIpan, that allows for pan-specific predictions of peptide binding to any HLA-DR molecule of known sequence. The method is derived from a large compilation of quantitative HLA-DR binding events covering 14 of the more than 500 known HLA-DR alleles. Taking both peptide and HLA sequence information into account, the method can generalize and predict peptide binding also for HLA-DR molecules where experimental data is absent. Validation of the method includes identification of endogenously derived HLA class II ligands, cross-validation, leave-one-molecule-out, and binding motif identification for hitherto uncharacterized HLA-DR molecules. The validation shows that the method can successfully predict binding for HLA-DR molecules—even in the absence of specific data for the particular molecule in question. Moreover, when compared to TEPITOPE, currently the only other publicly available prediction method aiming at providing broad HLA-DR allelic coverage, NetMHCIIpan performs equivalently for alleles included in the training of TEPITOPE while outperforming TEPITOPE on novel alleles. We propose that the method can be used to identify those hitherto uncharacterized alleles, which should be addressed experimentally in future updates of the method to cover the polymorphism of HLA-DR most efficiently. We thus conclude that the presented method meets the challenge of keeping up with the MHC polymorphism discovery rate and that it can be used to sample the MHC “space,” enabling a highly efficient iterative process for improving MHC class II binding predictions. CD4 positive T helper cells provide essential help for stimulation of both cellular and humoral immune reactions. T helper cells recognize peptides presented by molecules of the major histocompatibility complex (MHC) class II system. HLA-DR is a prominent example of a human MHC class II locus. The HLA molecules are extremely polymorphic, and more than 500 different HLA-DR protein sequences are known today. Each HLA-DR molecule potentially binds a unique set of antigenic peptides, and experimental characterization of the binding specificity for each molecule would be an immense and highly costly task. Only a very limited set of MHC molecules has been characterized experimentally. We have demonstrated earlier that it is possible to derive accurate predictions for MHC class I proteins by interpolating information from neighboring molecules. It is not straightforward to take a similar approach to derive pan-specific HLA-DR class II predictions because the HLA class II molecules can bind peptides of very different lengths. Here, we nonetheless show that this is indeed possible. We develop an HLA-DR pan-specific method that allows for prediction of binding to any HLA-DR molecule of known sequence—even in the absence of specific data for the particular molecule in question.
Collapse
Affiliation(s)
- Morten Nielsen
- Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, Lyngby, Denmark.
| | | | | | | | | | | | | | | |
Collapse
|
31
|
Ma XH, Wang R, Yang SY, Li ZR, Xue Y, Wei YC, Low BC, Chen YZ. Evaluation of virtual screening performance of support vector machines trained by sparsely distributed active compounds. J Chem Inf Model 2008; 48:1227-37. [PMID: 18533644 DOI: 10.1021/ci800022e] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
Abstract
Virtual screening performance of support vector machines (SVM) depends on the diversity of training active and inactive compounds. While diverse inactive compounds can be routinely generated, the number and diversity of known actives are typically low. We evaluated the performance of SVM trained by sparsely distributed actives in six MDDR biological target classes composed of a high number of known actives (983-1645) of high, intermediate, and low structural diversity (muscarinic M1 receptor agonists, NMDA receptor antagonists, thrombin inhibitors, HIV protease inhibitors, cephalosporins, and renin inhibitors). SVM trained by regularly sparse data sets of 100 actives show improved yields at substantially reduced false-hit rates compared to those of published studies and those of Tanimoto-based similarity searching method based on the same data sets and molecular descriptors. SVM trained by very sparse data sets of 40 actives (2.4%-4.1% of the known actives) predicted 17.5-39.5%, 23.0-48.1%, and 70.2-92.4% of the remaining 943-1605 actives in the high, intermediate, and low diversity classes, respectively, 13.8-68.7% of which are outside the training compound families. SVM predicted 99.97% and 97.1% of the 9.997 M PUBCHEM and 167K remaining MDDR compounds as inactive and 2.6%-8.3% of the 19,495-38,483 MDDR compounds similar to the known actives as active. These suggest that SVM has substantial capability in identifying novel active compounds from sparse active data sets at low false-hit rates.
Collapse
Affiliation(s)
- X H Ma
- Centre for Computational Science and Engineering, National University of Singapore, Singapore
| | | | | | | | | | | | | | | |
Collapse
|
32
|
Abstract
The prevailing methods to predict T-cell epitopes are reviewed. Motif matching, matrix, support vector machine (SVM), and empirical scoring function methods are mainly reviewed; and the thermodynamic integration (TI) method using all-atom molecular dynamics (MD) simulation is mentioned briefly. The motif matching method appeared first and developed with the increased understanding of the characteristic structure of MHC-peptide complexes, that is, pockets aligned in the groove and corresponding residues fitting on them. This method is now becoming outdated due to the insufficiency and inaccuracy of information. The matrix method, the generalization of interaction between pockets of MHC and residues of bound peptide to all the positions in the groove, is the most prevalent one. Efficiency of calculation makes this method appropriate to scan for candidates of T-cell epitopes within whole expressed proteins in an organ or even in a body. A large amount of experimental binding data is necessary to determine a matrix. SVM is a relative of the artificial neural network, especially direct generalization of a linear Perceptron. By incorporating non-binder data and adopting encoding that reflects the physical properties of amino acids, its performance becomes quite high. Empirical scoring functions apparently seem to be founded on a physical basis. However, the estimates directly derived from the method using only structural data are far from practical use. Through regression with binding data of a series of ligands and receptors, this method predicts binding affinity with appropriate accuracy. The TI method using MD requires only structural data and a general atomic parameter, that is, force field, and hence theoretically most consistent; however, the extent of perturbation, inaccuracy of the force field, the necessity of an immense amount of calculations, and continued difficulty of sampling an adequate structure hamper the application of this method in practical use.
Collapse
Affiliation(s)
- Hiromichi Tsurui
- Department of Pathology, Juntendo University School of Medicine, Hongo, Tokyo, Japan.
| | | |
Collapse
|
33
|
Zhang W, Liu J, Niu Y. Quantitative prediction of MHC-II peptide binding affinity using relevance vector machine. APPL INTELL 2008. [DOI: 10.1007/s10489-008-0121-3] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
|
34
|
A probabilistic meta-predictor for the MHC class II binding peptides. Immunogenetics 2007; 60:25-36. [PMID: 18092156 DOI: 10.1007/s00251-007-0266-y] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2007] [Accepted: 11/21/2007] [Indexed: 12/27/2022]
Abstract
Several computational methods for the prediction of major histocompatibility complex (MHC) class II binding peptides embodying different strengths and weaknesses have been developed. To provide reliable prediction, it is important to design a system that enables the integration of outcomes from various predictors. The construction of a meta-predictor of this type based on a probabilistic approach is introduced in this paper. The design permits the easy incorporation of results obtained from any number of individual predictors. It is demonstrated that this integrated method outperforms six state-of-the-art individual predictors based on computational studies using MHC class II peptides from 13 HLA alleles and three mouse MHC alleles obtained from the Immune Epitope Database and Analysis Resource. It is concluded that this integrative approach provides a clearly enhanced reliability of prediction. Moreover, this computational framework can be directly extended to MHC class I binding predictions.
Collapse
|
35
|
Han LY, Ma XH, Lin HH, Jia J, Zhu F, Xue Y, Li ZR, Cao ZW, Ji ZL, Chen YZ. A support vector machines approach for virtual screening of active compounds of single and multiple mechanisms from large libraries at an improved hit-rate and enrichment factor. J Mol Graph Model 2007; 26:1276-86. [PMID: 18218332 DOI: 10.1016/j.jmgm.2007.12.002] [Citation(s) in RCA: 65] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2007] [Revised: 12/05/2007] [Accepted: 12/05/2007] [Indexed: 01/04/2023]
Abstract
Support vector machines (SVM) and other machine-learning (ML) methods have been explored as ligand-based virtual screening (VS) tools for facilitating lead discovery. While exhibiting good hit selection performance, in screening large compound libraries, these methods tend to produce lower hit-rate than those of the best performing VS tools, partly because their training-sets contain limited spectrum of inactive compounds. We tested whether the performance of SVM can be improved by using training-sets of diverse inactive compounds. In retrospective database screening of active compounds of single mechanism (HIV protease inhibitors, DHFR inhibitors, dopamine antagonists) and multiple mechanisms (CNS active agents) from large libraries of 2.986 million compounds, the yields, hit-rates, and enrichment factors of our SVM models are 52.4-78.0%, 4.7-73.8%, and 214-10,543, respectively, compared to those of 62-95%, 0.65-35%, and 20-1200 by structure-based VS and 55-81%, 0.2-0.7%, and 110-795 by other ligand-based VS tools in screening libraries of >or=1 million compounds. The hit-rates are comparable and the enrichment factors are substantially better than the best results of other VS tools. 24.3-87.6% of the predicted hits are outside the known hit families. SVM appears to be potentially useful for facilitating lead discovery in VS of large compound libraries.
Collapse
Affiliation(s)
- L Y Han
- Bioinformatics and Drug Design Group, Department of Pharmacy, National University of Singapore, Blk S16, Level 8, 3 Science Drive 2, Singapore 117543, Singapore
| | | | | | | | | | | | | | | | | | | |
Collapse
|
36
|
Ray S, Kepler TB. Amino acid biophysical properties in the statistical prediction of peptide-MHC class I binding. Immunome Res 2007; 3:9. [PMID: 17967170 PMCID: PMC2186325 DOI: 10.1186/1745-7580-3-9] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2007] [Accepted: 10/29/2007] [Indexed: 11/10/2022] Open
Abstract
Background A key step in the development of an adaptive immune response to pathogens or vaccines is the binding of short peptides to molecules of the Major Histocompatibility Complex (MHC) for presentation to T lymphocytes, which are thereby activated and differentiate into effector and memory cells. The rational design of vaccines consists in part in the identification of appropriate peptides to effect this process. There are several algorithms currently in use for making such predictions, but these are limited to a small number of MHC molecules and have good but imperfect prediction power. Results We have undertaken an exploration of the power gained by taking advantage of a natural representation of the amino acids in terms of their biophysical properties. We used several well-known statistical classifiers using either a naive encoding of amino acids by name or an encoding by biophysical properties. In all cases, the encoding by biophysical properties leads to substantially lower misclassification error. Conclusion Representation of amino acids using a few important bio-physio-chemical property provide a natural basis for representing peptides and greatly improves peptide-MHC class I binding prediction.
Collapse
Affiliation(s)
- Surajit Ray
- Department of Mathematics and Statistics, Boston University, Boston, MA, USA.
| | | |
Collapse
|
37
|
Ong SAK, Lin HH, Chen YZ, Li ZR, Cao Z. Efficacy of different protein descriptors in predicting protein functional families. BMC Bioinformatics 2007; 8:300. [PMID: 17705863 PMCID: PMC1997217 DOI: 10.1186/1471-2105-8-300] [Citation(s) in RCA: 51] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2006] [Accepted: 08/17/2007] [Indexed: 02/02/2023] Open
Abstract
Background Sequence-derived structural and physicochemical descriptors have frequently been used in machine learning prediction of protein functional families, thus there is a need to comparatively evaluate the effectiveness of these descriptor-sets by using the same method and parameter optimization algorithm, and to examine whether the combined use of these descriptor-sets help to improve predictive performance. Six individual descriptor-sets and four combination-sets were evaluated in support vector machines (SVM) prediction of six protein functional families. Results The performance of these descriptor-sets were ranked by Matthews correlation coefficient (MCC), and categorized into two groups based on their performance. While there is no overwhelmingly favourable choice of descriptor-sets, certain trends were found. The combination-sets tend to give slightly but consistently higher MCC values and thus overall best performance such that three out of four combination-sets show slightly better performance compared to one out of six individual descriptor-sets. Conclusion Our study suggests that currently used descriptor-sets are generally useful for classifying proteins and the prediction performance may be enhanced by exploring combinations of descriptors.
Collapse
Affiliation(s)
- Serene AK Ong
- Department of Pharmacy, National University of Singapore, Blk S16, Level 8, 08-14, 3 Science Drive 2, Singapore 117543, Singapore
| | - Hong Huang Lin
- Department of Pharmacy, National University of Singapore, Blk S16, Level 8, 08-14, 3 Science Drive 2, Singapore 117543, Singapore
| | - Yu Zong Chen
- Department of Pharmacy, National University of Singapore, Blk S16, Level 8, 08-14, 3 Science Drive 2, Singapore 117543, Singapore
| | - Ze Rong Li
- College of Chemistry, Sichuan University, Chengdu, 610064, P.R. China
| | - Zhiwei Cao
- Shanghai Center for Bioinformatics Technology, 100, Qinzhou Road, Shanghai 200235 P.R. China
| |
Collapse
|
38
|
Zhang GL, Bozic I, Kwoh CK, August JT, Brusic V. Prediction of supertype-specific HLA class I binding peptides using support vector machines. J Immunol Methods 2007; 320:143-54. [PMID: 17303158 PMCID: PMC2806231 DOI: 10.1016/j.jim.2006.12.011] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2006] [Accepted: 12/20/2006] [Indexed: 12/13/2022]
Abstract
Experimental approaches for identifying T-cell epitopes are time-consuming, costly and not applicable to the large scale screening. Computer modeling methods can help to minimize the number of experiments required, enable a systematic scanning for candidate major histocompatibility complex (MHC) binding peptides and thus speed up vaccine development. We developed a prediction system based on a novel data representation of peptide/MHC interaction and support vector machines (SVM) for prediction of peptides that promiscuously bind to multiple Human Leukocyte Antigen (HLA, human MHC) alleles belonging to a HLA supertype. Ten-fold cross-validation results showed that the overall performance of SVM models is improved in comparison to our previously published methods based on hidden Markov models (HMM) and artificial neural networks (ANN), also confirmed by blind testing. At specificity 0.90, sensitivity values of SVM models were 0.90 and 0.92 for HLA-A2 and -A3 dataset respectively. Average area under the receiver operating curve (A(ROC)) of SVM models in blind testing are 0.89 and 0.92 for HLA-A2 and -A3 datasets. A(ROC) of HLA-A2 and -A3 SVM models were 0.94 and 0.95, validated using a full overlapping study of 9-mer peptides from human papillomavirus type 16 E6 and E7 proteins. In addition, a large-scale experimental dataset has been used to validate HLA-A2 and -A3 SVM models. The SVM prediction models were integrated into a web-based computational system MULTIPRED1, accessible at antigen.i2r.a-star.edu.sg/multipred1/.
Collapse
Affiliation(s)
- Guang Lan Zhang
- Institute for Infocomm Research, 21 Heng Mui Keng Terrace, Singapore 119613, Singapore
- School of Computer Engineering, Nanyang Technological University, Block N4, Nanyang Avenue, Singapore 639798, Singapore
| | - Ivana Bozic
- Faculty of Mathematics, University of Belgrade, Belgrade, Serbia and Montenegro
| | - Chee Keong Kwoh
- School of Computer Engineering, Nanyang Technological University, Block N4, Nanyang Avenue, Singapore 639798, Singapore
| | - J. Thomas August
- Department of Pharmacology and Molecular Sciences, Johns Hopkins School of Medicine, Baltimore, MD 21205, USA
| | - Vladimir Brusic
- Cancer Vaccine Center, Dana-Farber Cancer Institute, Boston, MA 02115, USA
- Corresponding author. Tel.: +1 617 632 3824; fax: +1 617 632 3351. (V. Brusic)
| |
Collapse
|