1
|
Prabhu H, Bhosale H, Sane A, Dhadwal R, Ramakrishnan V, Valadi J. Protein feature engineering framework for AMPylation site prediction. Sci Rep 2024; 14:8695. [PMID: 38622194 DOI: 10.1038/s41598-024-58450-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2023] [Accepted: 03/29/2024] [Indexed: 04/17/2024] Open
Abstract
AMPylation is a biologically significant yet understudied post-translational modification where an adenosine monophosphate (AMP) group is added to Tyrosine and Threonine residues primarily. While recent work has illuminated the prevalence and functional impacts of AMPylation, experimental identification of AMPylation sites remains challenging. Computational prediction techniques provide a faster alternative approach. The predictive performance of machine learning models is highly dependent on the features used to represent the raw amino acid sequences. In this work, we introduce a novel feature extraction pipeline to encode the key properties relevant to AMPylation site prediction. We utilize a recently published dataset of curated AMPylation sites to develop our feature generation framework. We demonstrate the utility of our extracted features by training various machine learning classifiers, on various numerical representations of the raw sequences extracted with the help of our framework. Tenfold cross-validation is used to evaluate the model's capability to distinguish between AMPylated and non-AMPylated sites. The top-performing set of features extracted achieved MCC score of 0.58, Accuracy of 0.8, AUC-ROC of 0.85 and F1 score of 0.73. Further, we elucidate the behaviour of the model on the set of features consisting of monogram and bigram counts for various representations using SHapley Additive exPlanations.
Collapse
Affiliation(s)
- Hardik Prabhu
- Computing and Data Sciences, FLAME University, Pune, 412115, India
- Robert Bosch Centre for Cyber Physical Systems, Indian Institute of Science, Bengaluru, 560012, India
| | | | - Aamod Sane
- Computing and Data Sciences, FLAME University, Pune, 412115, India
| | - Renu Dhadwal
- Computing and Data Sciences, FLAME University, Pune, 412115, India
| | - Vigneshwar Ramakrishnan
- Bioinformatics Center, School of Chemical and Biotechnology, SASTRA Deemed to be University, Thanjavur, 613401, India
| | - Jayaraman Valadi
- Computing and Data Sciences, FLAME University, Pune, 412115, India.
| |
Collapse
|
2
|
Dutta S, Zunjare RU, Sil A, Mishra DC, Arora A, Gain N, Chand G, Chhabra R, Muthusamy V, Hossain F. Prediction of matrilineal specific patatin-like protein governing in-vivo maternal haploid induction in maize using support vector machine and di-peptide composition. Amino Acids 2024; 56:20. [PMID: 38460024 DOI: 10.1007/s00726-023-03368-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2023] [Accepted: 12/05/2023] [Indexed: 03/11/2024]
Abstract
The mutant matrilineal (mtl) gene encoding patatin-like phospholipase activity is involved in in-vivo maternal haploid induction in maize. Doubling of chromosomes in haploids by colchicine treatment leads to complete fixation of inbreds in just one generation compared to 6-7 generations of selfing. Thus, knowledge of patatin-like proteins in other crops assumes great significance for in-vivo haploid induction. So far, no online tool is available that can classify unknown proteins into patatin-like proteins. Here, we aimed to optimize a machine learning-based algorithm to predict the patatin-like phospholipase activity of unknown proteins. Four different kernels [radial basis function (RBF), sigmoid, polynomial, and linear] were used for building support vector machine (SVM) classifiers using six different sequence-based compositional features (AAC, DPC, GDPC, CTDC, CTDT, and GAAC). A total of 1170 protein sequences including both patatin-like (585 sequences) from various monocots, dicots, and microbes; and non-patatin-like proteins (585 sequences) from different subspecies of Zea mays were analyzed. RBF and polynomial kernels were quite promising in the prediction of patatin-like proteins. Among six sequence-based compositional features, di-peptide composition attained > 90% prediction accuracies using RBF and polynomial kernels. Using mutual information, most explaining dipeptides that contributed the highest to the prediction process were identified. The knowledge generated in this study can be utilized in other crops prior to the initiation of any experiment. The developed SVM model opened a new paradigm for scientists working in in-vivo haploid induction in commercial crops. This is the first report of machine learning of the identification of proteins with patatin-like activity.
Collapse
Affiliation(s)
- Suman Dutta
- ICAR-Indian Agricultural Research Institute, New Delhi, India
| | | | - Anirban Sil
- ICAR-Indian Agricultural Research Institute, New Delhi, India
| | | | - Alka Arora
- ICAR-Indian Agricultural Statistical Research Institute, New Delhi, India
| | - Nisrita Gain
- ICAR-Indian Agricultural Research Institute, New Delhi, India
| | - Gulab Chand
- ICAR-Indian Agricultural Research Institute, New Delhi, India
| | - Rashmi Chhabra
- ICAR-Indian Agricultural Research Institute, New Delhi, India
| | | | - Firoz Hossain
- ICAR-Indian Agricultural Research Institute, New Delhi, India.
| |
Collapse
|
3
|
Lahorkar A, Bhosale H, Sane A, Ramakrishnan V, Jayaraman VK. Identification of Phase Separating Proteins With Distributed Reduced Alphabet Representations of Sequences. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:410-420. [PMID: 35139023 DOI: 10.1109/tcbb.2022.3149310] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Phase separation of proteins play key roles in cellular physiology including bacterial division, tumorigenesis etc. Consequently, understanding the molecular forces that drive phase separation has gained considerable attention and several factors including hydrophobicity, protein dynamics, etc., have been implicated in phase separation. Data-driven identification of new phase separating proteins can enable in-depth understanding of cellular physiology and may pave way towards developing novel methods of tackling disease progression. In this work, we exploit the existing wealth of data on phase separating proteins to develop sequence-based machine learning method for prediction of phase separating proteins. We use reduced alphabet schemes based on hydrophobicity and conformational similarity along with distributed representation of protein sequences and biochemical properties as input features to Support Vector Machine (SVM) and Random Forest (RF) machine learning algorithms. We used both curated and balanced dataset for building the models. RF trained on balanced dataset with hydropathy, conformational similarity embeddings and biochemical properties achieved accuracy of 97%. Our work highlights the use of conformational similarity, a feature that reflects amino acid flexibility, and hydrophobicity for predicting phase separating proteins. Use of such "interpretable" features obtained from the ever-growing knowledgebase of phase separation is likely to improve prediction performances further.
Collapse
|
4
|
Lv N, Zhou Z, He S, Shao X, Zhou X, Feng X, Qian Z, Zhang Y, Liu M. Identification of osteoporosis based on gene biomarkers using support vector machine. Open Med (Wars) 2022; 17:1216-1227. [PMID: 35859791 PMCID: PMC9263892 DOI: 10.1515/med-2022-0507] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2021] [Revised: 04/19/2022] [Accepted: 05/15/2022] [Indexed: 11/26/2022] Open
Abstract
Osteoporosis is a major health concern worldwide. The present study aimed to identify effective biomarkers for osteoporosis detection. In osteoporosis, 559 differentially expressed genes (DEGs) were enriched in PI3K-Akt signaling pathway and Foxo signaling pathway. Weighted gene co-expression network analysis showed that green, pink, and tan modules were clinically significant modules, and that six genes (VEGFA, DDX5, SOD2, HNRNPD, EIF5B, and HSP90B1) were identified as “real” hub genes in the protein–protein interaction network, co-expression network, and 559 DEGs. The sensitivity and specificity of the support vector machine (SVM) for identifying patients with osteoporosis was 100%, with an area under curve of 1 in both training and validation datasets. Our results indicated that the current system using the SVM method could identify patients with osteoporosis.
Collapse
Affiliation(s)
- Nanning Lv
- Department of Orthopedic Surgery, The Second People's Hospital of Lianyungang, Lianyungang, Jiangsu 222003, China
| | - Zhangzhe Zhou
- Department of Orthopedic Surgery, The First Affiliated Hospital of Soochow University, Suzhou, Jiangsu 215000, China
| | - Shuangjun He
- Department of Orthopedic Surgery, Affiliated Danyang Hospital of Nantong University, The People's Hospital of Danyang, Danyang, Jiangsu 212300, China
| | - Xiaofeng Shao
- Department of Orthopedic Surgery, The First Affiliated Hospital of Soochow University, Suzhou, Jiangsu 215000, China
| | - Xinfeng Zhou
- Department of Orthopedic Surgery, The First Affiliated Hospital of Soochow University, Suzhou, Jiangsu 215000, China
| | - Xiaoxiao Feng
- Department of Orthopedic Surgery, The Second People's Hospital of Lianyungang, Lianyungang, Jiangsu 222003, China
| | - Zhonglai Qian
- Department of Orthopedic Surgery, The First Affiliated Hospital of Soochow University, Suzhou, Jiangsu 215000, China
| | - Yijian Zhang
- Department of Orthopedic Surgery, The First Affiliated Hospital of Soochow University, Suzhou, Jiangsu 215000, China
| | - Mingming Liu
- Department of Orthopedic Surgery, The Second People's Hospital of Lianyungang, Lianyungang, Jiangsu 222003, China
| |
Collapse
|
5
|
Identification of Type 2 Diabetes Based on a Ten-Gene Biomarker Prediction Model Constructed Using a Support Vector Machine Algorithm. BIOMED RESEARCH INTERNATIONAL 2022; 2022:1230761. [PMID: 35281591 PMCID: PMC8916865 DOI: 10.1155/2022/1230761] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/30/2021] [Revised: 11/24/2021] [Accepted: 02/20/2022] [Indexed: 11/17/2022]
Abstract
Background Type 2 diabetes is a major health concern worldwide. The present study is aimed at discovering effective biomarkers for an efficient diagnosis of type 2 diabetes. Methods Differentially expressed genes (DEGs) between type 2 diabetes patients and normal controls were identified by analyses of integrated microarray data obtained from the Gene Expression Omnibus database using the Limma package. Functional analysis of genes was performed using the R software package clusterProfiler. Analyses of protein-protein interaction (PPI) performed using Cytoscape with the CytoHubba plugin were used to determine the most sensitive diagnostic gene biomarkers for type 2 diabetes in our study. The support vector machine (SVM) classification model was used to validate the gene biomarkers used for the diagnosis of type 2 diabetes. Results GSE164416 dataset analysis revealed 499 genes that were differentially expressed between type 2 diabetes patients and normal controls, and these DEGs were found to be enriched in the regulation of the immune effector pathway, type 1 diabetes mellitus, and fatty acid degradation. PPI analysis data showed that five MCODE clusters could be considered as clinically significant modules and that 10 genes (IL1B, ITGB2, ITGAX, COL1A1, CSF1, CXCL12, SPP1, FN1, C3, and MMP2) were identified as “real” hub genes in the PPI network using algorithms such as Degree, MNC, and Closeness. The sensitivity and specificity of the SVM model for identifying patients with type 2 diabetes were 100%, with an area under the curve of 1 in the training as well as the validation dataset. Conclusion Our results indicate that the SVM-based model developed by us can facilitate accurate diagnosis of type 2 diabetes.
Collapse
|
6
|
Bhosale H, Ramakrishnan V, Jayaraman VK. Support vector machine-based prediction of pore-forming toxins (PFT) using distributed representation of reduced alphabets. J Bioinform Comput Biol 2021; 19:2150028. [PMID: 34693886 DOI: 10.1142/s0219720021500281] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
Bacterial virulence can be attributed to a wide variety of factors including toxins that harm the host. Pore-forming toxins are one class of toxins that confer virulence to the bacteria and are one of the promising targets for therapeutic intervention. In this work, we develop a sequence-based machine learning framework for the prediction of pore-forming toxins. For this, we have used distributed representation of the protein sequence encoded by reduced alphabet schemes based on conformational similarity and hydropathy index as input features to Support Vector Machines (SVMs). The choice of conformational similarity and hydropathy indices is based on the functional mechanism of pore-forming toxins. Our methodology achieves about 81% accuracy indicating that conformational similarity, an indicator of the flexibility of amino acids, along with hydrophobic index can capture the intrinsic features of pore-forming toxins that distinguish it from other types of transporter proteins. Increased understanding of the mechanisms of pore-forming toxins can further contribute to the use of such "mechanism-informed" features that may increase the prediction accuracy further.
Collapse
Affiliation(s)
- Hrushikesh Bhosale
- Department of Computer Science, FLAME University, Pune, Maharashtra, India
| | - Vigneshwar Ramakrishnan
- School of Chemical & Biotechnology, SASTRA Deemed-to-be University, Thanjavur, Tamilnadu, India
| | - Valadi K Jayaraman
- Department of Computer Science, FLAME University, Pune, Maharashtra, India
| |
Collapse
|
7
|
Sohrawordi M, Hossain MA. Prediction of lysine formylation sites using support vector machine based on the sample selection from majority classes and synthetic minority over-sampling techniques. Biochimie 2021; 192:125-135. [PMID: 34627982 DOI: 10.1016/j.biochi.2021.10.001] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2021] [Revised: 10/03/2021] [Accepted: 10/05/2021] [Indexed: 12/22/2022]
Abstract
Lysine formylation is a newly discovered and mostly interested type of post-translational modification (PTM) that is generally found on core and linker histone proteins of prokaryote and eukaryote and plays various important roles on the regulation of various cellular mechanisms. Hence, it is very urgent to properly identify formylation site in protein for understanding the molecular mechanism of formylation deeply and defining drug for relevant diseases. As experimentally identification of formylation site using traditional processes are expensive and time consuming, a simple and high speedy mathematical model for predicting accurately lysine formylation sites is highly desired. A useful computational model named PLF_SVM is deigned and proposed in this study by using binary encoding (BE), amino acid composition (AAC), reverse position relative incidence matrix (RPRIM), position relative incidence matrix (PRIM), and position specific amino acid propensity (PSAAP) feature generation methods for predicting formylated and non-formylated lysine sites. Besides, the Synthetic Minority Oversampling Technique (SMOTE) and a proposed sample selection strategy named EnSVM are applied to handle the imbalance training dataset problem. Thereafter, the optimal number of features are selected by F-score method to train the model. Finally, it has been seen that PLF_SVM outperforms the state-of-the-art approaches in validation and independent test with an accuracy of 98.61% and 98.77% respectively. At https://plf-svm.herokuapp.com/, a user-friendly web tool is also created for identifying formylation sites. Therefore, the proposed method may be helpful guideline for the analysis and prediction of formylated lysine and knowing the process of cellular regulation.
Collapse
Affiliation(s)
- Md Sohrawordi
- Dept. of Computer Science and Engineering, Rajshahi University of Engineering and Technology, Rajshahi, Bangladesh; Dept. of Computer Science and Engineering, Hajee Mohammad Danesh Science and Technology University, Dinajpur, Bangladesh.
| | - Md Ali Hossain
- Dept. of Computer Science and Engineering, Rajshahi University of Engineering and Technology, Rajshahi, Bangladesh
| |
Collapse
|
8
|
Hou Q, Kwasigroch JM, Rooman M, Pucci F. SOLart: a structure-based method to predict protein solubility and aggregation. Bioinformatics 2020; 36:1445-1452. [PMID: 31603466 DOI: 10.1093/bioinformatics/btz773] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2019] [Revised: 08/31/2019] [Accepted: 10/08/2019] [Indexed: 12/12/2022] Open
Abstract
MOTIVATION The solubility of a protein is often decisive for its proper functioning. Lack of solubility is a major bottleneck in high-throughput structural genomic studies and in high-concentration protein production, and the formation of protein aggregates causes a wide variety of diseases. Since solubility measurements are time-consuming and expensive, there is a strong need for solubility prediction tools. RESULTS We have recently introduced solubility-dependent distance potentials that are able to unravel the role of residue-residue interactions in promoting or decreasing protein solubility. Here, we extended their construction by defining solubility-dependent potentials based on backbone torsion angles and solvent accessibility, and integrated them, together with other structure- and sequence-based features, into a random forest model trained on a set of Escherichia coli proteins with experimental structures and solubility values. We thus obtained the SOLart protein solubility predictor, whose most informative features turned out to be folding free energy differences computed from our solubility-dependent statistical potentials. SOLart performances are very good, with a Pearson correlation coefficient between experimental and predicted solubility values of almost 0.7 both in cross-validation on the training dataset and in an independent set of Saccharomyces cerevisiae proteins. On test sets of modeled structures, only a limited drop in performance is observed. SOLart can thus be used with both high-resolution and low-resolution structures, and clearly outperforms state-of-art solubility predictors. It is available through a user-friendly webserver, which is easy to use by non-expert scientists. AVAILABILITY AND IMPLEMENTATION The SOLart webserver is freely available at http://babylone.ulb.ac.be/SOLART/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Qingzhen Hou
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, Avenue Roosevelt 50, 1050 Brussels, Belgium.,Interuniversity Institute of Bioinformatics in Brussels, Boulevard du Triomphe, 1050 Brussels, Belgium
| | - Jean Marc Kwasigroch
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, Avenue Roosevelt 50, 1050 Brussels, Belgium.,Interuniversity Institute of Bioinformatics in Brussels, Boulevard du Triomphe, 1050 Brussels, Belgium
| | - Marianne Rooman
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, Avenue Roosevelt 50, 1050 Brussels, Belgium.,Interuniversity Institute of Bioinformatics in Brussels, Boulevard du Triomphe, 1050 Brussels, Belgium
| | - Fabrizio Pucci
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, Avenue Roosevelt 50, 1050 Brussels, Belgium.,Interuniversity Institute of Bioinformatics in Brussels, Boulevard du Triomphe, 1050 Brussels, Belgium.,John von Neumann Institute for Computing, Jülich Supercomputer Centre, Forschungszentrum Jülich, 52428 Jülich, Germany
| |
Collapse
|
9
|
Vormittag P, Klamp T, Hubbuch J. Optimization of a Soft Ensemble Vote Classifier for the Prediction of Chimeric Virus-Like Particle Solubility and Other Biophysical Properties. Front Bioeng Biotechnol 2020; 8:881. [PMID: 32850736 PMCID: PMC7411134 DOI: 10.3389/fbioe.2020.00881] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2020] [Accepted: 07/09/2020] [Indexed: 01/24/2023] Open
Abstract
Chimeric virus-like particles (cVLPs) are protein-based nanostructures applied as investigational vaccines against infectious diseases, cancer, and immunological disorders. Low solubility of cVLP vaccine candidates is a challenge that can prevent development of these very substances. Solubility of cVLPs is typically assessed empirically, leading to high time and material requirements. Prediction of cVLP solubility in silico can aid in reducing this effort. Protein aggregation by hydrophobic interaction is an important factor driving protein insolubility. In this article, a recently developed soft ensemble vote classifier (sEVC) for the prediction of cVLP solubility was used based on 91 literature amino acid hydrophobicity scales. Optimization algorithms were developed to boost model performance, and the model was redesigned as a regression tool for ammonium sulfate concentration required for cVLP precipitation. The present dataset consists of 568 cVLPs, created by insertion of 71 different peptide sequences using eight different insertion strategies. Two optimization algorithms were developed that (I) modified the sEVC with regard to systematic misclassification based on the different insertion strategies, and (II) modified the amino acid hydrophobicity scale tables to improve classification. The second algorithm was additionally used to synthesize scales from random vectors. Compared to the unmodified model, Matthew’s Correlation Coefficient (MCC), and accuracy of the test set predictions could be elevated from 0.63 and 0.81 to 0.77 and 0.88, respectively, for the best models. This improved performance compared to literature scales was suggested to be due to a decreased correlation between synthesized scales. In these, tryptophan was identified as the most hydrophobic amino acid, i.e., the amino acid most problematic for cVLP solubility, supported by previous literature findings. As a case study, the sEVC was redesigned as a regression tool and applied to determine ammonium sulfate concentrations for the precipitation of cVLPs. This was evaluated with a small dataset of ten cVLPs resulting in an R2 of 0.69. In summary, we propose optimization algorithms that improve sEVC model performance for the prediction of cVLP solubility, allow for the synthesis of amino acid scale tables, and further evaluate the sEVC as regression tool to predict cVLP-precipitating ammonium sulfate concentrations.
Collapse
Affiliation(s)
- Philipp Vormittag
- Institute of Engineering in Life Sciences, Section IV: Biomolecular Separation Engineering, Karlsruhe Institute of Technology, Karlsruhe, Germany
| | | | - Jürgen Hubbuch
- Institute of Engineering in Life Sciences, Section IV: Biomolecular Separation Engineering, Karlsruhe Institute of Technology, Karlsruhe, Germany
| |
Collapse
|
10
|
Vormittag P, Klamp T, Hubbuch J. Ensembles of Hydrophobicity Scales as Potent Classifiers for Chimeric Virus-Like Particle Solubility - An Amino Acid Sequence-Based Machine Learning Approach. Front Bioeng Biotechnol 2020; 8:395. [PMID: 32432098 PMCID: PMC7217080 DOI: 10.3389/fbioe.2020.00395] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2020] [Accepted: 04/08/2020] [Indexed: 11/13/2022] Open
Abstract
Virus-like particles (VLPs) are protein-based nanoscale structures that show high potential as immunotherapeutics or cargo delivery vehicles. Chimeric VLPs are decorated with foreign peptides resulting in structures that confer immune responses against the displayed epitope. However, insertion of foreign sequences often results in insoluble proteins, calling for methods capable of assessing a VLP candidate's solubility in silico. The prediction of VLP solubility requires a model that can identify critical hydrophobicity-related parameters, distinguishing between VLP-forming aggregation and aggregation leading to insoluble virus protein clusters. Therefore, we developed and implemented a soft ensemble vote classifier (sEVC) framework based on chimeric hepatitis B core antigen (HBcAg) amino acid sequences and 91 publicly available hydrophobicity scales. Based on each hydrophobicity scale, an individual decision tree was induced as classifier in the sEVC. An embedded feature selection algorithm and stratified sampling proved beneficial for model construction. With a learning experiment, model performance in the space of model training set size and number of included classifiers in the sEVC was explored. Additionally, seven models were created from training data of 24-384 chimeric HBcAg constructs, which were validated by 100-fold Monte Carlo cross-validation. The models predicted external test sets of 184-544 chimeric HBcAg constructs. Best models showed a Matthew's correlation coefficient of >0.6 on the validation and the external test set. Feature selection was evaluated for classifiers with best and worst performance in the chimeric HBcAg VLP solubility scenario. Analysis of the associated hydrophobicity scales allowed for retrieval of biological information related to the mechanistic backgrounds of VLP solubility, suggesting a special role of arginine for VLP assembly and solubility. In the future, the developed sEVC could further be applied to hydrophobicity-related problems in other domains, such as monoclonal antibodies.
Collapse
Affiliation(s)
- Philipp Vormittag
- Institute of Engineering in Life Sciences, Section IV: Biomolecular Separation Engineering, Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany
| | | | - Jürgen Hubbuch
- Institute of Engineering in Life Sciences, Section IV: Biomolecular Separation Engineering, Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany
| |
Collapse
|
11
|
Effect of restricted dissolved oxygen on expression of Clostridium difficile toxin A subunit from E. coli. Sci Rep 2020; 10:3059. [PMID: 32080292 PMCID: PMC7033237 DOI: 10.1038/s41598-020-59978-1] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2019] [Accepted: 02/06/2020] [Indexed: 12/11/2022] Open
Abstract
The repeating unit of the C. difficile Toxin A (rARU, also known as CROPS [combined repetitive oligopeptides]) C-terminal region, was shown to elicit protective immunity against C. difficile and is under consideration as a possible vaccine against this pathogen. However, expression of recombinant rARU in E. coli using the standard vaccine production process was very low. Transcriptome and proteome analyses showed that at restricted dissolved oxygen (DO) the numbers of differentially expressed genes (DEGs) was 2.5-times lower than those expressed at unrestricted oxygen. Additionally, a 7.4-times smaller number of ribosome formation genes (needed for translation) were down-regulated as compared with unrestricted DO. Higher rARU expression at restricted DO was associated with up-regulation of 24 heat shock chaperones involved in protein folding and with the up-regulation of the global regulator RNA chaperone hfq. Cellular stress response leading to down-regulation of transcription, translation, and energy generating pathways at unrestricted DO were associated with lower rARU expression. Investigation of the C. difficile DNA sequence revealed the presence of cell wall binding profiles, which based on structural similarity prediction by BLASTp, can possibly interact with cellular proteins of E. coli such as the transcriptional repressor ulaR, and the ankyrins repeat proteins. At restricted DO, rARU mRNA was 5-fold higher and the protein expression 27-fold higher compared with unrestricted DO. The report shows a strategy for improved production of C. difficile vaccine candidate in E. coli by using restricted DO growth. This strategy could improve the expression of recombinant proteins from anaerobic origin or those with cell wall binding profiles.
Collapse
|
12
|
In Silico Study of Different Signal Peptides to Express Recombinant Glutamate Decarboxylase in the Outer Membrane of Escherichia coli. Int J Pept Res Ther 2019. [DOI: 10.1007/s10989-019-09986-1] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
13
|
Prediction of Breast Cancer from Imbalance Respect Using Cluster-Based Undersampling Method. JOURNAL OF HEALTHCARE ENGINEERING 2019; 2019:7294582. [PMID: 31737241 PMCID: PMC6817921 DOI: 10.1155/2019/7294582] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/17/2018] [Revised: 04/03/2019] [Accepted: 06/10/2019] [Indexed: 11/18/2022]
Abstract
To overcome the two-class imbalanced problem existing in the diagnosis of breast cancer, a hybrid of K-means and Boosted C5.0 (K-Boosted C5.0) is proposed which is based on undersampling. K-means is utilized to select the informative samples near the boundary. During the training phase, the K-means algorithm clusters the majority and minority instances and selects a similar number of instances from each cluster. Boosted C5.0 is then used as the classifier. As there is one different instance selection factor via clustering that encourages the diversity of the training subspace in K-Boosted C5.0, it would be a great advantage to get better performance. To test the performance of the new hybrid classifier, it is implemented on 12 small-scale and 2 large-scale datasets, which are the often used datasets in class imbalanced learning. The extensive experimental results show that our proposed hybrid method outperforms most of the competitive algorithms in terms of Matthews' correlation coefficient (MCC) and accuracy indices. It can be a good alternative to the well-known machine learning methods.
Collapse
|
14
|
Zhang J, Chen L. Clustering-based undersampling with random over sampling examples and support vector machine for imbalanced classification of breast cancer diagnosis. Comput Assist Surg (Abingdon) 2019; 24:62-72. [PMID: 31403330 DOI: 10.1080/24699322.2019.1649074] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022] Open
Abstract
To overcome the two-class imbalanced classification problem existing in the diagnosis of breast cancer, a hybrid of Random Over Sampling Example, K-means and Support vector machine (RK-SVM) model is proposed which is based on sample selection. Random Over Sampling Example (ROSE) is utilized to balance the dataset and further improve the diagnosis accuracy by Support Vector Machine (SVM). As there is one different sample selection factor via clustering that encourages selecting the samples near the class boundary. The purpose of clustering here is to reduce the risk of removing useful samples and improve the efficiency of sample selection. To test the performance of the new hybrid classifier, it is implemented on breast cancer datasets and the other three datasets from the University of California Irvine (UCI) machine learning repository, which are commonly used datasets in class imbalanced learning. The extensive experimental results show that our proposed hybrid method outperforms most of the competitive algorithms in term of G-mean and accuracy indices. Additionally, experimental results show that this method also performs superiorly for binary problems.
Collapse
Affiliation(s)
- Jue Zhang
- School of Information and Technology, Northwest University , Xi'an , China.,School of Information Engineering, Yulin University , Yulin , China
| | - Li Chen
- School of Information and Technology, Northwest University , Xi'an , China
| |
Collapse
|
15
|
Han X, Wang X, Zhou K. Develop machine learning-based regression predictive models for engineering protein solubility. Bioinformatics 2019; 35:4640-4646. [DOI: 10.1093/bioinformatics/btz294] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2018] [Revised: 03/09/2019] [Accepted: 04/17/2019] [Indexed: 11/14/2022] Open
Abstract
Abstract
Motivation
Protein activity is a significant characteristic for recombinant proteins which can be used as biocatalysts. High activity of proteins reduces the cost of biocatalysts. A model that can predict protein activity from amino acid sequence is highly desired, as it aids experimental improvement of proteins. However, only limited data for protein activity are currently available, which prevents the development of such models. Since protein activity and solubility are correlated for some proteins, the publicly available solubility dataset may be adopted to develop models that can predict protein solubility from sequence. The models could serve as a tool to indirectly predict protein activity from sequence. In literature, predicting protein solubility from sequence has been intensively explored, but the predicted solubility represented in binary values from all the developed models was not suitable for guiding experimental designs to improve protein solubility. Here we propose new machine learning (ML) models for improving protein solubility in vivo.
Results
We first implemented a novel approach that predicted protein solubility in continuous numerical values instead of binary ones. After combining it with various ML algorithms, we achieved a R2 of 0.4115 when support vector machine algorithm was used. Continuous values of solubility are more meaningful in protein engineering, as they enable researchers to choose proteins with higher predicted solubility for experimental validation, while binary values fail to distinguish proteins with the same value—there are only two possible values so many proteins have the same one.
Availability and implementation
We present the ML workflow as a series of IPython notebooks hosted on GitHub (https://github.com/xiaomizhou616/protein_solubility). The workflow can be used as a template for analysis of other expression and solubility datasets.
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Xi Han
- Department of Chemical and Biomolecular Engineering, National University of Singapore, 117585 Singapore
| | - Xiaonan Wang
- Department of Chemical and Biomolecular Engineering, National University of Singapore, 117585 Singapore
| | - Kang Zhou
- Department of Chemical and Biomolecular Engineering, National University of Singapore, 117585 Singapore
| |
Collapse
|
16
|
Hou Q, Bourgeas R, Pucci F, Rooman M. Computational analysis of the amino acid interactions that promote or decrease protein solubility. Sci Rep 2018; 8:14661. [PMID: 30279585 PMCID: PMC6168528 DOI: 10.1038/s41598-018-32988-w] [Citation(s) in RCA: 27] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2018] [Accepted: 09/11/2018] [Indexed: 11/24/2022] Open
Abstract
The solubility of globular proteins is a basic biophysical property that is usually a prerequisite for their functioning. In this study, we probed the solubility of globular proteins with the help of the statistical potential formalism, in view of objectifying the connection of solubility with structural and energetic properties and of the solubility-dependence of specific amino acid interactions. We started by setting up two independent datasets containing either soluble or aggregation-prone proteins with known structures. From these two datasets, we computed solubility-dependent distance potentials that are by construction biased towards the solubility of the proteins from which they are derived. Their analysis showed the clear preference of amino acid interactions such as Lys-containing salt bridges and aliphatic interactions to promote protein solubility, whereas others such as aromatic, His-π, cation-π, amino-π and anion-π interactions rather tend to reduce it. These results indicate that interactions involving delocalized π-electrons favor aggregation, unlike those involving no (or few) dispersion forces. Furthermore, using our potentials derived from either highly or weakly soluble proteins to compute protein folding free energies, we found that the difference between these two energies correlates better with solubility than other properties analyzed before such as protein length, isoelectric point and aliphatic index. This is, to the best of our knowledge, the first comprehensive in silico study of the impact of residue-residue interactions on protein solubility properties.The results of this analysis provide new insights that will facilitate future rational protein design applications aimed at modulating the solubility of targeted proteins.
Collapse
Affiliation(s)
- Qingzhen Hou
- Department of BioModeling BioInformatics & BioProcesses, Université Libre de Bruxelles, Brussels, 1050, Belgium
| | - Raphaël Bourgeas
- Department of BioModeling BioInformatics & BioProcesses, Université Libre de Bruxelles, Brussels, 1050, Belgium
| | - Fabrizio Pucci
- Department of BioModeling BioInformatics & BioProcesses, Université Libre de Bruxelles, Brussels, 1050, Belgium
| | - Marianne Rooman
- Department of BioModeling BioInformatics & BioProcesses, Université Libre de Bruxelles, Brussels, 1050, Belgium.
| |
Collapse
|
17
|
Yang Y, Liu G, Liu M, Bai Z, Liu X, Dai X, Guo W. Correlation Between Protein Primary Structure and Soluble Expression Level of HSA dAb in Escherichia coli. Food Technol Biotechnol 2018; 56:101-109. [PMID: 29796003 DOI: 10.17113/ftb.56.01.18.5445] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
It is widely accepted that features such as pI, length, molecular mass and amino acid (AA) sequence have a significant influence on protein solubility. Here, we mainly focused on AA composition and explored those that most affected the soluble expression level of human serum albumin (HSA) domain antibody (dAb). The soluble expression and sequence of 65 dAb variants were analysed using clustering and linear modelling. Certain AAs significantly affected the soluble expression level of dAb, with the specific AA combinations being (S, R, N, D, Q), (G, R, C, N, S) and (R, S, G); these combinations respectively affected the dAb expression level in the broth supernatant, the level in the pellet lysate and total soluble dAb. Among the 20 AAs, R displayed a negative influence on the soluble expression level, whereas G and S showed positive effects. A linear model was built to predict the soluble expression level from the sequence; this model had a prediction accuracy of 80%. In summary, increasing the content of polar AAs, especially G and S, and decreasing the content of R, was helpful to improve the soluble expression level of HSA dAb.
Collapse
Affiliation(s)
- Yankun Yang
- The Key Laboratory of Carbohydrate Chemistry and Biotechnology, School of Biotechnology, Jiangnan University, Ministry of Education, 1800 Lihu Avenue, 214122 Wuxi, PR China.,National Engineering Laboratory for Cereal Fermentation Technology, Jiangnan University, 1800 Lihu Avenue, 214122 Wuxi, PR China
| | - Guoqiang Liu
- The Key Laboratory of Carbohydrate Chemistry and Biotechnology, School of Biotechnology, Jiangnan University, Ministry of Education, 1800 Lihu Avenue, 214122 Wuxi, PR China.,National Engineering Laboratory for Cereal Fermentation Technology, Jiangnan University, 1800 Lihu Avenue, 214122 Wuxi, PR China
| | - Meng Liu
- National Engineering Laboratory for Cereal Fermentation Technology, Jiangnan University, 1800 Lihu Avenue, 214122 Wuxi, PR China
| | - Zhonghu Bai
- National Engineering Laboratory for Cereal Fermentation Technology, Jiangnan University, 1800 Lihu Avenue, 214122 Wuxi, PR China.,Jiangsu Provincial Research Center for Bioactive Product Processing Technology, Jiangnan University, 1800 Lihu Avenue, 214122 Wuxi, PR China
| | - Xiuxia Liu
- National Engineering Laboratory for Cereal Fermentation Technology, Jiangnan University, 1800 Lihu Avenue, 214122 Wuxi, PR China.,Jiangsu Provincial Research Center for Bioactive Product Processing Technology, Jiangnan University, 1800 Lihu Avenue, 214122 Wuxi, PR China
| | - Xiaofeng Dai
- National Engineering Laboratory for Cereal Fermentation Technology, Jiangnan University, 1800 Lihu Avenue, 214122 Wuxi, PR China.,Jiangsu Provincial Research Center for Bioactive Product Processing Technology, Jiangnan University, 1800 Lihu Avenue, 214122 Wuxi, PR China
| | - Wenwen Guo
- Jiangsu Provincial Research Center for Bioactive Product Processing Technology, Jiangnan University, 1800 Lihu Avenue, 214122 Wuxi, PR China.,The Key Laboratory of Industrial Biotechnology, Ministry of Education, School of Biotechnology, Jiangnan University, 1800 Lihu Avenue, 214122 Wuxi, PR China
| |
Collapse
|
18
|
Sastry A, Monk J, Tegel H, Uhlen M, Palsson BO, Rockberg J, Brunk E. Machine learning in computational biology to accelerate high-throughput protein expression. Bioinformatics 2018; 33:2487-2495. [PMID: 28398465 DOI: 10.1093/bioinformatics/btx207] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2016] [Accepted: 04/05/2017] [Indexed: 01/21/2023] Open
Abstract
Motivation The Human Protein Atlas (HPA) enables the simultaneous characterization of thousands of proteins across various tissues to pinpoint their spatial location in the human body. This has been achieved through transcriptomics and high-throughput immunohistochemistry-based approaches, where over 40 000 unique human protein fragments have been expressed in E. coli. These datasets enable quantitative tracking of entire cellular proteomes and present new avenues for understanding molecular-level properties influencing expression and solubility. Results Combining computational biology and machine learning identifies protein properties that hinder the HPA high-throughput antibody production pipeline. We predict protein expression and solubility with accuracies of 70% and 80%, respectively, based on a subset of key properties (aromaticity, hydropathy and isoelectric point). We guide the selection of protein fragments based on these characteristics to optimize high-throughput experimentation. Availability and implementation We present the machine learning workflow as a series of IPython notebooks hosted on GitHub (https://github.com/SBRG/Protein_ML). The workflow can be used as a template for analysis of further expression and solubility datasets. Contact ebrunk@ucsd.edu or johanr@biotech.kth.se. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Anand Sastry
- Department of Bioengineering, University of California, San Diego, CA, USA
| | - Jonathan Monk
- Department of Bioengineering, University of California, San Diego, CA, USA
| | - Hanna Tegel
- KTH - Royal Institute of Technology, Department of Proteomics and Nanobiotechnology, SE-106 91 Stockholm, Sweden
| | - Mathias Uhlen
- KTH - Royal Institute of Technology, Department of Proteomics and Nanobiotechnology, SE-106 91 Stockholm, Sweden.,The Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, 2800 Lyngby, Denmark
| | - Bernhard O Palsson
- Department of Bioengineering, University of California, San Diego, CA, USA.,The Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, 2800 Lyngby, Denmark
| | - Johan Rockberg
- KTH - Royal Institute of Technology, Department of Proteomics and Nanobiotechnology, SE-106 91 Stockholm, Sweden
| | - Elizabeth Brunk
- Department of Bioengineering, University of California, San Diego, CA, USA.,The Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, 2800 Lyngby, Denmark
| |
Collapse
|
19
|
Chang CCH, Li C, Webb GI, Tey B, Song J, Ramanan RN. Periscope: quantitative prediction of soluble protein expression in the periplasm of Escherichia coli. Sci Rep 2016; 6:21844. [PMID: 26931649 PMCID: PMC4773868 DOI: 10.1038/srep21844] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2015] [Accepted: 01/28/2016] [Indexed: 12/20/2022] Open
Abstract
Periplasmic expression of soluble proteins in Escherichia coli not only offers a much-simplified downstream purification process, but also enhances the probability of obtaining correctly folded and biologically active proteins. Different combinations of signal peptides and target proteins lead to different soluble protein expression levels, ranging from negligible to several grams per litre. Accurate algorithms for rational selection of promising candidates can serve as a powerful tool to complement with current trial-and-error approaches. Accordingly, proteomics studies can be conducted with greater efficiency and cost-effectiveness. Here, we developed a predictor with a two-stage architecture, to predict the real-valued expression level of target protein in the periplasm. The output of the first-stage support vector machine (SVM) classifier determines which second-stage support vector regression (SVR) classifier to be used. When tested on an independent test dataset, the predictor achieved an overall prediction accuracy of 78% and a Pearson's correlation coefficient (PCC) of 0.77. We further illustrate the relative importance of various features with respect to different models. The results indicate that the occurrence of dipeptide glutamine and aspartic acid is the most important feature for the classification model. Finally, we provide access to the implemented predictor through the Periscope webserver, freely accessible at http://lightning.med.monash.edu/periscope/.
Collapse
Affiliation(s)
- Catherine Ching Han Chang
- Chemical Engineering Discipline, School of Engineering, Monash University, Jalan Lagoon Selatan 46150, Bandar Sunway, Selangor, Malaysia
- Department of Biochemistry and Molecular Biology, Monash University, Melbourne VIC 3800, Australia
| | - Chen Li
- Department of Biochemistry and Molecular Biology, Monash University, Melbourne VIC 3800, Australia
| | - Geoffrey I. Webb
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne VIC 3800, Australia
| | - BengTi Tey
- Chemical Engineering Discipline, School of Engineering, Monash University, Jalan Lagoon Selatan 46150, Bandar Sunway, Selangor, Malaysia
- Advanced Engineering Platform, School of Engineering, Monash University, Jalan Lagoon Selatan 46150, Bandar Sunway, Selangor, Malaysia
| | - Jiangning Song
- Department of Biochemistry and Molecular Biology, Monash University, Melbourne VIC 3800, Australia
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne VIC 3800, Australia
- National Engineering Laboratory for Industrial Enzymes, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin 300308, China
| | - Ramakrishnan Nagasundara Ramanan
- Chemical Engineering Discipline, School of Engineering, Monash University, Jalan Lagoon Selatan 46150, Bandar Sunway, Selangor, Malaysia
- Advanced Engineering Platform, School of Engineering, Monash University, Jalan Lagoon Selatan 46150, Bandar Sunway, Selangor, Malaysia
- School of Chemistry, Monash University, Melbourne VIC 3800, Australia
| |
Collapse
|
20
|
Ranganarayanan P, Thanigesan N, Ananth V, Jayaraman VK, Ramakrishnan V. Identification of Glucose-Binding Pockets in Human Serum Albumin Using Support Vector Machine and Molecular Dynamics Simulations. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2016; 13:148-157. [PMID: 26886739 DOI: 10.1109/tcbb.2015.2415806] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Human Serum Albumin (HSA) has been suggested to be an alternate biomarker to the existing Hemoglobin-A1c (HbA1c) marker for glycemic monitoring. Development and usage of HSA as an alternate biomarker requires the identification of glycation sites, or equivalently, glucose-binding pockets. In this work, we combine molecular dynamics simulations of HSA and the state-of-art machine learning method Support Vector Machine (SVM) to predict glucose-binding pockets in HSA. SVM uses the three dimensional arrangement of atoms and their chemical properties to predict glucose-binding ability of a pocket. Feature selection reveals that the arrangement of atoms and their chemical properties within the first 4Å from the centroid of the pocket play an important role in the binding of glucose. With a 10-fold cross validation accuracy of 84 percent, our SVM model reveals seven new potential glucose-binding sites in HSA of which two are exposed only during the dynamics of HSA. The predictions are further corroborated using docking studies. These findings can complement studies directed towards the development of HSA as an alternate biomarker for glycemic monitoring.
Collapse
|
21
|
|
22
|
Chen YF, Huang PC, Lin KC, Lin HH, Wang LE, Cheng CC, Chen TP, Chan YK, Chiang JY. Semi-automatic segmentation and classification of Pap smear cells. IEEE J Biomed Health Inform 2014; 18:94-108. [PMID: 24403407 DOI: 10.1109/jbhi.2013.2250984] [Citation(s) in RCA: 43] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Cytologic screening has been widely used for detecting the cervical cancers. In this study, a semiautomatic PC-based cellular image analysis system was developed for segmenting nuclear and cytoplasmic contours and for computing morphometric and textual features to train support vector machine (SVM) classifiers to classify four different types of cells and to discriminate dysplastic from normal cells. A software program incorporating function, including image reviewing and standardized denomination of file names, was also designed to facilitate and standardize the workflow of cell analyses. Two experiments were conducted to verify the classification performance. The cross-validation results of the first experiment showed that average accuracies of 97.16% and 98.83%, respectively, for differentiating four different types of cells and in discriminating dysplastic from normal cells have been achieved using salient features (8 for four-cluster and 7 for two-cluster classifiers) selected with SVM recursive feature addition. In the second experiment, 70% (837) of the cell images were used for training and 30% (361) for testing, achieving an accuracy of 96.12% and 98.61% for four-cluster and two-cluster classifiers, respectively. The proposed system provides a feasible and effective tool in evaluating cytologic specimens.
Collapse
|
23
|
Prediction of soluble heterologous protein expression levels inEscherichia colifrom sequence-based features and its potential in biopharmaceutical process development. ACTA ACUST UNITED AC 2014. [DOI: 10.4155/pbp.14.23] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
24
|
A review of machine learning methods to predict the solubility of overexpressed recombinant proteins in Escherichia coli. BMC Bioinformatics 2014; 15:134. [PMID: 24885721 PMCID: PMC4098780 DOI: 10.1186/1471-2105-15-134] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2013] [Accepted: 03/25/2014] [Indexed: 12/14/2022] Open
Abstract
Background Over the last 20 years in biotechnology, the production of recombinant proteins has been a crucial bioprocess in both biopharmaceutical and research arena in terms of human health, scientific impact and economic volume. Although logical strategies of genetic engineering have been established, protein overexpression is still an art. In particular, heterologous expression is often hindered by low level of production and frequent fail due to opaque reasons. The problem is accentuated because there is no generic solution available to enhance heterologous overexpression. For a given protein, the extent of its solubility can indicate the quality of its function. Over 30% of synthesized proteins are not soluble. In certain experimental circumstances, including temperature, expression host, etc., protein solubility is a feature eventually defined by its sequence. Until now, numerous methods based on machine learning are proposed to predict the solubility of protein merely from its amino acid sequence. In spite of the 20 years of research on the matter, no comprehensive review is available on the published methods. Results This paper presents an extensive review of the existing models to predict protein solubility in Escherichia coli recombinant protein overexpression system. The models are investigated and compared regarding the datasets used, features, feature selection methods, machine learning techniques and accuracy of prediction. A discussion on the models is provided at the end. Conclusions This study aims to investigate extensively the machine learning based methods to predict recombinant protein solubility, so as to offer a general as well as a detailed understanding for researches in the field. Some of the models present acceptable prediction performances and convenient user interfaces. These models can be considered as valuable tools to predict recombinant protein overexpression results before performing real laboratory experiments, thus saving labour, time and cost.
Collapse
|
25
|
Chang CCH, Tey BT, Song J, Ramanan RN. Towards more accurate prediction of protein folding rates: a review of the existing web-based bioinformatics approaches. Brief Bioinform 2014; 16:314-24. [DOI: 10.1093/bib/bbu007] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023] Open
|
26
|
Hirose S, Noguchi T. ESPRESSO: a system for estimating protein expression and solubility in protein expression systems. Proteomics 2013; 13:1444-56. [PMID: 23436767 DOI: 10.1002/pmic.201200175] [Citation(s) in RCA: 34] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2012] [Revised: 01/27/2013] [Accepted: 02/06/2013] [Indexed: 11/11/2022]
Abstract
Recombinant protein technology is essential for conducting protein science and using proteins as materials in pharmaceutical or industrial applications. Although obtaining soluble proteins is still a major experimental obstacle, knowledge about protein expression/solubility under standard conditions may increase the efficiency and reduce the cost of proteomics studies. In this study, we present a computational approach to estimate the probability of protein expression and solubility for two different protein expression systems: in vivo Escherichia coli and wheat germ cell-free, from only the sequence information. It implements two kinds of methods: a sequence/predicted structural property-based method that uses both the sequence and predicted structural features, and a sequence pattern-based method that utilizes the occurrence frequencies of sequence patterns. In the benchmark test, the proposed methods obtained F-scores of around 70%, and outperformed publicly available servers. Applying the proposed methods to genomic data revealed that proteins associated with translation or transcription have a strong tendency to be expressed as soluble proteins by the in vivo E. coli expression system. The sequence pattern-based method also has the potential to indicate a candidate region for modification, to increase protein solubility. All methods are available for free at the ESPRESSO server (http://mbs.cbrc.jp/ESPRESSO).
Collapse
Affiliation(s)
- Shuichi Hirose
- Computational Biology Research Center (CBRC), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan.
| | | |
Collapse
|
27
|
AcalPred: a sequence-based tool for discriminating between acidic and alkaline enzymes. PLoS One 2013; 8:e75726. [PMID: 24130738 PMCID: PMC3794003 DOI: 10.1371/journal.pone.0075726] [Citation(s) in RCA: 81] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2013] [Accepted: 08/16/2013] [Indexed: 11/19/2022] Open
Abstract
The structure and activity of enzymes are influenced by pH value of their surroundings. Although many enzymes work well in the pH range from 6 to 8, some specific enzymes have good efficiencies only in acidic (pH<5) or alkaline (pH>9) solution. Studies have demonstrated that the activities of enzymes correlate with their primary sequences. It is crucial to judge enzyme adaptation to acidic or alkaline environment from its amino acid sequence in molecular mechanism clarification and the design of high efficient enzymes. In this study, we developed a sequence-based method to discriminate acidic enzymes from alkaline enzymes. The analysis of variance was used to choose the optimized discriminating features derived from g-gap dipeptide compositions. And support vector machine was utilized to establish the prediction model. In the rigorous jackknife cross-validation, the overall accuracy of 96.7% was achieved. The method can correctly predict 96.3% acidic and 97.1% alkaline enzymes. Through the comparison between the proposed method and previous methods, it is demonstrated that the proposed method is more accurate. On the basis of this proposed method, we have built an online web-server called AcalPred which can be freely accessed from the website (http://lin.uestc.edu.cn/server/AcalPred). We believe that the AcalPred will become a powerful tool to study enzyme adaptation to acidic or alkaline environment.
Collapse
|
28
|
Xiaohui N, Nana L, Jingbo X, Dingyan C, Yuehua P, Yang X, Weiquan W, Dongming W, Zengzhen W. Using the concept of Chou's pseudo amino acid composition to predict protein solubility: An approach with entropies in information theory. J Theor Biol 2013; 332:211-7. [DOI: 10.1016/j.jtbi.2013.03.010] [Citation(s) in RCA: 34] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2012] [Revised: 03/10/2013] [Accepted: 03/11/2013] [Indexed: 11/15/2022]
|
29
|
Chang CCH, Song J, Tey BT, Ramanan RN. Bioinformatics approaches for improved recombinant protein production in Escherichia coli: protein solubility prediction. Brief Bioinform 2013; 15:953-62. [DOI: 10.1093/bib/bbt057] [Citation(s) in RCA: 52] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|
30
|
Guilloux A, Caudron B, Jestin JL. A method to predict edge strands in beta-sheets from protein sequences. Comput Struct Biotechnol J 2013; 7:e201305001. [PMID: 24688737 PMCID: PMC3962219 DOI: 10.5936/csbj.201305001] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2013] [Revised: 05/27/2013] [Accepted: 05/30/2013] [Indexed: 12/15/2022] Open
Abstract
There is a need for rules allowing three-dimensional structure information to be derived from protein sequences. In this work, consideration of an elementary protein folding step allows protein sub-sequences which optimize folding to be derived for any given protein sequence. Classical mechanics applied to this system and the energy conservation law during the elementary folding step yields an equation whose solutions are taken over the field of rational numbers. This formalism is applied to beta-sheets containing two edge strands and at least two central strands. The number of protein sub-sequences optimized for folding per amino acid in beta-strands is shown in particular to predict edge strands from protein sequences. Topological information on beta-strands and loops connecting them is derived for protein sequences with a prediction accuracy of 75%. The statistical significance of the finding is given. Applications in protein structure prediction are envisioned such as for the quality assessment of protein structure models.
Collapse
Affiliation(s)
- Antonin Guilloux
- Analyse algébrique, Institut de Mathématiques de Jussieu, Université Pierre et Marie Curie, Paris VI, France
| | - Bernard Caudron
- Centre d'Informatique pour la Biologie, Institut Pasteur, Paris, France
| | | |
Collapse
|
31
|
Singh GP, Dash D. Electrostatic mis-interactions cause overexpression toxicity of proteins in E. coli. PLoS One 2013; 8:e64893. [PMID: 23734225 PMCID: PMC3667126 DOI: 10.1371/journal.pone.0064893] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2013] [Accepted: 04/19/2013] [Indexed: 01/28/2023] Open
Abstract
A majority of E. coli proteins when overexpressed inhibit its growth, but the reasons behind overexpression toxicity of proteins remain unknown. Understanding the mechanism of overexpression toxicity is important from evolutionary, biotechnological and possibly clinical perspectives. Here we study sequence and functional features of cytosolic proteins of E. coli associated with overexpression toxicity to understand its mechanism. We find that number of positively charged residues is significantly higher in proteins showing overexpression toxicity. Very long proteins also show high overexpression toxicity. Among the functional classes, transcription factors and regulatory proteins are enriched in toxic proteins, while catalytic proteins are depleted. Overexpression toxicity could be predicted with reasonable accuracy using these few properties. The importance of charged residues in overexpression toxicity indicates that nonspecific electrostatic interactions resulting from protein overexpression cause toxicity of these proteins and suggests ways to improve the expression level of native and foreign proteins in E. coli for basic research and biotechnology. These results might also be applicable to other bacterial species.
Collapse
Affiliation(s)
- Gajinder Pal Singh
- G. N. Ramachandran Knowledge Center for Genome Informatics, Institute of Genomics and Integrative Biology (Council of Scientific and Industrial Research), Delhi, India.
| | | |
Collapse
|
32
|
Fang Y, Fang J. Discrimination of soluble and aggregation-prone proteins based on sequence information. MOLECULAR BIOSYSTEMS 2013; 9:806-11. [PMID: 23440081 PMCID: PMC3627541 DOI: 10.1039/c3mb70033j] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]
Abstract
Understanding the factors governing protein solubility is a key to grasp the mechanisms of protein solubility and may provide insight into protein aggregation and misfolding related diseases such as Alzheimer's disease. In this work, we attempt to identify factors important to protein solubility using feature selection. Firstly, we calculate 1438 features including physicochemical properties and statistics for each protein. Random Forest algorithm is used to select the most informative and the minimal subset of features based on their predictive performance. A predictive model is built based on 17 selected features. Compared with previous models, our model achieves better performance with a sensitivity of 0.82, specificity 0.85, ACC 0.84, AUC 0.91 and MCC 0.67. Furthermore, a model using a redundancy-reduced dataset (sequence identity <= 30%) achieves the same performance as the model without redundancy reduction. Our results provide not only a reliable model for predicting protein solubility but also a list of features important to protein solubility. The predictive model is implemented as a freely available web application at .
Collapse
Affiliation(s)
- Yaping Fang
- Applied Bioinformatics Laboratory, The University of Kansas, 2034 Becker Dr., Lawrence, Kansas 66047, USA.
| | | |
Collapse
|
33
|
Current state and recent advances in biopharmaceutical production in Escherichia coli, yeasts and mammalian cells. J Ind Microbiol Biotechnol 2013; 40:257-74. [PMID: 23385853 DOI: 10.1007/s10295-013-1235-0] [Citation(s) in RCA: 139] [Impact Index Per Article: 12.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2012] [Accepted: 01/22/2013] [Indexed: 12/28/2022]
Abstract
Almost all of the 200 or so approved biopharmaceuticals have been produced in one of three host systems: the bacterium Escherichia coli, yeasts (Saccharomyces cerevisiae, Pichia pastoris) and mammalian cells. We describe the most widely used methods for the expression of recombinant proteins in the cytoplasm or periplasm of E. coli, as well as strategies for secreting the product to the growth medium. Recombinant expression in E. coli influences the cell physiology and triggers a stress response, which has to be considered in process development. Increased expression of a functional protein can be achieved by optimizing the gene, plasmid, host cell, and fermentation process. Relevant properties of two yeast expression systems, S. cerevisiae and P. pastoris, are summarized. Optimization of expression in S. cerevisiae has focused mainly on increasing the secretion, which is otherwise limiting. P. pastoris was recently approved as a host for biopharmaceutical production for the first time. It enables high-level protein production and secretion. Additionally, genetic engineering has resulted in its ability to produce recombinant proteins with humanized glycosylation patterns. Several mammalian cell lines of either rodent or human origin are also used in biopharmaceutical production. Optimization of their expression has focused on clonal selection, interference with epigenetic factors and genetic engineering. Systemic optimization approaches are applied to all cell expression systems. They feature parallel high-throughput techniques, such as DNA microarray, next-generation sequencing and proteomics, and enable simultaneous monitoring of multiple parameters. Systemic approaches, together with technological advances such as disposable bioreactors and microbioreactors, are expected to lead to increased quality and quantity of biopharmaceuticals, as well as to reduced product development times.
Collapse
|
34
|
Prediction and analysis of protein solubility using a novel scoring card method with dipeptide composition. BMC Bioinformatics 2012; 13 Suppl 17:S3. [PMID: 23282103 PMCID: PMC3521471 DOI: 10.1186/1471-2105-13-s17-s3] [Citation(s) in RCA: 47] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Existing methods for predicting protein solubility on overexpression in Escherichia coli advance performance by using ensemble classifiers such as two-stage support vector machine (SVM) based classifiers and a number of feature types such as physicochemical properties, amino acid and dipeptide composition, accompanied with feature selection. It is desirable to develop a simple and easily interpretable method for predicting protein solubility, compared to existing complex SVM-based methods. RESULTS This study proposes a novel scoring card method (SCM) by using dipeptide composition only to estimate solubility scores of sequences for predicting protein solubility. SCM calculates the propensities of 400 individual dipeptides to be soluble using statistic discrimination between soluble and insoluble proteins of a training data set. Consequently, the propensity scores of all dipeptides are further optimized using an intelligent genetic algorithm. The solubility score of a sequence is determined by the weighted sum of all propensity scores and dipeptide composition. To evaluate SCM by performance comparisons, four data sets with different sizes and variation degrees of experimental conditions were used. The results show that the simple method SCM with interpretable propensities of dipeptides has promising performance, compared with existing SVM-based ensemble methods with a number of feature types. Furthermore, the propensities of dipeptides and solubility scores of sequences can provide insights to protein solubility. For example, the analysis of dipeptide scores shows high propensity of α-helix structure and thermophilic proteins to be soluble. CONCLUSIONS The propensities of individual dipeptides to be soluble are varied for proteins under altered experimental conditions. For accurately predicting protein solubility using SCM, it is better to customize the score card of dipeptide propensities by using a training data set under the same specified experimental conditions. The proposed method SCM with solubility scores and dipeptide propensities can be easily applied to the protein function prediction problems that dipeptide composition features play an important role. AVAILABILITY The used datasets, source codes of SCM, and supplementary files are available at http://iclab.life.nctu.edu.tw/SCM/.
Collapse
|
35
|
O’Malley CJ, Montague GA, Martin EB, Liddell JM, Kara B, Titchener-Hooker NJ. Utilisation of key descriptors from protein sequence data to aid bioprocess route selection. FOOD AND BIOPRODUCTS PROCESSING 2012. [DOI: 10.1016/j.fbp.2012.01.005] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
|
36
|
Joseph S, Karnik S, Nilawe P, Jayaraman VK, Idicula-Thomas S. ClassAMP: a prediction tool for classification of antimicrobial peptides. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2012; 9:1535-1538. [PMID: 22732690 DOI: 10.1109/tcbb.2012.89] [Citation(s) in RCA: 89] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/01/2023]
Abstract
Antimicrobial peptides (AMPs) are gaining popularity as anti-infective agents. Information on sequence features that contribute to target specificity of AMPs will aid in accelerating drug discovery programs involving them. In this study, an algorithm called ClassAMP using Random Forests (RFs) and Support Vector Machines (SVMs) has been developed to predict the propensity of a protein sequence to have antibacterial, antifungal, or antiviral activity. ClassAMP is available at http://www.bicnirrh.res.in/classamp/.
Collapse
Affiliation(s)
- Shaini Joseph
- Biomedical Informatics Center of Indian Council of Medical Research, National Institute for Research in Reproductive Health, Parel, Mumbai, Maharashtra, India.
| | | | | | | | | |
Collapse
|
37
|
Lin HH, Tseng LY. Prediction of disulfide bonding pattern based on a support vector machine and multiple trajectory search. Inf Sci (N Y) 2012. [DOI: 10.1016/j.ins.2012.02.035] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|
38
|
Predict mycobacterial proteins subcellular locations by incorporating pseudo-average chemical shift into the general form of Chou’s pseudo amino acid composition. J Theor Biol 2012; 304:88-95. [DOI: 10.1016/j.jtbi.2012.03.017] [Citation(s) in RCA: 89] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2011] [Revised: 03/13/2012] [Accepted: 03/14/2012] [Indexed: 11/18/2022]
|
39
|
Tokmakov AA, Kurotani A, Takagi T, Toyama M, Shirouzu M, Fukami Y, Yokoyama S. Multiple post-translational modifications affect heterologous protein synthesis. J Biol Chem 2012; 287:27106-16. [PMID: 22674579 DOI: 10.1074/jbc.m112.366351] [Citation(s) in RCA: 49] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022] Open
Abstract
Post-translational modifications (PTMs) are required for proper folding of many proteins. The low capacity for PTMs hinders the production of heterologous proteins in the widely used prokaryotic systems of protein synthesis. Until now, a systematic and comprehensive study concerning the specific effects of individual PTMs on heterologous protein synthesis has not been presented. To address this issue, we expressed 1488 human proteins and their domains in a bacterial cell-free system, and we examined the correlation of the expression yields with the presence of multiple PTM sites bioinformatically predicted in these proteins. This approach revealed a number of previously unknown statistically significant correlations. Prediction of some PTMs, such as myristoylation, glycosylation, palmitoylation, and disulfide bond formation, was found to significantly worsen protein amenability to soluble expression. The presence of other PTMs, such as aspartyl hydroxylation, C-terminal amidation, and Tyr sulfation, did not correlate with the yield of heterologous protein expression. Surprisingly, the predicted presence of several PTMs, such as phosphorylation, ubiquitination, SUMOylation, and prenylation, was associated with the increased production of properly folded soluble proteins. The plausible rationales for the existence of the observed correlations are presented. Our findings suggest that identification of potential PTMs in polypeptide sequences can be of practical use for predicting expression success and optimizing heterologous protein synthesis. In sum, this study provides the most compelling evidence so far for the role of multiple PTMs in the stability and solubility of heterologously expressed recombinant proteins.
Collapse
Affiliation(s)
- Alexander A Tokmakov
- RIKEN Systems and Structural Biology Center, University of Tokyo, Bunkyo, Tokyo 113-0033, Japan.
| | | | | | | | | | | | | |
Collapse
|
40
|
Smialowski P, Doose G, Torkler P, Kaufmann S, Frishman D. PROSO II--a new method for protein solubility prediction. FEBS J 2012; 279:2192-200. [PMID: 22536855 DOI: 10.1111/j.1742-4658.2012.08603.x] [Citation(s) in RCA: 129] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Many fields of science and industry depend on efficient production of active protein using heterologous expression in Escherichia coli. The solubility of proteins upon expression is dependent on their amino acid sequence. Prediction of solubility from sequence is therefore highly valuable. We present a novel machine-learning-based model called PROSO II which makes use of new classification methods and growth in experimental data to improve coverage and accuracy of solubility predictions. The classification algorithm is organized as a two-layered structure in which the output of a primary Parzen window model for sequence similarity and a logistic regression classifier of amino acid k-mer composition serve as input for a second-level logistic regression classifier. Compared with previously published research our model is trained on five times more data than used by any other method before (82 000 proteins). When tested on a separate holdout set not used at any point of method development our server attained the best results in comparison with other currently available methods: accuracy 75.4%, Matthew's correlation coefficient 0.39, sensitivity 0.731, specificity 0.759, gain (soluble) 2.263. In summary, due to utilization of cutting edge machine learning technologies combined with the largest currently available experimental data set the PROSO II server constitutes a substantial improvement in protein solubility predictions. PROSO II is available at http://mips.helmholtz-muenchen.de/prosoII.
Collapse
Affiliation(s)
- Pawel Smialowski
- Department of Genome Oriented Bioinformatics, Technische Universität Muenchen, Freising, Germany.
| | | | | | | | | |
Collapse
|
41
|
Programmable bacterial catalysis - designing cells for biosynthesis of value-added compounds. FEBS Lett 2012; 586:2184-90. [DOI: 10.1016/j.febslet.2012.02.030] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2012] [Revised: 02/16/2012] [Accepted: 02/20/2012] [Indexed: 12/26/2022]
|
42
|
Mehta CM, White ET, Litster JD. Correlation of second virial coefficient with solubility for proteins in salt solutions. Biotechnol Prog 2011; 28:163-70. [PMID: 22002946 DOI: 10.1002/btpr.724] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2011] [Revised: 08/30/2011] [Indexed: 11/08/2022]
Abstract
In this work, osmotic second virial coefficients (B(22)) were determined and correlated with the measured solubilities for the proteins, α-amylase, ovalbumin, and lysozyme. The B(22) values and solubilities were determined in similar solution conditions using two salts, sodium chloride and ammonium sulfate in an acidic pH range. An overall decrease in the solubility of the proteins (salting out) was observed at high concentrations of ammonium sulfate and sodium chloride solutions. However, for α-amylase, salting-in behavior was also observed in low concentration sodium chloride solutions. In ammonium sulfate solutions, the B(22) are small and close to zero below 2.4 M. As the ammonium sulfate concentrations were further increased, B(22) values decreased for all systems studied. The effect of sodium chloride on B(22) varies with concentration, solution pH, and the type of protein studied. Theoretical models show a reasonable fit to the experimental derived data of B(22) and solubility. B(22) is also directly proportional to the logarithm of the solubility values for individual proteins in salt solutions, so the log-linear empirical models developed in this work can also be used to rapidly predict solubility and B(22) values for given protein-salt systems.
Collapse
Affiliation(s)
- Chirag M Mehta
- School of Chemical Engineering, The University of Queensland, St Lucia, Brisbane, QLD 4072, Australia.
| | | | | |
Collapse
|
43
|
Overton IM, Barton GJ. Computational approaches to selecting and optimising targets for structural biology. Methods 2011; 55:3-11. [PMID: 21906678 PMCID: PMC3202631 DOI: 10.1016/j.ymeth.2011.08.014] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2011] [Revised: 08/18/2011] [Accepted: 08/22/2011] [Indexed: 11/29/2022] Open
Abstract
Selection of protein targets for study is central to structural biology and may be influenced by numerous factors. A key aim is to maximise returns for effort invested by identifying proteins with the balance of biophysical properties that are conducive to success at all stages (e.g. solubility, crystallisation) in the route towards a high resolution structural model. Selected targets can be optimised through construct design (e.g. to minimise protein disorder), switching to a homologous protein, and selection of experimental methodology (e.g. choice of expression system) to prime for efficient progress through the structural proteomics pipeline. Here we discuss computational techniques in target selection and optimisation, with more detailed focus on tools developed within the Scottish Structural Proteomics Facility (SSPF); namely XANNpred, ParCrys, OB-Score (target selection) and TarO (target optimisation). TarO runs a large number of algorithms, searching for homologues and annotating the pool of possible alternative targets. This pool of putative homologues is presented in a ranked, tabulated format and results are also visualised as an automatically generated and annotated multiple sequence alignment. The target selection algorithms each predict the propensity of a selected protein target to progress through the experimental stages leading to diffracting crystals. This single predictor approach has advantages for target selection, when compared with an approach using two or more predictors that each predict for success at a single experimental stage. The tools described here helped SSPF achieve a high (21%) success rate in progressing cloned targets to diffraction-quality crystals.
Collapse
Affiliation(s)
- Ian M Overton
- MRC Human Genetics Unit, Institute of Genetics and Molecular Medicine, Western General Hospital, Crewe Road, Edinburgh EH4 2XU, United Kingdom.
| | | |
Collapse
|
44
|
Restrepo-Montoya D, Pino C, Nino LF, Patarroyo ME, Patarroyo MA. NClassG+: A classifier for non-classically secreted Gram-positive bacterial proteins. BMC Bioinformatics 2011; 12:21. [PMID: 21235786 PMCID: PMC3025837 DOI: 10.1186/1471-2105-12-21] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2010] [Accepted: 01/14/2011] [Indexed: 11/16/2022] Open
Abstract
Background Most predictive methods currently available for the identification of protein secretion mechanisms have focused on classically secreted proteins. In fact, only two methods have been reported for predicting non-classically secreted proteins of Gram-positive bacteria. This study describes the implementation of a sequence-based classifier, denoted as NClassG+, for identifying non-classically secreted Gram-positive bacterial proteins. Results Several feature-based classifiers were trained using different sequence transformation vectors (frequencies, dipeptides, physicochemical factors and PSSM) and Support Vector Machines (SVMs) with Linear, Polynomial and Gaussian kernel functions. Nested k-fold cross-validation (CV) was applied to select the best models, using the inner CV loop to tune the model parameters and the outer CV group to compute the error. The parameters and Kernel functions and the combinations between all possible feature vectors were optimized using grid search. Conclusions The final model was tested against an independent set not previously seen by the model, obtaining better predictive performance compared to SecretomeP V2.0 and SecretPV2.0 for the identification of non-classically secreted proteins. NClassG+ is freely available on the web at http://www.biolisi.unal.edu.co/web-servers/nclassgpositive/
Collapse
Affiliation(s)
- Daniel Restrepo-Montoya
- School of Medicine and Health Sciences, Universidad del Rosario, Carrera 24 No, 63C-69, Bogotá DC, Colombia
| | | | | | | | | |
Collapse
|
45
|
Magnan CN, Zeller M, Kayala MA, Vigil A, Randall A, Felgner PL, Baldi P. High-throughput prediction of protein antigenicity using protein microarray data. Bioinformatics 2010; 26:2936-43. [PMID: 20934990 PMCID: PMC2982151 DOI: 10.1093/bioinformatics/btq551] [Citation(s) in RCA: 301] [Impact Index Per Article: 21.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2010] [Revised: 09/08/2010] [Accepted: 09/23/2010] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Discovery of novel protective antigens is fundamental to the development of vaccines for existing and emerging pathogens. Most computational methods for predicting protein antigenicity rely directly on homology with previously characterized protective antigens; however, homology-based methods will fail to discover truly novel protective antigens. Thus, there is a significant need for homology-free methods capable of screening entire proteomes for the antigens most likely to generate a protective humoral immune response. RESULTS Here we begin by curating two types of positive data: (i) antigens that elicit a strong antibody response in protected individuals but not in unprotected individuals, using human immunoglobulin reactivity data obtained from protein microarray analyses; and (ii) known protective antigens from the literature. The resulting datasets are used to train a sequence-based prediction model, ANTIGENpro, to predict the likelihood that a protein is a protective antigen. ANTIGENpro correctly classifies 82% of the known protective antigens when trained using only the protein microarray datasets. The accuracy on the combined dataset is estimated at 76% by cross-validation experiments. Finally, ANTIGENpro performs well when evaluated on an external pathogen proteome for which protein microarray data were obtained after the initial development of ANTIGENpro. AVAILABILITY ANTIGENpro is integrated in the SCRATCH suite of predictors available at http://scratch.proteomics.ics.uci.edu. CONTACT pfbaldi@ics.uci.edu
Collapse
Affiliation(s)
- Christophe N Magnan
- Institute for Genomics and Bioinformatics, School of Information and Computer Sciences, University of California, Irvine, CA 92697, USA
| | | | | | | | | | | | | |
Collapse
|
46
|
Tian Y, Deutsch C, Krishnamoorthy B. Scoring function to predict solubility mutagenesis. Algorithms Mol Biol 2010; 5:33. [PMID: 20929563 PMCID: PMC2958853 DOI: 10.1186/1748-7188-5-33] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2010] [Accepted: 10/07/2010] [Indexed: 11/16/2022] Open
Abstract
BACKGROUND Mutagenesis is commonly used to engineer proteins with desirable properties not present in the wild type (WT) protein, such as increased or decreased stability, reactivity, or solubility. Experimentalists often have to choose a small subset of mutations from a large number of candidates to obtain the desired change, and computational techniques are invaluable to make the choices. While several such methods have been proposed to predict stability and reactivity mutagenesis, solubility has not received much attention. RESULTS We use concepts from computational geometry to define a three body scoring function that predicts the change in protein solubility due to mutations. The scoring function captures both sequence and structure information. By exploring the literature, we have assembled a substantial database of 137 single- and multiple-point solubility mutations. Our database is the largest such collection with structural information known so far. We optimize the scoring function using linear programming (LP) methods to derive its weights based on training. Starting with default values of 1, we find weights in the range [0,2] so that predictions of increase or decrease in solubility are optimized. We compare the LP method to the standard machine learning techniques of support vector machines (SVM) and the Lasso. Using statistics for leave-one-out (LOO), 10-fold, and 3-fold cross validations (CV) for training and prediction, we demonstrate that the LP method performs the best overall. For the LOOCV, the LP method has an overall accuracy of 81%. AVAILABILITY Executables of programs, tables of weights, and datasets of mutants are available from the following web page: http://www.wsu.edu/~kbala/OptSolMut.html.
Collapse
Affiliation(s)
- Ye Tian
- Department of Mathematics, Washington State University, Pullman, WA 99164, USA
| | | | - Bala Krishnamoorthy
- Department of Mathematics, Washington State University, Pullman, WA 99164, USA
| |
Collapse
|
47
|
HE HQ, HU JP, LIU B, CHEN WZ, WANG CX. Activity, Solubility Comparison and Molecular Dynamics Simulation Analysis of Wild Type and F185K Mutant Type HIV-1 Integrase Catalytic Domain*. PROG BIOCHEM BIOPHYS 2010. [DOI: 10.3724/sp.j.1206.2009.00126] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
|
48
|
Diaz AA, Tomba E, Lennarson R, Richard R, Bagajewicz MJ, Harrison RG. Prediction of protein solubility inEscherichia coliusing logistic regression. Biotechnol Bioeng 2010; 105:374-83. [DOI: 10.1002/bit.22537] [Citation(s) in RCA: 61] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
|
49
|
Chan WC, Liang PH, Shih YP, Yang UC, Lin WC, Hsu CN. Learning to predict expression efficacy of vectors in recombinant protein production. BMC Bioinformatics 2010; 11 Suppl 1:S21. [PMID: 20122193 PMCID: PMC3009492 DOI: 10.1186/1471-2105-11-s1-s21] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Background Recombinant protein production is a useful biotechnology to produce a large quantity of highly soluble proteins. Currently, the most widely used production system is to fuse a target protein into different vectors in Escherichia coli (E. coli). However, the production efficacy of different vectors varies for different target proteins. Trial-and-error is still the common practice to find out the efficacy of a vector for a given target protein. Previous studies are limited in that they assumed that proteins would be over-expressed and focused only on the solubility of expressed proteins. In fact, many pairings of vectors and proteins result in no expression. Results In this study, we applied machine learning to train prediction models to predict whether a pairing of vector-protein will express or not express in E. coli. For expressed cases, the models further predict whether the expressed proteins would be soluble. We collected a set of real cases from the clients of our recombinant protein production core facility, where six different vectors were designed and studied. This set of cases is used in both training and evaluation of our models. We evaluate three different models based on the support vector machines (SVM) and their ensembles. Unlike many previous works, these models consider the sequence of the target protein as well as the sequence of the whole fusion vector as the features. We show that a model that classifies a case into one of the three classes (no expression, inclusion body and soluble) outperforms a model that considers the nested structure of the three classes, while a model that can take advantage of the hierarchical structure of the three classes performs slight worse but comparably to the best model. Meanwhile, compared to previous works, we show that the prediction accuracy of our best method still performs the best. Lastly, we briefly present two methods to use the trained model in the design of the recombinant protein production systems to improve the chance of high soluble protein production. Conclusion In this paper, we show that a machine learning approach to the prediction of the efficacy of a vector for a target protein in a recombinant protein production system is promising and may compliment traditional knowledge-driven study of the efficacy. We will release our program to share with other labs in the public domain when this paper is published.
Collapse
Affiliation(s)
- Wen-Ching Chan
- Institute of Biomedical Informatics, National Yang-Ming University, Taipei, Taiwan.
| | | | | | | | | | | |
Collapse
|
50
|
Magnan CN, Randall A, Baldi P. SOLpro: accurate sequence-based prediction of protein solubility. ACTA ACUST UNITED AC 2009; 25:2200-7. [PMID: 19549632 DOI: 10.1093/bioinformatics/btp386] [Citation(s) in RCA: 343] [Impact Index Per Article: 22.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION Protein insolubility is a major obstacle for many experimental studies. A sequence-based prediction method able to accurately predict the propensity of a protein to be soluble on overexpression could be used, for instance, to prioritize targets in large-scale proteomics projects and to identify mutations likely to increase the solubility of insoluble proteins. RESULTS Here, we first curate a large, non-redundant and balanced training set of more than 17 000 proteins. Next, we extract and study 23 groups of features computed directly or predicted (e.g. secondary structure) from the primary sequence. The data and the features are used to train a two-stage support vector machine (SVM) architecture. The resulting predictor, SOLpro, is compared directly with existing methods and shows significant improvement according to standard evaluation metrics, with an overall accuracy of over 74% estimated using multiple runs of 10-fold cross-validation.
Collapse
Affiliation(s)
- Christophe N Magnan
- Institute for Genomics and Bioinformatics, School of Information and Computer Sciences, University of California, Irvine, CA, USA
| | | | | |
Collapse
|