Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Idicula-Thomas S, Kulkarni AJ, Kulkarni BD, Jayaraman VK, Balaji PV. A support vector machine-based method for predicting the propensity of a protein to be soluble or to form inclusion body on overexpression in Escherichia coli. Bioinformatics 2005;22:278-84. [PMID: 16332713 DOI: 10.1093/bioinformatics/bti810] [Citation(s) in RCA: 77] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open

For:	Idicula-Thomas S, Kulkarni AJ, Kulkarni BD, Jayaraman VK, Balaji PV. A support vector machine-based method for predicting the propensity of a protein to be soluble or to form inclusion body on overexpression in Escherichia coli. Bioinformatics 2005;22:278-84. [PMID: 16332713 DOI: 10.1093/bioinformatics/bti810] [Citation(s) in RCA: 77] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open

Number

Cited by Other Article(s)

Prabhu H, Bhosale H, Sane A, Dhadwal R, Ramakrishnan V, Valadi J. Protein feature engineering framework for AMPylation site prediction. Sci Rep 2024;14:8695. [PMID: 38622194 DOI: 10.1038/s41598-024-58450-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2023] [Accepted: 03/29/2024] [Indexed: 04/17/2024] Open

Dutta S, Zunjare RU, Sil A, Mishra DC, Arora A, Gain N, Chand G, Chhabra R, Muthusamy V, Hossain F. Prediction of matrilineal specific patatin-like protein governing in-vivo maternal haploid induction in maize using support vector machine and di-peptide composition. Amino Acids 2024;56:20. [PMID: 38460024 DOI: 10.1007/s00726-023-03368-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2023] [Accepted: 12/05/2023] [Indexed: 03/11/2024]

Lahorkar A, Bhosale H, Sane A, Ramakrishnan V, Jayaraman VK. Identification of Phase Separating Proteins With Distributed Reduced Alphabet Representations of Sequences. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023;20:410-420. [PMID: 35139023 DOI: 10.1109/tcbb.2022.3149310] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]

Lv N, Zhou Z, He S, Shao X, Zhou X, Feng X, Qian Z, Zhang Y, Liu M. Identification of osteoporosis based on gene biomarkers using support vector machine. Open Med (Wars) 2022;17:1216-1227. [PMID: 35859791 PMCID: PMC9263892 DOI: 10.1515/med-2022-0507] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2021] [Revised: 04/19/2022] [Accepted: 05/15/2022] [Indexed: 11/26/2022] Open

Identification of Type 2 Diabetes Based on a Ten-Gene Biomarker Prediction Model Constructed Using a Support Vector Machine Algorithm. BIOMED RESEARCH INTERNATIONAL 2022;2022:1230761. [PMID: 35281591 PMCID: PMC8916865 DOI: 10.1155/2022/1230761] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/30/2021] [Revised: 11/24/2021] [Accepted: 02/20/2022] [Indexed: 11/17/2022]

Bhosale H, Ramakrishnan V, Jayaraman VK. Support vector machine-based prediction of pore-forming toxins (PFT) using distributed representation of reduced alphabets. J Bioinform Comput Biol 2021;19:2150028. [PMID: 34693886 DOI: 10.1142/s0219720021500281] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]

Sohrawordi M, Hossain MA. Prediction of lysine formylation sites using support vector machine based on the sample selection from majority classes and synthetic minority over-sampling techniques. Biochimie 2021;192:125-135. [PMID: 34627982 DOI: 10.1016/j.biochi.2021.10.001] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2021] [Revised: 10/03/2021] [Accepted: 10/05/2021] [Indexed: 12/22/2022]

Hou Q, Kwasigroch JM, Rooman M, Pucci F. SOLart: a structure-based method to predict protein solubility and aggregation. Bioinformatics 2020;36:1445-1452. [PMID: 31603466 DOI: 10.1093/bioinformatics/btz773] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2019] [Revised: 08/31/2019] [Accepted: 10/08/2019] [Indexed: 12/12/2022] Open

Abstract

MOTIVATION

The solubility of a protein is often decisive for its proper functioning. Lack of solubility is a major bottleneck in high-throughput structural genomic studies and in high-concentration protein production, and the formation of protein aggregates causes a wide variety of diseases. Since solubility measurements are time-consuming and expensive, there is a strong need for solubility prediction tools.

RESULTS

We have recently introduced solubility-dependent distance potentials that are able to unravel the role of residue-residue interactions in promoting or decreasing protein solubility. Here, we extended their construction by defining solubility-dependent potentials based on backbone torsion angles and solvent accessibility, and integrated them, together with other structure- and sequence-based features, into a random forest model trained on a set of Escherichia coli proteins with experimental structures and solubility values. We thus obtained the SOLart protein solubility predictor, whose most informative features turned out to be folding free energy differences computed from our solubility-dependent statistical potentials. SOLart performances are very good, with a Pearson correlation coefficient between experimental and predicted solubility values of almost 0.7 both in cross-validation on the training dataset and in an independent set of Saccharomyces cerevisiae proteins. On test sets of modeled structures, only a limited drop in performance is observed. SOLart can thus be used with both high-resolution and low-resolution structures, and clearly outperforms state-of-art solubility predictors. It is available through a user-friendly webserver, which is easy to use by non-expert scientists.

AVAILABILITY AND IMPLEMENTATION

The SOLart webserver is freely available at http://babylone.ulb.ac.be/SOLART/.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

Collapse

Vormittag P, Klamp T, Hubbuch J. Optimization of a Soft Ensemble Vote Classifier for the Prediction of Chimeric Virus-Like Particle Solubility and Other Biophysical Properties. Front Bioeng Biotechnol 2020;8:881. [PMID: 32850736 PMCID: PMC7411134 DOI: 10.3389/fbioe.2020.00881] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2020] [Accepted: 07/09/2020] [Indexed: 01/24/2023] Open

Abstract

Chimeric virus-like particles (cVLPs) are protein-based nanostructures applied as investigational vaccines against infectious diseases, cancer, and immunological disorders. Low solubility of cVLP vaccine candidates is a challenge that can prevent development of these very substances. Solubility of cVLPs is typically assessed empirically, leading to high time and material requirements. Prediction of cVLP solubility in silico can aid in reducing this effort. Protein aggregation by hydrophobic interaction is an important factor driving protein insolubility. In this article, a recently developed soft ensemble vote classifier (sEVC) for the prediction of cVLP solubility was used based on 91 literature amino acid hydrophobicity scales. Optimization algorithms were developed to boost model performance, and the model was redesigned as a regression tool for ammonium sulfate concentration required for cVLP precipitation. The present dataset consists of 568 cVLPs, created by insertion of 71 different peptide sequences using eight different insertion strategies. Two optimization algorithms were developed that (I) modified the sEVC with regard to systematic misclassification based on the different insertion strategies, and (II) modified the amino acid hydrophobicity scale tables to improve classification. The second algorithm was additionally used to synthesize scales from random vectors. Compared to the unmodified model, Matthew’s Correlation Coefficient (MCC), and accuracy of the test set predictions could be elevated from 0.63 and 0.81 to 0.77 and 0.88, respectively, for the best models. This improved performance compared to literature scales was suggested to be due to a decreased correlation between synthesized scales. In these, tryptophan was identified as the most hydrophobic amino acid, i.e., the amino acid most problematic for cVLP solubility, supported by previous literature findings. As a case study, the sEVC was redesigned as a regression tool and applied to determine ammonium sulfate concentrations for the precipitation of cVLPs. This was evaluated with a small dataset of ten cVLPs resulting in an R² of 0.69. In summary, we propose optimization algorithms that improve sEVC model performance for the prediction of cVLP solubility, allow for the synthesis of amino acid scale tables, and further evaluate the sEVC as regression tool to predict cVLP-precipitating ammonium sulfate concentrations.

Collapse

Vormittag P, Klamp T, Hubbuch J. Ensembles of Hydrophobicity Scales as Potent Classifiers for Chimeric Virus-Like Particle Solubility - An Amino Acid Sequence-Based Machine Learning Approach. Front Bioeng Biotechnol 2020;8:395. [PMID: 32432098 PMCID: PMC7217080 DOI: 10.3389/fbioe.2020.00395] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2020] [Accepted: 04/08/2020] [Indexed: 11/13/2022] Open

Abstract

Virus-like particles (VLPs) are protein-based nanoscale structures that show high potential as immunotherapeutics or cargo delivery vehicles. Chimeric VLPs are decorated with foreign peptides resulting in structures that confer immune responses against the displayed epitope. However, insertion of foreign sequences often results in insoluble proteins, calling for methods capable of assessing a VLP candidate's solubility in silico. The prediction of VLP solubility requires a model that can identify critical hydrophobicity-related parameters, distinguishing between VLP-forming aggregation and aggregation leading to insoluble virus protein clusters. Therefore, we developed and implemented a soft ensemble vote classifier (sEVC) framework based on chimeric hepatitis B core antigen (HBcAg) amino acid sequences and 91 publicly available hydrophobicity scales. Based on each hydrophobicity scale, an individual decision tree was induced as classifier in the sEVC. An embedded feature selection algorithm and stratified sampling proved beneficial for model construction. With a learning experiment, model performance in the space of model training set size and number of included classifiers in the sEVC was explored. Additionally, seven models were created from training data of 24-384 chimeric HBcAg constructs, which were validated by 100-fold Monte Carlo cross-validation. The models predicted external test sets of 184-544 chimeric HBcAg constructs. Best models showed a Matthew's correlation coefficient of >0.6 on the validation and the external test set. Feature selection was evaluated for classifiers with best and worst performance in the chimeric HBcAg VLP solubility scenario. Analysis of the associated hydrophobicity scales allowed for retrieval of biological information related to the mechanistic backgrounds of VLP solubility, suggesting a special role of arginine for VLP assembly and solubility. In the future, the developed sEVC could further be applied to hydrophobicity-related problems in other domains, such as monoclonal antibodies.

Collapse

Effect of restricted dissolved oxygen on expression of Clostridium difficile toxin A subunit from E. coli. Sci Rep 2020;10:3059. [PMID: 32080292 PMCID: PMC7033237 DOI: 10.1038/s41598-020-59978-1] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2019] [Accepted: 02/06/2020] [Indexed: 12/11/2022] Open

In Silico Study of Different Signal Peptides to Express Recombinant Glutamate Decarboxylase in the Outer Membrane of Escherichia coli. Int J Pept Res Ther 2019. [DOI: 10.1007/s10989-019-09986-1] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]

Prediction of Breast Cancer from Imbalance Respect Using Cluster-Based Undersampling Method. JOURNAL OF HEALTHCARE ENGINEERING 2019;2019:7294582. [PMID: 31737241 PMCID: PMC6817921 DOI: 10.1155/2019/7294582] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/17/2018] [Revised: 04/03/2019] [Accepted: 06/10/2019] [Indexed: 11/18/2022]

Zhang J, Chen L. Clustering-based undersampling with random over sampling examples and support vector machine for imbalanced classification of breast cancer diagnosis. Comput Assist Surg (Abingdon) 2019;24:62-72. [PMID: 31403330 DOI: 10.1080/24699322.2019.1649074] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022] Open

Han X, Wang X, Zhou K. Develop machine learning-based regression predictive models for engineering protein solubility. Bioinformatics 2019;35:4640-4646. [DOI: 10.1093/bioinformatics/btz294] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2018] [Revised: 03/09/2019] [Accepted: 04/17/2019] [Indexed: 11/14/2022] Open

Abstract Abstract Motivation Protein activity is a significant characteristic for recombinant proteins which can be used as biocatalysts. High activity of proteins reduces the cost of biocatalysts. A model that can predict protein activity from amino acid sequence is highly desired, as it aids experimental improvement of proteins. However, only limited data for protein activity are currently available, which prevents the development of such models. Since protein activity and solubility are correlated for some proteins, the publicly available solubility dataset may be adopted to develop models that can predict protein solubility from sequence. The models could serve as a tool to indirectly predict protein activity from sequence. In literature, predicting protein solubility from sequence has been intensively explored, but the predicted solubility represented in binary values from all the developed models was not suitable for guiding experimental designs to improve protein solubility. Here we propose new machine learning (ML) models for improving protein solubility in vivo. Results We first implemented a novel approach that predicted protein solubility in continuous numerical values instead of binary ones. After combining it with various ML algorithms, we achieved a R2 of 0.4115 when support vector machine algorithm was used. Continuous values of solubility are more meaningful in protein engineering, as they enable researchers to choose proteins with higher predicted solubility for experimental validation, while binary values fail to distinguish proteins with the same value—there are only two possible values so many proteins have the same one. Availability and implementation We present the ML workflow as a series of IPython notebooks hosted on GitHub (https://github.com/xiaomizhou616/protein_solubility). The workflow can be used as a template for analysis of other expression and solubility datasets. Supplementary information Supplementary data are available at Bioinformatics online. Collapse

Hou Q, Bourgeas R, Pucci F, Rooman M. Computational analysis of the amino acid interactions that promote or decrease protein solubility. Sci Rep 2018;8:14661. [PMID: 30279585 PMCID: PMC6168528 DOI: 10.1038/s41598-018-32988-w] [Citation(s) in RCA: 27] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2018] [Accepted: 09/11/2018] [Indexed: 11/24/2022] Open

Yang Y, Liu G, Liu M, Bai Z, Liu X, Dai X, Guo W. Correlation Between Protein Primary Structure and Soluble Expression Level of HSA dAb in Escherichia coli. Food Technol Biotechnol 2018;56:101-109. [PMID: 29796003 DOI: 10.17113/ftb.56.01.18.5445] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open

Affiliation(s)

Yankun Yang The Key Laboratory of Carbohydrate Chemistry and Biotechnology, School of Biotechnology, Jiangnan University, Ministry of Education, 1800 Lihu Avenue, 214122 Wuxi, PR China.,National Engineering Laboratory for Cereal Fermentation Technology, Jiangnan University, 1800 Lihu Avenue, 214122 Wuxi, PR China
Guoqiang Liu The Key Laboratory of Carbohydrate Chemistry and Biotechnology, School of Biotechnology, Jiangnan University, Ministry of Education, 1800 Lihu Avenue, 214122 Wuxi, PR China.,National Engineering Laboratory for Cereal Fermentation Technology, Jiangnan University, 1800 Lihu Avenue, 214122 Wuxi, PR China
Meng Liu National Engineering Laboratory for Cereal Fermentation Technology, Jiangnan University, 1800 Lihu Avenue, 214122 Wuxi, PR China
Zhonghu Bai National Engineering Laboratory for Cereal Fermentation Technology, Jiangnan University, 1800 Lihu Avenue, 214122 Wuxi, PR China.,Jiangsu Provincial Research Center for Bioactive Product Processing Technology, Jiangnan University, 1800 Lihu Avenue, 214122 Wuxi, PR China
Xiuxia Liu National Engineering Laboratory for Cereal Fermentation Technology, Jiangnan University, 1800 Lihu Avenue, 214122 Wuxi, PR China.,Jiangsu Provincial Research Center for Bioactive Product Processing Technology, Jiangnan University, 1800 Lihu Avenue, 214122 Wuxi, PR China
Xiaofeng Dai National Engineering Laboratory for Cereal Fermentation Technology, Jiangnan University, 1800 Lihu Avenue, 214122 Wuxi, PR China.,Jiangsu Provincial Research Center for Bioactive Product Processing Technology, Jiangnan University, 1800 Lihu Avenue, 214122 Wuxi, PR China
Wenwen Guo Jiangsu Provincial Research Center for Bioactive Product Processing Technology, Jiangnan University, 1800 Lihu Avenue, 214122 Wuxi, PR China.,The Key Laboratory of Industrial Biotechnology, Ministry of Education, School of Biotechnology, Jiangnan University, 1800 Lihu Avenue, 214122 Wuxi, PR China

Collapse

Sastry A, Monk J, Tegel H, Uhlen M, Palsson BO, Rockberg J, Brunk E. Machine learning in computational biology to accelerate high-throughput protein expression. Bioinformatics 2018;33:2487-2495. [PMID: 28398465 DOI: 10.1093/bioinformatics/btx207] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2016] [Accepted: 04/05/2017] [Indexed: 01/21/2023] Open

Chang CCH, Li C, Webb GI, Tey B, Song J, Ramanan RN. Periscope: quantitative prediction of soluble protein expression in the periplasm of Escherichia coli. Sci Rep 2016;6:21844. [PMID: 26931649 PMCID: PMC4773868 DOI: 10.1038/srep21844] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2015] [Accepted: 01/28/2016] [Indexed: 12/20/2022] Open

Ranganarayanan P, Thanigesan N, Ananth V, Jayaraman VK, Ramakrishnan V. Identification of Glucose-Binding Pockets in Human Serum Albumin Using Support Vector Machine and Molecular Dynamics Simulations. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2016;13:148-157. [PMID: 26886739 DOI: 10.1109/tcbb.2015.2415806] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]

Predicting recombinant protein expression experiments using molecular dynamics simulation. Chem Eng Sci 2015. [DOI: 10.1016/j.ces.2014.09.044] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023]

Chen YF, Huang PC, Lin KC, Lin HH, Wang LE, Cheng CC, Chen TP, Chan YK, Chiang JY. Semi-automatic segmentation and classification of Pap smear cells. IEEE J Biomed Health Inform 2014;18:94-108. [PMID: 24403407 DOI: 10.1109/jbhi.2013.2250984] [Citation(s) in RCA: 43] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]

Prediction of soluble heterologous protein expression levels inEscherichia colifrom sequence-based features and its potential in biopharmaceutical process development. ACTA ACUST UNITED AC 2014. [DOI: 10.4155/pbp.14.23] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]

A review of machine learning methods to predict the solubility of overexpressed recombinant proteins in Escherichia coli. BMC Bioinformatics 2014;15:134. [PMID: 24885721 PMCID: PMC4098780 DOI: 10.1186/1471-2105-15-134] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2013] [Accepted: 03/25/2014] [Indexed: 12/14/2022] Open

Abstract

Background

Over the last 20 years in biotechnology, the production of recombinant proteins has been a crucial bioprocess in both biopharmaceutical and research arena in terms of human health, scientific impact and economic volume. Although logical strategies of genetic engineering have been established, protein overexpression is still an art. In particular, heterologous expression is often hindered by low level of production and frequent fail due to opaque reasons. The problem is accentuated because there is no generic solution available to enhance heterologous overexpression. For a given protein, the extent of its solubility can indicate the quality of its function. Over 30% of synthesized proteins are not soluble. In certain experimental circumstances, including temperature, expression host, etc., protein solubility is a feature eventually defined by its sequence. Until now, numerous methods based on machine learning are proposed to predict the solubility of protein merely from its amino acid sequence. In spite of the 20 years of research on the matter, no comprehensive review is available on the published methods.

Results

This paper presents an extensive review of the existing models to predict protein solubility in Escherichia coli recombinant protein overexpression system. The models are investigated and compared regarding the datasets used, features, feature selection methods, machine learning techniques and accuracy of prediction. A discussion on the models is provided at the end.

Conclusions

This study aims to investigate extensively the machine learning based methods to predict recombinant protein solubility, so as to offer a general as well as a detailed understanding for researches in the field. Some of the models present acceptable prediction performances and convenient user interfaces. These models can be considered as valuable tools to predict recombinant protein overexpression results before performing real laboratory experiments, thus saving labour, time and cost.

Collapse

Chang CCH, Tey BT, Song J, Ramanan RN. Towards more accurate prediction of protein folding rates: a review of the existing web-based bioinformatics approaches. Brief Bioinform 2014;16:314-24. [DOI: 10.1093/bib/bbu007] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023] Open

Hirose S, Noguchi T. ESPRESSO: a system for estimating protein expression and solubility in protein expression systems. Proteomics 2013;13:1444-56. [PMID: 23436767 DOI: 10.1002/pmic.201200175] [Citation(s) in RCA: 34] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2012] [Revised: 01/27/2013] [Accepted: 02/06/2013] [Indexed: 11/11/2022]

AcalPred: a sequence-based tool for discriminating between acidic and alkaline enzymes. PLoS One 2013;8:e75726. [PMID: 24130738 PMCID: PMC3794003 DOI: 10.1371/journal.pone.0075726] [Citation(s) in RCA: 81] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2013] [Accepted: 08/16/2013] [Indexed: 11/19/2022] Open

Xiaohui N, Nana L, Jingbo X, Dingyan C, Yuehua P, Yang X, Weiquan W, Dongming W, Zengzhen W. Using the concept of Chou's pseudo amino acid composition to predict protein solubility: An approach with entropies in information theory. J Theor Biol 2013;332:211-7. [DOI: 10.1016/j.jtbi.2013.03.010] [Citation(s) in RCA: 34] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2012] [Revised: 03/10/2013] [Accepted: 03/11/2013] [Indexed: 11/15/2022]

Chang CCH, Song J, Tey BT, Ramanan RN. Bioinformatics approaches for improved recombinant protein production in Escherichia coli: protein solubility prediction. Brief Bioinform 2013;15:953-62. [DOI: 10.1093/bib/bbt057] [Citation(s) in RCA: 52] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open

Guilloux A, Caudron B, Jestin JL. A method to predict edge strands in beta-sheets from protein sequences. Comput Struct Biotechnol J 2013;7:e201305001. [PMID: 24688737 PMCID: PMC3962219 DOI: 10.5936/csbj.201305001] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2013] [Revised: 05/27/2013] [Accepted: 05/30/2013] [Indexed: 12/15/2022] Open

Singh GP, Dash D. Electrostatic mis-interactions cause overexpression toxicity of proteins in E. coli. PLoS One 2013;8:e64893. [PMID: 23734225 PMCID: PMC3667126 DOI: 10.1371/journal.pone.0064893] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2013] [Accepted: 04/19/2013] [Indexed: 01/28/2023] Open

Fang Y, Fang J. Discrimination of soluble and aggregation-prone proteins based on sequence information. MOLECULAR BIOSYSTEMS 2013;9:806-11. [PMID: 23440081 PMCID: PMC3627541 DOI: 10.1039/c3mb70033j] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]

Current state and recent advances in biopharmaceutical production in Escherichia coli, yeasts and mammalian cells. J Ind Microbiol Biotechnol 2013;40:257-74. [PMID: 23385853 DOI: 10.1007/s10295-013-1235-0] [Citation(s) in RCA: 139] [Impact Index Per Article: 12.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2012] [Accepted: 01/22/2013] [Indexed: 12/28/2022]

Abstract

Almost all of the 200 or so approved biopharmaceuticals have been produced in one of three host systems: the bacterium Escherichia coli, yeasts (Saccharomyces cerevisiae, Pichia pastoris) and mammalian cells. We describe the most widely used methods for the expression of recombinant proteins in the cytoplasm or periplasm of E. coli, as well as strategies for secreting the product to the growth medium. Recombinant expression in E. coli influences the cell physiology and triggers a stress response, which has to be considered in process development. Increased expression of a functional protein can be achieved by optimizing the gene, plasmid, host cell, and fermentation process. Relevant properties of two yeast expression systems, S. cerevisiae and P. pastoris, are summarized. Optimization of expression in S. cerevisiae has focused mainly on increasing the secretion, which is otherwise limiting. P. pastoris was recently approved as a host for biopharmaceutical production for the first time. It enables high-level protein production and secretion. Additionally, genetic engineering has resulted in its ability to produce recombinant proteins with humanized glycosylation patterns. Several mammalian cell lines of either rodent or human origin are also used in biopharmaceutical production. Optimization of their expression has focused on clonal selection, interference with epigenetic factors and genetic engineering. Systemic optimization approaches are applied to all cell expression systems. They feature parallel high-throughput techniques, such as DNA microarray, next-generation sequencing and proteomics, and enable simultaneous monitoring of multiple parameters. Systemic approaches, together with technological advances such as disposable bioreactors and microbioreactors, are expected to lead to increased quality and quantity of biopharmaceuticals, as well as to reduced product development times.

Collapse

Prediction and analysis of protein solubility using a novel scoring card method with dipeptide composition. BMC Bioinformatics 2012;13 Suppl 17:S3. [PMID: 23282103 PMCID: PMC3521471 DOI: 10.1186/1471-2105-13-s17-s3] [Citation(s) in RCA: 47] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open

Abstract

BACKGROUND

Existing methods for predicting protein solubility on overexpression in Escherichia coli advance performance by using ensemble classifiers such as two-stage support vector machine (SVM) based classifiers and a number of feature types such as physicochemical properties, amino acid and dipeptide composition, accompanied with feature selection. It is desirable to develop a simple and easily interpretable method for predicting protein solubility, compared to existing complex SVM-based methods.

RESULTS

This study proposes a novel scoring card method (SCM) by using dipeptide composition only to estimate solubility scores of sequences for predicting protein solubility. SCM calculates the propensities of 400 individual dipeptides to be soluble using statistic discrimination between soluble and insoluble proteins of a training data set. Consequently, the propensity scores of all dipeptides are further optimized using an intelligent genetic algorithm. The solubility score of a sequence is determined by the weighted sum of all propensity scores and dipeptide composition. To evaluate SCM by performance comparisons, four data sets with different sizes and variation degrees of experimental conditions were used. The results show that the simple method SCM with interpretable propensities of dipeptides has promising performance, compared with existing SVM-based ensemble methods with a number of feature types. Furthermore, the propensities of dipeptides and solubility scores of sequences can provide insights to protein solubility. For example, the analysis of dipeptide scores shows high propensity of α-helix structure and thermophilic proteins to be soluble.

CONCLUSIONS

The propensities of individual dipeptides to be soluble are varied for proteins under altered experimental conditions. For accurately predicting protein solubility using SCM, it is better to customize the score card of dipeptide propensities by using a training data set under the same specified experimental conditions. The proposed method SCM with solubility scores and dipeptide propensities can be easily applied to the protein function prediction problems that dipeptide composition features play an important role.

AVAILABILITY

The used datasets, source codes of SCM, and supplementary files are available at http://iclab.life.nctu.edu.tw/SCM/.

Collapse

O’Malley CJ, Montague GA, Martin EB, Liddell JM, Kara B, Titchener-Hooker NJ. Utilisation of key descriptors from protein sequence data to aid bioprocess route selection. FOOD AND BIOPRODUCTS PROCESSING 2012. [DOI: 10.1016/j.fbp.2012.01.005] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]

Joseph S, Karnik S, Nilawe P, Jayaraman VK, Idicula-Thomas S. ClassAMP: a prediction tool for classification of antimicrobial peptides. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2012;9:1535-1538. [PMID: 22732690 DOI: 10.1109/tcbb.2012.89] [Citation(s) in RCA: 89] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/01/2023]

Lin HH, Tseng LY. Prediction of disulfide bonding pattern based on a support vector machine and multiple trajectory search. Inf Sci (N Y) 2012. [DOI: 10.1016/j.ins.2012.02.035] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]

Predict mycobacterial proteins subcellular locations by incorporating pseudo-average chemical shift into the general form of Chou’s pseudo amino acid composition. J Theor Biol 2012;304:88-95. [DOI: 10.1016/j.jtbi.2012.03.017] [Citation(s) in RCA: 89] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2011] [Revised: 03/13/2012] [Accepted: 03/14/2012] [Indexed: 11/18/2022]

Tokmakov AA, Kurotani A, Takagi T, Toyama M, Shirouzu M, Fukami Y, Yokoyama S. Multiple post-translational modifications affect heterologous protein synthesis. J Biol Chem 2012;287:27106-16. [PMID: 22674579 DOI: 10.1074/jbc.m112.366351] [Citation(s) in RCA: 49] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022] Open

Smialowski P, Doose G, Torkler P, Kaufmann S, Frishman D. PROSO II--a new method for protein solubility prediction. FEBS J 2012;279:2192-200. [PMID: 22536855 DOI: 10.1111/j.1742-4658.2012.08603.x] [Citation(s) in RCA: 129] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]

Programmable bacterial catalysis - designing cells for biosynthesis of value-added compounds. FEBS Lett 2012;586:2184-90. [DOI: 10.1016/j.febslet.2012.02.030] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2012] [Revised: 02/16/2012] [Accepted: 02/20/2012] [Indexed: 12/26/2022]

Mehta CM, White ET, Litster JD. Correlation of second virial coefficient with solubility for proteins in salt solutions. Biotechnol Prog 2011;28:163-70. [PMID: 22002946 DOI: 10.1002/btpr.724] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2011] [Revised: 08/30/2011] [Indexed: 11/08/2022]

Overton IM, Barton GJ. Computational approaches to selecting and optimising targets for structural biology. Methods 2011;55:3-11. [PMID: 21906678 PMCID: PMC3202631 DOI: 10.1016/j.ymeth.2011.08.014] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2011] [Revised: 08/18/2011] [Accepted: 08/22/2011] [Indexed: 11/29/2022] Open

Restrepo-Montoya D, Pino C, Nino LF, Patarroyo ME, Patarroyo MA. NClassG+: A classifier for non-classically secreted Gram-positive bacterial proteins. BMC Bioinformatics 2011;12:21. [PMID: 21235786 PMCID: PMC3025837 DOI: 10.1186/1471-2105-12-21] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2010] [Accepted: 01/14/2011] [Indexed: 11/16/2022] Open

Magnan CN, Zeller M, Kayala MA, Vigil A, Randall A, Felgner PL, Baldi P. High-throughput prediction of protein antigenicity using protein microarray data. Bioinformatics 2010;26:2936-43. [PMID: 20934990 PMCID: PMC2982151 DOI: 10.1093/bioinformatics/btq551] [Citation(s) in RCA: 301] [Impact Index Per Article: 21.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2010] [Revised: 09/08/2010] [Accepted: 09/23/2010] [Indexed: 11/14/2022] Open

Tian Y, Deutsch C, Krishnamoorthy B. Scoring function to predict solubility mutagenesis. Algorithms Mol Biol 2010;5:33. [PMID: 20929563 PMCID: PMC2958853 DOI: 10.1186/1748-7188-5-33] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2010] [Accepted: 10/07/2010] [Indexed: 11/16/2022] Open

HE HQ, HU JP, LIU B, CHEN WZ, WANG CX. Activity, Solubility Comparison and Molecular Dynamics Simulation Analysis of Wild Type and F185K Mutant Type HIV-1 Integrase Catalytic Domain*. PROG BIOCHEM BIOPHYS 2010. [DOI: 10.3724/sp.j.1206.2009.00126] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]

Diaz AA, Tomba E, Lennarson R, Richard R, Bagajewicz MJ, Harrison RG. Prediction of protein solubility inEscherichia coliusing logistic regression. Biotechnol Bioeng 2010;105:374-83. [DOI: 10.1002/bit.22537] [Citation(s) in RCA: 61] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]

Chan WC, Liang PH, Shih YP, Yang UC, Lin WC, Hsu CN. Learning to predict expression efficacy of vectors in recombinant protein production. BMC Bioinformatics 2010;11 Suppl 1:S21. [PMID: 20122193 PMCID: PMC3009492 DOI: 10.1186/1471-2105-11-s1-s21] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open

Abstract

Background

Recombinant protein production is a useful biotechnology to produce a large quantity of highly soluble proteins. Currently, the most widely used production system is to fuse a target protein into different vectors in Escherichia coli (E. coli). However, the production efficacy of different vectors varies for different target proteins. Trial-and-error is still the common practice to find out the efficacy of a vector for a given target protein. Previous studies are limited in that they assumed that proteins would be over-expressed and focused only on the solubility of expressed proteins. In fact, many pairings of vectors and proteins result in no expression.

Results

In this study, we applied machine learning to train prediction models to predict whether a pairing of vector-protein will express or not express in E. coli. For expressed cases, the models further predict whether the expressed proteins would be soluble. We collected a set of real cases from the clients of our recombinant protein production core facility, where six different vectors were designed and studied. This set of cases is used in both training and evaluation of our models. We evaluate three different models based on the support vector machines (SVM) and their ensembles. Unlike many previous works, these models consider the sequence of the target protein as well as the sequence of the whole fusion vector as the features. We show that a model that classifies a case into one of the three classes (no expression, inclusion body and soluble) outperforms a model that considers the nested structure of the three classes, while a model that can take advantage of the hierarchical structure of the three classes performs slight worse but comparably to the best model. Meanwhile, compared to previous works, we show that the prediction accuracy of our best method still performs the best. Lastly, we briefly present two methods to use the trained model in the design of the recombinant protein production systems to improve the chance of high soluble protein production.

Conclusion

In this paper, we show that a machine learning approach to the prediction of the efficacy of a vector for a target protein in a recombinant protein production system is promising and may compliment traditional knowledge-driven study of the efficacy. We will release our program to share with other labs in the public domain when this paper is published.

Collapse

Magnan CN, Randall A, Baldi P. SOLpro: accurate sequence-based prediction of protein solubility. ACTA ACUST UNITED AC 2009;25:2200-7. [PMID: 19549632 DOI: 10.1093/bioinformatics/btp386] [Citation(s) in RCA: 343] [Impact Index Per Article: 22.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]