1
|
Wong DPH, Wong KH, Park S, Boël G, Hunt JF, Aalberts DP. OPT: Codon optimize gene sequences for E. coli protein overexpression. J Mol Biol 2025:168965. [PMID: 40133777 DOI: 10.1016/j.jmb.2025.168965] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2024] [Revised: 01/23/2025] [Accepted: 01/23/2025] [Indexed: 03/27/2025]
Abstract
The ability to overexpress proteins is valuable for biotechnology, but not all sequences are compatible with high yield. We previously analyzed the sequence features and mRNA folding stability of a large data set of 6,384 distinct gene constructs, and developed a model for protein yield. Our OPT.williams.edu server (1) predicts the probability an input sequence will produce protein at a high level when overexpressed in E. coli, and (2) returns optimized synonymous sequences designed to boost protein expression. Here we also present experimental evidence of the high yields of our OPT constructs for eight commercially produced proteins.
Collapse
Affiliation(s)
- Daniel P H Wong
- Physics Department, Williams College, Williamstown, MA 01267, USA
| | - Kam-Ho Wong
- Department of Biological Sciences, Columbia University, New York, NY 10027, USA
| | - Sunjae Park
- Department of Biological Sciences, Columbia University, New York, NY 10027, USA
| | - Grégory Boël
- Expression Génétique Microbienne, CNRS, Universite Paris Cite, Institut de Biologie Physio-Chimique, F-75005 Paris, France.
| | - John F Hunt
- Department of Biological Sciences, Columbia University, New York, NY 10027, USA.
| | | |
Collapse
|
2
|
Huang F, Gao Q, Zhou X, Guo W, Feng K, Zhu L, Huang T, Cai YD. Prediction of Solubility of Proteins in Escherichia coli Based on Functional and Structural Features Using Machine Learning Methods. Protein J 2024; 43:983-996. [PMID: 39243320 DOI: 10.1007/s10930-024-10230-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 08/21/2024] [Indexed: 09/09/2024]
Abstract
Protein solubility is a critical parameter that determines the stability, activity, and functionality of proteins, with broad and far-reaching implications in biotechnology and biochemistry. Accurate prediction and control of protein solubility are essential for successful protein expression and purification in research and industrial settings. This study gathered information on soluble and insoluble proteins. In characterizing the proteins, they were mapped to STRING and characterized by functional and structural features. All functional/structural features were integrated to create a 5768-dimensional binary vector to encode proteins. Seven feature-ranking algorithms were employed to analyze the functional/structural features, yielding seven feature lists. These lists were subjected to the incremental feature selection, incorporating four classification algorithms, one by one to build effective classification models and identify functional/structural features with classification-related importance. Some essential functional/structural features used to differentiate between soluble and insoluble proteins were identified, including GO:0009987 (intercellular communication) and GO:0022613 (ribonucleoprotein complex biogenesis). The best classification model using support vector machine as the classification algorithm and 295 optimized functional/structural features generated the F1 score of 0.825, which can be a powerful tool to differentiate soluble proteins from insoluble proteins.
Collapse
Affiliation(s)
- Feiming Huang
- School of Life Sciences, Shanghai University, Shanghai, 200444, People's Republic of China
| | - Qian Gao
- Department of Pharmacy, Shanghai Children's Medical Center, School of Medicine, Shanghai Jiao Tong University, Shanghai, 200127, China
| | - XianChao Zhou
- Center for Single-Cell Omics, School of Public Health, Shanghai Jiao Tong University School of Medicine, Shanghai, 200025, China
| | - Wei Guo
- Key Laboratory of Stem Cell Biology, Shanghai Institutes for Biological Sciences (SIBS), Shanghai Jiao Tong University School of Medicine (SJTUSM), Chinese Academy of Sciences (CAS), Shanghai, 200030, China
| | - KaiYan Feng
- Department of Computer Science, Guangdong AIB Polytechnic College, Guangzhou, 510507, China
| | - Lin Zhu
- School of Information Science, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA
| | - Tao Huang
- CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, Bio-Med Big Data Center, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, 200031, China.
- CAS Key Laboratory of Tissue Microenvironment and Tumor, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, 200031, China.
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, 200444, People's Republic of China.
| |
Collapse
|
3
|
Fraga KJ, Huang YJ, Ramelot TA, Swapna GVT, Lashawn Anak Kendary A, Li E, Korf I, Montelione GT. SpecDB: A relational database for archiving biomolecular NMR spectral data. JOURNAL OF MAGNETIC RESONANCE (SAN DIEGO, CALIF. : 1997) 2022; 342:107268. [PMID: 35930941 PMCID: PMC9922030 DOI: 10.1016/j.jmr.2022.107268] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/25/2022] [Revised: 06/16/2022] [Accepted: 07/06/2022] [Indexed: 05/11/2023]
Abstract
NMR is a valuable experimental tool in the structural biologist's toolkit to elucidate the structures, functions, and motions of biomolecules. The progress of machine learning, particularly in structural biology, reveals the critical importance of large, diverse, and reliable datasets in developing new methods and understanding in structural biology and science more broadly. Biomolecular NMR research groups produce large amounts of data, and there is renewed interest in organizing these data to train new, sophisticated machine learning architectures and to improve biomolecular NMR analysis pipelines. The foundational data type in NMR is the free-induction decay (FID). There are opportunities to build sophisticated machine learning methods to tackle long-standing problems in NMR data processing, resonance assignment, dynamics analysis, and structure determination using NMR FIDs. Our goal in this study is to provide a lightweight, broadly available tool for archiving FID data as it is generated at the spectrometer, and grow a new resource of FID data and associated metadata. This study presents a relational schema for storing and organizing the metadata items that describe an NMR sample and FID data, which we call Spectral Database (SpecDB). SpecDB is implemented in SQLite and includes a Python software library providing a command-line application to create, organize, query, backup, share, and maintain the database. This set of software tools and database schema allow users to store, organize, share, and learn from NMR time domain data. SpecDB is freely available under an open source license at https://github.rpi.edu/RPIBioinformatics/SpecDB.
Collapse
Affiliation(s)
- Keith J Fraga
- Department of Molecular and Cellular Biology, University of California, Davis, CA 95616, USA.
| | - Yuanpeng J Huang
- Department of Chemistry and Chemical Biology, Center for Biotechnology and Interdisciplinary Sciences, Rensselaer Polytechnic Institute, Troy, NY 12180 USA.
| | - Theresa A Ramelot
- Department of Chemistry and Chemical Biology, Center for Biotechnology and Interdisciplinary Sciences, Rensselaer Polytechnic Institute, Troy, NY 12180 USA.
| | - G V T Swapna
- Department of Chemistry and Chemical Biology, Center for Biotechnology and Interdisciplinary Sciences, Rensselaer Polytechnic Institute, Troy, NY 12180 USA; Department of Pharmacology, Robert Wood Johnson Medical School, Rutgers The State University of New Jersey, Piscataway, NJ 08854, USA.
| | | | - Ethan Li
- Department of Chemistry and Chemical Biology, Center for Biotechnology and Interdisciplinary Sciences, Rensselaer Polytechnic Institute, Troy, NY 12180 USA.
| | - Ian Korf
- Department of Molecular and Cellular Biology, University of California, Davis, CA 95616, USA.
| | - Gaetano T Montelione
- Department of Chemistry and Chemical Biology, Center for Biotechnology and Interdisciplinary Sciences, Rensselaer Polytechnic Institute, Troy, NY 12180 USA.
| |
Collapse
|
4
|
Masnoddin M, Ling CMWV, Yusof NA. Functional Analysis of Conserved Hypothetical Proteins from the Antarctic Bacterium, Pedobacter cryoconitis Strain BG5 Reveals Protein Cold Adaptation and Thermal Tolerance Strategies. Microorganisms 2022; 10:microorganisms10081654. [PMID: 36014072 PMCID: PMC9415557 DOI: 10.3390/microorganisms10081654] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2022] [Revised: 08/04/2022] [Accepted: 08/12/2022] [Indexed: 11/16/2022] Open
Abstract
Pedobacter cryoconitis BG5 is an obligate psychrophilic bacterium that was first isolated on King George Island, Antarctica. Over the last 50 years, the West Antarctic, including King George Island, has been one of the most rapidly warming places on Earth, hence making it an excellent area to measure the resilience of living species in warmed areas exposed to the constantly changing environment due to climate change. This bacterium encodes a genome of approximately 5694 protein-coding genes. However, 35% of the gene models for this species are found to be hypothetical proteins (HP). In this study, three conserved HP genes of P. cryoconitis, designated pcbg5hp1, pcbg5hp2 and pcbg5hp12, were cloned and the proteins were expressed, purified and their functions and structures were evaluated. Real-time quantitative PCR analysis revealed that these genes were expressed constitutively, suggesting a potentially important role where the expression of these genes under an almost constant demand might have some regulatory functions in thermal stress tolerance. Functional analysis showed that these proteins maintained their activities at low and moderate temperatures. Meanwhile, a low citrate synthase aggregation at 43 °C in the presence of PCBG5HP1 suggested the characteristics of chaperone activity. Furthermore, our comparative structural analysis demonstrated that the HPs exhibited cold-adapted traits, most notably increased flexibility in their 3D structures compared to their counterparts. Concurrently, the presence of a disulphide bridge and aromatic clusters was attributed to PCBG5HP1’s unusual protein stability and chaperone activity. Thus, this suggested that the HPs examined in this study acquired strategies to maintain a balance between molecular stability and structural flexibility. Conclusively, this study has established the structure–function relationships of the HPs produced by P. cryoconitis and provided crucial experimental evidence indicating their importance in thermal stress response.
Collapse
Affiliation(s)
- Makdi Masnoddin
- Biotechnology Research Institute, Universiti Malaysia Sabah, Jalan UMS, Kota Kinabalu 88400, Sabah, Malaysia
- Preparatory Centre for Science and Technology, Universiti Malaysia Sabah, Jalan UMS, Kota Kinabalu 88400, Sabah, Malaysia
| | | | - Nur Athirah Yusof
- Biotechnology Research Institute, Universiti Malaysia Sabah, Jalan UMS, Kota Kinabalu 88400, Sabah, Malaysia
- Correspondence:
| |
Collapse
|
5
|
Cardoso V, Brás JLA, Costa IF, Ferreira LMA, Gama LT, Vincentelli R, Henrissat B, Fontes CMGA. Generation of a Library of Carbohydrate-Active Enzymes for Plant Biomass Deconstruction. Int J Mol Sci 2022; 23:ijms23074024. [PMID: 35409382 PMCID: PMC8999789 DOI: 10.3390/ijms23074024] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2022] [Revised: 03/29/2022] [Accepted: 04/03/2022] [Indexed: 01/27/2023] Open
Abstract
In nature, the deconstruction of plant carbohydrates is carried out by carbohydrate-active enzymes (CAZymes). A high-throughput (HTP) strategy was used to isolate and clone 1476 genes obtained from a diverse library of recombinant CAZymes covering a variety of sequence-based families, enzyme classes, and source organisms. All genes were successfully isolated by either PCR (61%) or gene synthesis (GS) (39%) and were subsequently cloned into Escherichia coli expression vectors. Most proteins (79%) were obtained at a good yield during recombinant expression. A significantly lower number (p < 0.01) of proteins from eukaryotic (57.7%) and archaeal (53.3%) origin were soluble compared to bacteria (79.7%). Genes obtained by GS gave a significantly lower number (p = 0.04) of soluble proteins while the green fluorescent protein tag improved protein solubility (p = 0.05). Finally, a relationship between the amino acid composition and protein solubility was observed. Thus, a lower percentage of non-polar and higher percentage of negatively charged amino acids in a protein may be a good predictor for higher protein solubility in E. coli. The HTP approach presented here is a powerful tool for producing recombinant CAZymes that can be used for future studies of plant cell wall degradation. Successful production and expression of soluble recombinant proteins at a high rate opens new possibilities for the high-throughput production of targets from limitless sources.
Collapse
Affiliation(s)
- Vânia Cardoso
- Centro de Investigação Interdisciplinar em Sanidade Animal—Faculdade de Medicina Veterinária, Universidade de Lisboa, Pólo Universitário do Alto da Ajuda, Avenida da Universidade Técnica, 1300-477 Lisboa, Portugal; (L.M.A.F.); (L.T.G.)
- NZYTech Ltd., Estrada do Paço do Lumiar, Campus do Lumiar, 1649-038 Lisboa, Portugal; (J.L.A.B.); (I.F.C.)
- Correspondence: (V.C.); (C.M.G.A.F.)
| | - Joana L. A. Brás
- NZYTech Ltd., Estrada do Paço do Lumiar, Campus do Lumiar, 1649-038 Lisboa, Portugal; (J.L.A.B.); (I.F.C.)
| | - Inês F. Costa
- NZYTech Ltd., Estrada do Paço do Lumiar, Campus do Lumiar, 1649-038 Lisboa, Portugal; (J.L.A.B.); (I.F.C.)
| | - Luís M. A. Ferreira
- Centro de Investigação Interdisciplinar em Sanidade Animal—Faculdade de Medicina Veterinária, Universidade de Lisboa, Pólo Universitário do Alto da Ajuda, Avenida da Universidade Técnica, 1300-477 Lisboa, Portugal; (L.M.A.F.); (L.T.G.)
| | - Luís T. Gama
- Centro de Investigação Interdisciplinar em Sanidade Animal—Faculdade de Medicina Veterinária, Universidade de Lisboa, Pólo Universitário do Alto da Ajuda, Avenida da Universidade Técnica, 1300-477 Lisboa, Portugal; (L.M.A.F.); (L.T.G.)
| | - Renaud Vincentelli
- Centre National de la Recherche Scientifique, Unité Mixte de Recherche 7257, Université Aix-Marseille, 13288 Marseille, France; (R.V.); (B.H.)
- Institut National de la Recherche Agronomique, Unité sous Contrat 1408 Architecture et Fonction des Macromolécules Biologiques, 13288 Marseille, France
| | - Bernard Henrissat
- Centre National de la Recherche Scientifique, Unité Mixte de Recherche 7257, Université Aix-Marseille, 13288 Marseille, France; (R.V.); (B.H.)
- Institut National de la Recherche Agronomique, Unité sous Contrat 1408 Architecture et Fonction des Macromolécules Biologiques, 13288 Marseille, France
- Department of Biological Sciences, King Abdulaziz University, Jeddah 21589, Saudi Arabia
| | - Carlos M. G. A. Fontes
- Centro de Investigação Interdisciplinar em Sanidade Animal—Faculdade de Medicina Veterinária, Universidade de Lisboa, Pólo Universitário do Alto da Ajuda, Avenida da Universidade Técnica, 1300-477 Lisboa, Portugal; (L.M.A.F.); (L.T.G.)
- NZYTech Ltd., Estrada do Paço do Lumiar, Campus do Lumiar, 1649-038 Lisboa, Portugal; (J.L.A.B.); (I.F.C.)
- Correspondence: (V.C.); (C.M.G.A.F.)
| |
Collapse
|
6
|
Wu X, Yu L. EPSOL: sequence-based protein solubility prediction using multidimensional embedding. Bioinformatics 2021; 37:4314-4320. [PMID: 34145885 DOI: 10.1093/bioinformatics/btab463] [Citation(s) in RCA: 25] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2021] [Revised: 05/18/2021] [Accepted: 06/17/2021] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION The heterologous expression of recombinant protein requires host cells, such as Escherichia coli, and the solubility of protein greatly affects the protein yield. A novel and highly accurate solubility predictor that concurrently improves the production yield and minimizes production cost, and that forecasts protein solubility in an E. coli expression system before the actual experimental work is highly sought. RESULTS In this paper, EPSOL, a novel deep learning architecture for the prediction of protein solubility in an E. coli expression system, which automatically obtains comprehensive protein feature representations using multidimensional embedding, is presented. EPSOL outperformed all existing sequence-based solubility predictors and achieved 0.79 in accuracy and 0.58 in Matthew's correlation coefficient. The higher performance of EPSOL permits large-scale screening for sequence variants with enhanced manufacturability and predicts the solubility of new recombinant proteins in an E. coli expression system with greater reliability. AVAILABILITY AND IMPLEMENTATION EPSOL's best model and results can be downloaded from GitHub (https://github.com/LiangYu-Xidian/EPSOL). SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Xiang Wu
- School of Computer Science and Technology, Xidian University, Xi'an 710071, Shaanxi, China
| | - Liang Yu
- School of Computer Science and Technology, Xidian University, Xi'an 710071, Shaanxi, China
| |
Collapse
|
7
|
Khurana S, Rawi R, Kunji K, Chuang GY, Bensmail H, Mall R. DeepSol: a deep learning framework for sequence-based protein solubility prediction. Bioinformatics 2019; 34:2605-2613. [PMID: 29554211 DOI: 10.1093/bioinformatics/bty166] [Citation(s) in RCA: 112] [Impact Index Per Article: 18.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2017] [Accepted: 03/13/2018] [Indexed: 01/09/2023] Open
Abstract
Motivation Protein solubility plays a vital role in pharmaceutical research and production yield. For a given protein, the extent of its solubility can represent the quality of its function, and is ultimately defined by its sequence. Thus, it is imperative to develop novel, highly accurate in silico sequence-based protein solubility predictors. In this work we propose, DeepSol, a novel Deep Learning-based protein solubility predictor. The backbone of our framework is a convolutional neural network that exploits k-mer structure and additional sequence and structural features extracted from the protein sequence. Results DeepSol outperformed all known sequence-based state-of-the-art solubility prediction methods and attained an accuracy of 0.77 and Matthew's correlation coefficient of 0.55. The superior prediction accuracy of DeepSol allows to screen for sequences with enhanced production capacity and can more reliably predict solubility of novel proteins. Availability and implementation DeepSol's best performing models and results are publicly deposited at https://doi.org/10.5281/zenodo.1162886 (Khurana and Mall, 2018). Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Sameer Khurana
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Reda Rawi
- Vaccine Research Center, National Institute of Allergy and Infectious Diseases, National Institute of Health, Bethesda, MD, USA
| | - Khalid Kunji
- Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar
| | - Gwo-Yu Chuang
- Vaccine Research Center, National Institute of Allergy and Infectious Diseases, National Institute of Health, Bethesda, MD, USA
| | - Halima Bensmail
- Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar
| | - Raghvendra Mall
- Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar
| |
Collapse
|
8
|
Rawi R, Mall R, Kunji K, Shen CH, Kwong PD, Chuang GY. PaRSnIP: sequence-based protein solubility prediction using gradient boosting machine. Bioinformatics 2019; 34:1092-1098. [PMID: 29069295 DOI: 10.1093/bioinformatics/btx662] [Citation(s) in RCA: 67] [Impact Index Per Article: 11.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2017] [Accepted: 10/17/2017] [Indexed: 11/13/2022] Open
Abstract
Motivation Protein solubility can be a decisive factor in both research and production efficiency, and in silico sequence-based predictors that can accurately estimate solubility outcomes are highly sought. Results In this study, we present a novel approach termed PRotein SolubIlity Predictor (PaRSnIP), which uses a gradient boosting machine algorithm as well as an approximation of sequence and structural features of the protein of interest. Based on an independent test set, PaRSnIP outperformed other state-of-the-art sequence-based methods by more than 9% in accuracy and 0.17 in Matthew's correlation coefficient, with an overall accuracy of 74% and Matthew's correlation coefficient of 0.48. Additionally, PaRSnIP provides importance scores for all features used in training. We observed higher fractions of exposed residues to associate positively with protein solubility and tripeptide stretches with multiple histidines to associate negatively with solubility. The improved prediction accuracy of PaRSnIP should enable it to predict protein solubility with greater reliability and to screen for sequence variants with enhanced manufacturability. Availability and implementation PaRSnIP software is available for download under GitHub (https://github.com/RedaRawi/PaRSnIP). Contact gwo-yu.chuang@nih.gov. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Reda Rawi
- Vaccine Research Center, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, MD 20892, USA
| | - Raghvendra Mall
- Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, 34110, Qatar
| | - Khalid Kunji
- Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, 34110, Qatar
| | - Chen-Hsiang Shen
- Vaccine Research Center, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, MD 20892, USA
| | - Peter D Kwong
- Vaccine Research Center, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, MD 20892, USA
| | - Gwo-Yu Chuang
- Vaccine Research Center, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, MD 20892, USA
| |
Collapse
|
9
|
Sharafi E, Farmani J, Parizi AP, Dehestani A. In Search of Engineered Prokaryotic Chlorophyllases: A Bioinformatics Approach. BIOTECHNOL BIOPROC E 2018. [DOI: 10.1007/s12257-018-0143-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
10
|
Wang H, Feng L, Webb GI, Kurgan L, Song J, Lin D. Critical evaluation of bioinformatics tools for the prediction of protein crystallization propensity. Brief Bioinform 2018; 19:838-852. [PMID: 28334201 PMCID: PMC6171492 DOI: 10.1093/bib/bbx018] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2016] [Revised: 01/19/2017] [Indexed: 12/11/2022] Open
Abstract
X-ray crystallography is the main tool for structural determination of proteins. Yet, the underlying crystallization process is costly, has a high attrition rate and involves a series of trial-and-error attempts to obtain diffraction-quality crystals. The Structural Genomics Consortium aims to systematically solve representative structures of major protein-fold classes using primarily high-throughput X-ray crystallography. The attrition rate of these efforts can be improved by selection of proteins that are potentially easier to be crystallized. In this context, bioinformatics approaches have been developed to predict crystallization propensities based on protein sequences. These approaches are used to facilitate prioritization of the most promising target proteins, search for alternative structural orthologues of the target proteins and suggest designs of constructs capable of potentially enhancing the likelihood of successful crystallization. We reviewed and compared nine predictors of protein crystallization propensity. Moreover, we demonstrated that integrating selected outputs from multiple predictors as candidate input features to build the predictive model results in a significantly higher predictive performance when compared to using these predictors individually. Furthermore, we also introduced a new and accurate predictor of protein crystallization propensity, Crysf, which uses functional features extracted from UniProt as inputs. This comprehensive review will assist structural biologists in selecting the most appropriate predictor, and is also beneficial for bioinformaticians to develop a new generation of predictive algorithms.
Collapse
Affiliation(s)
- Huilin Wang
- Department of Chemical Biology, College of Chemistry and Chemical Engineering, Xiamen University, China
| | | | - Geoffrey I Webb
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, Australia
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, USA
| | - Jiangning Song
- Department of Biochemistry and Molecular Biology, Monash University, Australia
| | - Donghai Lin
- Department of Chemical Biology, College of Chemistry and Chemical Engineering, Xiamen University, China
| |
Collapse
|
11
|
Saladi SM, Javed N, Müller A, Clemons WM. A statistical model for improved membrane protein expression using sequence-derived features. J Biol Chem 2018; 293:4913-4927. [PMID: 29378850 PMCID: PMC5880134 DOI: 10.1074/jbc.ra117.001052] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2017] [Revised: 01/22/2018] [Indexed: 11/06/2022] Open
Abstract
The heterologous expression of integral membrane proteins (IMPs) remains a major bottleneck in the characterization of this important protein class. IMP expression levels are currently unpredictable, which renders the pursuit of IMPs for structural and biophysical characterization challenging and inefficient. Experimental evidence demonstrates that changes within the nucleotide or amino acid sequence for a given IMP can dramatically affect expression levels, yet these observations have not resulted in generalizable approaches to improve expression levels. Here, we develop a data-driven statistical predictor named IMProve that, using only sequence information, increases the likelihood of selecting an IMP that expresses in Escherichia coli The IMProve model, trained on experimental data, combines a set of sequence-derived features resulting in an IMProve score, where higher values have a higher probability of success. The model is rigorously validated against a variety of independent data sets that contain a wide range of experimental outcomes from various IMP expression trials. The results demonstrate that use of the model can more than double the number of successfully expressed targets at any experimental scale. IMProve can immediately be used to identify favorable targets for characterization. Most notably, IMProve demonstrates for the first time that IMP expression levels can be predicted directly from sequence.
Collapse
Affiliation(s)
- Shyam M Saladi
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, California 91125
| | - Nauman Javed
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, California 91125
| | - Axel Müller
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, California 91125
| | - William M Clemons
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, California 91125.
| |
Collapse
|
12
|
Grant TD, Luft JR, Carter LG, Matsui T, Weiss TM, Martel A, Snell EH. The accurate assessment of small-angle X-ray scattering data. ACTA ACUST UNITED AC 2015; 71:45-56. [PMID: 25615859 PMCID: PMC4304685 DOI: 10.1107/s1399004714010876] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2014] [Accepted: 05/12/2014] [Indexed: 12/05/2022]
Abstract
A set of quantitative techniques is suggested for assessing SAXS data quality. These are applied in the form of a script, SAXStats, to a test set of 27 proteins, showing that these techniques are more sensitive than manual assessment of data quality. Small-angle X-ray scattering (SAXS) has grown in popularity in recent times with the advent of bright synchrotron X-ray sources, powerful computational resources and algorithms enabling the calculation of increasingly complex models. However, the lack of standardized data-quality metrics presents difficulties for the growing user community in accurately assessing the quality of experimental SAXS data. Here, a series of metrics to quantitatively describe SAXS data in an objective manner using statistical evaluations are defined. These metrics are applied to identify the effects of radiation damage, concentration dependence and interparticle interactions on SAXS data from a set of 27 previously described targets for which high-resolution structures have been determined via X-ray crystallography or nuclear magnetic resonance (NMR) spectroscopy. The studies show that these metrics are sufficient to characterize SAXS data quality on a small sample set with statistical rigor and sensitivity similar to or better than manual analysis. The development of data-quality analysis strategies such as these initial efforts is needed to enable the accurate and unbiased assessment of SAXS data quality.
Collapse
Affiliation(s)
- Thomas D Grant
- Hauptman-Woodward Medical Research Institute, 700 Ellicott Street, Buffalo, NY 14203, USA
| | - Joseph R Luft
- Hauptman-Woodward Medical Research Institute, 700 Ellicott Street, Buffalo, NY 14203, USA
| | - Lester G Carter
- Stanford Synchrotron Radiation Lightsource, 2575 Sand Hill Road, MS69, Menlo Park, CA 94025, USA
| | - Tsutomu Matsui
- Stanford Synchrotron Radiation Lightsource, 2575 Sand Hill Road, MS69, Menlo Park, CA 94025, USA
| | - Thomas M Weiss
- Stanford Synchrotron Radiation Lightsource, 2575 Sand Hill Road, MS69, Menlo Park, CA 94025, USA
| | - Anne Martel
- Stanford Synchrotron Radiation Lightsource, 2575 Sand Hill Road, MS69, Menlo Park, CA 94025, USA
| | - Edward H Snell
- Hauptman-Woodward Medical Research Institute, 700 Ellicott Street, Buffalo, NY 14203, USA
| |
Collapse
|
13
|
|
14
|
Wang H, Wang M, Tan H, Li Y, Zhang Z, Song J. PredPPCrys: accurate prediction of sequence cloning, protein production, purification and crystallization propensity from protein sequences using multi-step heterogeneous feature fusion and selection. PLoS One 2014; 9:e105902. [PMID: 25148528 PMCID: PMC4141844 DOI: 10.1371/journal.pone.0105902] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2014] [Accepted: 07/25/2014] [Indexed: 01/14/2023] Open
Abstract
X-ray crystallography is the primary approach to solve the three-dimensional structure of a protein. However, a major bottleneck of this method is the failure of multi-step experimental procedures to yield diffraction-quality crystals, including sequence cloning, protein material production, purification, crystallization and ultimately, structural determination. Accordingly, prediction of the propensity of a protein to successfully undergo these experimental procedures based on the protein sequence may help narrow down laborious experimental efforts and facilitate target selection. A number of bioinformatics methods based on protein sequence information have been developed for this purpose. However, our knowledge on the important determinants of propensity for a protein sequence to produce high diffraction-quality crystals remains largely incomplete. In practice, most of the existing methods display poorer performance when evaluated on larger and updated datasets. To address this problem, we constructed an up-to-date dataset as the benchmark, and subsequently developed a new approach termed ‘PredPPCrys’ using the support vector machine (SVM). Using a comprehensive set of multifaceted sequence-derived features in combination with a novel multi-step feature selection strategy, we identified and characterized the relative importance and contribution of each feature type to the prediction performance of five individual experimental steps required for successful crystallization. The resulting optimal candidate features were used as inputs to build the first-level SVM predictor (PredPPCrys I). Next, prediction outputs of PredPPCrys I were used as the input to build second-level SVM classifiers (PredPPCrys II), which led to significantly enhanced prediction performance. Benchmarking experiments indicated that our PredPPCrys method outperforms most existing procedures on both up-to-date and previous datasets. In addition, the predicted crystallization targets of currently non-crystallizable proteins were provided as compendium data, which are anticipated to facilitate target selection and design for the worldwide structural genomics consortium. PredPPCrys is freely available at http://www.structbioinfor.org/PredPPCrys.
Collapse
Affiliation(s)
- Huilin Wang
- National Engineering Laboratory for Industrial Enzymes and Key Laboratory of Systems Microbial Biotechnology, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin, China
| | - Mingjun Wang
- National Engineering Laboratory for Industrial Enzymes and Key Laboratory of Systems Microbial Biotechnology, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin, China
| | - Hao Tan
- Department of Biochemistry and Molecular Biology, Faculty of Medicine, Monash University, Melbourne, Victoria, Australia
| | - Yuan Li
- National Engineering Laboratory for Industrial Enzymes and Key Laboratory of Systems Microbial Biotechnology, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin, China
| | - Ziding Zhang
- State Key Laboratory of Agrobiotechnology, College of Biological Sciences, China Agricultural University, Beijing, China
- * E-mail: (JS); (ZZ)
| | - Jiangning Song
- National Engineering Laboratory for Industrial Enzymes and Key Laboratory of Systems Microbial Biotechnology, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin, China
- Department of Biochemistry and Molecular Biology, Faculty of Medicine, Monash University, Melbourne, Victoria, Australia
- ARC Centre of Excellence in Structural and Functional Microbial Genomics, Monash University, Melbourne, Victoria, Australia
- * E-mail: (JS); (ZZ)
| |
Collapse
|
15
|
Tokmakov AA. Identification of multiple physicochemical and structural properties associated with soluble expression of eukaryotic proteins in cell-free bacterial extracts. Front Microbiol 2014; 5:295. [PMID: 24999341 PMCID: PMC4064534 DOI: 10.3389/fmicb.2014.00295] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2014] [Accepted: 05/29/2014] [Indexed: 11/17/2022] Open
Abstract
Bacterial extracts are widely used to synthesize recombinant proteins. Vast data volumes have been accumulated in cell-free expression databases, covering a whole range of existing proteins. It makes possible comprehensive bioinformatics analysis and identification of multiple features associated with protein solubility and aggregation. In the present paper, an approach to identify the multiple physicochemical and structural properties of amino acid sequences associated with soluble expression of eukaryotic proteins in cell-free bacterial extracts is presented. The method includes: (1) categorical assessment of expression data; (2) calculation and prediction of multiple properties of expressed sequences; (3) correlation of the individual properties with the expression scores; and (4) evaluation of statistical significance of the observed correlations. Using this method, a number of significant correlations between calculated and predicted properties of amino acid sequences and their propensity for soluble cell-free expression have been revealed.
Collapse
|
16
|
Prediction of soluble heterologous protein expression levels inEscherichia colifrom sequence-based features and its potential in biopharmaceutical process development. ACTA ACUST UNITED AC 2014. [DOI: 10.4155/pbp.14.23] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
17
|
A review of machine learning methods to predict the solubility of overexpressed recombinant proteins in Escherichia coli. BMC Bioinformatics 2014; 15:134. [PMID: 24885721 PMCID: PMC4098780 DOI: 10.1186/1471-2105-15-134] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2013] [Accepted: 03/25/2014] [Indexed: 12/14/2022] Open
Abstract
Background Over the last 20 years in biotechnology, the production of recombinant proteins has been a crucial bioprocess in both biopharmaceutical and research arena in terms of human health, scientific impact and economic volume. Although logical strategies of genetic engineering have been established, protein overexpression is still an art. In particular, heterologous expression is often hindered by low level of production and frequent fail due to opaque reasons. The problem is accentuated because there is no generic solution available to enhance heterologous overexpression. For a given protein, the extent of its solubility can indicate the quality of its function. Over 30% of synthesized proteins are not soluble. In certain experimental circumstances, including temperature, expression host, etc., protein solubility is a feature eventually defined by its sequence. Until now, numerous methods based on machine learning are proposed to predict the solubility of protein merely from its amino acid sequence. In spite of the 20 years of research on the matter, no comprehensive review is available on the published methods. Results This paper presents an extensive review of the existing models to predict protein solubility in Escherichia coli recombinant protein overexpression system. The models are investigated and compared regarding the datasets used, features, feature selection methods, machine learning techniques and accuracy of prediction. A discussion on the models is provided at the end. Conclusions This study aims to investigate extensively the machine learning based methods to predict recombinant protein solubility, so as to offer a general as well as a detailed understanding for researches in the field. Some of the models present acceptable prediction performances and convenient user interfaces. These models can be considered as valuable tools to predict recombinant protein overexpression results before performing real laboratory experiments, thus saving labour, time and cost.
Collapse
|
18
|
Tokmakov AA, Kurotani A, Shirouzu M, Fukami Y, Yokoyama S. Bioinformatics analysis and optimization of cell-free protein synthesis. Methods Mol Biol 2014; 1118:17-33. [PMID: 24395407 DOI: 10.1007/978-1-62703-782-2_2] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Cell-free protein synthesis offers substantial advantages over cell-based expression, allowing direct access to the protein synthetic reaction and meticulous control over the reaction conditions. Recently, we identified a number of statistically significant correlations between calculated and predicted properties of amino acid sequences and their amenability to heterologous cell-free expression. These correlations can be of practical use for predicting expression success and optimizing cell-free protein synthesis. In this chapter, we describe our approach and demonstrate how computational and predictive bioinformatics can be used to analyze and optimize cell-free protein expression.
Collapse
|
19
|
Hirose S, Noguchi T. ESPRESSO: a system for estimating protein expression and solubility in protein expression systems. Proteomics 2013; 13:1444-56. [PMID: 23436767 DOI: 10.1002/pmic.201200175] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2012] [Revised: 01/27/2013] [Accepted: 02/06/2013] [Indexed: 11/11/2022]
Abstract
Recombinant protein technology is essential for conducting protein science and using proteins as materials in pharmaceutical or industrial applications. Although obtaining soluble proteins is still a major experimental obstacle, knowledge about protein expression/solubility under standard conditions may increase the efficiency and reduce the cost of proteomics studies. In this study, we present a computational approach to estimate the probability of protein expression and solubility for two different protein expression systems: in vivo Escherichia coli and wheat germ cell-free, from only the sequence information. It implements two kinds of methods: a sequence/predicted structural property-based method that uses both the sequence and predicted structural features, and a sequence pattern-based method that utilizes the occurrence frequencies of sequence patterns. In the benchmark test, the proposed methods obtained F-scores of around 70%, and outperformed publicly available servers. Applying the proposed methods to genomic data revealed that proteins associated with translation or transcription have a strong tendency to be expressed as soluble proteins by the in vivo E. coli expression system. The sequence pattern-based method also has the potential to indicate a candidate region for modification, to increase protein solubility. All methods are available for free at the ESPRESSO server (http://mbs.cbrc.jp/ESPRESSO).
Collapse
Affiliation(s)
- Shuichi Hirose
- Computational Biology Research Center (CBRC), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan.
| | | |
Collapse
|
20
|
Protein structure validation and identification from unassigned residual dipolar coupling data using 2D-PDPA. Molecules 2013; 18:10162-88. [PMID: 23973992 PMCID: PMC4090686 DOI: 10.3390/molecules180910162] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2013] [Revised: 08/10/2013] [Accepted: 08/13/2013] [Indexed: 11/22/2022] Open
Abstract
More than 90% of protein structures submitted to the PDB each year are homologous to some previously characterized protein structure. The extensive resources that are required for structural characterization of proteins can be justified for the 10% of the novel structures, but not for the remaining 90%. This report presents the 2D-PDPA method, which utilizes unassigned residual dipolar coupling in order to address the economics of structure determination of routine proteins by reducing the data acquisition and processing time. 2D-PDPA has been demonstrated to successfully identify the correct structure of an array of proteins that range from 46 to 445 residues in size from a library of 619 decoy structures by using unassigned simulated RDC data. When using experimental data, 2D-PDPA successfully identified the correct NMR structures from the same library of decoy structures. In addition, the most homologous X-ray structure was also identified as the second best structural candidate. Finally, success of 2D-PDPA in identifying and evaluating the most appropriate structure from a set of computationally predicted structures in the case of a previously uncharacterized protein Pf2048.1 has been demonstrated. This protein exhibits less than 20% sequence identity to any protein with known structure and therefore presents a compelling and practical application of our proposed work.
Collapse
|
21
|
Zhu D, Zhong Y, Wu H, Ye L, Wang J, Li Y, Wei Y, Ren L, Xu B, Xu J, Qin X. Predicting metachronous liver metastasis from colorectal cancer using serum proteomic fingerprinting. J Surg Res 2013; 184:861-6. [PMID: 23721930 DOI: 10.1016/j.jss.2013.04.065] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2012] [Revised: 03/25/2013] [Accepted: 04/25/2013] [Indexed: 02/06/2023]
Abstract
BACKGROUND There are currently no accurate predictive markers of metachronous liver metastasis (MLM) from colorectal cancer. METHODS Magnetic bead-based fractionation coupled with mass spectrometry analysis was used to compare serum samples from 64 patients with MLM and 64 without recurrence or metastasis for at least 3 y after radical colorectal surgery (NM). A total of 40 MLM and 40 NM serum samples were randomly selected to build a decision tree, and the remainder were tested as blinded samples. Selected peptides were identified. RESULTS The patients in the two groups were matched for gender, age, tumor location, TNM staging, and histologic differentiation grade. Preoperative serum carcinoembryonic antigen retained no independent power to predict MLM. The decision tree model with eight proteomic features (m/z 3315, 6637, 1207, 1466, 4167, 4210, 2660, and 4186) correctly classified 33 of 40 NM sera (82.5%) and 32 of 40 MLM sera (80%) in the training set and 19 of 24 NM sera (79.2%) and 17 of 24 MLM sera (70.8%) in the test set. The peptides were identified as fragments of alpha-fetoprotein, complement C4-A, fibrinogen alpha, eukaryotic peptide chain release factor GTP-binding subunit ERF3B, and angiotensinogen. CONCLUSIONS In patients matched for gender, age, tumor location, TNM staging, and histologic differentiation grade, preoperative carcinoembryonic antigen retained no independent power to predict MLM. The decision tree model of eight proteomic features demonstrated promising value for predicting MLM in patients who underwent radical resection of colorectal cancer.
Collapse
Affiliation(s)
- Dexiang Zhu
- Department of General Surgery, Zhongshan Hospital, Fudan University, Shanghai, China
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
22
|
Santner AA, Croy CH, Vasanwala FH, Uversky VN, Van YYJ, Dunker AK. Sweeping away protein aggregation with entropic bristles: intrinsically disordered protein fusions enhance soluble expression. Biochemistry 2012; 51:7250-62. [PMID: 22924672 DOI: 10.1021/bi300653m] [Citation(s) in RCA: 93] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Intrinsically disordered, highly charged protein sequences act as entropic bristles (EBs), which, when translationally fused to partner proteins, serve as effective solubilizers by creating both a large favorable surface area for water interactions and large excluded volumes around the partner. By extending away from the partner and sweeping out large molecules, EBs can allow the target protein to fold free from interference. Using both naturally occurring and artificial polypeptides, we demonstrate the successful implementation of intrinsically disordered fusions as protein solubilizers. The artificial fusions discussed herein have a low level of sequence complexity and a high net charge but are diversified by means of distinctive amino acid compositions and lengths. Using 6xHis fusions as controls, soluble protein expression enhancements from 65% (EB60A) to 100% (EB250) were observed for a 20-protein portfolio. Additionally, these EBs were able to more effectively solubilize targets compared to frequently used fusions such as maltose-binding protein, glutathione S-transferase, thioredoxin, and N utilization substance A. Finally, although these EBs possess very distinct physiochemical properties, they did not perturb the structure, conformational stability, or function of the green fluorescent protein or the glutathione S-transferase protein. This work thus illustrates the successful de novo design of intrinsically disordered fusions and presents a promising technology and complementary resource for researchers attempting to solubilize recalcitrant proteins.
Collapse
Affiliation(s)
- Aaron A Santner
- Molecular Kinetics Inc., Indianapolis, Indiana 46268, United States
| | | | | | | | | | | |
Collapse
|
23
|
Tokmakov AA, Kurotani A, Takagi T, Toyama M, Shirouzu M, Fukami Y, Yokoyama S. Multiple post-translational modifications affect heterologous protein synthesis. J Biol Chem 2012; 287:27106-16. [PMID: 22674579 DOI: 10.1074/jbc.m112.366351] [Citation(s) in RCA: 53] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022] Open
Abstract
Post-translational modifications (PTMs) are required for proper folding of many proteins. The low capacity for PTMs hinders the production of heterologous proteins in the widely used prokaryotic systems of protein synthesis. Until now, a systematic and comprehensive study concerning the specific effects of individual PTMs on heterologous protein synthesis has not been presented. To address this issue, we expressed 1488 human proteins and their domains in a bacterial cell-free system, and we examined the correlation of the expression yields with the presence of multiple PTM sites bioinformatically predicted in these proteins. This approach revealed a number of previously unknown statistically significant correlations. Prediction of some PTMs, such as myristoylation, glycosylation, palmitoylation, and disulfide bond formation, was found to significantly worsen protein amenability to soluble expression. The presence of other PTMs, such as aspartyl hydroxylation, C-terminal amidation, and Tyr sulfation, did not correlate with the yield of heterologous protein expression. Surprisingly, the predicted presence of several PTMs, such as phosphorylation, ubiquitination, SUMOylation, and prenylation, was associated with the increased production of properly folded soluble proteins. The plausible rationales for the existence of the observed correlations are presented. Our findings suggest that identification of potential PTMs in polypeptide sequences can be of practical use for predicting expression success and optimizing heterologous protein synthesis. In sum, this study provides the most compelling evidence so far for the role of multiple PTMs in the stability and solubility of heterologously expressed recombinant proteins.
Collapse
Affiliation(s)
- Alexander A Tokmakov
- RIKEN Systems and Structural Biology Center, University of Tokyo, Bunkyo, Tokyo 113-0033, Japan.
| | | | | | | | | | | | | |
Collapse
|
24
|
Gifford LK, Carter LG, Gabanyi MJ, Berman HM, Adams PD. The Protein Structure Initiative Structural Biology Knowledgebase Technology Portal: a structural biology web resource. JOURNAL OF STRUCTURAL AND FUNCTIONAL GENOMICS 2012; 13:57-62. [PMID: 22527514 PMCID: PMC3588887 DOI: 10.1007/s10969-012-9133-7] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/18/2011] [Accepted: 03/05/2012] [Indexed: 02/01/2023]
Abstract
The Technology Portal of the Protein Structure Initiative Structural Biology Knowledgebase (PSI SBKB; http://technology.sbkb.org/portal/ ) is a web resource providing information about methods and tools that can be used to relieve bottlenecks in many areas of protein production and structural biology research. Several useful features are available on the web site, including multiple ways to search the database of over 250 technological advances, a link to videos of methods on YouTube, and access to a technology forum where scientists can connect, ask questions, get news, and develop collaborations. The Technology Portal is a component of the PSI SBKB ( http://sbkb.org ), which presents integrated genomic, structural, and functional information for all protein sequence targets selected by the Protein Structure Initiative. Created in collaboration with the Nature Publishing Group, the SBKB offers an array of resources for structural biologists, such as a research library, editorials about new research advances, a featured biological system each month, and a functional sleuth for searching protein structures of unknown function. An overview of the various features and examples of user searches highlight the information, tools, and avenues for scientific interaction available through the Technology Portal.
Collapse
Affiliation(s)
- Lida K. Gifford
- Physical Biosciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720
| | | | - Margaret J. Gabanyi
- Department of Chemistry & Chemical Biology, Rutgers – The State University of New Jersey, Piscataway, NJ 08854
| | - Helen M. Berman
- Department of Chemistry & Chemical Biology, Rutgers – The State University of New Jersey, Piscataway, NJ 08854
| | - Paul D. Adams
- Physical Biosciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720
| |
Collapse
|
25
|
Smialowski P, Doose G, Torkler P, Kaufmann S, Frishman D. PROSO II--a new method for protein solubility prediction. FEBS J 2012; 279:2192-200. [PMID: 22536855 DOI: 10.1111/j.1742-4658.2012.08603.x] [Citation(s) in RCA: 141] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Many fields of science and industry depend on efficient production of active protein using heterologous expression in Escherichia coli. The solubility of proteins upon expression is dependent on their amino acid sequence. Prediction of solubility from sequence is therefore highly valuable. We present a novel machine-learning-based model called PROSO II which makes use of new classification methods and growth in experimental data to improve coverage and accuracy of solubility predictions. The classification algorithm is organized as a two-layered structure in which the output of a primary Parzen window model for sequence similarity and a logistic regression classifier of amino acid k-mer composition serve as input for a second-level logistic regression classifier. Compared with previously published research our model is trained on five times more data than used by any other method before (82 000 proteins). When tested on a separate holdout set not used at any point of method development our server attained the best results in comparison with other currently available methods: accuracy 75.4%, Matthew's correlation coefficient 0.39, sensitivity 0.731, specificity 0.759, gain (soluble) 2.263. In summary, due to utilization of cutting edge machine learning technologies combined with the largest currently available experimental data set the PROSO II server constitutes a substantial improvement in protein solubility predictions. PROSO II is available at http://mips.helmholtz-muenchen.de/prosoII.
Collapse
Affiliation(s)
- Pawel Smialowski
- Department of Genome Oriented Bioinformatics, Technische Universität Muenchen, Freising, Germany.
| | | | | | | | | |
Collapse
|
26
|
Mehta CM, White ET, Litster JD. Correlation of second virial coefficient with solubility for proteins in salt solutions. Biotechnol Prog 2011; 28:163-70. [PMID: 22002946 DOI: 10.1002/btpr.724] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2011] [Revised: 08/30/2011] [Indexed: 11/08/2022]
Abstract
In this work, osmotic second virial coefficients (B(22)) were determined and correlated with the measured solubilities for the proteins, α-amylase, ovalbumin, and lysozyme. The B(22) values and solubilities were determined in similar solution conditions using two salts, sodium chloride and ammonium sulfate in an acidic pH range. An overall decrease in the solubility of the proteins (salting out) was observed at high concentrations of ammonium sulfate and sodium chloride solutions. However, for α-amylase, salting-in behavior was also observed in low concentration sodium chloride solutions. In ammonium sulfate solutions, the B(22) are small and close to zero below 2.4 M. As the ammonium sulfate concentrations were further increased, B(22) values decreased for all systems studied. The effect of sodium chloride on B(22) varies with concentration, solution pH, and the type of protein studied. Theoretical models show a reasonable fit to the experimental derived data of B(22) and solubility. B(22) is also directly proportional to the logarithm of the solubility values for individual proteins in salt solutions, so the log-linear empirical models developed in this work can also be used to rapidly predict solubility and B(22) values for given protein-salt systems.
Collapse
Affiliation(s)
- Chirag M Mehta
- School of Chemical Engineering, The University of Queensland, St Lucia, Brisbane, QLD 4072, Australia.
| | | | | |
Collapse
|
27
|
Hirose S, Kawamura Y, Yokota K, Kuroita T, Natsume T, Komiya K, Tsutsumi T, Suwa Y, Isogai T, Goshima N, Noguchi T. Statistical analysis of features associated with protein expression/solubility in an in vivo Escherichia coli expression system and a wheat germ cell-free expression system. ACTA ACUST UNITED AC 2011; 150:73-81. [DOI: 10.1093/jb/mvr042] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023]
|
28
|
Overton IM, van Niekerk CAJ, Barton GJ. XANNpred: neural nets that predict the propensity of a protein to yield diffraction-quality crystals. Proteins 2011; 79:1027-33. [PMID: 21246630 PMCID: PMC3084997 DOI: 10.1002/prot.22914] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2010] [Revised: 09/22/2010] [Accepted: 10/07/2010] [Indexed: 11/08/2022]
Abstract
Production of diffracting crystals is a critical step in determining the three-dimensional structure of a protein by X-ray crystallography. Computational techniques to rank proteins by their propensity to yield diffraction-quality crystals can improve efficiency in obtaining structural data by guiding both protein selection and construct design. XANNpred comprises a pair of artificial neural networks that each predict the propensity of a selected protein sequence to produce diffraction-quality crystals by current structural biology techniques. Blind tests show XANNpred has accuracy and Matthews correlation values ranging from 75% to 81% and 0.50 to 0.63 respectively; values of area under the receiver operator characteristic (ROC) curve range from 0.81 to 0.88. On blind test data XANNpred outperforms the other available algorithms XtalPred, PXS, OB-Score, and ParCrys. XANNpred also guides construct design by presenting graphs of predicted propensity for diffraction-quality crystals against residue sequence position. The XANNpred-SG algorithm is likely to be most useful to target selection in structural genomics consortia, while the XANNpred-PDB algorithm is more suited to the general structural biology community. XANNpred predictions that include sliding window graphs are freely available from http://www.compbio.dundee.ac.uk/xannpred Proteins 2011. © 2010 Wiley-Liss, Inc.
Collapse
Affiliation(s)
- Ian M Overton
- School of Life Sciences Research, College of Life Sciences, University of Dundee, Dundee, UK
| | | | | |
Collapse
|
29
|
Acton TB, Xiao R, Anderson S, Aramini J, Buchwald WA, Ciccosanti C, Conover K, Everett J, Hamilton K, Huang YJ, Janjua H, Kornhaber G, Lau J, Lee DY, Liu G, Maglaqui M, Ma L, Mao L, Patel D, Rossi P, Sahdev S, Shastry R, Swapna GVT, Tang Y, Tong S, Wang D, Wang H, Zhao L, Montelione GT. Preparation of protein samples for NMR structure, function, and small-molecule screening studies. Methods Enzymol 2011; 493:21-60. [PMID: 21371586 DOI: 10.1016/b978-0-12-381274-2.00002-9] [Citation(s) in RCA: 81] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022]
Abstract
In this chapter, we concentrate on the production of high-quality protein samples for nuclear magnetic resonance (NMR) studies. In particular, we provide an in-depth description of recent advances in the production of NMR samples and their synergistic use with recent advancements in NMR hardware. We describe the protein production platform of the Northeast Structural Genomics Consortium and outline our high-throughput strategies for producing high-quality protein samples for NMR studies. Our strategy is based on the cloning, expression, and purification of 6×-His-tagged proteins using T7-based Escherichia coli systems and isotope enrichment in minimal media. We describe 96-well ligation-independent cloning and analytical expression systems, parallel preparative scale fermentation, and high-throughput purification protocols. The 6×-His affinity tag allows for a similar two-step purification procedure implemented in a parallel high-throughput fashion that routinely results in purity levels sufficient for NMR studies (>97% homogeneity). Using this platform, the protein open reading frames of over 17,500 different targeted proteins (or domains) have been cloned as over 28,000 constructs. Nearly 5000 of these proteins have been purified to homogeneity in tens of milligram quantities (see Summary Statistics, http://nesg.org/statistics.html), resulting in more than 950 new protein structures, including more than 400 NMR structures, deposited in the Protein Data Bank. The Northeast Structural Genomics Consortium pipeline has been effective in producing protein samples of both prokaryotic and eukaryotic origin. Although this chapter describes our entire pipeline for producing isotope-enriched protein samples, it focuses on the major updates introduced during the last 5 years (Phase 2 of the National Institute of General Medical Sciences Protein Structure Initiative). Our advanced automated and/or parallel cloning, expression, purification, and biophysical screening technologies are suitable for implementation in a large individual laboratory or by a small group of collaborating investigators for structural biology, functional proteomics, ligand screening, and structural genomics research.
Collapse
Affiliation(s)
- Thomas B Acton
- Center for Advanced Biotechnology and Medicine, Department of Molecular Biology and Biochemistry, Northeast Structural Genomics Consortium, Rutgers University, Piscataway, New Jersey, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
30
|
Farmani J, Safari M, Roohvand F, Razavi SH, Aghasadeghi MR, Noorbazargan H. Conjugated linoleic acid-producing enzymes: A bioinformatics study. EUR J LIPID SCI TECH 2010. [DOI: 10.1002/ejlt.201000360] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
|
31
|
Xiao R, Anderson S, Aramini J, Belote R, Buchwald WA, Ciccosanti C, Conover K, Everett JK, Hamilton K, Huang YJ, Janjua H, Jiang M, Kornhaber GJ, Lee DY, Locke JY, Ma LC, Maglaqui M, Mao L, Mitra S, Patel D, Rossi P, Sahdev S, Sharma S, Shastry R, Swapna GVT, Tong SN, Wang D, Wang H, Zhao L, Montelione GT, Acton TB. The high-throughput protein sample production platform of the Northeast Structural Genomics Consortium. J Struct Biol 2010; 172:21-33. [PMID: 20688167 DOI: 10.1016/j.jsb.2010.07.011] [Citation(s) in RCA: 108] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2010] [Revised: 07/24/2010] [Accepted: 07/28/2010] [Indexed: 11/15/2022]
Abstract
We describe the core Protein Production Platform of the Northeast Structural Genomics Consortium (NESG) and outline the strategies used for producing high-quality protein samples. The platform is centered on the cloning, expression and purification of 6X-His-tagged proteins using T7-based Escherichia coli systems. The 6X-His tag allows for similar purification procedures for most targets and implementation of high-throughput (HTP) parallel methods. In most cases, the 6X-His-tagged proteins are sufficiently purified (>97% homogeneity) using a HTP two-step purification protocol for most structural studies. Using this platform, the open reading frames of over 16,000 different targeted proteins (or domains) have been cloned as>26,000 constructs. Over the past 10 years, more than 16,000 of these expressed protein, and more than 4400 proteins (or domains) have been purified to homogeneity in tens of milligram quantities (see Summary Statistics, http://nesg.org/statistics.html). Using these samples, the NESG has deposited more than 900 new protein structures to the Protein Data Bank (PDB). The methods described here are effective in producing eukaryotic and prokaryotic protein samples in E. coli. This paper summarizes some of the updates made to the protein production pipeline in the last 5 years, corresponding to phase 2 of the NIGMS Protein Structure Initiative (PSI-2) project. The NESG Protein Production Platform is suitable for implementation in a large individual laboratory or by a small group of collaborating investigators. These advanced automated and/or parallel cloning, expression, purification, and biophysical screening technologies are of broad value to the structural biology, functional proteomics, and structural genomics communities.
Collapse
Affiliation(s)
- Rong Xiao
- Center for Advanced Biotechnology and Medicine, Department of Molecular Biology and Biochemistry, Rutgers, The State University of New Jersey and Robert Wood Johnson Medical School, and Northeast Structural Genomics Consortium, Piscataway, NJ 08854, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
32
|
Zucker FH, Stewart C, dela Rosa J, Kim J, Zhang L, Xiao L, Ross J, Napuli AJ, Mueller N, Castaneda LJ, Nakazawa Hewitt SR, Arakaki TL, Larson ET, Subramanian E, Verlinde CLMJ, Fan E, Buckner FS, Van Voorhis WC, Merritt EA, Hol WGJ. Prediction of protein crystallization outcome using a hybrid method. J Struct Biol 2010; 171:64-73. [PMID: 20347992 DOI: 10.1016/j.jsb.2010.03.016] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2009] [Revised: 03/18/2010] [Accepted: 03/23/2010] [Indexed: 10/19/2022]
Abstract
The great power of protein crystallography to reveal biological structure is often limited by the tremendous effort required to produce suitable crystals. A hybrid crystal growth predictive model is presented that combines both experimental and sequence-derived data from target proteins, including novel variables derived from physico-chemical characterization such as R(30), the ratio between a protein's DSF intensity at 30°C and at T(m). This hybrid model is shown to be more powerful than sequence-based prediction alone - and more likely to be useful for prioritizing and directing the efforts of structural genomics and individual structural biology laboratories.
Collapse
Affiliation(s)
- Frank H Zucker
- Medical Structural Genomics of Pathogenic Protozoa, School of Medicine, University of Washington, Seattle, WA 98195-7742, United States
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
33
|
Raman S, Huang YJ, Mao B, Rossi P, Aramini JM, Liu G, Montelione GT, Baker D. Accurate automated protein NMR structure determination using unassigned NOESY data. J Am Chem Soc 2010; 132:202-7. [PMID: 20000319 PMCID: PMC2841443 DOI: 10.1021/ja905934c] [Citation(s) in RCA: 39] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
![]()
Conventional NMR structure determination requires nearly complete assignment of the cross peaks of a refined NOESY peak list. Depending on the size of the protein and quality of the spectral data, this can be a time-consuming manual process requiring several rounds of peak list refinement and structure determination. Programs such as Aria, CYANA, and AutoStructure can generate models using unassigned NOESY data but are very sensitive to the quality of the input peak lists and can converge to inaccurate structures if the signal-to-noise of the peak lists is low. Here, we show that models with high accuracy and reliability can be produced by combining the strengths of the high-resolution structure prediction program Rosetta with global measures of the agreement between structure models and experimental data. A first round of models generated using CS-Rosetta (Rosetta supplemented with backbone chemical shift information) are filtered on the basis of their goodness-of-fit with unassigned NOESY peak lists using the DP-score, and the best fitting models are subjected to high resolution refinement with the Rosetta rebuild-and-refine protocol. This hybrid approach uses both local backbone chemical shift and the unassigned NOESY data to direct Rosetta trajectories toward the native structure and produces more accurate models than AutoStructure/CYANA or CS-Rosetta alone, particularly when using raw unedited NOESY peak lists. We also show that when accurate manually refined NOESY peak lists are available, Rosetta refinement can consistently increase the accuracy of models generated using CYANA and AutoStructure.
Collapse
Affiliation(s)
- Srivatsan Raman
- Department of Biochemistry, University of Washington, Seattle, Washington 98195, USA
| | | | | | | | | | | | | | | |
Collapse
|
34
|
Babnigg G, Joachimiak A. Predicting protein crystallization propensity from protein sequence. ACTA ACUST UNITED AC 2010; 11:71-80. [PMID: 20177794 DOI: 10.1007/s10969-010-9080-0] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2009] [Accepted: 02/05/2010] [Indexed: 10/19/2022]
Abstract
The high-throughput structure determination pipelines developed by structural genomics programs offer a unique opportunity for data mining. One important question is how protein properties derived from a primary sequence correlate with the protein's propensity to yield X-ray quality crystals (crystallizability) and 3D X-ray structures. A set of protein properties were computed for over 1,300 proteins that expressed well but were insoluble, and for approximately 720 unique proteins that resulted in X-ray structures. The correlation of the protein's iso-electric point and grand average hydropathy (GRAVY) with crystallizability was analyzed for full length and domain constructs of protein targets. In a second step, several additional properties that can be calculated from the protein sequence were added and evaluated. Using statistical analyses we have identified a set of the attributes correlating with a protein's propensity to crystallize and implemented a Support Vector Machine (SVM) classifier based on these. We have created applications to analyze and provide optimal boundary information for query sequences and to visualize the data. These tools are available via the web site http://bioinformatics.anl.gov/cgi-bin/tools/pdpredictor .
Collapse
Affiliation(s)
- György Babnigg
- Midwest Center for Structural Genomics, Biosciences Division, Argonne National Laboratory, 9700 S Cass Ave., Argonne, IL 60439, USA.
| | | |
Collapse
|
35
|
Chan WC, Liang PH, Shih YP, Yang UC, Lin WC, Hsu CN. Learning to predict expression efficacy of vectors in recombinant protein production. BMC Bioinformatics 2010; 11 Suppl 1:S21. [PMID: 20122193 PMCID: PMC3009492 DOI: 10.1186/1471-2105-11-s1-s21] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Background Recombinant protein production is a useful biotechnology to produce a large quantity of highly soluble proteins. Currently, the most widely used production system is to fuse a target protein into different vectors in Escherichia coli (E. coli). However, the production efficacy of different vectors varies for different target proteins. Trial-and-error is still the common practice to find out the efficacy of a vector for a given target protein. Previous studies are limited in that they assumed that proteins would be over-expressed and focused only on the solubility of expressed proteins. In fact, many pairings of vectors and proteins result in no expression. Results In this study, we applied machine learning to train prediction models to predict whether a pairing of vector-protein will express or not express in E. coli. For expressed cases, the models further predict whether the expressed proteins would be soluble. We collected a set of real cases from the clients of our recombinant protein production core facility, where six different vectors were designed and studied. This set of cases is used in both training and evaluation of our models. We evaluate three different models based on the support vector machines (SVM) and their ensembles. Unlike many previous works, these models consider the sequence of the target protein as well as the sequence of the whole fusion vector as the features. We show that a model that classifies a case into one of the three classes (no expression, inclusion body and soluble) outperforms a model that considers the nested structure of the three classes, while a model that can take advantage of the hierarchical structure of the three classes performs slight worse but comparably to the best model. Meanwhile, compared to previous works, we show that the prediction accuracy of our best method still performs the best. Lastly, we briefly present two methods to use the trained model in the design of the recombinant protein production systems to improve the chance of high soluble protein production. Conclusion In this paper, we show that a machine learning approach to the prediction of the efficacy of a vector for a target protein in a recombinant protein production system is promising and may compliment traditional knowledge-driven study of the efficacy. We will release our program to share with other labs in the public domain when this paper is published.
Collapse
Affiliation(s)
- Wen-Ching Chan
- Institute of Biomedical Informatics, National Yang-Ming University, Taipei, Taiwan.
| | | | | | | | | | | |
Collapse
|
36
|
Rossi P, Swapna GVT, Huang YJ, Aramini JM, Anklin C, Conover K, Hamilton K, Xiao R, Acton TB, Ertekin A, Everett JK, Montelione GT. A microscale protein NMR sample screening pipeline. JOURNAL OF BIOMOLECULAR NMR 2010; 46:11-22. [PMID: 19915800 PMCID: PMC2797623 DOI: 10.1007/s10858-009-9386-z] [Citation(s) in RCA: 75] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/09/2009] [Accepted: 10/14/2009] [Indexed: 05/14/2023]
Abstract
As part of efforts to develop improved methods for NMR protein sample preparation and structure determination, the Northeast Structural Genomics Consortium (NESG) has implemented an NMR screening pipeline for protein target selection, construct optimization, and buffer optimization, incorporating efficient microscale NMR screening of proteins using a micro-cryoprobe. The process is feasible because the newest generation probe requires only small amounts of protein, typically 30-200 microg in 8-35 microl volume. Extensive automation has been made possible by the combination of database tools, mechanization of key process steps, and the use of a micro-cryoprobe that gives excellent data while requiring little optimization and manual setup. In this perspective, we describe the overall process used by the NESG for screening NMR samples as part of a sample optimization process, assessing optimal construct design and solution conditions, as well as for determining protein rotational correlation times in order to assess protein oligomerization states. Database infrastructure has been developed to allow for flexible implementation of new screening protocols and harvesting of the resulting output. The NESG micro NMR screening pipeline has also been used for detergent screening of membrane proteins. Descriptions of the individual steps in the NESG NMR sample design, production, and screening pipeline are presented in the format of a standard operating procedure.
Collapse
Affiliation(s)
- Paolo Rossi
- Center for Advanced Biotechnology and Medicine, Department of Molecular Biology and Biochemistry, Rutgers, The State University of New Jersey, 679 Hoes Lane, Piscataway, NJ 08854 USA
- Northeast Structural Genomics Consortium, Piscataway, NJ USA
| | - G. V. T. Swapna
- Center for Advanced Biotechnology and Medicine, Department of Molecular Biology and Biochemistry, Rutgers, The State University of New Jersey, 679 Hoes Lane, Piscataway, NJ 08854 USA
- Northeast Structural Genomics Consortium, Piscataway, NJ USA
| | - Yuanpeng J. Huang
- Center for Advanced Biotechnology and Medicine, Department of Molecular Biology and Biochemistry, Rutgers, The State University of New Jersey, 679 Hoes Lane, Piscataway, NJ 08854 USA
- Northeast Structural Genomics Consortium, Piscataway, NJ USA
| | - James M. Aramini
- Center for Advanced Biotechnology and Medicine, Department of Molecular Biology and Biochemistry, Rutgers, The State University of New Jersey, 679 Hoes Lane, Piscataway, NJ 08854 USA
- Northeast Structural Genomics Consortium, Piscataway, NJ USA
| | - Clemens Anklin
- Bruker Biospin Corporation, 15 Fortune Drive, Billerica, MA 01821 USA
| | - Kenith Conover
- Center for Advanced Biotechnology and Medicine, Department of Molecular Biology and Biochemistry, Rutgers, The State University of New Jersey, 679 Hoes Lane, Piscataway, NJ 08854 USA
- Northeast Structural Genomics Consortium, Piscataway, NJ USA
| | - Keith Hamilton
- Center for Advanced Biotechnology and Medicine, Department of Molecular Biology and Biochemistry, Rutgers, The State University of New Jersey, 679 Hoes Lane, Piscataway, NJ 08854 USA
- Northeast Structural Genomics Consortium, Piscataway, NJ USA
| | - Rong Xiao
- Center for Advanced Biotechnology and Medicine, Department of Molecular Biology and Biochemistry, Rutgers, The State University of New Jersey, 679 Hoes Lane, Piscataway, NJ 08854 USA
- Northeast Structural Genomics Consortium, Piscataway, NJ USA
| | - Thomas B. Acton
- Center for Advanced Biotechnology and Medicine, Department of Molecular Biology and Biochemistry, Rutgers, The State University of New Jersey, 679 Hoes Lane, Piscataway, NJ 08854 USA
- Northeast Structural Genomics Consortium, Piscataway, NJ USA
| | - Asli Ertekin
- Center for Advanced Biotechnology and Medicine, Department of Molecular Biology and Biochemistry, Rutgers, The State University of New Jersey, 679 Hoes Lane, Piscataway, NJ 08854 USA
- Northeast Structural Genomics Consortium, Piscataway, NJ USA
| | - John K. Everett
- Center for Advanced Biotechnology and Medicine, Department of Molecular Biology and Biochemistry, Rutgers, The State University of New Jersey, 679 Hoes Lane, Piscataway, NJ 08854 USA
- Northeast Structural Genomics Consortium, Piscataway, NJ USA
| | - Gaetano T. Montelione
- Center for Advanced Biotechnology and Medicine, Department of Molecular Biology and Biochemistry, Rutgers, The State University of New Jersey, 679 Hoes Lane, Piscataway, NJ 08854 USA
- Northeast Structural Genomics Consortium, Piscataway, NJ USA
- Department of Biochemistry, Robert Wood Johnson Medical School, UMDNJ, Piscataway, NJ 08854 USA
| |
Collapse
|
37
|
Kurotani A, Takagi T, Toyama M, Shirouzu M, Yokoyama S, Fukami Y, Tokmakov AA. Comprehensive bioinformatics analysis of cell-free protein synthesis: identification of multiple protein properties that correlate with successful expression. FASEB J 2009; 24:1095-104. [PMID: 19940260 DOI: 10.1096/fj.09-139527] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
High-throughput cell-free protein synthesis is being used increasingly in structural/functional genomics projects. However, the factors determining expression success are poorly understood. Here, we evaluated the expression of 3066 human proteins and their domains in a bacterial cell-free system and analyzed the correlation of protein expression with 39 physicochemical and structural properties of proteins. As a result of the bioinformatics analysis performed, we determined the 18 most influential features that affect protein amenability to cell-free expression. They include protein length; hydrophobicity; pI; content of charged, nonpolar, and aromatic residues;, cysteine content; solvent accessibility; presence of coiled coil; content of intrinsically disordered and structured (alpha-helix and beta-sheet) sequence; number of disulfide bonds and functional domains; presence of transmembrane regions; PEST motifs; and signaling sequences. This study represents the first comprehensive bioinformatics analysis of heterologous protein synthesis in a cell-free system. The rules and correlations revealed here provide a plethora of important insights into rationalization of cell-free protein production and can be of practical use for protein engineering with the aim of increasing expression success.-Kurotani, A., Takagi, T., Toyama, M., Shirouzu, M., Yokoyama, S., Fukami, Y., Tokmakov, A. A. Comprehensive bioinformatics analysis of cell-free protein synthesis: identification of multiple protein properties that correlate with successful expression.
Collapse
|
38
|
Magnan CN, Randall A, Baldi P. SOLpro: accurate sequence-based prediction of protein solubility. ACTA ACUST UNITED AC 2009; 25:2200-7. [PMID: 19549632 DOI: 10.1093/bioinformatics/btp386] [Citation(s) in RCA: 404] [Impact Index Per Article: 25.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION Protein insolubility is a major obstacle for many experimental studies. A sequence-based prediction method able to accurately predict the propensity of a protein to be soluble on overexpression could be used, for instance, to prioritize targets in large-scale proteomics projects and to identify mutations likely to increase the solubility of insoluble proteins. RESULTS Here, we first curate a large, non-redundant and balanced training set of more than 17 000 proteins. Next, we extract and study 23 groups of features computed directly or predicted (e.g. secondary structure) from the primary sequence. The data and the features are used to train a two-stage support vector machine (SVM) architecture. The resulting predictor, SOLpro, is compared directly with existing methods and shows significant improvement according to standard evaluation metrics, with an overall accuracy of over 74% estimated using multiple runs of 10-fold cross-validation.
Collapse
Affiliation(s)
- Christophe N Magnan
- Institute for Genomics and Bioinformatics, School of Information and Computer Sciences, University of California, Irvine, CA, USA
| | | | | |
Collapse
|
39
|
Price WN, Chen Y, Handelman SK, Neely H, Manor P, Karlin R, Nair R, Liu J, Baran M, Everett J, Tong SN, Forouhar F, Swaminathan SS, Acton T, Xiao R, Luft JR, Lauricella A, DeTitta GT, Rost B, Montelione GT, Hunt JF. Understanding the physical properties that control protein crystallization by analysis of large-scale experimental data. Nat Biotechnol 2009; 27:51-7. [PMID: 19079241 DOI: 10.1038/nbt.1514] [Citation(s) in RCA: 107] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Crystallization is the most serious bottleneck in high-throughput protein-structure determination by diffraction methods. We have used data mining of the large-scale experimental results of the Northeast Structural Genomics Consortium and experimental folding studies to characterize the biophysical properties that control protein crystallization. This analysis leads to the conclusion that crystallization propensity depends primarily on the prevalence of well-ordered surface epitopes capable of mediating interprotein interactions and is not strongly influenced by overall thermodynamic stability. We identify specific sequence features that correlate with crystallization propensity and that can be used to estimate the crystallization probability of a given construct. Analyses of entire predicted proteomes demonstrate substantial differences in the amino acid-sequence properties of human versus eubacterial proteins, which likely reflect differences in biophysical properties, including crystallization propensity. Our thermodynamic measurements do not generally support previous claims regarding correlations between sequence properties and protein stability.
Collapse
Affiliation(s)
- W Nicholson Price
- Northeast Structural Genomics Consortium, Columbia University, New York, New York 10027, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
40
|
Ngoka LCM. Sample prep for proteomics of breast cancer: proteomics and gene ontology reveal dramatic differences in protein solubilization preferences of radioimmunoprecipitation assay and urea lysis buffers. Proteome Sci 2008; 6:30. [PMID: 18950484 PMCID: PMC2600628 DOI: 10.1186/1477-5956-6-30] [Citation(s) in RCA: 72] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2008] [Accepted: 10/24/2008] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND An important step in the proteomics of solid tumors, including breast cancer, consists of efficiently extracting most of proteins in the tumor specimen. For this purpose, Radio-Immunoprecipitation Assay (RIPA) buffer is widely employed. RIPA buffer's rapid and highly efficient cell lysis and good solubilization of a wide range of proteins is further augmented by its compatibility with protease and phosphatase inhibitors, ability to minimize non-specific protein binding leading to a lower background in immunoprecipitation, and its suitability for protein quantitation. RESULTS In this work, the insoluble matter left after RIPA buffer extraction of proteins from breast tumors are subjected to another extraction step, using a urea-based buffer. It is shown that RIPA and urea lysis buffers fractionate breast tissue proteins primarily on the basis of molecular weights. The average molecular weight of proteins that dissolve exclusively in urea buffer is up to 60% higher than in RIPA.Gene Ontology (GO) and Directed Acyclic Graphs (DAG) are used to map the collective biological and biophysical attributes of the RIPA and urea proteomes. The Cellular Component and Molecular Function annotations reveal protein solubilization preferences of the buffers, especially the compartmentalization and functional distributions.It is shown that nearly all extracellular matrix proteins (ECM) in the breast tumors and matched normal tissues are found, nearly exclusively, in the urea fraction, while they are mostly insoluble in RIPA buffer. Additionally, it is demonstrated that cytoskeletal and extracellular region proteins are more soluble in urea than in RIPA, whereas for nuclear, cytoplasmic and mitochondrial proteins, RIPA buffer is preferred.Extracellular matrix proteins are highly implicated in cancer, including their proteinase-mediated degradation and remodelling, tumor development, progression, adhesion and metastasis. Thus, if they are not efficiently extracted by RIPA buffer, important information may be missed in cancer research. CONCLUSION For proteomics of solid tumors, a two-step extraction process is recommended. First, proteins in the tumor specimen should be extracted with RIPA buffer. Second, the RIPA-insoluble material should be extracted with the urea-based buffer employed in this work.
Collapse
Affiliation(s)
- Lambert C M Ngoka
- Department of Chemistry, Virginia Commonwealth University, Richmond, 23284-2006, USA.
| |
Collapse
|
41
|
Qi L, Cazares L, Johnson C, de Alarcon P, Kupfer GM, Semmes OJ. Serum protein expression profiling in pediatric Hodgkin lymphoma: a report from the Children's Oncology Group. Pediatr Blood Cancer 2008; 51:216-21. [PMID: 18421715 DOI: 10.1002/pbc.21581] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
BACKGROUND The prognosis for children with Hodgkin lymphoma (HL) treated with a risk adjusted combination of radiation therapy and multi-drug chemotherapy has markedly improved. There remains a group of patients whose disease either recurs or does not respond to therapy. Protein expression profiling has been used to define protein characteristics of serum from adult patients in order to improve screening for early diagnosis. However, profiling for the purpose of staging and defining prognostic characteristics of childhood diseases is not well studied. The current stage-based risk assignment of HL cannot predict the patients within a risk group that are destined to recur or do not respond to therapy. Thus, a need exists to develop new methodologies to better stratify the risk classification of pediatric HL. PROCEDURE We have completed a preliminary project to identify characteristic serum protein peaks determined by protein expression profiling in serum of 22 subjects with HL, 13 with stage II HL and 9 with stage III or IV. RESULTS Protein profiling successfully discriminated between high grade (III/IV) HL and low grade (II) HL. CONCLUSION These data lay the basis for prospective studies to identify protein expression profiles useful for diagnosis, prognosis, treatment stratification, and the follow-up of minimal residual disease.
Collapse
Affiliation(s)
- Lining Qi
- Department of Microbiology and Molecular Cell Biology, Eastern Virginia Medical School, Norfolk, Virginia, USA
| | | | | | | | | | | |
Collapse
|
42
|
Abstract
Although graft-versus-host disease (GVHD) is a life-threatening complication of hematopoietic stem-cell transplantation (HSCT), its current diagnosis depends mainly on clinical manifestations and invasive biopsies. Specific biomarkers for GVHD would facilitate early and accurate recognition of this grave condition. Using proteomics, we screened for plasma proteins specific for GVHD in a mouse model. One peak with 8972-Da molecular mass (m/z) retained a discriminatory value in 2 diagnostic groups (GVHD and normal controls) with increased expression in the disease and decreased expression during cyclosporin A treatment, and was barely detectable in syngeneic transplantation. Purification and mass analysis identified this molecule as CCL8, a member of a large chemokine family. In human samples, the serum concentration of CCL8 correlated closely with GVHD severity. All non-GVHD samples contained less than 48 pg/mL (mean +/- SE: 22.5 +/- 5.5 pg/mL, range: 12.6-48.0 pg/mL, n = 7). In sharp contrast, CCL8 was highly up-regulated in GVHD sera ranging from 52.0 to 333.6 pg/mL (mean +/- SE: 165.0 +/- 39.8 pg/mL, n = 7). Strikingly, 2 patients with severe fatal GVHD had extremely high levels of CCL8 (333.6 and 290.4 pg/mL. CCL8 is a promising specific serum marker for the early and accurate diagnosis of GVHD.
Collapse
|
43
|
Slabinski L, Jaroszewski L, Rodrigues APC, Rychlewski L, Wilson IA, Lesley SA, Godzik A. The challenge of protein structure determination--lessons from structural genomics. Protein Sci 2008; 16:2472-82. [PMID: 17962404 DOI: 10.1110/ps.073037907] [Citation(s) in RCA: 95] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
Abstract
The process of experimental determination of protein structure is marred with a high ratio of failures at many stages. With availability of large quantities of data from high-throughput structure determination in structural genomics centers, we can now learn to recognize protein features correlated with failures; thus, we can recognize proteins more likely to succeed and eventually learn how to modify those that are less likely to succeed. Here, we identify several protein features that correlate strongly with successful protein production and crystallization and combine them into a single score that assesses "crystallization feasibility." The formula derived here was tested with a jackknife procedure and validated on independent benchmark sets. The "crystallization feasibility" score described here is being applied to target selection in the Joint Center for Structural Genomics, and is now contributing to increasing the success rate, lowering the costs, and shortening the time for protein structure determination. Analyses of PDB depositions suggest that very similar features also play a role in non-high-throughput structure determination, suggesting that this crystallization feasibility score would also be of significant interest to structural biology, as well as to molecular and biochemistry laboratories.
Collapse
Affiliation(s)
- Lukasz Slabinski
- Joint Center for Structural Genomics, Bioinformatics Core, Burnham Institute for Medical Research, La Jolla, CA 92037, USA
| | | | | | | | | | | | | |
Collapse
|
44
|
Slabinski L, Jaroszewski L, Rychlewski L, Wilson IA, Lesley SA, Godzik A. XtalPred: a web server for prediction of protein crystallizability. ACTA ACUST UNITED AC 2007; 23:3403-5. [PMID: 17921170 DOI: 10.1093/bioinformatics/btm477] [Citation(s) in RCA: 223] [Impact Index Per Article: 12.4] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
UNLABELLED XtalPred is a web server for prediction of protein crystallizability. The prediction is made by comparing several features of the protein with distributions of these features in TargetDB and combining the results into an overall probability of crystallization. XtalPred provides: (1) a detailed comparison of the protein's features to the corresponding distribution from TargetDB; (2) a summary of protein features and predictions that indicate problems that are likely to be encountered during protein crystallization; (3) prediction of ligands; and (4) (optional) lists of close homologs from complete microbial genomes that are more likely to crystallize. AVAILABILITY The XtalPred web server is freely available for academic users on http://ffas.burnham.org/XtalPred
Collapse
|
45
|
Mirkovic N, Li Z, Parnassa A, Murray D. Strategies for high-throughput comparative modeling: applications to leverage analysis in structural genomics and protein family organization. Proteins 2007; 66:766-77. [PMID: 17154423 DOI: 10.1002/prot.21191] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
The technological breakthroughs in structural genomics were designed to facilitate the solution of a sufficient number of structures, so that as many protein sequences as possible can be structurally characterized with the aid of comparative modeling. The leverage of a solved structure is the number and quality of the models that can be produced using the structure as a template for modeling and may be viewed as the "currency" with which the success of a structural genomics endeavor can be measured. Moreover, the models obtained in this way should be valuable to all biologists. To this end, at the Northeast Structural Genomics Consortium (NESG), a modular computational pipeline for automated high-throughput leverage analysis was devised and used to assess the leverage of the 186 unique NESG structures solved during the first phase of the Protein Structure Initiative (January 2000 to July 2005). Here, the results of this analysis are presented. The number of sequences in the nonredundant protein sequence database covered by quality models produced by the pipeline is approximately 39,000, so that the average leverage is approximately 210 models per structure. Interestingly, only 7900 of these models fulfill the stringent modeling criterion of being at least 30% sequence-identical to the corresponding NESG structures. This study shows how high-throughput modeling increases the efficiency of structure determination efforts by providing enhanced coverage of protein structure space. In addition, the approach is useful in refining the boundaries of structural domains within larger protein sequences, subclassifying sequence diverse protein families, and defining structure-based strategies specific to a particular family.
Collapse
Affiliation(s)
- Nebojsa Mirkovic
- Department of Microbiology and Immunology, Weill Medical College of Cornell University, New York, New York 10021, USA
| | | | | | | |
Collapse
|
46
|
Seibold E, Bogumil R, Vorderwülbecke S, Al Dahouk S, Buckendahl A, Tomaso H, Splettstoesser W. Optimized application of surface-enhanced laser desorption/ionization time-of-flight MS to differentiateFrancisella tularensisat the level of subspecies and individual strains. ACTA ACUST UNITED AC 2007; 49:364-73. [PMID: 17378900 DOI: 10.1111/j.1574-695x.2007.00216.x] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Francisella tularensis, the causative agent of tularaemia, is a potential agent of bioterrorism. The phenotypic discrimination of the closely related F. tularensis subspecies and individual strains with traditional methods is difficult and time consuming, often producing ambiguous results. Surface-enhanced laser desorption/ionization time-of-flight MS (SELDI-TOF MS) was used in this study to discriminate the different species and subspecies of the genus Francisella. We tested 18 Francisella strains including at least one representative of each species/subspecies on four different types of chromatographic chip surfaces. Multivariate analysis (hierarchical clustering and principal component analysis) allowed grouping of the strains according to their designated subspecies. Furthermore, single strains within F. tularensis subspecies could be discriminated.
Collapse
Affiliation(s)
- Eric Seibold
- Bundeswehr Institute of Microbiology, Neuherbergstr, München, Germany.
| | | | | | | | | | | | | |
Collapse
|
47
|
Abola E, Carlton DD, Kuhn P, Stevens RC. Five Years of Increasing Structural Biology Throughput - A Retrospective Analysis. STRUCTURE-BASED DRUG DISCOVERY 2007. [PMCID: PMC7122022 DOI: 10.1007/1-4020-4407-0_1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 12/05/2022]
|
48
|
Smialowski P, Martin-Galiano AJ, Mikolajka A, Girschick T, Holak TA, Frishman D. Protein solubility: sequence based prediction and experimental verification. Bioinformatics 2006; 23:2536-42. [PMID: 17150993 DOI: 10.1093/bioinformatics/btl623] [Citation(s) in RCA: 106] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Obtaining soluble proteins in sufficient concentrations is a recurring limiting factor in various experimental studies. Solubility is an individual trait of proteins which, under a given set of experimental conditions, is determined by their amino acid sequence. Accurate theoretical prediction of solubility from sequence is instrumental for setting priorities on targets in large-scale proteomics projects. RESULTS We present a machine-learning approach called PROSO to assess the chance of a protein to be soluble upon heterologous expression in Escherichia coli based on its amino acid composition. The classification algorithm is organized as a two-layered structure in which the output of primary support vector machine (SVM) classifiers serves as input for a secondary Naive Bayes classifier. Experimental progress information from the TargetDB database as well as previously published datasets were used as the source of training data. In comparison with previously published methods our classification algorithm possesses improved discriminatory capacity characterized by the Matthews Correlation Coefficient (MCC) of 0.434 between predicted and known solubility states and the overall prediction accuracy of 72% (75 and 68% for positive and negative class, respectively). We also provide experimental verification of our predictions using solubility measurements for 31 mutational variants of two different proteins.
Collapse
Affiliation(s)
- Pawel Smialowski
- Department of Genome Oriented Bioinformatics, Technische Universität München, Wissenschaftszentrum Weihenstephan, 85350 Freising, Germany
| | | | | | | | | | | |
Collapse
|
49
|
Abstract
Proteomic studies involve the identification as well as qualitative and quantitative comparison of proteins expressed under different conditions, and elucidation of their properties and functions, usually in a large-scale, high-throughput format. The high dimensionality of data generated from these studies will require the development of improved bioinformatics tools and data-mining approaches for efficient and accurate data analysis of biological specimens from healthy and diseased individuals. Mining large proteomics data sets provides a better understanding of the complexities between the normal and abnormal cell proteome of various biological systems, including environmental hazards, infectious agents (bioterrorism) and cancers. This review will shed light on recent developments in bioinformatics and data-mining approaches, and their limitations when applied to proteomics data sets, in order to strengthen the interdependence between proteomic technologies and bioinformatics tools.
Collapse
Affiliation(s)
- Abdelali Haoudi
- Eastern Virginia Medical School, Department of Microbiology & Molecular Cell Biology, George L Wright Center for Biomedical Proteomics, Lewis Hall 3011, Norfolk, VA 23501, USA.
| | | |
Collapse
|
50
|
Abstract
Many classification schemes for proteins and domains are either hierarchical or semi-hierarchical yet most databases, especially those offering genome-wide analysis, only provide assignments to sequences at one level of their hierarchy. Given an established hierarchy, the problem of assigning new sequences to lower levels of that existing hierarchy is less hard (but no less important) than the initial top level assignment which requires the detection of the most distant relationships. A solution to this problem is described here in the form of a new procedure which can be thought of as a hybrid between pairwise and profile methods. The hybrid method is a general procedure that can be applied to any pre-defined hierarchy, at any level, including in principle multiple sub-levels. It has been tested on the SCOP classification via the SUPERFAMILY database and performs significantly better than either pairwise or profile methods alone. Perhaps the greatest advantage of the hybrid method over other possible approaches to the problem is that within the framework of an existing profile library, the assignments are fully automatic and come at almost no additional computational cost. Hence it has already been applied at the SCOP family level to all genomes in the SUPERFAMILY database, providing a wealth of new data to the biological and bioinformatics communities.
Collapse
Affiliation(s)
- Julian Gough
- Unite de Bioinformatique Structurale, Institut Pasteur, 25-28 Rue du Docteur Roux, 75724 Paris Cedex 15, Paris, France.
| |
Collapse
|