1
|
Le NQK, Li W, Cao Y. Sequence-based prediction model of protein crystallization propensity using machine learning and two-level feature selection. Brief Bioinform 2023; 24:bbad319. [PMID: 37649385 DOI: 10.1093/bib/bbad319] [Citation(s) in RCA: 20] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2023] [Revised: 07/09/2023] [Accepted: 08/16/2023] [Indexed: 09/01/2023] Open
Abstract
Protein crystallization is crucial for biology, but the steps involved are complex and demanding in terms of external factors and internal structure. To save on experimental costs and time, the tendency of proteins to crystallize can be initially determined and screened by modeling. As a result, this study created a new pipeline aimed at using protein sequence to predict protein crystallization propensity in the protein material production stage, purification stage and production of crystal stage. The newly created pipeline proposed a new feature selection method, which involves combining Chi-square (${\chi }^{2}$) and recursive feature elimination together with the 12 selected features, followed by a linear discriminant analysisfor dimensionality reduction and finally, a support vector machine algorithm with hyperparameter tuning and 10-fold cross-validation is used to train the model and test the results. This new pipeline has been tested on three different datasets, and the accuracy rates are higher than the existing pipelines. In conclusion, our model provides a new solution to predict multistage protein crystallization propensity which is a big challenge in computational biology.
Collapse
Affiliation(s)
- Nguyen Quoc Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, 250 Wuxing Street, 110, Taipei, Taiwan
- AIBioMed Research Group, Taipei Medical University, 250 Wuxing Street, 110, Taipei, Taiwan
- Research Center for Artificial Intelligence in Medicine, Taipei Medical University, 250 Wuxing Street, 110, Taipei, Taiwan
- Translational Imaging Research Center, Taipei Medical University Hospital, 252 Wuxing Street, 110, Taipei, Taiwan
| | - Wanru Li
- NUS-ISS, National University of Singapore, 25 Heng Mui Keng Terrace, 119615, Singapore, Singapore
| | - Yanshuang Cao
- NUS-ISS, National University of Singapore, 25 Heng Mui Keng Terrace, 119615, Singapore, Singapore
| |
Collapse
|
2
|
Wang PH, Zhu YH, Yang X, Yu DJ. GCmapCrys: Integrating graph attention network with predicted contact map for multi-stage protein crystallization propensity prediction. Anal Biochem 2023; 663:115020. [PMID: 36521558 DOI: 10.1016/j.ab.2022.115020] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2022] [Revised: 12/05/2022] [Accepted: 12/10/2022] [Indexed: 12/14/2022]
Abstract
X-ray crystallography is the major approach for atomic-level protein structure determination. Since not all proteins can be easily crystallized, accurate prediction of protein crystallization propensity is critical to guiding the experimental design and improving the success rate of X-ray crystallography experiments. In this work, we proposed a new deep learning pipeline, GCmapCrys, for multi-stage crystallization propensity prediction through integrating graph attention network with predicted protein contact map. Experimental results on 1548 proteins with known crystallization records demonstrated that GCmapCrys increased the value of Matthew's correlation coefficient by 37.0% in average compared to state-of-the-art protein crystallization propensity predictors. Detailed analyses show that the major advantages of GCmapCrys lie in the efficiency of the graph attention network with predicted contact map, which effectively associates the residue-interaction knowledge with crystallization pattern. Meanwhile, the designed four sequence-based features can be complementary to further enhance crystallization propensity proprediction.
Collapse
Affiliation(s)
- Peng-Hao Wang
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing, 210094, PR China
| | - Yi-Heng Zhu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing, 210094, PR China
| | - Xibei Yang
- School of Computer, Jiangsu University of Science and Technology, Zhenjiang, 212100, PR China
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing, 210094, PR China.
| |
Collapse
|
3
|
Jin C, Shi Z, Kang C, Lin K, Zhang H. TLCrys: Transfer Learning Based Method for Protein Crystallization Prediction. Int J Mol Sci 2022; 23:972. [PMID: 35055158 PMCID: PMC8778968 DOI: 10.3390/ijms23020972] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2021] [Revised: 01/05/2022] [Accepted: 01/14/2022] [Indexed: 11/17/2022] Open
Abstract
X-ray diffraction technique is one of the most common methods of ascertaining protein structures, yet only 2-10% of proteins can produce diffraction-quality crystals. Several computational methods have been proposed so far to predict protein crystallization. Nevertheless, the current state-of-the-art computational methods are limited by the scarcity of experimental data. Thus, the prediction accuracy of existing models hasn't reached the ideal level. To address the problems above, we propose a novel transfer-learning-based framework for protein crystallization prediction, named TLCrys. The framework proceeds in two steps: pre-training and fine-tuning. The pre-training step adopts attention mechanism to extract both global and local information of the protein sequences. The representation learned from the pre-training step is regarded as knowledge to be transferred and fine-tuned to enhance the performance of crystalization prediction. During pre-training, TLCrys adopts a multi-task learning method, which not only improves the learning ability of protein encoding, but also enhances the robustness and generalization of protein representation. The multi-head self-attention layer guarantees that different levels of the protein representation can be extracted by the fine-tuned step. During transfer learning, the fine-tuning strategy used by TLCrys improves the task-specialized learning ability of the network. Our method outperforms all previous predictors significantly in five crystallization stages of prediction. Furthermore, the proposed methodology can be well generalized to other protein sequence classification tasks.
Collapse
Affiliation(s)
- Chen Jin
- College of Computer Science, Nankai University, Tianjin 300350, China; (C.J.); (C.K.)
| | - Zhuangwei Shi
- College of Artificial Intelligence, Nankai University, Tianjin 300350, China; (Z.S.); (K.L.)
| | - Chuanze Kang
- College of Computer Science, Nankai University, Tianjin 300350, China; (C.J.); (C.K.)
| | - Ken Lin
- College of Artificial Intelligence, Nankai University, Tianjin 300350, China; (Z.S.); (K.L.)
| | - Han Zhang
- College of Artificial Intelligence, Nankai University, Tianjin 300350, China; (Z.S.); (K.L.)
| |
Collapse
|
4
|
Robinson SL, Piel J, Sunagawa S. A roadmap for metagenomic enzyme discovery. Nat Prod Rep 2021; 38:1994-2023. [PMID: 34821235 PMCID: PMC8597712 DOI: 10.1039/d1np00006c] [Citation(s) in RCA: 66] [Impact Index Per Article: 16.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2021] [Indexed: 12/13/2022]
Abstract
Covering: up to 2021Metagenomics has yielded massive amounts of sequencing data offering a glimpse into the biosynthetic potential of the uncultivated microbial majority. While genome-resolved information about microbial communities from nearly every environment on earth is now available, the ability to accurately predict biocatalytic functions directly from sequencing data remains challenging. Compared to primary metabolic pathways, enzymes involved in secondary metabolism often catalyze specialized reactions with diverse substrates, making these pathways rich resources for the discovery of new enzymology. To date, functional insights gained from studies on environmental DNA (eDNA) have largely relied on PCR- or activity-based screening of eDNA fragments cloned in fosmid or cosmid libraries. As an alternative, shotgun metagenomics holds underexplored potential for the discovery of new enzymes directly from eDNA by avoiding common biases introduced through PCR- or activity-guided functional metagenomics workflows. However, inferring new enzyme functions directly from eDNA is similar to searching for a 'needle in a haystack' without direct links between genotype and phenotype. The goal of this review is to provide a roadmap to navigate shotgun metagenomic sequencing data and identify new candidate biosynthetic enzymes. We cover both computational and experimental strategies to mine metagenomes and explore protein sequence space with a spotlight on natural product biosynthesis. Specifically, we compare in silico methods for enzyme discovery including phylogenetics, sequence similarity networks, genomic context, 3D structure-based approaches, and machine learning techniques. We also discuss various experimental strategies to test computational predictions including heterologous expression and screening. Finally, we provide an outlook for future directions in the field with an emphasis on meta-omics, single-cell genomics, cell-free expression systems, and sequence-independent methods.
Collapse
Affiliation(s)
| | - Jörn Piel
- Eidgenössische Technische Hochschule (ETH), Zürich, Switzerland.
| | | |
Collapse
|
5
|
Ding Y, Tang J, Guo F. Protein Crystallization Identification via Fuzzy Model on Linear Neighborhood Representation. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1986-1995. [PMID: 31751248 DOI: 10.1109/tcbb.2019.2954826] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
X-ray crystallography is the most popular approach for analyzing protein 3D structure. However, the success rate of protein crystallization is very low (2-10 percent). To reduce the cost of time and resources, lots of computation-based methods are developed to detect the protein crystallization. Improving the accuracy of predicting protein crystallization is very important for the determination of protein structure by X-ray crystallography. At present, many machine learning methods are used to predict protein crystallization. In this article, we propose a Fuzzy Support Vector Machine based on Linear Neighborhood Representation (FSVM-LNR) to predict the crystallization propensity of proteins. Proteins are represented by three types of features (PsePSSM, PSSM-DWT, MMI-PS), and these features are serially combined and fed into FSVM-LNR. FSVM-LNR can filter outliers by membership score, which is calculated via reconstruction residuals of k nearest samples. To evaluate the performance of our predictive model, we test FSVM-LNR on the datasets of TRAIN3587, TEST3585 and TEST500. Our method achieves better Mathew's correlation coefficient (MCC) on TRAIN3587 (MCC: 0.56) and TEST3585 (MCC: 0.58). Although the performance of independent test is not the best on TEST500, FSVM-LNR also has a certain predictability (MCC: 0.70) in the identification of protein crystallization. The good performance on the datasets proves the effectiveness of our method and the better performance on large datasets further demonstrates the stability and superiority of our method.
Collapse
|
6
|
Sequence-Based Prediction of Transmembrane Protein Crystallization Propensity. Interdiscip Sci 2021; 13:693-702. [PMID: 34143353 DOI: 10.1007/s12539-021-00448-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2020] [Revised: 05/31/2021] [Accepted: 06/04/2021] [Indexed: 10/21/2022]
Abstract
Transmembrane proteins play a vital role in cell life activities. There are several techniques to determine transmembrane protein structures and X-ray crystallography is the primary methodology. However, due to the special properties of transmembrane proteins, it is still hard to determine their structures by X-ray crystallography technique. To reduce experimental consumption and improve experimental efficiency, it is of great significance to develop computational methods for predicting the crystallization propensity of transmembrane proteins. In this work, we proposed a sequence-based machine learning method, namely Prediction of TransMembrane protein Crystallization propensity (PTMC), to predict the propensity of transmembrane protein crystallization. First, we obtained several general sequence features and the specific encoded features of relative solvent accessibility and hydrophobicity. Second, feature selection was employed to filter out redundant and irrelevant features, and the optimal feature subset is composed of hydrophobicity, amino acid composition and relative solvent accessibility. Finally, we chose extreme gradient boosting by comparing with other several machine learning methods. Comparative results on the independent test set indicate that PTMC outperforms state-of-the-art sequence-based methods in terms of sensitivity, specificity, accuracy, Matthew's Correlation Coefficient (MCC) and Area Under the receiver operating characteristic Curve (AUC). In comparison with two competitors, Bcrystal and TMCrys, PTMC achieves an improvement by 0.132 and 0.179 for sensitivity, 0.014 and 0.127 for specificity, 0.037 and 0.192 for accuracy, 0.128 and 0.362 for MCC, and 0.027 and 0.125 for AUC, respectively.
Collapse
|
7
|
Wang Y, Ding Y, Tang J, Dai Y, Guo F. CrystalM: A Multi-View Fusion Approach for Protein Crystallization Prediction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:325-335. [PMID: 31027046 DOI: 10.1109/tcbb.2019.2912173] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Improving the accuracy of predicting protein crystallization is very important for protein crystallization projects, which is a critical step for the determination of protein structure by X-ray crystallography. At present, many machine learning methods are used to predict protein crystallization. Here, we use a novel feature combination to construct a SVM model in the prediction of protein crystallization, called as CrystalM. In this work, we extract six features to represent protein sequences, namely Average Block-Position specific scoring matrix (AVBlock-PSSM), Average Block-Secondary Structure (AVBlock-SS), Global Encoding (GE), Pseudo-Position specific scoring matrix (PsePSSM), Protscale, and Discrete Wavelet Transform-Position specific scoring matrix (DWT-PSSM). Moreover, we employ two training datasets (TRAIN3587 and TRAIN1500) and their corresponding independent test datasets (TEST3585 and TEST500) to evaluate CrystalM by feeding multi-view features into Support Vector Machine (SVM) classifier. Two training datasets are employed for five-fold cross validation, and two test datasets are separately used to test the corresponding datasets. Finally, we compare CrystalM with other existing methods in the performance. For the datasets of TRAIN3587 and TEST3585, CrystalM achieves best Accuracy (ACC), best Specificity (SP), and the same Mathew's correlation coefficient (MCC) as the previous outperforming methods in the five-fold cross validation. In particular, ACC, SP, and MCC have surpassed the existing methods in independent test, which proves the effectiveness of CrystalM. Meanwhile, ACC, SP, and MCC are higher than existing methods in the five-fold cross validation for TRAIN1500. Although the performance of independent test for TEST500 is not the best, CrystalM also has a certain predictability in the prediction of protein crystallization. In addition, we find that only choosing the first four features can improve the performance of prediction for TRAIN1500 and TEST500, not only in independent tests but also in five-fold cross validation. This phenomenon indicates that the latter two features can not effectively represent proteins of TRAIN1500 and TEST500. CrystalM is a sequence-based protein crystallization prediction method. The good performance on the datasets proves the effectiveness of CrystalM and the better performance on large datasets further demonstrates the stability and superiority of CrystalM.
Collapse
|
8
|
Souza LFDF, Silva ICL, Marques AG, Silva FHDS, Nunes VX, Hassan MM, de Albuquerque VHC, Filho PPR. Internet of Medical Things: An Effective and Fully Automatic IoT Approach Using Deep Learning and Fine-Tuning to Lung CT Segmentation. SENSORS 2020; 20:s20236711. [PMID: 33255308 PMCID: PMC7727680 DOI: 10.3390/s20236711] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/13/2020] [Revised: 11/16/2020] [Accepted: 11/17/2020] [Indexed: 12/17/2022]
Abstract
Several pathologies have a direct impact on society, causing public health problems. Pulmonary diseases such as Chronic obstructive pulmonary disease (COPD) are already the third leading cause of death in the world, leaving tuberculosis at ninth with 1.7 million deaths and over 10.4 million new occurrences. The detection of lung regions in images is a classic medical challenge. Studies show that computational methods contribute significantly to the medical diagnosis of lung pathologies by Computerized Tomography (CT), as well as through Internet of Things (IoT) methods based in the context on the health of things. The present work proposes a new model based on IoT for classification and segmentation of pulmonary CT images, applying the transfer learning technique in deep learning methods combined with Parzen’s probability density. The proposed model uses an Application Programming Interface (API) based on the Internet of Medical Things to classify lung images. The approach was very effective, with results above 98% accuracy for classification in pulmonary images. Then the model proceeds to the lung segmentation stage using the Mask R-CNN network to create a pulmonary map and use fine-tuning to find the pulmonary borders on the CT image. The experiment was a success, the proposed method performed better than other works in the literature, reaching high segmentation metrics values such as accuracy of 98.34%. Besides reaching 5.43 s in segmentation time and overcoming other transfer learning models, our methodology stands out among the others because it is fully automatic. The proposed approach has simplified the segmentation process using transfer learning. It has introduced a faster and more effective method for better-performing lung segmentation, making our model fully automatic and robust.
Collapse
Affiliation(s)
- Luís Fabrício de Freitas Souza
- Department of Computer Science, Federal Institute of Education, Science and Technology of Ceará, Fortaleza CE 60040-215, Brazil; (L.F.d.F.S.); (I.C.L.S.); (A.G.M.); (F.H.d.S.S.); (V.X.N.); (V.H.C.d.A.); (P.P.R.F.)
- Department of Teleinformatics Engineering, Federal University of Ceará, Fortaleza CE 60020-181, Brazil
| | - Iágson Carlos Lima Silva
- Department of Computer Science, Federal Institute of Education, Science and Technology of Ceará, Fortaleza CE 60040-215, Brazil; (L.F.d.F.S.); (I.C.L.S.); (A.G.M.); (F.H.d.S.S.); (V.X.N.); (V.H.C.d.A.); (P.P.R.F.)
| | - Adriell Gomes Marques
- Department of Computer Science, Federal Institute of Education, Science and Technology of Ceará, Fortaleza CE 60040-215, Brazil; (L.F.d.F.S.); (I.C.L.S.); (A.G.M.); (F.H.d.S.S.); (V.X.N.); (V.H.C.d.A.); (P.P.R.F.)
| | - Francisco Hércules dos S. Silva
- Department of Computer Science, Federal Institute of Education, Science and Technology of Ceará, Fortaleza CE 60040-215, Brazil; (L.F.d.F.S.); (I.C.L.S.); (A.G.M.); (F.H.d.S.S.); (V.X.N.); (V.H.C.d.A.); (P.P.R.F.)
| | - Virgínia Xavier Nunes
- Department of Computer Science, Federal Institute of Education, Science and Technology of Ceará, Fortaleza CE 60040-215, Brazil; (L.F.d.F.S.); (I.C.L.S.); (A.G.M.); (F.H.d.S.S.); (V.X.N.); (V.H.C.d.A.); (P.P.R.F.)
| | - Mohammad Mehedi Hassan
- Information Systems Department, College of Computer and Information Sciences, King Saud University, Riyadh 11543, Saudi Arabia
- Correspondence:
| | - Victor Hugo C. de Albuquerque
- Department of Computer Science, Federal Institute of Education, Science and Technology of Ceará, Fortaleza CE 60040-215, Brazil; (L.F.d.F.S.); (I.C.L.S.); (A.G.M.); (F.H.d.S.S.); (V.X.N.); (V.H.C.d.A.); (P.P.R.F.)
- Department of Teleinformatics Engineering, Federal University of Ceará, Fortaleza CE 60020-181, Brazil
| | - Pedro P. Rebouças Filho
- Department of Computer Science, Federal Institute of Education, Science and Technology of Ceará, Fortaleza CE 60040-215, Brazil; (L.F.d.F.S.); (I.C.L.S.); (A.G.M.); (F.H.d.S.S.); (V.X.N.); (V.H.C.d.A.); (P.P.R.F.)
- Department of Teleinformatics Engineering, Federal University of Ceará, Fortaleza CE 60020-181, Brazil
| |
Collapse
|
9
|
Zhu YH, Hu J, Ge F, Li F, Song J, Zhang Y, Yu DJ. Accurate multistage prediction of protein crystallization propensity using deep-cascade forest with sequence-based features. Brief Bioinform 2020; 22:5839971. [PMID: 32436937 DOI: 10.1093/bib/bbaa076] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2020] [Revised: 04/09/2020] [Accepted: 04/13/2020] [Indexed: 11/13/2022] Open
Abstract
X-ray crystallography is the major approach for determining atomic-level protein structures. Because not all proteins can be easily crystallized, accurate prediction of protein crystallization propensity provides critical help in guiding experimental design and improving the success rate of X-ray crystallography experiments. This study has developed a new machine-learning-based pipeline that uses a newly developed deep-cascade forest (DCF) model with multiple types of sequence-based features to predict protein crystallization propensity. Based on the developed pipeline, two new protein crystallization propensity predictors, denoted as DCFCrystal and MDCFCrystal, have been implemented. DCFCrystal is a multistage predictor that can estimate the success propensities of the three individual steps (production of protein material, purification and production of crystals) in the protein crystallization process. MDCFCrystal is a single-stage predictor that aims to estimate the probability that a protein will pass through the entire crystallization process. Moreover, DCFCrystal is designed for general proteins, whereas MDCFCrystal is specially designed for membrane proteins, which are notoriously difficult to crystalize. DCFCrystal and MDCFCrystal were separately tested on two benchmark datasets consisting of 12 289 and 950 proteins, respectively, with known crystallization results from various experimental records. The experimental results demonstrated that DCFCrystal and MDCFCrystal increased the value of Matthew's correlation coefficient by 199.7% and 77.8%, respectively, compared to the best of other state-of-the-art protein crystallization propensity predictors. Detailed analyses show that the major advantages of DCFCrystal and MDCFCrystal lie in the efficiency of the DCF model and the sensitivity of the sequence-based features used, especially the newly designed pseudo-predicted hybrid solvent accessibility (PsePHSA) feature, which improves crystallization recognition by incorporating sequence-order information with solvent accessibility of residues. Meanwhile, the new crystal-dataset constructions help to train the models with more comprehensive crystallization knowledge.
Collapse
|
10
|
Abstract
The process of macromolecular crystallisation almost always begins by setting up crystallisation trials using commercial or other premade screens, followed by cycles of optimisation where the crystallisation cocktails are focused towards a particular small region of chemical space. The screening process is relatively straightforward, but still requires an understanding of the plethora of commercially available screens. Optimisation is complicated by requiring both the design and preparation of the appropriate secondary screens. Software has been developed in the C3 lab to aid the process of choosing initial screens, to analyse the results of the initial trials, and to design and describe how to prepare optimisation screens.
Collapse
|
11
|
MacGowan SA, Madeira F, Britto‐Borges T, Warowny M, Drozdetskiy A, Procter JB, Barton GJ. The Dundee Resource for Sequence Analysis and Structure Prediction. Protein Sci 2020; 29:277-297. [PMID: 31710725 PMCID: PMC6933851 DOI: 10.1002/pro.3783] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2019] [Revised: 11/07/2019] [Accepted: 11/07/2019] [Indexed: 11/06/2022]
Abstract
The Dundee Resource for Sequence Analysis and Structure Prediction (DRSASP; http://www.compbio.dundee.ac.uk/drsasp.html) is a collection of web services provided by the Barton Group at the University of Dundee. DRSASP's flagship services are the JPred4 webserver for secondary structure and solvent accessibility prediction and the JABAWS 2.2 webserver for multiple sequence alignment, disorder prediction, amino acid conservation calculations, and specificity-determining site prediction. DRSASP resources are available through conventional web interfaces and APIs but are also integrated into the Jalview sequence analysis workbench, which enables the composition of multitool interactive workflows. Other existing Barton Group tools are being brought under the banner of DRSASP, including NoD (Nucleolar localization sequence detector) and 14-3-3-Pred. New resources are being developed that enable the analysis of population genetic data in evolutionary and 3D structural contexts. Existing resources are actively developed to exploit new technologies and maintain parity with evolving web standards. DRSASP provides substantial computational resources for public use, and since 2016 DRSASP services have completed over 1.5 million jobs.
Collapse
Affiliation(s)
- Stuart A. MacGowan
- Division of Computational BiologyCollege of Life Sciences, University of DundeeUK
| | - Fábio Madeira
- Division of Computational BiologyCollege of Life Sciences, University of DundeeUK
| | - Thiago Britto‐Borges
- Division of Computational BiologyCollege of Life Sciences, University of DundeeUK
| | - Mateusz Warowny
- Division of Computational BiologyCollege of Life Sciences, University of DundeeUK
| | - Alexey Drozdetskiy
- Division of Computational BiologyCollege of Life Sciences, University of DundeeUK
| | - James B. Procter
- Division of Computational BiologyCollege of Life Sciences, University of DundeeUK
| | - Geoffrey J. Barton
- Division of Computational BiologyCollege of Life Sciences, University of DundeeUK
| |
Collapse
|
12
|
Wang H, Feng L, Webb GI, Kurgan L, Song J, Lin D. Critical evaluation of bioinformatics tools for the prediction of protein crystallization propensity. Brief Bioinform 2018; 19:838-852. [PMID: 28334201 PMCID: PMC6171492 DOI: 10.1093/bib/bbx018] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2016] [Revised: 01/19/2017] [Indexed: 12/11/2022] Open
Abstract
X-ray crystallography is the main tool for structural determination of proteins. Yet, the underlying crystallization process is costly, has a high attrition rate and involves a series of trial-and-error attempts to obtain diffraction-quality crystals. The Structural Genomics Consortium aims to systematically solve representative structures of major protein-fold classes using primarily high-throughput X-ray crystallography. The attrition rate of these efforts can be improved by selection of proteins that are potentially easier to be crystallized. In this context, bioinformatics approaches have been developed to predict crystallization propensities based on protein sequences. These approaches are used to facilitate prioritization of the most promising target proteins, search for alternative structural orthologues of the target proteins and suggest designs of constructs capable of potentially enhancing the likelihood of successful crystallization. We reviewed and compared nine predictors of protein crystallization propensity. Moreover, we demonstrated that integrating selected outputs from multiple predictors as candidate input features to build the predictive model results in a significantly higher predictive performance when compared to using these predictors individually. Furthermore, we also introduced a new and accurate predictor of protein crystallization propensity, Crysf, which uses functional features extracted from UniProt as inputs. This comprehensive review will assist structural biologists in selecting the most appropriate predictor, and is also beneficial for bioinformaticians to develop a new generation of predictive algorithms.
Collapse
Affiliation(s)
- Huilin Wang
- Department of Chemical Biology, College of Chemistry and Chemical Engineering, Xiamen University, China
| | | | - Geoffrey I Webb
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, Australia
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, USA
| | - Jiangning Song
- Department of Biochemistry and Molecular Biology, Monash University, Australia
| | - Donghai Lin
- Department of Chemical Biology, College of Chemistry and Chemical Engineering, Xiamen University, China
| |
Collapse
|
13
|
Fahim A, Rehman Z, Bhatti MF, Ali A, Virk N, Rashid A, Paracha RZ. Structural insights and characterization of human Npas4 protein. PeerJ 2018; 6:e4978. [PMID: 29915698 PMCID: PMC6004298 DOI: 10.7717/peerj.4978] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2018] [Accepted: 05/15/2018] [Indexed: 12/25/2022] Open
Abstract
Npas4 is an activity dependent transcription factor which is responsible for gearing the expression of target genes involved in neuro-transmission. Despite the importance of Npas4 in many neuronal diseases, the tertiary structure of Npas4 protein along with its physico-chemical properties is limited. In the current study, first we perfomed the phylogenetic analysis of Npas4 and determined the content of hydrophobic, flexible and order-disorder promoting amino acids. The protein binding regions, post-translational modifications and crystallization propensity of Npas4 were predicted through different in-silico methods. The three dimensional model of Npas4 was predicted through LOMET, SPARSKS-X, I-Tasser, RaptorX, MUSTER and Pyhre and the best model was selected on the basis of Ramachandran plot, PROSA, and Qmean scores. The best model was then subjected to further refinement though MODREFINER. Finally the interacting partners of Npas4 were identified through STRING database. The phylogenetic analysis showed the human Npas4 gene to be closely related to other primates such as chimpanzees, monkey, gibbon. The physiochemical properties of Npas4 showed that it is an intrinsically disordered protein with N-terminal ordered region. The post-translational modification analyses indicated absence of acetylation and mannosylation sites. Three potential phosphorylation sites (S108, T130 and T136) were found in PAS A domain whilst a single phosphorylation site (S273) was present in PAS B domain. The predicted tertiary structure of Npas4 showed that bHLH domain and PAS domain possess tertiary structures while the rest of the protein exhibited disorder property. Protein-protein interaction analysis revealed NPas4 interaction with various proteins which are mainly involved in nuclear trafficking of proteins to cytoplasm, activity regulated gene transcription and neurodevelopmental disorders. Moreover the analysis also highlighted the direct relation to proteins involved in promoting neuronal survival, plasticity and cAMP responsive element binding protein proteins. The current study helps in understanding the physicochemical properties and reveals the neuro-modulatory role of Npas4 in crucial pathways involved in neuronal survival and neural signalling hemostasis.
Collapse
Affiliation(s)
- Ammad Fahim
- Atta-ur-Rahman School of Applied Biosciences (ASAB), National University of Sciences and Technology (NUST), Islamabad, Pakistan
| | - Zaira Rehman
- Atta-ur-Rahman School of Applied Biosciences (ASAB), National University of Sciences and Technology (NUST), Islamabad, Pakistan
| | - Muhammad Faraz Bhatti
- Atta-ur-Rahman School of Applied Biosciences (ASAB), National University of Sciences and Technology (NUST), Islamabad, Pakistan
| | - Amjad Ali
- Atta-ur-Rahman School of Applied Biosciences (ASAB), National University of Sciences and Technology (NUST), Islamabad, Pakistan
| | - Nasar Virk
- Atta-ur-Rahman School of Applied Biosciences (ASAB), National University of Sciences and Technology (NUST), Islamabad, Pakistan
| | - Amir Rashid
- Army Medical College, National University of Medical Sciences, Rawalpindi, Pakistan
| | - Rehan Zafar Paracha
- Research Center for Modeling and Simulation (RCMS), National University of Sciences & Technology (NUST), Islamabad, Pakistan
| |
Collapse
|
14
|
Gao J, Wu Z, Hu G, Wang K, Song J, Joachimiak A, Kurgan L. Survey of Predictors of Propensity for Protein Production and Crystallization with Application to Predict Resolution of Crystal Structures. Curr Protein Pept Sci 2018; 19:200-210. [PMID: 28933304 PMCID: PMC7001581 DOI: 10.2174/1389203718666170921114437] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2017] [Revised: 09/14/2017] [Accepted: 09/14/2017] [Indexed: 11/22/2022]
Abstract
Selection of proper targets for the X-ray crystallography will benefit biological research community immensely. Several computational models were proposed to predict propensity of successful protein production and diffraction quality crystallization from protein sequences. We reviewed a comprehensive collection of 22 such predictors that were developed in the last decade. We found that almost all of these models are easily accessible as webservers and/or standalone software and we demonstrated that some of them are widely used by the research community. We empirically evaluated and compared the predictive performance of seven representative methods. The analysis suggests that these methods produce quite accurate propensities for the diffraction-quality crystallization. We also summarized results of the first study of the relation between these predictive propensities and the resolution of the crystallizable proteins. We found that the propensities predicted by several methods are significantly higher for proteins that have high resolution structures compared to those with the low resolution structures. Moreover, we tested a new meta-predictor, MetaXXC, which averages the propensities generated by the three most accurate predictors of the diffraction-quality crystallization. MetaXXC generates putative values of resolution that have modest levels of correlation with the experimental resolutions and it offers the lowest mean absolute error when compared to the seven considered methods. We conclude that protein sequences can be used to fairly accurately predict whether their corresponding protein structures can be solved using X-ray crystallography. Moreover, we also ascertain that sequences can be used to reasonably well predict the resolution of the resulting protein crystals.
Collapse
Affiliation(s)
- Jianzhao Gao
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin, People’s Republic of China
| | - Zhonghua Wu
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin, People’s Republic of China
| | - Gang Hu
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin, People’s Republic of China
| | - Kui Wang
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin, People’s Republic of China
| | - Jiangning Song
- Infection and Immunity Program, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Australia
- ARC Centre of Excellence in Advanced Molecular Imaging, Monash University, Melbourne, Australia
| | - Andrzej Joachimiak
- Midwest Center for Structural Genomics, Argonne, USA
- Structural Biology Center, Biosciences, Argonne National Laboratory, Argonne, USA
- Department of Biochemistry and Molecular Biology, University of Chicago, Chicago, USA
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, USA
| |
Collapse
|
15
|
Abstract
Obtaining diffracting quality crystals remains a major challenge in protein structure research. We summarize and compare methods for selecting the best protein targets for crystallization, construct optimization and crystallization condition design. Target selection methods are divided into algorithms predicting the chance of successful progression through all stages of structural determination (from cloning to solving the structure) and those focusing only on the crystallization step. We tried to highlight pros and cons of different approaches examining the following aspects: data size, redundancy and representativeness, overfitting during model construction, and results evaluation. In summary, although in recent years progress was made and several sequence properties were reported to be relevant for crystallization, the successful prediction of protein crystallization behavior and selection of corresponding crystallization conditions continue to challenge structural researchers.
Collapse
|
16
|
The "Sticky Patch" Model of Crystallization and Modification of Proteins for Enhanced Crystallizability. Methods Mol Biol 2017; 1607:77-115. [PMID: 28573570 DOI: 10.1007/978-1-4939-7000-1_4] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022]
Abstract
Crystallization of macromolecules has long been perceived as a stochastic process, which cannot be predicted or controlled. This is consistent with another popular notion that the interactions of molecules within the crystal, i.e., crystal contacts, are essentially random and devoid of specific physicochemical features. In contrast, functionally relevant surfaces, such as oligomerization interfaces and specific protein-protein interaction sites, are under evolutionary pressures so their amino acid composition, structure, and topology are distinct. However, current theoretical and experimental studies are significantly changing our understanding of the nature of crystallization. The increasingly popular "sticky patch" model, derived from soft matter physics, describes crystallization as a process driven by interactions between select, specific surface patches, with properties thermodynamically favorable for cohesive interactions. Independent support for this model comes from various sources including structural studies and bioinformatics. Proteins that are recalcitrant to crystallization can be modified for enhanced crystallizability through chemical or mutational modification of their surface to effectively engineer "sticky patches" which would drive crystallization. Here, we discuss the current state of knowledge of the relationship between the microscopic properties of the target macromolecule and its crystallizability, focusing on the "sticky patch" model. We discuss state-of-the-art in silico methods that evaluate the propensity of a given target protein to form crystals based on these relationships, with the objective to design variants with modified molecular surface properties and enhanced crystallization propensity. We illustrate this discussion with specific cases where these approaches allowed to generate crystals suitable for structural analysis.
Collapse
|
17
|
Hu J, Han K, Li Y, Yang JY, Shen HB, Yu DJ. TargetCrys: protein crystallization prediction by fusing multi-view features with two-layered SVM. Amino Acids 2016; 48:2533-2547. [DOI: 10.1007/s00726-016-2274-4] [Citation(s) in RCA: 34] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2015] [Accepted: 06/07/2016] [Indexed: 12/12/2022]
|
18
|
Crysalis: an integrated server for computational analysis and design of protein crystallization. Sci Rep 2016; 6:21383. [PMID: 26906024 PMCID: PMC4764925 DOI: 10.1038/srep21383] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2015] [Accepted: 01/22/2016] [Indexed: 11/08/2022] Open
Abstract
The failure of multi-step experimental procedures to yield diffraction-quality crystals is a major bottleneck in protein structure determination. Accordingly, several bioinformatics methods have been successfully developed and employed to select crystallizable proteins. Unfortunately, the majority of existing in silico methods only allow the prediction of crystallization propensity, seldom enabling computational design of protein mutants that can be targeted for enhancing protein crystallizability. Here, we present Crysalis, an integrated crystallization analysis tool that builds on support-vector regression (SVR) models to facilitate computational protein crystallization prediction, analysis, and design. More specifically, the functionality of this new tool includes: (1) rapid selection of target crystallizable proteins at the proteome level, (2) identification of site non-optimality for protein crystallization and systematic analysis of all potential single-point mutations that might enhance protein crystallization propensity, and (3) annotation of target protein based on predicted structural properties. We applied the design mode of Crysalis to identify site non-optimality for protein crystallization on a proteome-scale, focusing on proteins currently classified as non-crystallizable. Our results revealed that site non-optimality is based on biases related to residues, predicted structures, physicochemical properties, and sequence loci, which provides in-depth understanding of the features influencing protein crystallization. Crysalis is freely available at http://nmrcen.xmu.edu.cn/crysalis/.
Collapse
|
19
|
Deller MC, Kong L, Rupp B. Protein stability: a crystallographer's perspective. ACTA CRYSTALLOGRAPHICA SECTION F-STRUCTURAL BIOLOGY COMMUNICATIONS 2016; 72:72-95. [PMID: 26841758 PMCID: PMC4741188 DOI: 10.1107/s2053230x15024619] [Citation(s) in RCA: 160] [Impact Index Per Article: 17.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/27/2015] [Accepted: 12/21/2015] [Indexed: 12/18/2022]
Abstract
Protein stability is a topic of major interest for the biotechnology, pharmaceutical and food industries, in addition to being a daily consideration for academic researchers studying proteins. An understanding of protein stability is essential for optimizing the expression, purification, formulation, storage and structural studies of proteins. In this review, discussion will focus on factors affecting protein stability, on a somewhat practical level, particularly from the view of a protein crystallographer. The differences between protein conformational stability and protein compositional stability will be discussed, along with a brief introduction to key methods useful for analyzing protein stability. Finally, tactics for addressing protein-stability issues during protein expression, purification and crystallization will be discussed.
Collapse
Affiliation(s)
- Marc C Deller
- Stanford ChEM-H, Macromolecular Structure Knowledge Center, Stanford University, Shriram Center, 443 Via Ortega, Room 097, MC5082, Stanford, CA 94305-4125, USA
| | - Leopold Kong
- Laboratory of Cell and Molecular Biology, National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK), National Institutes of Health (NIH), Building 8, Room 1A03, 8 Center Drive, Bethesda, MD 20814, USA
| | - Bernhard Rupp
- Department of Forensic Crystallography, k.-k. Hofkristallamt, 91 Audrey Place, Vista, CA 92084, USA
| |
Collapse
|
20
|
Altan I, Charbonneau P, Snell EH. Computational crystallization. Arch Biochem Biophys 2016; 602:12-20. [PMID: 26792536 DOI: 10.1016/j.abb.2016.01.004] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2015] [Revised: 12/22/2015] [Accepted: 01/07/2016] [Indexed: 11/28/2022]
Abstract
Crystallization is a key step in macromolecular structure determination by crystallography. While a robust theoretical treatment of the process is available, due to the complexity of the system, the experimental process is still largely one of trial and error. In this article, efforts in the field are discussed together with a theoretical underpinning using a solubility phase diagram. Prior knowledge has been used to develop tools that computationally predict the crystallization outcome and define mutational approaches that enhance the likelihood of crystallization. For the most part these tools are based on binary outcomes (crystal or no crystal), and the full information contained in an assembly of crystallization screening experiments is lost. The potential of this additional information is illustrated by examples where new biological knowledge can be obtained and where a target can be sub-categorized to predict which class of reagents provides the crystallization driving force. Computational analysis of crystallization requires complete and correctly formatted data. While massive crystallization screening efforts are under way, the data available from many of these studies are sparse. The potential for this data and the steps needed to realize this potential are discussed.
Collapse
Affiliation(s)
- Irem Altan
- Department of Chemistry, Duke University, Durham, NC 27708, USA
| | - Patrick Charbonneau
- Department of Chemistry, Duke University, Durham, NC 27708, USA; Department of Physics, Duke University, Durham, NC 27708, USA
| | - Edward H Snell
- Hauptman-Woodward Medical Research Institute, 700 Ellicott St., NY 14203, USA; Department of Structural Biology, SUNY University of Buffalo, 700 Ellicott St., NY 14203, USA.
| |
Collapse
|
21
|
Yan S, Wu G. Predicting Crystallization Propensity of Proteins from Arabidopsis Thaliana. Biol Proced Online 2015; 17:16. [PMID: 26604856 PMCID: PMC4657326 DOI: 10.1186/s12575-015-0029-3] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2015] [Accepted: 11/12/2015] [Indexed: 12/02/2022] Open
Abstract
Background Many studies have correlated characteristics of amino acids with crystallization propensity, as part of the effort to determine the factors that affect the propensity of protein crystallization. However, these characteristics are constant; that is, the encoded amino acid sequences have the same value for each type of amino acid. To overcome this inflexibility, three dynamic characteristics of amino acids and protein were introduced to analyze the crystallization propensity of proteins. Both logistic regression and neural network models were used to correlate each of two dynamic characteristics with the crystallization propensity of 301 proteins from Arabidopsis thaliana, and their results were compared with those obtained from each of 531 constant amino acid characteristics, which served as the benchmark. Results The neural network model was more powerful for predicting the crystallization propensity of proteins than the logistic regression model. Compared with the benchmark, the dynamic characteristics of amino acids provided good prediction results for the crystallization propensity, and the distribution probability gave the highest sensitivity. Using 90 % accuracy as a cutoff point, the predictable portion of A. thaliana portions was ranked, and the statistical analysis showed that the larger the predictable portion, the better the prediction. Conclusions These results demonstrate that dynamic characteristics have a certain relationship with the crystallization propensity, and they could be helpful for the prediction of protein crystallization, which may provide a theoretical concept for certain proteins before conducting experimental crystallization. Electronic supplementary material The online version of this article (doi:10.1186/s12575-015-0029-3) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Shaomin Yan
- State Key Laboratory of Non-food Biomass Enzyme Technology, National Engineering Research Center for Non-food Biorefinery, Guangxi Key Laboratory of Biorefinery, Guangxi Academy of Sciences, 98 Daling Road, Nanning, Guangxi 530007 China
| | - Guang Wu
- State Key Laboratory of Non-food Biomass Enzyme Technology, National Engineering Research Center for Non-food Biorefinery, Guangxi Key Laboratory of Biorefinery, Guangxi Academy of Sciences, 98 Daling Road, Nanning, Guangxi 530007 China
| |
Collapse
|
22
|
Kirkwood J, Hargreaves D, O’Keefe S, Wilson J. Analysis of crystallization data in the Protein Data Bank. Acta Crystallogr F Struct Biol Commun 2015; 71:1228-34. [PMID: 26457511 PMCID: PMC4601584 DOI: 10.1107/s2053230x15014892] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2015] [Accepted: 08/08/2015] [Indexed: 11/10/2022] Open
Abstract
The Protein Data Bank (PDB) is the largest available repository of solved protein structures and contains a wealth of information on successful crystallization. Many centres have used their own experimental data to draw conclusions about proteins and the conditions in which they crystallize. Here, data from the PDB were used to reanalyse some of these results. The most successful crystallization reagents were identified, the link between solution pH and the isoelectric point of the protein was investigated and the possibility of predicting whether a protein will crystallize was explored.
Collapse
Affiliation(s)
- Jobie Kirkwood
- Department of Chemistry, University of York, York YO10 5DD, England
| | - David Hargreaves
- AstraZeneca, Darwin Building, Cambridge Science Park, Cambridge CB4 0WG, England
| | - Simon O’Keefe
- Department of Computer Science, University of York, York YO10 5DD, England
| | - Julie Wilson
- Department of Chemistry, University of York, York YO10 5DD, England
- Department of Mathematics, University of York, York YO10 5DD, England
| |
Collapse
|
23
|
Kirkwood J, Hargreaves D, O'Keefe S, Wilson J. Using isoelectric point to determine the pH for initial protein crystallization trials. ACTA ACUST UNITED AC 2015; 31:1444-51. [PMID: 25573921 PMCID: PMC4410668 DOI: 10.1093/bioinformatics/btv011] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2014] [Accepted: 01/05/2015] [Indexed: 11/12/2022]
Abstract
Motivation: The identification of suitable conditions for crystallization is a rate-limiting step in protein structure determination. The pH of an experiment is an important parameter and has the potential to be used in data-mining studies to help reduce the number of crystallization trials required. However, the pH is usually recorded as that of the buffer solution, which can be highly inaccurate. Results: Here, we show that a better estimate of the true pH can be predicted by considering not only the buffer pH but also any other chemicals in the crystallization solution. We use these more accurate pH values to investigate the disputed relationship between the pI of a protein and the pH at which it crystallizes. Availability and implementation: Data used to generate models are available as Supplementary Material. Contact:julie.wilson@york.ac.uk Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jobie Kirkwood
- Department of Chemistry, University of York, YO10 5DD, UK, Discovery Sciences, Structure & Biophysics, AstraZeneca Darwin Building, Cambridge, Science Park, Milton Road, Cambridge, CB4 0WG, UK, Department of Computer Science, University of York, YO10 5GH, UK and Department of Mathematics, University of York, YO10 5DD, UK
| | - David Hargreaves
- Department of Chemistry, University of York, YO10 5DD, UK, Discovery Sciences, Structure & Biophysics, AstraZeneca Darwin Building, Cambridge, Science Park, Milton Road, Cambridge, CB4 0WG, UK, Department of Computer Science, University of York, YO10 5GH, UK and Department of Mathematics, University of York, YO10 5DD, UK
| | - Simon O'Keefe
- Department of Chemistry, University of York, YO10 5DD, UK, Discovery Sciences, Structure & Biophysics, AstraZeneca Darwin Building, Cambridge, Science Park, Milton Road, Cambridge, CB4 0WG, UK, Department of Computer Science, University of York, YO10 5GH, UK and Department of Mathematics, University of York, YO10 5DD, UK
| | - Julie Wilson
- Department of Chemistry, University of York, YO10 5DD, UK, Discovery Sciences, Structure & Biophysics, AstraZeneca Darwin Building, Cambridge, Science Park, Milton Road, Cambridge, CB4 0WG, UK, Department of Computer Science, University of York, YO10 5GH, UK and Department of Mathematics, University of York, YO10 5DD, UK Department of Chemistry, University of York, YO10 5DD, UK, Discovery Sciences, Structure & Biophysics, AstraZeneca Darwin Building, Cambridge, Science Park, Milton Road, Cambridge, CB4 0WG, UK, Department of Computer Science, University of York, YO10 5GH, UK and Department of Mathematics, University of York, YO10 5DD, UK
| |
Collapse
|
24
|
Bhardwaj RM, Johnston A, Johnston BF, Florence AJ. A random forest model for predicting the crystallisability of organic molecules. CrystEngComm 2015. [DOI: 10.1039/c4ce02403f] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
25
|
Mizianty MJ, Fan X, Yan J, Chalmers E, Woloschuk C, Joachimiak A, Kurgan L. Covering complete proteomes with X-ray structures: a current snapshot. ACTA CRYSTALLOGRAPHICA. SECTION D, BIOLOGICAL CRYSTALLOGRAPHY 2014; 70:2781-93. [PMID: 25372670 PMCID: PMC4220968 DOI: 10.1107/s1399004714019427] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/23/2014] [Accepted: 08/27/2014] [Indexed: 12/23/2022]
Abstract
Structural genomics programs have developed and applied structure-determination pipelines to a wide range of protein targets, facilitating the visualization of macromolecular interactions and the understanding of their molecular and biochemical functions. The fundamental question of whether three-dimensional structures of all proteins and all functional annotations can be determined using X-ray crystallography is investigated. A first-of-its-kind large-scale analysis of crystallization propensity for all proteins encoded in 1953 fully sequenced genomes was performed. It is shown that current X-ray crystallographic knowhow combined with homology modeling can provide structures for 25% of modeling families (protein clusters for which structural models can be obtained through homology modeling), with at least one structural model produced for each Gene Ontology functional annotation. The coverage varies between superkingdoms, with 19% for eukaryotes, 35% for bacteria and 49% for archaea, and with those of viruses following the coverage values of their hosts. It is shown that the crystallization propensities of proteomes from the taxonomic superkingdoms are distinct. The use of knowledge-based target selection is shown to substantially increase the ability to produce X-ray structures. It is demonstrated that the human proteome has one of the highest attainable coverage values among eukaryotes, and GPCR membrane proteins suitable for X-ray structure determination were determined.
Collapse
Affiliation(s)
- Marcin J. Mizianty
- Electrical and Computer Engineering, University of Alberta, Edmonton, Alberta T6G 2V4, Canada
| | - Xiao Fan
- Electrical and Computer Engineering, University of Alberta, Edmonton, Alberta T6G 2V4, Canada
| | - Jing Yan
- Electrical and Computer Engineering, University of Alberta, Edmonton, Alberta T6G 2V4, Canada
| | - Eric Chalmers
- Electrical and Computer Engineering, University of Alberta, Edmonton, Alberta T6G 2V4, Canada
| | - Christopher Woloschuk
- Electrical and Computer Engineering, University of Alberta, Edmonton, Alberta T6G 2V4, Canada
| | - Andrzej Joachimiak
- Midwest Center for Structural Genomics, Argonne National Laboratory, Argonne, IL 60439, USA
| | - Lukasz Kurgan
- Electrical and Computer Engineering, University of Alberta, Edmonton, Alberta T6G 2V4, Canada
| |
Collapse
|
26
|
Wang H, Wang M, Tan H, Li Y, Zhang Z, Song J. PredPPCrys: accurate prediction of sequence cloning, protein production, purification and crystallization propensity from protein sequences using multi-step heterogeneous feature fusion and selection. PLoS One 2014; 9:e105902. [PMID: 25148528 PMCID: PMC4141844 DOI: 10.1371/journal.pone.0105902] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2014] [Accepted: 07/25/2014] [Indexed: 01/14/2023] Open
Abstract
X-ray crystallography is the primary approach to solve the three-dimensional structure of a protein. However, a major bottleneck of this method is the failure of multi-step experimental procedures to yield diffraction-quality crystals, including sequence cloning, protein material production, purification, crystallization and ultimately, structural determination. Accordingly, prediction of the propensity of a protein to successfully undergo these experimental procedures based on the protein sequence may help narrow down laborious experimental efforts and facilitate target selection. A number of bioinformatics methods based on protein sequence information have been developed for this purpose. However, our knowledge on the important determinants of propensity for a protein sequence to produce high diffraction-quality crystals remains largely incomplete. In practice, most of the existing methods display poorer performance when evaluated on larger and updated datasets. To address this problem, we constructed an up-to-date dataset as the benchmark, and subsequently developed a new approach termed ‘PredPPCrys’ using the support vector machine (SVM). Using a comprehensive set of multifaceted sequence-derived features in combination with a novel multi-step feature selection strategy, we identified and characterized the relative importance and contribution of each feature type to the prediction performance of five individual experimental steps required for successful crystallization. The resulting optimal candidate features were used as inputs to build the first-level SVM predictor (PredPPCrys I). Next, prediction outputs of PredPPCrys I were used as the input to build second-level SVM classifiers (PredPPCrys II), which led to significantly enhanced prediction performance. Benchmarking experiments indicated that our PredPPCrys method outperforms most existing procedures on both up-to-date and previous datasets. In addition, the predicted crystallization targets of currently non-crystallizable proteins were provided as compendium data, which are anticipated to facilitate target selection and design for the worldwide structural genomics consortium. PredPPCrys is freely available at http://www.structbioinfor.org/PredPPCrys.
Collapse
Affiliation(s)
- Huilin Wang
- National Engineering Laboratory for Industrial Enzymes and Key Laboratory of Systems Microbial Biotechnology, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin, China
| | - Mingjun Wang
- National Engineering Laboratory for Industrial Enzymes and Key Laboratory of Systems Microbial Biotechnology, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin, China
| | - Hao Tan
- Department of Biochemistry and Molecular Biology, Faculty of Medicine, Monash University, Melbourne, Victoria, Australia
| | - Yuan Li
- National Engineering Laboratory for Industrial Enzymes and Key Laboratory of Systems Microbial Biotechnology, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin, China
| | - Ziding Zhang
- State Key Laboratory of Agrobiotechnology, College of Biological Sciences, China Agricultural University, Beijing, China
- * E-mail: (JS); (ZZ)
| | - Jiangning Song
- National Engineering Laboratory for Industrial Enzymes and Key Laboratory of Systems Microbial Biotechnology, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin, China
- Department of Biochemistry and Molecular Biology, Faculty of Medicine, Monash University, Melbourne, Victoria, Australia
- ARC Centre of Excellence in Structural and Functional Microbial Genomics, Monash University, Melbourne, Victoria, Australia
- * E-mail: (JS); (ZZ)
| |
Collapse
|
27
|
Moon WK, Lo CM, Chen RT, Shen YW, Chang JM, Huang CS, Chen JH, Hsu WW, Chang RF. Tumor detection in automated breast ultrasound images using quantitative tissue clustering. Med Phys 2014; 41:042901. [DOI: 10.1118/1.4869264] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022] Open
|
28
|
Jahandideh S, Jaroszewski L, Godzik A. Improving the chances of successful protein structure determination with a random forest classifier. ACTA CRYSTALLOGRAPHICA. SECTION D, BIOLOGICAL CRYSTALLOGRAPHY 2014; 70:627-35. [PMID: 24598732 PMCID: PMC3949519 DOI: 10.1107/s1399004713032070] [Citation(s) in RCA: 41] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/05/2013] [Accepted: 11/25/2013] [Indexed: 01/29/2023]
Abstract
Obtaining diffraction quality crystals remains one of the major bottlenecks in structural biology. The ability to predict the chances of crystallization from the amino-acid sequence of the protein can, at least partly, address this problem by allowing a crystallographer to select homologs that are more likely to succeed and/or to modify the sequence of the target to avoid features that are detrimental to successful crystallization. In 2007, the now widely used XtalPred algorithm [Slabinski et al. (2007), Protein Sci. 16, 2472-2482] was developed. XtalPred classifies proteins into five `crystallization classes' based on a simple statistical analysis of the physicochemical features of a protein. Here, towards the same goal, advanced machine-learning methods are applied and, in addition, the predictive potential of additional protein features such as predicted surface ruggedness, hydrophobicity, side-chain entropy of surface residues and amino-acid composition of the predicted protein surface are tested. The new XtalPred-RF (random forest) achieves significant improvement of the prediction of crystallization success over the original XtalPred. To illustrate this, XtalPred-RF was tested by revisiting target selection from 271 Pfam families targeted by the Joint Center for Structural Genomics (JCSG) in PSI-2, and it was estimated that the number of targets entered into the protein-production and crystallization pipeline could have been reduced by 30% without lowering the number of families for which the first structures were solved. The prediction improvement depends on the subset of targets used as a testing set and reaches 100% (i.e. twofold) for the top class of predicted targets.
Collapse
Affiliation(s)
- Samad Jahandideh
- Bioinformatics and Systems Biology Program, Sanford-Burnham Medical Research Institute, 10901 North Torrey Pines Road, La Jolla, CA 92307, USA
- Joint Center for Structural Genomics, http://www.jcsg.org/, USA
| | - Lukasz Jaroszewski
- Bioinformatics and Systems Biology Program, Sanford-Burnham Medical Research Institute, 10901 North Torrey Pines Road, La Jolla, CA 92307, USA
- Joint Center for Structural Genomics, http://www.jcsg.org/, USA
- Center for Research in Biological Systems (CRBS), University of California, San Diego, La Jolla, California USA
| | - Adam Godzik
- Bioinformatics and Systems Biology Program, Sanford-Burnham Medical Research Institute, 10901 North Torrey Pines Road, La Jolla, CA 92307, USA
- Joint Center for Structural Genomics, http://www.jcsg.org/, USA
- Center for Research in Biological Systems (CRBS), University of California, San Diego, La Jolla, California USA
| |
Collapse
|
29
|
Charoenkwan P, Shoombuatong W, Lee HC, Chaijaruwanich J, Huang HL, Ho SY. SCMCRYS: predicting protein crystallization using an ensemble scoring card method with estimating propensity scores of P-collocated amino acid pairs. PLoS One 2013; 8:e72368. [PMID: 24019868 PMCID: PMC3760885 DOI: 10.1371/journal.pone.0072368] [Citation(s) in RCA: 71] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2013] [Accepted: 07/15/2013] [Indexed: 11/19/2022] Open
Abstract
Existing methods for predicting protein crystallization obtain high accuracy using various types of complemented features and complex ensemble classifiers, such as support vector machine (SVM) and Random Forest classifiers. It is desirable to develop a simple and easily interpretable prediction method with informative sequence features to provide insights into protein crystallization. This study proposes an ensemble method, SCMCRYS, to predict protein crystallization, for which each classifier is built by using a scoring card method (SCM) with estimating propensity scores of p-collocated amino acid (AA) pairs (p = 0 for a dipeptide). The SCM classifier determines the crystallization of a sequence according to a weighted-sum score. The weights are the composition of the p-collocated AA pairs, and the propensity scores of these AA pairs are estimated using a statistic with optimization approach. SCMCRYS predicts the crystallization using a simple voting method from a number of SCM classifiers. The experimental results show that the single SCM classifier utilizing dipeptide composition with accuracy of 73.90% is comparable to the best previously-developed SVM-based classifier, SVM_POLY (74.6%), and our proposed SVM-based classifier utilizing the same dipeptide composition (77.55%). The SCMCRYS method with accuracy of 76.1% is comparable to the state-of-the-art ensemble methods PPCpred (76.8%) and RFCRYS (80.0%), which used the SVM and Random Forest classifiers, respectively. This study also investigates mutagenesis analysis based on SCM and the result reveals the hypothesis that the mutagenesis of surface residues Ala and Cys has large and small probabilities of enhancing protein crystallizability considering the estimated scores of crystallizability and solubility, melting point, molecular weight and conformational entropy of amino acids in a generalized condition. The propensity scores of amino acids and dipeptides for estimating the protein crystallizability can aid biologists in designing mutation of surface residues to enhance protein crystallizability. The source code of SCMCRYS is available at http://iclab.life.nctu.edu.tw/SCMCRYS/.
Collapse
Affiliation(s)
- Phasit Charoenkwan
- Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu, Taiwan
| | - Watshara Shoombuatong
- Department of Computer Science, Bioinformatics Research Laboratory, Chiang Mai University, Chiang Mai, Thailand
| | - Hua-Chin Lee
- Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu, Taiwan
| | - Jeerayut Chaijaruwanich
- Department of Computer Science, Bioinformatics Research Laboratory, Chiang Mai University, Chiang Mai, Thailand
| | - Hui-Ling Huang
- Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu, Taiwan
- Department of Biological Science and Technology, National Chiao Tung University, Hsinchu, Taiwan
- * E-mail: (HLH); (SYH)
| | - Shinn-Ying Ho
- Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu, Taiwan
- Department of Biological Science and Technology, National Chiao Tung University, Hsinchu, Taiwan
- * E-mail: (HLH); (SYH)
| |
Collapse
|
30
|
Yan S, Wu G. Association of combined features of amino acid and protein withcrystallization propensity of proteins fromCytophaga Hutchinsonii. Z KRIST-CRYST MATER 2013. [DOI: 10.1524/zkri.2013.1570] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
AbstractVarious features of amino acids have been so far associated with crystallization propensity of proteins. The majority of features generally represents a certain aspect of individual amino acid, for example, molecular weight of an amino acid. Meanwhile, a small portion of features, which represents a certain aspect of a whole protein, is also associated with crystallization propensity of proteins, such as protein length. Clearly, the features of individual amino acids distinguish themselves from the features of a whole protein. Therefore it would be more rationale to use the features, which combine both features of individual amino acids and a whole protein, to associate crystallization propensity of proteins, because the features of individual amino acids are not subject to amino acids‘ positions in a protein, for instance. In this study, each of three combined features of individual amino acids and a whole protein are associated with crystallization propensity of proteins fromCytophaga Hutchinsoniithrough logistic regression and neural network, and each of 535 features of individual amino acids is also associated with crystallization propensity of the proteins to serve as benchmark. The results show that the combined features have a good relationship with crystallization propensity of proteins fromCytophaga Hutchinsonii. This study provides the information that the combined features can be used for predicting crystallization propensity of protein.
Collapse
|
31
|
Jahandideh S, Mahdavi A. RFCRYS: sequence-based protein crystallization propensity prediction by means of random forest. J Theor Biol 2012; 306:115-9. [PMID: 22726810 DOI: 10.1016/j.jtbi.2012.04.028] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2011] [Revised: 01/27/2012] [Accepted: 04/24/2012] [Indexed: 01/01/2023]
Abstract
Production of high-quality diffracting crystals is a critical step in determining the 3D structure of a protein by X-ray crystallography. Only 2%-10% of crystallization projects result in high-resolution protein structures. Previously, several computational methods for prediction of protein crystallizability were developed. In this work, we introduce RFCRYS, a Random Forest based method to predict crystallizability of proteins. RFCRYS utilizes mono-, di-, and tri-peptides amino acid compositions, frequencies of amino acids in different physicochemical groups, isoelectric point, molecular weight, and length of protein sequences, from the primary sequences to predict crystallizabillity by using two different databases. RFCRYS was compared with previous methods and the results obtained show that our proposed method using this set of features outperforms existing predictors with higher accuracy, MCC, and Specificity. Especially, our method is characterized by high Specificity of 0.95, which means RFCRYS rarely mispredicts a protein chain to be crystallizable which consequently would be useful for saving time and resources. In conclusion RFCRYS provides accurate crystallizability prediction for a protein chain that can be applied to support crystallization projects getting higher success rate towards obtaining diffraction-quality crystals.
Collapse
Affiliation(s)
- Samad Jahandideh
- Department of Biostatistics, Section on Statistical Genetics, University of Alabama at Birmingham, Birmingham, AL, USA.
| | | |
Collapse
|
32
|
Yan S, Wu G. Correlating dynamic amino acid properties with success rate of crystallization of proteins from Bacteroides vulgatus. CRYSTAL RESEARCH AND TECHNOLOGY 2012. [DOI: 10.1002/crat.201200007] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
33
|
Bray JE. Target selection for structural genomics based on combining fold recognition and crystallisation prediction methods: application to the human proteome. ACTA ACUST UNITED AC 2012; 13:37-46. [PMID: 22354707 DOI: 10.1007/s10969-012-9130-x] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2011] [Accepted: 02/07/2012] [Indexed: 11/29/2022]
Abstract
The objective of this study is to automatically identify regions of the human proteome that are suitable for 3D structure determination by X-ray crystallography and to annotate them according to their likelihood to produce diffraction quality crystals. The results provide a powerful tool for structural genomics laboratories who wish to select human proteins based on the statistical likelihood of crystallisation success. Combining fold recognition and crystallisation prediction algorithms enables the efficient calculation of the crystallisability of the entire human proteome. This novel study estimates that there are approximately 40,000 crystallisable regions in the human proteome. Currently, only 15% of these regions (approx. 6,000 sequences) have been solved to at least 95% sequence identity. The remaining unsolved regions have been categorised into 5 crystallisation classes and an integral membrane protein (IMP) class, based on established structure prediction, crystallisation prediction and transmembrane (TM) helix prediction algorithms. Approximately 750 unsolved regions (2% of the proteome) have been identified as having a PDB fold representative (template) and an 'optimal' likelihood of crystallisation. At the other end of the spectrum, more than 10,500 non-IMP regions with a PDB template are classified as 'very difficult' to crystallise (26%) and almost 2,500 regions (6%) were predicted to contain at least 3 TM helices. The 3D-SPECS (3D Structural Proteomics Explorer with Crystallisation Scores) website contains crystallisation predictions for the entire human proteome and can be found at http://www.bioinformaticsplus.org/3dspecs.
Collapse
Affiliation(s)
- James E Bray
- Structural Genomics Consortium, University of Oxford, Old Road Campus Research Building, Roosevelt Drive, Oxford, OX3 7DQ, UK.
| |
Collapse
|
34
|
The structural plasticity of the human copper chaperone for SOD1: insights from combined size-exclusion chromatographic and solution X-ray scattering studies. Biochem J 2011; 439:39-44. [DOI: 10.1042/bj20110948] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
The incorporation of copper into biological macromolecules such as SOD1 (Cu,Zn superoxide dismutase) is essential for the viability of most organisms. However, copper is toxic and therefore the intracellular free copper concentration is kept to an absolute minimum. Several proteins, termed metallochaperones, are charged with the responsibility of delivering copper from membrane transporters to its intracellular destination. The CCS (copper chaperone for SOD1) is the major pathway for SOD1 copper loading. We have determined the first solution structure of hCCS (human CCS) by SAXS (small-angle X-ray scattering) in conjunction with SEC (size-exclusion chromatography). The findings of the present study highlight the importance of this combined on-line chromatographic technology with SAXS, which has allowed us to unambiguously separate the hCCS dimer from other oligomeric and non-physiological aggregated states that would otherwise adversely effect measurements performed on bulk solutions. The present study exposes the dynamic molecular conformation of this multi-domain chaperone in solution. The metal-binding domains known to be responsible for the conveyance of copper to SOD1 can be found in positions that would expedite this movement. Domains I and III of a single hCCS monomer are able to interact and can also move into positions that would facilitate initial copper binding and ultimately transfer to SOD1. Conversely, the interpretation of our solution studies is not compatible with an interaction between these domains and their counterparts in an hCCS dimer. Overall, the results of the present study reveal the plasticity of this multi-domain chaperone in solution and are consistent with an indispensable flexibility necessary for executing its dual functions of metal binding and transfer.
Collapse
|
35
|
Overton IM, Barton GJ. Computational approaches to selecting and optimising targets for structural biology. Methods 2011; 55:3-11. [PMID: 21906678 PMCID: PMC3202631 DOI: 10.1016/j.ymeth.2011.08.014] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2011] [Revised: 08/18/2011] [Accepted: 08/22/2011] [Indexed: 11/29/2022] Open
Abstract
Selection of protein targets for study is central to structural biology and may be influenced by numerous factors. A key aim is to maximise returns for effort invested by identifying proteins with the balance of biophysical properties that are conducive to success at all stages (e.g. solubility, crystallisation) in the route towards a high resolution structural model. Selected targets can be optimised through construct design (e.g. to minimise protein disorder), switching to a homologous protein, and selection of experimental methodology (e.g. choice of expression system) to prime for efficient progress through the structural proteomics pipeline. Here we discuss computational techniques in target selection and optimisation, with more detailed focus on tools developed within the Scottish Structural Proteomics Facility (SSPF); namely XANNpred, ParCrys, OB-Score (target selection) and TarO (target optimisation). TarO runs a large number of algorithms, searching for homologues and annotating the pool of possible alternative targets. This pool of putative homologues is presented in a ranked, tabulated format and results are also visualised as an automatically generated and annotated multiple sequence alignment. The target selection algorithms each predict the propensity of a selected protein target to progress through the experimental stages leading to diffracting crystals. This single predictor approach has advantages for target selection, when compared with an approach using two or more predictors that each predict for success at a single experimental stage. The tools described here helped SSPF achieve a high (21%) success rate in progressing cloned targets to diffraction-quality crystals.
Collapse
Affiliation(s)
- Ian M Overton
- MRC Human Genetics Unit, Institute of Genetics and Molecular Medicine, Western General Hospital, Crewe Road, Edinburgh EH4 2XU, United Kingdom.
| | | |
Collapse
|
36
|
Mizianty MJ, Kurgan L. Sequence-based prediction of protein crystallization, purification and production propensity. Bioinformatics 2011; 27:i24-33. [PMID: 21685077 PMCID: PMC3117383 DOI: 10.1093/bioinformatics/btr229] [Citation(s) in RCA: 63] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
MOTIVATION X-ray crystallography-based protein structure determination, which accounts for majority of solved structures, is characterized by relatively low success rates. One solution is to build tools which support selection of targets that are more likely to crystallize. Several in silico methods that predict propensity of diffraction-quality crystallization from protein chains were developed. We show that the quality of their predictions drops when applied to more recent crystallization trails, which calls for new solutions. We propose a novel approach that alleviates drawbacks of the existing methods by using a recent dataset and improved protocol to annotate progress along the crystallization process, by predicting the success of the entire process and steps which result in the failed attempts, and by utilizing a compact and comprehensive set of sequence-derived inputs to generate accurate predictions. RESULTS The proposed PPCpred (predictor of protein Production, Purification and Crystallization) predict propensity for production of diffraction-quality crystals, production of crystals, purification and production of the protein material. PPCpred utilizes comprehensive set of inputs based on energy and hydrophobicity indices, composition of certain amino acid types, predicted disorder, secondary structure and solvent accessibility, and content of certain buried and exposed residues. Our method significantly outperforms alignment-based predictions and several modern crystallization propensity predictors. Receiver operating characteristic (ROC) curves show that PPCpred is particularly useful for users who desire high true positive (TP) rates, i.e. low rate of mispredictions for solvable chains. Our model reveals several intuitive factors that influence the success of individual steps and the entire crystallization process, including the content of Cys, buried His and Ser, hydrophobic/hydrophilic segments and the number of predicted disordered segments. AVAILABILITY http://biomine.ece.ualberta.ca/PPCpred/. CONTACT lkurgan@ece.ualberta.ca.
Collapse
Affiliation(s)
- Marcin J Mizianty
- Department of Electrical and Computer Engineering, University of Alberta, Edmonton, Canada
| | | |
Collapse
|
37
|
Trilisky E, Gillespie R, Osslund TD, Vunnum S. Crystallization and liquid-liquid phase separation of monoclonal antibodies and fc-fusion proteins: Screening results. Biotechnol Prog 2011; 27:1054-67. [DOI: 10.1002/btpr.621] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2010] [Revised: 02/16/2011] [Indexed: 11/09/2022]
|
38
|
Rao HB, Zhu F, Yang GB, Li ZR, Chen YZ. Update of PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence. Nucleic Acids Res 2011; 39:W385-90. [PMID: 21609959 PMCID: PMC3125735 DOI: 10.1093/nar/gkr284] [Citation(s) in RCA: 105] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/04/2022] Open
Abstract
Sequence-derived structural and physicochemical features have been extensively used for analyzing and predicting structural, functional, expression and interaction profiles of proteins and peptides. PROFEAT has been developed as a web server for computing commonly used features of proteins and peptides from amino acid sequence. To facilitate more extensive studies of protein and peptides, numerous improvements and updates have been made to PROFEAT. We added new functions for computing descriptors of protein–protein and protein–small molecule interactions, segment descriptors for local properties of protein sequences, topological descriptors for peptide sequences and small molecule structures. We also added new feature groups for proteins and peptides (pseudo-amino acid composition, amphiphilic pseudo-amino acid composition, total amino acid properties and atomic-level topological descriptors) as well as for small molecules (atomic-level topological descriptors). Overall, PROFEAT computes 11 feature groups of descriptors for proteins and peptides, and a feature group of more than 400 descriptors for small molecules plus the derived features for protein–protein and protein–small molecule interactions. Our computational algorithms have been extensively tested and used in a number of published works for predicting proteins of specific structural or functional classes, protein–protein interactions, peptides of specific functions and quantitative structure activity relationships of small molecules. PROFEAT is accessible free of charge at http://bidd.cz3.nus.edu.sg/cgi-bin/prof/protein/profnew.cgi.
Collapse
Affiliation(s)
- H B Rao
- College of Chemistry, Sichuan University, Chengdu, 610064, PR China
| | | | | | | | | |
Collapse
|
39
|
Overton IM, van Niekerk CAJ, Barton GJ. XANNpred: neural nets that predict the propensity of a protein to yield diffraction-quality crystals. Proteins 2011; 79:1027-33. [PMID: 21246630 PMCID: PMC3084997 DOI: 10.1002/prot.22914] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2010] [Revised: 09/22/2010] [Accepted: 10/07/2010] [Indexed: 11/08/2022]
Abstract
Production of diffracting crystals is a critical step in determining the three-dimensional structure of a protein by X-ray crystallography. Computational techniques to rank proteins by their propensity to yield diffraction-quality crystals can improve efficiency in obtaining structural data by guiding both protein selection and construct design. XANNpred comprises a pair of artificial neural networks that each predict the propensity of a selected protein sequence to produce diffraction-quality crystals by current structural biology techniques. Blind tests show XANNpred has accuracy and Matthews correlation values ranging from 75% to 81% and 0.50 to 0.63 respectively; values of area under the receiver operator characteristic (ROC) curve range from 0.81 to 0.88. On blind test data XANNpred outperforms the other available algorithms XtalPred, PXS, OB-Score, and ParCrys. XANNpred also guides construct design by presenting graphs of predicted propensity for diffraction-quality crystals against residue sequence position. The XANNpred-SG algorithm is likely to be most useful to target selection in structural genomics consortia, while the XANNpred-PDB algorithm is more suited to the general structural biology community. XANNpred predictions that include sliding window graphs are freely available from http://www.compbio.dundee.ac.uk/xannpred Proteins 2011. © 2010 Wiley-Liss, Inc.
Collapse
Affiliation(s)
- Ian M Overton
- School of Life Sciences Research, College of Life Sciences, University of Dundee, Dundee, UK
| | | | | |
Collapse
|
40
|
Oke M, Carter LG, Johnson KA, Liu H, McMahon SA, Yan X, Kerou M, Weikart ND, Kadi N, Sheikh MA, Schmelz S, Dorward M, Zawadzki M, Cozens C, Falconer H, Powers H, Overton IM, van Niekerk CAJ, Peng X, Patel P, Garrett RA, Prangishvili D, Botting CH, Coote PJ, Dryden DTF, Barton GJ, Schwarz-Linek U, Challis GL, Taylor GL, White MF, Naismith JH. The Scottish Structural Proteomics Facility: targets, methods and outputs. ACTA ACUST UNITED AC 2010; 11:167-80. [PMID: 20419351 PMCID: PMC2883930 DOI: 10.1007/s10969-010-9090-y] [Citation(s) in RCA: 100] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2010] [Accepted: 04/06/2010] [Indexed: 12/19/2022]
Abstract
The Scottish Structural Proteomics Facility was funded to develop a laboratory scale approach to high throughput structure determination. The effort was successful in that over 40 structures were determined. These structures and the methods harnessed to obtain them are reported here. This report reflects on the value of automation but also on the continued requirement for a high degree of scientific and technical expertise. The efficiency of the process poses challenges to the current paradigm of structural analysis and publication. In the 5 year period we published ten peer-reviewed papers reporting structural data arising from the pipeline. Nevertheless, the number of structures solved exceeded our ability to analyse and publish each new finding. By reporting the experimental details and depositing the structures we hope to maximize the impact of the project by allowing others to follow up the relevant biology.
Collapse
Affiliation(s)
- Muse Oke
- Biomedical Sciences Research Complex, University of St Andrews, St Andrews, KY16 9ST UK
| | - Lester G. Carter
- Biomedical Sciences Research Complex, University of St Andrews, St Andrews, KY16 9ST UK
- Present Address: Stanford Synchrotron Radiation Light Source, 2575 Sand Hill Road, MS 69, Menlo Park, CA 94025 USA
| | - Kenneth A. Johnson
- Biomedical Sciences Research Complex, University of St Andrews, St Andrews, KY16 9ST UK
- Present Address: The Norwegian Structural Biology Centre, University of Tromsø, 9037 Tromsø, Norway
| | - Huanting Liu
- Biomedical Sciences Research Complex, University of St Andrews, St Andrews, KY16 9ST UK
| | - Stephen A. McMahon
- Biomedical Sciences Research Complex, University of St Andrews, St Andrews, KY16 9ST UK
| | - Xuan Yan
- Biomedical Sciences Research Complex, University of St Andrews, St Andrews, KY16 9ST UK
| | - Melina Kerou
- Biomedical Sciences Research Complex, University of St Andrews, St Andrews, KY16 9ST UK
| | - Nadine D. Weikart
- Biomedical Sciences Research Complex, University of St Andrews, St Andrews, KY16 9ST UK
- Present Address: Faculty of Chemistry, Technische Universität Dortmund, Otto-Hahn-Str. 6, 44227 Dortmund, Germany
| | - Nadia Kadi
- Department of Chemistry, University of Warwick, Coventry, CV4 7AL UK
- Present Address: Institute of Cancer Research, 15 Cotswold Road, Belmont, Sutton, Surrey, SM2 5NG UK
| | - Md. Arif Sheikh
- Biomedical Sciences Research Complex, University of St Andrews, St Andrews, KY16 9ST UK
| | - Stefan Schmelz
- Biomedical Sciences Research Complex, University of St Andrews, St Andrews, KY16 9ST UK
| | - Mark Dorward
- Biomedical Sciences Research Complex, University of St Andrews, St Andrews, KY16 9ST UK
- Present Address: Division of Signal Transduction Therapy, College of Life Sciences, University of Dundee, Dundee, DD1 5EH Scotland, UK
| | - Michal Zawadzki
- Biomedical Sciences Research Complex, University of St Andrews, St Andrews, KY16 9ST UK
- Present Address: Syngenta Ltd, Jealott’s Hill International Research Centre, Bracknell, Berkshire, RG42 6EY UK
| | - Christopher Cozens
- Biomedical Sciences Research Complex, University of St Andrews, St Andrews, KY16 9ST UK
- Present Address: Medical Research Council Laboratory of Molecular Biology, Hills Road, Cambridge, CB2 0QH UK
| | - Helen Falconer
- Biomedical Sciences Research Complex, University of St Andrews, St Andrews, KY16 9ST UK
- Present Address: Institute of Structural and Molecular Biology, Edinburgh University, Kings Buildings, Edinburgh, EH9 3JR UK
| | - Helen Powers
- Biomedical Sciences Research Complex, University of St Andrews, St Andrews, KY16 9ST UK
| | - Ian M. Overton
- Division of Biological Chemistry and Drug Discovery, College of Life Sciences, University of Dundee, Dundee, DD1 5EH Scotland, UK
- Present Address: MRC Human Genetics Unit, Crewe Road South, Edinburgh, EH4 2XU UK
| | - C. A. Johannes van Niekerk
- Division of Biological Chemistry and Drug Discovery, College of Life Sciences, University of Dundee, Dundee, DD1 5EH Scotland, UK
| | - Xu Peng
- Department of Biology, Archaea Centre, University of Copenhagen, Ole Maaløes Vej 5, 2200, Copenhagen N, Denmark
| | - Prakash Patel
- Department of Chemistry, University of Warwick, Coventry, CV4 7AL UK
| | - Roger A. Garrett
- Department of Biology, Archaea Centre, University of Copenhagen, Ole Maaløes Vej 5, 2200, Copenhagen N, Denmark
| | | | - Catherine H. Botting
- Biomedical Sciences Research Complex, University of St Andrews, St Andrews, KY16 9ST UK
| | - Peter J. Coote
- Biomedical Sciences Research Complex, University of St Andrews, St Andrews, KY16 9ST UK
| | - David T. F. Dryden
- EaStChem School of Chemistry, University of Edinburgh, The King’s Buildings, Edinburgh, EH9 3JJ UK
| | - Geoffrey J. Barton
- Division of Biological Chemistry and Drug Discovery, College of Life Sciences, University of Dundee, Dundee, DD1 5EH Scotland, UK
| | - Ulrich Schwarz-Linek
- Biomedical Sciences Research Complex, University of St Andrews, St Andrews, KY16 9ST UK
| | | | - Garry L. Taylor
- Biomedical Sciences Research Complex, University of St Andrews, St Andrews, KY16 9ST UK
| | - Malcolm F. White
- Biomedical Sciences Research Complex, University of St Andrews, St Andrews, KY16 9ST UK
| | - James H. Naismith
- Biomedical Sciences Research Complex, University of St Andrews, St Andrews, KY16 9ST UK
| |
Collapse
|
41
|
Zucker FH, Stewart C, dela Rosa J, Kim J, Zhang L, Xiao L, Ross J, Napuli AJ, Mueller N, Castaneda LJ, Nakazawa Hewitt SR, Arakaki TL, Larson ET, Subramanian E, Verlinde CLMJ, Fan E, Buckner FS, Van Voorhis WC, Merritt EA, Hol WGJ. Prediction of protein crystallization outcome using a hybrid method. J Struct Biol 2010; 171:64-73. [PMID: 20347992 DOI: 10.1016/j.jsb.2010.03.016] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2009] [Revised: 03/18/2010] [Accepted: 03/23/2010] [Indexed: 10/19/2022]
Abstract
The great power of protein crystallography to reveal biological structure is often limited by the tremendous effort required to produce suitable crystals. A hybrid crystal growth predictive model is presented that combines both experimental and sequence-derived data from target proteins, including novel variables derived from physico-chemical characterization such as R(30), the ratio between a protein's DSF intensity at 30°C and at T(m). This hybrid model is shown to be more powerful than sequence-based prediction alone - and more likely to be useful for prioritizing and directing the efforts of structural genomics and individual structural biology laboratories.
Collapse
Affiliation(s)
- Frank H Zucker
- Medical Structural Genomics of Pathogenic Protozoa, School of Medicine, University of Washington, Seattle, WA 98195-7742, United States
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
42
|
Babnigg G, Joachimiak A. Predicting protein crystallization propensity from protein sequence. ACTA ACUST UNITED AC 2010; 11:71-80. [PMID: 20177794 DOI: 10.1007/s10969-010-9080-0] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2009] [Accepted: 02/05/2010] [Indexed: 10/19/2022]
Abstract
The high-throughput structure determination pipelines developed by structural genomics programs offer a unique opportunity for data mining. One important question is how protein properties derived from a primary sequence correlate with the protein's propensity to yield X-ray quality crystals (crystallizability) and 3D X-ray structures. A set of protein properties were computed for over 1,300 proteins that expressed well but were insoluble, and for approximately 720 unique proteins that resulted in X-ray structures. The correlation of the protein's iso-electric point and grand average hydropathy (GRAVY) with crystallizability was analyzed for full length and domain constructs of protein targets. In a second step, several additional properties that can be calculated from the protein sequence were added and evaluated. Using statistical analyses we have identified a set of the attributes correlating with a protein's propensity to crystallize and implemented a Support Vector Machine (SVM) classifier based on these. We have created applications to analyze and provide optimal boundary information for query sequences and to visualize the data. These tools are available via the web site http://bioinformatics.anl.gov/cgi-bin/tools/pdpredictor .
Collapse
Affiliation(s)
- György Babnigg
- Midwest Center for Structural Genomics, Biosciences Division, Argonne National Laboratory, 9700 S Cass Ave., Argonne, IL 60439, USA.
| | | |
Collapse
|
43
|
Meta prediction of protein crystallization propensity. Biochem Biophys Res Commun 2009; 390:10-5. [DOI: 10.1016/j.bbrc.2009.09.036] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2009] [Accepted: 09/10/2009] [Indexed: 11/23/2022]
|
44
|
Kurgan L, Razib AA, Aghakhani S, Dick S, Mizianty M, Jahandideh S. CRYSTALP2: sequence-based protein crystallization propensity prediction. BMC STRUCTURAL BIOLOGY 2009; 9:50. [PMID: 19646256 PMCID: PMC2731098 DOI: 10.1186/1472-6807-9-50] [Citation(s) in RCA: 58] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/05/2009] [Accepted: 07/31/2009] [Indexed: 11/10/2022]
Abstract
BACKGROUND Current protocols yield crystals for <30% of known proteins, indicating that automatically identifying crystallizable proteins may improve high-throughput structural genomics efforts. We introduce CRYSTALP2, a kernel-based method that predicts the propensity of a given protein sequence to produce diffraction-quality crystals. This method utilizes the composition and collocation of amino acids, isoelectric point, and hydrophobicity, as estimated from the primary sequence, to generate predictions. CRYSTALP2 extends its predecessor, CRYSTALP, by enabling predictions for sequences of unrestricted size and provides improved prediction quality. RESULTS A significant majority of the collocations used by CRYSTALP2 include residues with high conformational entropy, or low entropy and high potential to mediate crystal contacts; notably, such residues are utilized by surface entropy reduction methods. We show that the collocations provide complementary information to the hydrophobicity and isoelectric point. Tests on four datasets show that CRYSTALP2 outperforms several existing sequence-based predictors (CRYSTALP, OB-score, and SECRET). CRYSTALP2's accuracy, MCC, and AROC range between 69.3 and 77.5%, 0.39 and 0.55, and 0.72 and 0.79, respectively. Our predictions are similar in quality and are complementary to the predictions of the most recent ParCrys and XtalPred methods. Our results also suggest that, as work in protein crystallization continues (thereby enlarging the population of proteins with known crystallization propensities), the prediction quality of the CRYSTALP2 method should increase. The prediction model and the datasets used in this contribution can be downloaded from http://biomine.ece.ualberta.ca/CRYSTALP2/CRYSTALP2.html. CONCLUSION CRYSTALP2 provides relatively accurate crystallization propensity predictions for a given protein chain that either outperform or complement the existing approaches. The proposed method can be used to support current efforts towards improving the success rate in obtaining diffraction-quality crystals.
Collapse
Affiliation(s)
- Lukasz Kurgan
- Department of Electrical and Computer Engineering, University of Alberta, Edmonton, Alberta, Canada.
| | | | | | | | | | | |
Collapse
|
45
|
Overton IM, van Niekerk CAJ, Carter LG, Dawson A, Martin DMA, Cameron S, McMahon SA, White MF, Hunter WN, Naismith JH, Barton GJ. TarO: a target optimisation system for structural biology. Nucleic Acids Res 2008; 36:W190-6. [PMID: 18385152 PMCID: PMC2447720 DOI: 10.1093/nar/gkn141] [Citation(s) in RCA: 68] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Abstract
TarO (http://www.compbio.dundee.ac.uk/taro) offers a single point of reference for key bioinformatics analyses relevant to selecting proteins or domains for study by structural biology techniques. The protein sequence is analysed by 17 algorithms and compared to 8 databases. TarO gathers putative homologues, including orthologues, and then obtains predictions of properties for these sequences including crystallisation propensity, protein disorder and post-translational modifications. Analyses are run on a high-performance computing cluster, the results integrated, stored in a database and accessed through a web-based user interface. Output is in tabulated format and in the form of an annotated multiple sequence alignment (MSA) that may be edited interactively in the program Jalview. TarO also simplifies the gathering of additional annotations via the Distributed Annotation System, both from the MSA in Jalview and through links to Dasty2. Routes to other information gateways are included, for example to relevant pages from UniProt, COG and the Conserved Domains Database. Open access to TarO is available from a guest account with private accounts for academic use available on request. Future development of TarO will include further analysis steps and integration with the Protein Information Management System (PIMS), a sister project in the BBSRC ‘Structural Proteomics of Rational Targets’ initiative
Collapse
Affiliation(s)
- Ian M Overton
- School of Life Sciences Research, University of Dundee, Dow Street, Dundee, UK
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|