1
|
Jahn LR, Marquet C, Heinzinger M, Rost B. Protein embeddings predict binding residues in disordered regions. Sci Rep 2024; 14:13566. [PMID: 38866950 PMCID: PMC11169622 DOI: 10.1038/s41598-024-64211-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2024] [Accepted: 06/06/2024] [Indexed: 06/14/2024] Open
Abstract
The identification of protein binding residues helps to understand their biological processes as protein function is often defined through ligand binding, such as to other proteins, small molecules, ions, or nucleotides. Methods predicting binding residues often err for intrinsically disordered proteins or regions (IDPs/IDPRs), often also referred to as molecular recognition features (MoRFs). Here, we presented a novel machine learning (ML) model trained to specifically predict binding regions in IDPRs. The proposed model, IDBindT5, leveraged embeddings from the protein language model (pLM) ProtT5 to reach a balanced accuracy of 57.2 ± 3.6% (95% confidence interval). Assessed on the same data set, this did not differ at the 95% CI from the state-of-the-art (SOTA) methods ANCHOR2 and DeepDISOBind that rely on expert-crafted features and evolutionary information from multiple sequence alignments (MSAs). Assessed on other data, methods such as SPOT-MoRF reached higher MCCs. IDBindT5's SOTA predictions are much faster than other methods, easily enabling full-proteome analyses. Our findings emphasize the potential of pLMs as a promising approach for exploring and predicting features of disordered proteins. The model and a comprehensive manual are publicly available at https://github.com/jahnl/binding_in_disorder .
Collapse
Affiliation(s)
- Laura R Jahn
- School of Computation, Information, and Technology (CIT), Department of Informatics, Bioinformatics and Computational Biology, TUM (Technical University of Munich), 85748, Garching/Munich, Germany
| | - Céline Marquet
- School of Computation, Information, and Technology (CIT), Department of Informatics, Bioinformatics and Computational Biology, TUM (Technical University of Munich), 85748, Garching/Munich, Germany.
| | - Michael Heinzinger
- School of Computation, Information, and Technology (CIT), Department of Informatics, Bioinformatics and Computational Biology, TUM (Technical University of Munich), 85748, Garching/Munich, Germany
| | - Burkhard Rost
- School of Computation, Information, and Technology (CIT), Department of Informatics, Bioinformatics and Computational Biology, TUM (Technical University of Munich), 85748, Garching/Munich, Germany
- Institute for Advanced Study (TUM-IAS), Lichtenbergstr. 2a, 85748, Garching/Munich, Germany
- TUM School of Life Sciences Weihenstephan (TUM-WZW), Alte Akademie 8, Freising, Germany
| |
Collapse
|
2
|
Roca-Martinez J, Lazar T, Gavalda-Garcia J, Bickel D, Pancsa R, Dixit B, Tzavella K, Ramasamy P, Sanchez-Fornaris M, Grau I, Vranken WF. Challenges in describing the conformation and dynamics of proteins with ambiguous behavior. Front Mol Biosci 2022; 9:959956. [PMID: 35992270 PMCID: PMC9382080 DOI: 10.3389/fmolb.2022.959956] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2022] [Accepted: 06/27/2022] [Indexed: 11/13/2022] Open
Abstract
Traditionally, our understanding of how proteins operate and how evolution shapes them is based on two main data sources: the overall protein fold and the protein amino acid sequence. However, a significant part of the proteome shows highly dynamic and/or structurally ambiguous behavior, which cannot be correctly represented by the traditional fixed set of static coordinates. Representing such protein behaviors remains challenging and necessarily involves a complex interpretation of conformational states, including probabilistic descriptions. Relating protein dynamics and multiple conformations to their function as well as their physiological context (e.g., post-translational modifications and subcellular localization), therefore, remains elusive for much of the proteome, with studies to investigate the effect of protein dynamics relying heavily on computational models. We here investigate the possibility of delineating three classes of protein conformational behavior: order, disorder, and ambiguity. These definitions are explored based on three different datasets, using interpretable machine learning from a set of features, from AlphaFold2 to sequence-based predictions, to understand the overlap and differences between these datasets. This forms the basis for a discussion on the current limitations in describing the behavior of dynamic and ambiguous proteins.
Collapse
Affiliation(s)
- Joel Roca-Martinez
- Structural Biology Brussels, Vrije Universiteit Brussel, Brussels, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, VUB/ULB, Brussels, Belgium
| | - Tamas Lazar
- Structural Biology Brussels, Vrije Universiteit Brussel, Brussels, Belgium
- VIB-VUB Center for Structural Biology, Brussels, Belgium
| | - Jose Gavalda-Garcia
- Structural Biology Brussels, Vrije Universiteit Brussel, Brussels, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, VUB/ULB, Brussels, Belgium
| | - David Bickel
- Structural Biology Brussels, Vrije Universiteit Brussel, Brussels, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, VUB/ULB, Brussels, Belgium
| | - Rita Pancsa
- Research Centre for Natural Sciences, Institute of Enzymology, Budapest, Hungary
| | - Bhawna Dixit
- Structural Biology Brussels, Vrije Universiteit Brussel, Brussels, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, VUB/ULB, Brussels, Belgium
- IBiTech-Biommeda, Universiteit Gent, Gent, Belgium
| | - Konstantina Tzavella
- Structural Biology Brussels, Vrije Universiteit Brussel, Brussels, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, VUB/ULB, Brussels, Belgium
| | - Pathmanaban Ramasamy
- Structural Biology Brussels, Vrije Universiteit Brussel, Brussels, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, VUB/ULB, Brussels, Belgium
- VIB-UGent Center for Medical Biotechnology, Universiteit Gent, Gent, Belgium
| | - Maite Sanchez-Fornaris
- Structural Biology Brussels, Vrije Universiteit Brussel, Brussels, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, VUB/ULB, Brussels, Belgium
- Department of Computer Sciences, University of Camagüey, Camagüey, Cuba
| | - Isel Grau
- Information Systems, Eindhoven University of Technology, Eindhoven, Netherlands
| | - Wim F. Vranken
- Structural Biology Brussels, Vrije Universiteit Brussel, Brussels, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, VUB/ULB, Brussels, Belgium
| |
Collapse
|
3
|
Fang C, Moriwaki Y, Li C, Shimizu K. MoRFPred_en: Sequence-based prediction of MoRFs using an ensemble learning strategy. J Bioinform Comput Biol 2020; 17:1940015. [PMID: 32019410 DOI: 10.1142/s0219720019400158] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Molecular recognition features (MoRFs) usually act as "hub" sites in the interaction networks of intrinsically disordered proteins (IDPs). Because an increasing number of serious diseases have been found to be associated with disordered proteins, identifying MoRFs has become increasingly important. In this study, we propose an ensemble learning strategy, named MoRFPred_en, to predict MoRFs from protein sequences. This approach combines four submodels that utilize different sequence-derived features for the prediction, including a multichannel one-dimensional convolutional neural network (CNN_1D multichannel) based model, two deep two-dimensional convolutional neural network (DCNN_2D) based models, and a support vector machine (SVM) based model. When compared with other methods on the same datasets, the MoRFPred_en approach produced better results than existing state-of-the-art MoRF prediction methods, achieving an AUC of 0.762 on the VALIDATION419 dataset, 0.795 on the TEST45 dataset, and 0.776 on the TEST49 dataset. Availability: http://vivace.bi.a.u-tokyo.ac.jp:8008/fang/MoRFPred_en.php.
Collapse
Affiliation(s)
- Chun Fang
- Department of Computer Science and Engineering, Shandong University of Technology, Shandong 255049, P. R. China
| | - Yoshitaka Moriwaki
- Graduate School of Agricultural and Life Sciences, The University of Tokyo, Tokyo 113-8657, Japan
| | - Caihong Li
- Department of Computer Science and Engineering, Shandong University of Technology, Shandong 255049, P. R. China
| | - Kentaro Shimizu
- Graduate School of Agricultural and Life Sciences, The University of Tokyo, Tokyo 113-8657, Japan
| |
Collapse
|
4
|
Current pivotal strategies leading a difficult target protein to a sample suitable for crystallographic analysis. Biochem Soc Trans 2020; 48:1661-1673. [PMID: 32677661 DOI: 10.1042/bst20200106] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2020] [Revised: 06/26/2020] [Accepted: 06/30/2020] [Indexed: 12/15/2022]
Abstract
Crystallographic structural analysis is an essential method for the determination of protein structure. However, crystallization of a protein of interest is the most difficult process in the analysis. The process is often hampered during the sample preparation, including expression and purification. Even after a sample has been purified, not all candidate proteins crystallize. In this mini-review, the current methodologies used to overcome obstacles encountered during protein crystallization are sorted. Specifically, the strategy for an effective crystallization is compared with a pipeline where various expression hosts and constructs, purification and crystallization conditions, and crystallization chaperones as target-specific binder proteins are assessed by a precrystallization screening. These methodologies are also developed continuously to improve the process. The described methods are useful for sample preparation in crystallographic analysis and other structure determination techniques, such as cryo-electron microscopy.
Collapse
|
5
|
Abstract
Entropy should directly reflect the extent of disorder in proteins. By clustering structurally related proteins and studying the multiple-sequence-alignment of the sequences of these clusters, we were able to link between sequence, structure, and disorder information. We introduced several parameters as measures of fluctuations at a given MSA site and used these as representative of the sequence and structure entropy at that site. In general, we found a tendency for negative correlations between disorder and structure, and significant positive correlations between disorder and the fluctuations in the system. We also found evidence for residue-type conservation for those residues proximate to potentially disordered sites. Mutation at the disorder site itself appear to be allowed. In addition, we found positive correlation for disorder and accessible surface area, validating that disordered residues occur in exposed regions of proteins. Finally, we also found that fluctuations in the dihedral angles at the original mutated residue and disorder are positively correlated while dihedral angle fluctuations in spatially proximal residues are negatively correlated with disorder. Our results seem to indicate permissible variability in the disordered site, but greater rigidity in the parts of the protein with which the disordered site interacts. This is another indication that disordered residues are involved in protein function.
Collapse
|
6
|
Fang C, Moriwaki Y, Tian A, Li C, Shimizu K. Identifying short disorder-to-order binding regions in disordered proteins with a deep convolutional neural network method. J Bioinform Comput Biol 2019; 17:1950004. [PMID: 30866736 DOI: 10.1142/s0219720019500045] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Molecular recognition features (MoRFs) are key functional regions of intrinsically disordered proteins (IDPs), which play important roles in the molecular interaction network of cells and are implicated in many serious human diseases. Identifying MoRFs is essential for both functional studies of IDPs and drug design. This study adopts the cutting-edge machine learning method of artificial intelligence to develop a powerful model for improving MoRFs prediction. We proposed a method, named as en_DCNNMoRF (ensemble deep convolutional neural network-based MoRF predictor). It combines the outcomes of two independent deep convolutional neural network (DCNN) classifiers that take advantage of different features. The first, DCNNMoRF1, employs position-specific scoring matrix (PSSM) and 22 types of amino acid-related factors to describe protein sequences. The second, DCNNMoRF2, employs PSSM and 13 types of amino acid indexes to describe protein sequences. For both single classifiers, DCNN with a novel two-dimensional attention mechanism was adopted, and an average strategy was added to further process the output probabilities of each DCNN model. Finally, en_DCNNMoRF combined the two models by averaging their final scores. When compared with other well-known tools applied to the same datasets, the accuracy of the novel proposed method was comparable with that of state-of-the-art methods. The related web server can be accessed freely via http://vivace.bi.a.u-tokyo.ac.jp:8008/fang/en_MoRFs.php .
Collapse
Affiliation(s)
- Chun Fang
- Department of Computer Science and Engineering, Shandong University of Technology, Shandong 255049, P. R. China
| | - Yoshitaka Moriwaki
- Graduate School of Agricultural and Life Sciences, The University of Tokyo, Tokyo 113-8657, Japan
| | - Aikui Tian
- Department of Computer Science and Engineering, Shandong University of Technology, Shandong 255049, P. R. China
| | - Caihong Li
- Department of Computer Science and Engineering, Shandong University of Technology, Shandong 255049, P. R. China
| | - Kentaro Shimizu
- Graduate School of Agricultural and Life Sciences, The University of Tokyo, Tokyo 113-8657, Japan
| |
Collapse
|
7
|
Necci M, Piovesan D, Dosztányi Z, Tompa P, Tosatto SCE. A comprehensive assessment of long intrinsic protein disorder from the DisProt database. Bioinformatics 2017; 34:445-452. [DOI: 10.1093/bioinformatics/btx590] [Citation(s) in RCA: 44] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2017] [Accepted: 09/15/2017] [Indexed: 12/30/2022] Open
Affiliation(s)
- Marco Necci
- Department of Biomedical Sciences, University of Padua, Padova, Italy
| | - Damiano Piovesan
- Department of Biomedical Sciences, University of Padua, Padova, Italy
| | - Zsuzsanna Dosztányi
- Agricoltural Sciences, University of Udine, Udine, Italy
- MTA-ELTE Lendület Bioinformatics Research Group, Department of Biochemistry, Eötvös Loránd University, Budapest, Hungary
| | - Peter Tompa
- Fondazione Edmund Mach, S. Michele all'Adige, Italy
- Institute of Enzymology, Research Centre for Natural Sciences, Hungarian Academy of Sciences, Budapest, Hungary
- Structural Biology Brussels, Vrije Universiteit Brussel (VUB), and Center for Structural Biology (CSB), Flanders Institute for Biotechnology (VIB), Brussels, Belgium
| | - Silvio C E Tosatto
- Department of Biomedical Sciences, University of Padua, Padova, Italy
- CNR Institute of Neuroscience, Padova, Italy
| |
Collapse
|
8
|
Pozo PN, Cook JG. Regulation and Function of Cdt1; A Key Factor in Cell Proliferation and Genome Stability. Genes (Basel) 2016; 8:genes8010002. [PMID: 28025526 PMCID: PMC5294997 DOI: 10.3390/genes8010002] [Citation(s) in RCA: 65] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2016] [Revised: 12/13/2016] [Accepted: 12/14/2016] [Indexed: 12/30/2022] Open
Abstract
Successful cell proliferation requires efficient and precise genome duplication followed by accurate chromosome segregation. The Cdc10-dependent transcript 1 protein (Cdt1) is required for the first step in DNA replication, and in human cells Cdt1 is also required during mitosis. Tight cell cycle controls over Cdt1 abundance and activity are critical to normal development and genome stability. We review here recent advances in elucidating Cdt1 molecular functions in both origin licensing and kinetochore–microtubule attachment, and we describe the current understanding of human Cdt1 regulation.
Collapse
Affiliation(s)
- Pedro N Pozo
- Curriculum in Genetics and Molecular Biology, The University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA.
| | - Jeanette Gowen Cook
- Curriculum in Genetics and Molecular Biology, The University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA.
- Department of Biochemistry and Biophysics, The University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA.
| |
Collapse
|