1
|
Xiao YX, Lee SY, Aguilera-Uribe M, Samson R, Au A, Khanna Y, Liu Z, Cheng R, Aulakh K, Wei J, Farias AG, Reilly T, Birkadze S, Habsid A, Brown KR, Chan K, Mero P, Huang JQ, Billmann M, Rahman M, Myers C, Andrews BJ, Youn JY, Yip CM, Rotin D, Derry WB, Forman-Kay JD, Moses AM, Pritišanac I, Gingras AC, Moffat J. The TSC22D, WNK, and NRBP gene families exhibit functional buffering and evolved with Metazoa for cell volume regulation. Cell Rep 2024; 43:114417. [PMID: 38980795 DOI: 10.1016/j.celrep.2024.114417] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2024] [Revised: 05/08/2024] [Accepted: 06/13/2024] [Indexed: 07/11/2024] Open
Abstract
The ability to sense and respond to osmotic fluctuations is critical for the maintenance of cellular integrity. We used gene co-essentiality analysis to identify an unappreciated relationship between TSC22D2, WNK1, and NRBP1 in regulating cell volume homeostasis. All of these genes have paralogs and are functionally buffered for osmo-sensing and cell volume control. Within seconds of hyperosmotic stress, TSC22D, WNK, and NRBP family members physically associate into biomolecular condensates, a process that is dependent on intrinsically disordered regions (IDRs). A close examination of these protein families across metazoans revealed that TSC22D genes evolved alongside a domain in NRBPs that specifically binds to TSC22D proteins, which we have termed NbrT (NRBP binding region with TSC22D), and this co-evolution is accompanied by rapid IDR length expansion in WNK-family kinases. Our study reveals that TSC22D, WNK, and NRBP genes evolved in metazoans to co-regulate rapid cell volume changes in response to osmolarity.
Collapse
Affiliation(s)
- Yu-Xi Xiao
- Program in Genetics and Genome Biology, The Hospital for Sick Children, Toronto, ON, Canada; Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada
| | - Seon Yong Lee
- Program in Genetics and Genome Biology, The Hospital for Sick Children, Toronto, ON, Canada
| | - Magali Aguilera-Uribe
- Program in Genetics and Genome Biology, The Hospital for Sick Children, Toronto, ON, Canada; Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada
| | - Reuben Samson
- Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada; The Lunenfeld-Tanenbaum Research Institute, Mount Sinai Hospital, Sinai Health, Toronto, ON, Canada
| | - Aaron Au
- Institute for Biomedical Engineering, University of Toronto, Toronto, ON, Canada; Department of Cell and Systems Biology, University of Toronto, Toronto, ON, Canada; Donnelly Centre, University of Toronto, Toronto, ON, Canada
| | - Yukti Khanna
- Otto-Loewi Research Center, Division of Medicinal Chemistry, Medical University of Graz, Neue Stiftingtalstrabe 6, 8010, Graz, Austria
| | - Zetao Liu
- Program in Cell Biology, The Hospital for Sick Children, Toronto, ON, Canada; Department of Biochemistry, University of Toronto, Toronto, ON, Canada
| | - Ran Cheng
- Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada; Program in Developmental and Stem Cell Biology, The Hospital for Sick Children, Toronto, ON, Canada
| | - Kamaldeep Aulakh
- Program in Genetics and Genome Biology, The Hospital for Sick Children, Toronto, ON, Canada
| | - Jiarun Wei
- Program in Genetics and Genome Biology, The Hospital for Sick Children, Toronto, ON, Canada
| | - Adrian Granda Farias
- Program in Genetics and Genome Biology, The Hospital for Sick Children, Toronto, ON, Canada; Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada
| | - Taylor Reilly
- Program in Genetics and Genome Biology, The Hospital for Sick Children, Toronto, ON, Canada; Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada
| | - Saba Birkadze
- Program in Genetics and Genome Biology, The Hospital for Sick Children, Toronto, ON, Canada; Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada
| | - Andrea Habsid
- Program in Genetics and Genome Biology, The Hospital for Sick Children, Toronto, ON, Canada
| | - Kevin R Brown
- Program in Genetics and Genome Biology, The Hospital for Sick Children, Toronto, ON, Canada
| | - Katherine Chan
- Program in Genetics and Genome Biology, The Hospital for Sick Children, Toronto, ON, Canada
| | - Patricia Mero
- Program in Genetics and Genome Biology, The Hospital for Sick Children, Toronto, ON, Canada
| | - Jie Qi Huang
- Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada; Program in Molecular Medicine, The Hospital for Sick Children, Toronto, ON, Canada
| | - Maximilian Billmann
- Institute of Human Genetics, School of Medicine and University Hospital Bonn, University of Bonn, 53127 Bonn, Germany
| | - Mahfuzur Rahman
- Department of Computer Science and Engineering, University of Minnesota, Minneapolis, MN, USA
| | - Chad Myers
- Department of Computer Science and Engineering, University of Minnesota, Minneapolis, MN, USA
| | - Brenda J Andrews
- Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada; Donnelly Centre, University of Toronto, Toronto, ON, Canada
| | - Ji-Young Youn
- Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada; Program in Molecular Medicine, The Hospital for Sick Children, Toronto, ON, Canada
| | - Christopher M Yip
- Institute for Biomedical Engineering, University of Toronto, Toronto, ON, Canada; Donnelly Centre, University of Toronto, Toronto, ON, Canada
| | - Daniela Rotin
- Program in Cell Biology, The Hospital for Sick Children, Toronto, ON, Canada; Department of Biochemistry, University of Toronto, Toronto, ON, Canada
| | - W Brent Derry
- Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada; Program in Developmental and Stem Cell Biology, The Hospital for Sick Children, Toronto, ON, Canada
| | - Julie D Forman-Kay
- Department of Biochemistry, University of Toronto, Toronto, ON, Canada; Program in Molecular Medicine, The Hospital for Sick Children, Toronto, ON, Canada
| | - Alan M Moses
- Department of Cell and Systems Biology, University of Toronto, Toronto, ON, Canada
| | - Iva Pritišanac
- Otto-Loewi Research Center, Division of Medicinal Chemistry, Medical University of Graz, Neue Stiftingtalstrabe 6, 8010, Graz, Austria
| | - Anne-Claude Gingras
- Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada; The Lunenfeld-Tanenbaum Research Institute, Mount Sinai Hospital, Sinai Health, Toronto, ON, Canada
| | - Jason Moffat
- Program in Genetics and Genome Biology, The Hospital for Sick Children, Toronto, ON, Canada; Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada; Institute for Biomedical Engineering, University of Toronto, Toronto, ON, Canada.
| |
Collapse
|
2
|
Ding M, Chen K, Yang Y, Zhao H. Prioritizing genomic variants pathogenicity via DNA, RNA, and protein-level features based on extreme gradient boosting. Hum Genet 2024:10.1007/s00439-024-02667-0. [PMID: 38575818 DOI: 10.1007/s00439-024-02667-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2023] [Accepted: 03/05/2024] [Indexed: 04/06/2024]
Abstract
Genetic diseases are mostly implicated with genetic variants, including missense, synonymous, non-sense, and copy number variants. These different kinds of variants are indicated to affect phenotypes in various ways from previous studies. It remains essential but challenging to understand the functional consequences of these genetic variants, especially the noncoding ones, due to the lack of corresponding annotations. While many computational methods have been proposed to identify the risk variants. Most of them have only curated DNA-level and protein-level annotations to predict the pathogenicity of the variants, and others have been restricted to missense variants exclusively. In this study, we have curated DNA-, RNA-, and protein-level features to discriminate disease-causing variants in both coding and noncoding regions, where the features of protein sequences and protein structures have been shown essential for analyzing missense variants in coding regions while the features related to RNA-splicing and RBP binding are significant for variants in noncoding regions and synonymous variants in coding regions. Through the integration of these features, we have formulated the Multi-level feature Genomic Variants Predictor (ML-GVP) using the gradient boosting tree. The method has been trained on more than 400,000 variants in the Sherloc-training set from the 6th critical assessment of genome interpretation with superior performance. The method is one of the two best-performing predictors on the blind test in the Sherloc assessment, and is further confirmed by another independent test dataset of de novo variants.
Collapse
Affiliation(s)
- Maolin Ding
- School of Data and Computer Science, Sun Yat-Sen University, Guangzhou, 510000, China
| | - Ken Chen
- School of Data and Computer Science, Sun Yat-Sen University, Guangzhou, 510000, China
| | - Yuedong Yang
- School of Data and Computer Science, Sun Yat-Sen University, Guangzhou, 510000, China.
- Key Laboratory of Machine Intelligence and Advanced Computing (Sun Yat-Sen University), Ministry of Education, Guangzhou, China.
| | - Huiying Zhao
- Sun Yat-Sen Memorial Hospital, Sun Yat-Sen University, Guangzhou, 510000, China.
| |
Collapse
|
3
|
Antonini V, Mileo A, Roantree M. Engineering Features from Raw Sensor Data to Analyse Player Movements during Competition. SENSORS (BASEL, SWITZERLAND) 2024; 24:1308. [PMID: 38400466 PMCID: PMC10893073 DOI: 10.3390/s24041308] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/11/2024] [Revised: 02/09/2024] [Accepted: 02/16/2024] [Indexed: 02/25/2024]
Abstract
Research in field sports often involves analysis of running performance profiles of players during competitive games with individual, per-position, and time-related descriptive statistics. Data are acquired through wearable technologies, which generally capture simple data points, which in the case of many team-based sports are times, latitudes, and longitudes. While the data capture is simple and in relatively high volumes, the raw data are unsuited to any form of analysis or machine learning functions. The main goal of this research is to develop a multistep feature engineering framework that delivers the transformation of sequential data into feature sets more suited to machine learning applications.
Collapse
Affiliation(s)
- Valerio Antonini
- School of Computing, Dublin City University, Dublin 9, D09 V209 Dublin, Ireland; (A.M.); (M.R.)
| | - Alessandra Mileo
- School of Computing, Dublin City University, Dublin 9, D09 V209 Dublin, Ireland; (A.M.); (M.R.)
- Insight Centre for Data Analytics, School of Computing, Dublin City University, Dublin 9, D09 V209 Dublin, Ireland
| | - Mark Roantree
- School of Computing, Dublin City University, Dublin 9, D09 V209 Dublin, Ireland; (A.M.); (M.R.)
- Insight Centre for Data Analytics, School of Computing, Dublin City University, Dublin 9, D09 V209 Dublin, Ireland
| |
Collapse
|
4
|
Tesei G, Trolle AI, Jonsson N, Betz J, Knudsen FE, Pesce F, Johansson KE, Lindorff-Larsen K. Conformational ensembles of the human intrinsically disordered proteome. Nature 2024; 626:897-904. [PMID: 38297118 DOI: 10.1038/s41586-023-07004-5] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2023] [Accepted: 12/19/2023] [Indexed: 02/02/2024]
Abstract
Intrinsically disordered proteins and regions (collectively, IDRs) are pervasive across proteomes in all kingdoms of life, help to shape biological functions and are involved in numerous diseases. IDRs populate a diverse set of transiently formed structures and defy conventional sequence-structure-function relationships1. Developments in protein science have made it possible to predict the three-dimensional structures of folded proteins at the proteome scale2. By contrast, there is a lack of knowledge about the conformational properties of IDRs, partly because the sequences of disordered proteins are poorly conserved and also because only a few of these proteins have been characterized experimentally. The inability to predict structural properties of IDRs across the proteome has limited our understanding of the functional roles of IDRs and how evolution shapes them. As a supplement to previous structural studies of individual IDRs3, we developed an efficient molecular model to generate conformational ensembles of IDRs and thereby to predict their conformational properties from sequences4,5. Here we use this model to simulate nearly all of the IDRs in the human proteome. Examining conformational ensembles of 28,058 IDRs, we show how chain compaction is correlated with cellular function and localization. We provide insights into how sequence features relate to chain compaction and, using a machine-learning model trained on our simulation data, show the conservation of conformational properties across orthologues. Our results recapitulate observations from previous studies of individual protein systems and exemplify how to link-at the proteome scale-conformational ensembles with cellular function and localization, amino acid sequence, evolutionary conservation and disease variants. Our freely available database of conformational properties will encourage further experimental investigation and enable the generation of hypotheses about the biological roles and evolution of IDRs.
Collapse
Affiliation(s)
- Giulio Tesei
- Structural Biology and NMR Laboratory, Linderstrøm-Lang Centre for Protein Science, Department of Biology, University of Copenhagen, Copenhagen, Denmark.
| | - Anna Ida Trolle
- Structural Biology and NMR Laboratory, Linderstrøm-Lang Centre for Protein Science, Department of Biology, University of Copenhagen, Copenhagen, Denmark
| | - Nicolas Jonsson
- Structural Biology and NMR Laboratory, Linderstrøm-Lang Centre for Protein Science, Department of Biology, University of Copenhagen, Copenhagen, Denmark
| | - Johannes Betz
- Structural Biology and NMR Laboratory, Linderstrøm-Lang Centre for Protein Science, Department of Biology, University of Copenhagen, Copenhagen, Denmark
| | - Frederik E Knudsen
- Structural Biology and NMR Laboratory, Linderstrøm-Lang Centre for Protein Science, Department of Biology, University of Copenhagen, Copenhagen, Denmark
| | - Francesco Pesce
- Structural Biology and NMR Laboratory, Linderstrøm-Lang Centre for Protein Science, Department of Biology, University of Copenhagen, Copenhagen, Denmark
| | - Kristoffer E Johansson
- Structural Biology and NMR Laboratory, Linderstrøm-Lang Centre for Protein Science, Department of Biology, University of Copenhagen, Copenhagen, Denmark
| | - Kresten Lindorff-Larsen
- Structural Biology and NMR Laboratory, Linderstrøm-Lang Centre for Protein Science, Department of Biology, University of Copenhagen, Copenhagen, Denmark.
| |
Collapse
|
5
|
Pang Y, Liu B. DisoFLAG: accurate prediction of protein intrinsic disorder and its functions using graph-based interaction protein language model. BMC Biol 2024; 22:3. [PMID: 38166858 PMCID: PMC10762911 DOI: 10.1186/s12915-023-01803-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2023] [Accepted: 12/15/2023] [Indexed: 01/05/2024] Open
Abstract
Intrinsically disordered proteins and regions (IDPs/IDRs) are functionally important proteins and regions that exist as highly dynamic conformations under natural physiological conditions. IDPs/IDRs exhibit a broad range of molecular functions, and their functions involve binding interactions with partners and remaining native structural flexibility. The rapid increase in the number of proteins in sequence databases and the diversity of disordered functions challenge existing computational methods for predicting protein intrinsic disorder and disordered functions. A disordered region interacts with different partners to perform multiple functions, and these disordered functions exhibit different dependencies and correlations. In this study, we introduce DisoFLAG, a computational method that leverages a graph-based interaction protein language model (GiPLM) for jointly predicting disorder and its multiple potential functions. GiPLM integrates protein semantic information based on pre-trained protein language models into graph-based interaction units to enhance the correlation of the semantic representation of multiple disordered functions. The DisoFLAG predictor takes amino acid sequences as the only inputs and provides predictions of intrinsic disorder and six disordered functions for proteins, including protein-binding, DNA-binding, RNA-binding, ion-binding, lipid-binding, and flexible linker. We evaluated the predictive performance of DisoFLAG following the Critical Assessment of protein Intrinsic Disorder (CAID) experiments, and the results demonstrated that DisoFLAG offers accurate and comprehensive predictions of disordered functions, extending the current coverage of computationally predicted disordered function categories. The standalone package and web server of DisoFLAG have been established to provide accurate prediction tools for intrinsic disorders and their associated functions.
Collapse
Affiliation(s)
- Yihe Pang
- School of Computer Science and Technology, Beijing Institute of Technology, No. 5, South Zhongguancun Street, Beijing, Haidian District, 100081, China
| | - Bin Liu
- School of Computer Science and Technology, Beijing Institute of Technology, No. 5, South Zhongguancun Street, Beijing, Haidian District, 100081, China.
- Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, No. 5, South Zhongguancun Street, Beijing, Haidian District, 100081, China.
| |
Collapse
|
6
|
Kurgan L, Hu G, Wang K, Ghadermarzi S, Zhao B, Malhis N, Erdős G, Gsponer J, Uversky VN, Dosztányi Z. Tutorial: a guide for the selection of fast and accurate computational tools for the prediction of intrinsic disorder in proteins. Nat Protoc 2023; 18:3157-3172. [PMID: 37740110 DOI: 10.1038/s41596-023-00876-x] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2022] [Accepted: 06/21/2023] [Indexed: 09/24/2023]
Abstract
Intrinsic disorder is instrumental for a wide range of protein functions, and its analysis, using computational predictions from primary structures, complements secondary and tertiary structure-based approaches. In this Tutorial, we provide an overview and comparison of 23 publicly available computational tools with complementary parameters useful for intrinsic disorder prediction, partly relying on results from the Critical Assessment of protein Intrinsic Disorder prediction experiment. We consider factors such as accuracy, runtime, availability and the need for functional insights. The selected tools are available as web servers and downloadable programs, offer state-of-the-art predictions and can be used in a high-throughput manner. We provide examples and instructions for the selected tools to illustrate practical aspects related to the submission, collection and interpretation of predictions, as well as the timing and their limitations. We highlight two predictors for intrinsically disordered proteins, flDPnn as accurate and fast and IUPred as very fast and moderately accurate, while suggesting ANCHOR2 and MoRFchibi as two of the best-performing predictors for intrinsically disordered region binding. We link these tools to additional resources, including databases of predictions and web servers that integrate multiple predictive methods. Altogether, this Tutorial provides a hands-on guide to comparatively evaluating multiple predictors, submitting and collecting their own predictions, and reading and interpreting results. It is suitable for experimentalists and computational biologists interested in accurately and conveniently identifying intrinsic disorder, facilitating the functional characterization of the rapidly growing collections of protein sequences.
Collapse
Affiliation(s)
- Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, USA.
| | - Gang Hu
- School of Statistics and Data Science, LPMC and KLMDASR, Nankai University, Tianjin, China
| | - Kui Wang
- School of Statistics and Data Science, LPMC and KLMDASR, Nankai University, Tianjin, China
| | - Sina Ghadermarzi
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, USA
| | - Bi Zhao
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, USA
| | - Nawar Malhis
- Michael Smith Laboratories, University of British Columbia, Vancouver, British Columbia, Canada
| | - Gábor Erdős
- MTA-ELTE Momentum Bioinformatics Research Group, Department of Biochemistry, Eötvös Loránd University, Budapest, Hungary
| | - Jörg Gsponer
- Michael Smith Laboratories, University of British Columbia, Vancouver, British Columbia, Canada.
| | - Vladimir N Uversky
- Department of Molecular Medicine, Morsani College of Medicine, University of South Florida, Tampa, FL, USA.
- Byrd Alzheimer's Center and Research Institute, Morsani College of Medicine, University of South Florida, Tampa, FL, USA.
| | - Zsuzsanna Dosztányi
- MTA-ELTE Momentum Bioinformatics Research Group, Department of Biochemistry, Eötvös Loránd University, Budapest, Hungary.
| |
Collapse
|
7
|
Pang Y, Liu B. IDP-LM: Prediction of protein intrinsic disorder and disorder functions based on language models. PLoS Comput Biol 2023; 19:e1011657. [PMID: 37992088 DOI: 10.1371/journal.pcbi.1011657] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2022] [Revised: 12/06/2023] [Accepted: 11/03/2023] [Indexed: 11/24/2023] Open
Abstract
Intrinsically disordered proteins (IDPs) and regions (IDRs) are a class of functionally important proteins and regions that lack stable three-dimensional structures under the native physiologic conditions. They participate in critical biological processes and thus are associated with the pathogenesis of many severe human diseases. Identifying the IDPs/IDRs and their functions will be helpful for a comprehensive understanding of protein structures and functions, and inform studies of rational drug design. Over the past decades, the exponential growth in the number of proteins with sequence information has deepened the gap between uncharacterized and annotated disordered sequences. Protein language models have recently demonstrated their powerful abilities to capture complex structural and functional information from the enormous quantity of unlabelled protein sequences, providing opportunities to apply protein language models to uncover the intrinsic disorders and their biological properties from the amino acid sequences. In this study, we proposed a computational predictor called IDP-LM for predicting intrinsic disorder and disorder functions by leveraging the pre-trained protein language models. IDP-LM takes the embeddings extracted from three pre-trained protein language models as the exclusive inputs, including ProtBERT, ProtT5 and a disorder specific language model (IDP-BERT). The ablation analysis shown that the IDP-BERT provided fine-grained feature representations of disorder, and the combination of three language models is the key to the performance improvement of IDP-LM. The evaluation results on independent test datasets demonstrated that the IDP-LM provided high-quality prediction results for intrinsic disorder and four common disordered functions.
Collapse
Affiliation(s)
- Yihe Pang
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
| | - Bin Liu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
- Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing, China
| |
Collapse
|
8
|
Alderson TR, Pritišanac I, Kolarić Đ, Moses AM, Forman-Kay JD. Systematic identification of conditionally folded intrinsically disordered regions by AlphaFold2. Proc Natl Acad Sci U S A 2023; 120:e2304302120. [PMID: 37878721 PMCID: PMC10622901 DOI: 10.1073/pnas.2304302120] [Citation(s) in RCA: 12] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2023] [Accepted: 08/30/2023] [Indexed: 10/27/2023] Open
Abstract
The AlphaFold Protein Structure Database contains predicted structures for millions of proteins. For the majority of human proteins that contain intrinsically disordered regions (IDRs), which do not adopt a stable structure, it is generally assumed that these regions have low AlphaFold2 confidence scores that reflect low-confidence structural predictions. Here, we show that AlphaFold2 assigns confident structures to nearly 15% of human IDRs. By comparison to experimental NMR data for a subset of IDRs that are known to conditionally fold (i.e., upon binding or under other specific conditions), we find that AlphaFold2 often predicts the structure of the conditionally folded state. Based on databases of IDRs that are known to conditionally fold, we estimate that AlphaFold2 can identify conditionally folding IDRs at a precision as high as 88% at a 10% false positive rate, which is remarkable considering that conditionally folded IDR structures were minimally represented in its training data. We find that human disease mutations are nearly fivefold enriched in conditionally folded IDRs over IDRs in general and that up to 80% of IDRs in prokaryotes are predicted to conditionally fold, compared to less than 20% of eukaryotic IDRs. These results indicate that a large majority of IDRs in the proteomes of human and other eukaryotes function in the absence of conditional folding, but the regions that do acquire folds are more sensitive to mutations. We emphasize that the AlphaFold2 predictions do not reveal functionally relevant structural plasticity within IDRs and cannot offer realistic ensemble representations of conditionally folded IDRs.
Collapse
Affiliation(s)
- T. Reid Alderson
- Department of Biochemistry, University of Toronto, Toronto, ONM5S 1A8, Canada
- Department of Molecular Genetics, University of Toronto, Toronto, ONM5S 1A8, Canada
| | - Iva Pritišanac
- Department of Cell and Systems Biology, University of Toronto, Toronto, ONM5S 35G, Canada
- Molecular Medicine Program, The Hospital for Sick Children, Toronto, ONM5G 0A4, Canada
- Department of Molecular Biology and Biochemistry, Gottfried Schatz Research Center for Cell Signaling, Metabolism and Aging, Medical University of Graz, Graz8010, Austria
| | - Đesika Kolarić
- Department of Molecular Biology and Biochemistry, Gottfried Schatz Research Center for Cell Signaling, Metabolism and Aging, Medical University of Graz, Graz8010, Austria
| | - Alan M. Moses
- Department of Cell and Systems Biology, University of Toronto, Toronto, ONM5S 35G, Canada
| | - Julie D. Forman-Kay
- Department of Biochemistry, University of Toronto, Toronto, ONM5S 1A8, Canada
- Molecular Medicine Program, The Hospital for Sick Children, Toronto, ONM5G 0A4, Canada
| |
Collapse
|
9
|
Tang YJ, Yan K, Zhang X, Tian Y, Liu B. Protein intrinsically disordered region prediction by combining neural architecture search and multi-objective genetic algorithm. BMC Biol 2023; 21:188. [PMID: 37674132 PMCID: PMC10483879 DOI: 10.1186/s12915-023-01672-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2023] [Accepted: 07/31/2023] [Indexed: 09/08/2023] Open
Abstract
BACKGROUND Intrinsically disordered regions (IDRs) are widely distributed in proteins and related to many important biological functions. Accurately identifying IDRs is of great significance for protein structure and function analysis. Because the long disordered regions (LDRs) and short disordered regions (SDRs) share different characteristics, the existing predictors fail to achieve better and more stable performance on datasets with different ratios between LDRs and SDRs. There are two main reasons. First, the existing predictors construct network structures based on their own experiences such as convolutional neural network (CNN) which is used to extract the feature of neighboring residues in protein, and long short-term memory (LSTM) is used to extract the long-distance dependencies feature of protein residues. But these networks cannot capture the hidden feature associated with the length-dependent between residues. Second, many algorithms based on deep learning have been proposed but the complementarity of the existing predictors is not fully explored and used. RESULTS In this study, the neural architecture search (NAS) algorithm was employed to automatically construct the network structures so as to capture the hidden features in protein sequences. In order to stably predict both the LDRs and SDRs, the model constructed by NAS was combined with length-dependent models for capturing the unique features of SDRs or LDRs and general models for capturing the common features between LDRs and SDRs. A new predictor called IDP-Fusion was proposed. CONCLUSIONS Experimental results showed that IDP-Fusion can achieve more stable performance than the other existing predictors on independent test sets with different ratios between SDRs and LDRs.
Collapse
Affiliation(s)
- Yi-Jun Tang
- School of Computer Science and Technology, Beijing Institute of Technology, Haidian District, No. 5, South Zhongguancun Street, Beijing, 100081, China
| | - Ke Yan
- School of Computer Science and Technology, Beijing Institute of Technology, Haidian District, No. 5, South Zhongguancun Street, Beijing, 100081, China
| | - Xingyi Zhang
- School of Artificial Intelligence, Anhui University, Hefei, 230601, China
| | - Ye Tian
- Institutes of Physical Science and Information Technology, Anhui University, Hefei, 230601, China
| | - Bin Liu
- School of Computer Science and Technology, Beijing Institute of Technology, Haidian District, No. 5, South Zhongguancun Street, Beijing, 100081, China.
- Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing, 100081, China.
| |
Collapse
|
10
|
Zhao B, Ghadermarzi S, Kurgan L. Comparative evaluation of AlphaFold2 and disorder predictors for prediction of intrinsic disorder, disorder content and fully disordered proteins. Comput Struct Biotechnol J 2023; 21:3248-3258. [PMID: 38213902 PMCID: PMC10782001 DOI: 10.1016/j.csbj.2023.06.001] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2023] [Revised: 05/31/2023] [Accepted: 06/01/2023] [Indexed: 01/13/2024] Open
Abstract
We expand studies of AlphaFold2 (AF2) in the context of intrinsic disorder prediction by comparing it against a broad selection of 20 accurate, popular and recently released disorder predictors. We use 25% larger benchmark dataset with 646 proteins and cover protein-level predictions of disorder content and fully disordered proteins. AF2-based disorder predictions secure a relatively high Area Under receiver operating characteristic Curve (AUC) of 0.77 and are statistically outperformed by several modern disorder predictors that secure AUCs around 0.8 with median runtime of about 20 s compared to 1200 s for AF2. Moreover, AF2 provides modestly accurate predictions of fully disordered proteins (F1 = 0.59 vs. 0.91 for the best disorder predictor) and disorder content (mean absolute error of 0.21 vs. 0.15). AF2 also generates statistically more accurate disorder predictions for about 20% of proteins that have relatively short sequences and a few disordered regions that tend to be located at the sequence termini, and which are absent of disordered protein-binding regions. Interestingly, AF2 and the most accurate disorder predictors rely on deep neural networks, suggesting that these models are useful for protein structure and disorder predictions.
Collapse
Affiliation(s)
- Bi Zhao
- Genomics program, College of Public Health, University of South Florida, Tampa, FL, United States
| | - Sina Ghadermarzi
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, United States
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, United States
| |
Collapse
|
11
|
Redl I, Fisicaro C, Dutton O, Hoffmann F, Henderson L, Owens BJ, Heberling M, Paci E, Tamiola K. ADOPT: intrinsic protein disorder prediction through deep bidirectional transformers. NAR Genom Bioinform 2023; 5:lqad041. [PMID: 37138579 PMCID: PMC10150328 DOI: 10.1093/nargab/lqad041] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2022] [Revised: 02/07/2023] [Accepted: 04/17/2023] [Indexed: 05/05/2023] Open
Abstract
Intrinsically disordered proteins (IDPs) are important for a broad range of biological functions and are involved in many diseases. An understanding of intrinsic disorder is key to develop compounds that target IDPs. Experimental characterization of IDPs is hindered by the very fact that they are highly dynamic. Computational methods that predict disorder from the amino acid sequence have been proposed. Here, we present ADOPT (Attention DisOrder PredicTor), a new predictor of protein disorder. ADOPT is composed of a self-supervised encoder and a supervised disorder predictor. The former is based on a deep bidirectional transformer, which extracts dense residue-level representations from Facebook's Evolutionary Scale Modeling library. The latter uses a database of nuclear magnetic resonance chemical shifts, constructed to ensure balanced amounts of disordered and ordered residues, as a training and a test dataset for protein disorder. ADOPT predicts whether a protein or a specific region is disordered with better performance than the best existing predictors and faster than most other proposed methods (a few seconds per sequence). We identify the features that are relevant for the prediction performance and show that good performance can already be gained with <100 features. ADOPT is available as a stand-alone package at https://github.com/PeptoneLtd/ADOPT and as a web server at https://adopt.peptone.io/.
Collapse
Affiliation(s)
- Istvan Redl
- Peptone Ltd, 370 Grays Inn Road, London WC1X 8BB, UK
| | | | - Oliver Dutton
- Peptone Ltd, 370 Grays Inn Road, London WC1X 8BB, UK
| | - Falk Hoffmann
- Peptone Ltd, 370 Grays Inn Road, London WC1X 8BB, UK
| | | | | | | | - Emanuele Paci
- Peptone Ltd, 370 Grays Inn Road, London WC1X 8BB, UK
- Department of Physics and Astronomy ‘Augusto Righi’, University of Bologna, 40127 Bologna, Italy
| | - Kamil Tamiola
- To whom correspondence should be addressed. Tel: +41 79 609 7333;
| |
Collapse
|
12
|
Abstract
There are over 100 computational predictors of intrinsic disorder. These methods predict amino acid-level propensities for disorder directly from protein sequences. The propensities can be used to annotate putative disordered residues and regions. This unit provides a practical and holistic introduction to the sequence-based intrinsic disorder prediction. We define intrinsic disorder, explain the format of computational prediction of disorder, and identify and describe several accurate predictors. We also introduce recently released databases of intrinsic disorder predictions and use an illustrative example to provide insights into how predictions should be interpreted and combined. Lastly, we summarize key experimental methods that can be used to validate computational predictions. © 2023 Wiley Periodicals LLC.
Collapse
Affiliation(s)
- Vladimir N Uversky
- Department of Molecular Medicine and USF Health Byrd Alzheimer's Research Institute, Morsani College of Medicine, University of South Florida, Tampa, Florida
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, Virginia
| |
Collapse
|
13
|
Ashour DJ, Durney CH, Planelles-Herrero VJ, Stevens TJ, Feng JJ, Röper K. Zasp52 strengthens whole embryo tissue integrity through supracellular actomyosin networks. Development 2023; 150:dev201238. [PMID: 36897564 PMCID: PMC10112930 DOI: 10.1242/dev.201238] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2022] [Accepted: 02/28/2023] [Indexed: 03/11/2023]
Abstract
During morphogenesis, large-scale changes of tissue primordia are coordinated across an embryo. In Drosophila, several tissue primordia and embryonic regions are bordered or encircled by supracellular actomyosin cables, junctional actomyosin enrichments networked between many neighbouring cells. We show that the single Drosophila Alp/Enigma-family protein Zasp52, which is most prominently found in Z-discs of muscles, is a component of many supracellular actomyosin structures during embryogenesis, including the ventral midline and the boundary of the salivary gland placode. We reveal that Zasp52 contains within its central coiled-coil region a type of actin-binding motif usually found in CapZbeta proteins, and this domain displays actin-binding activity. Using endogenously-tagged lines, we identify that Zasp52 interacts with junctional components, including APC2, Polychaetoid and Sidekick, and actomyosin regulators. Analysis of zasp52 mutant embryos reveals that the severity of the embryonic defects observed scales inversely with the amount of functional protein left. Large tissue deformations occur where actomyosin cables are found during embryogenesis, and in vivo and in silico analyses suggest a model whereby supracellular Zasp52-containing cables aid to insulate morphogenetic changes from one another.
Collapse
Affiliation(s)
- Dina J. Ashour
- MRC Laboratory of Molecular Biology, Francis Crick Avenue, Cambridge Biomedical Campus, Cambridge CB2 0QH, UK
| | - Clinton H. Durney
- Department of Mathematics, University of British Columbia, Vancouver, V6T 1Z2Canada
| | | | - Tim J. Stevens
- MRC Laboratory of Molecular Biology, Francis Crick Avenue, Cambridge Biomedical Campus, Cambridge CB2 0QH, UK
| | - James J. Feng
- Department of Mathematics, University of British Columbia, Vancouver, V6T 1Z2Canada
- Department of Chemical and Biological Engineering, University of British Columbia, Vancouver, V6T 1Z3Canada
| | - Katja Röper
- MRC Laboratory of Molecular Biology, Francis Crick Avenue, Cambridge Biomedical Campus, Cambridge CB2 0QH, UK
| |
Collapse
|
14
|
Mohammed Alsumaidaee YA, Yaw CT, Koh SP, Tiong SK, Chen CP, Yusaf T, Abdalla AN, Ali K, Raj AA. Detection of Corona Faults in Switchgear by Using 1D-CNN, LSTM, and 1D-CNN-LSTM Methods. SENSORS (BASEL, SWITZERLAND) 2023; 23:3108. [PMID: 36991819 PMCID: PMC10059847 DOI: 10.3390/s23063108] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/30/2023] [Revised: 02/28/2023] [Accepted: 03/01/2023] [Indexed: 06/19/2023]
Abstract
The damaging effects of corona faults have made them a major concern in metal-clad switchgear, requiring extreme caution during operation. Corona faults are also the primary cause of flashovers in medium-voltage metal-clad electrical equipment. The root cause of this issue is an electrical breakdown of the air due to electrical stress and poor air quality within the switchgear. Without proper preventative measures, a flashover can occur, resulting in serious harm to workers and equipment. As a result, detecting corona faults in switchgear and preventing electrical stress buildup in switches is critical. Recent years have seen the successful use of Deep Learning (DL) applications for corona and non-corona detection, owing to their autonomous feature learning capability. This paper systematically analyzes three deep learning techniques, namely 1D-CNN, LSTM, and 1D-CNN-LSTM hybrid models, to identify the most effective model for detecting corona faults. The hybrid 1D-CNN-LSTM model is deemed the best due to its high accuracy in both the time and frequency domains. This model analyzes the sound waves generated in switchgear to detect faults. The study examines model performance in both the time and frequency domains. In the time domain analysis (TDA), 1D-CNN achieved success rates of 98%, 98.4%, and 93.9%, while LSTM obtained success rates of 97.3%, 98.4%, and 92.4%. The most suitable model, the 1D-CNN-LSTM, achieved success rates of 99.3%, 98.4%, and 98.4% in differentiating corona and non-corona cases during training, validation, and testing. In the frequency domain analysis (FDA), 1D-CNN achieved success rates of 100%, 95.8%, and 95.8%, while LSTM obtained success rates of 100%, 100%, and 100%. The 1D-CNN-LSTM model achieved a 100%, 100%, and 100% success rate during training, validation, and testing. Hence, the developed algorithms achieved high performance in identifying corona faults in switchgear, particularly the 1D-CNN-LSTM model due to its accuracy in detecting corona faults in both the time and frequency domains.
Collapse
Affiliation(s)
- Yaseen Ahmed Mohammed Alsumaidaee
- College of Graduate Studies (COGS), Universiti Tenaga Nasional (The Energy University), Jalan Ikram-Uniten, Kajang 43000, Selangor, Malaysia
| | - Chong Tak Yaw
- Institute of Sustainable Energy, Universiti Tenaga Nasional (The Energy University), Jalan Ikram-Uniten, Kajang 43000, Selangor, Malaysia
| | - Siaw Paw Koh
- Institute of Sustainable Energy, Universiti Tenaga Nasional (The Energy University), Jalan Ikram-Uniten, Kajang 43000, Selangor, Malaysia
- Department Electrical and Electronics Engineering, Universiti Tenaga Nasional (The Energy University), Jalan Ikram-Uniten, Kajang 43000, Selangor, Malaysia
| | - Sieh Kiong Tiong
- Institute of Sustainable Energy, Universiti Tenaga Nasional (The Energy University), Jalan Ikram-Uniten, Kajang 43000, Selangor, Malaysia
- Department Electrical and Electronics Engineering, Universiti Tenaga Nasional (The Energy University), Jalan Ikram-Uniten, Kajang 43000, Selangor, Malaysia
| | - Chai Phing Chen
- Department Electrical and Electronics Engineering, Universiti Tenaga Nasional (The Energy University), Jalan Ikram-Uniten, Kajang 43000, Selangor, Malaysia
| | - Talal Yusaf
- School of Engineering and Technology, Central Queensland University, Brisbane, QLD 4009, Australia
| | - Ahmed N Abdalla
- Faculty of Electronic Information Engineering, Huaiyin Institute of Technology, Huai’an 223003, China
| | - Kharudin Ali
- Faculty of Electrical and Automation Engineering Technology, UC TATI, Teluk Kalong, Kemaman 24000, Terengganu, Malaysia
| | - Avinash Ashwin Raj
- Tenaga National Berhard Research Sdn. Bhd., No. 1, Kawasan Institusi Penyelidikan, Jln Ayer Hitam, Kajang 43000, Selangor, Malaysia
| |
Collapse
|
15
|
Han B, Ren C, Wang W, Li J, Gong X. Computational Prediction of Protein Intrinsically Disordered Region Related Interactions and Functions. Genes (Basel) 2023; 14:432. [PMID: 36833360 PMCID: PMC9956190 DOI: 10.3390/genes14020432] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2022] [Revised: 02/02/2023] [Accepted: 02/05/2023] [Indexed: 02/11/2023] Open
Abstract
Intrinsically Disordered Proteins (IDPs) and Regions (IDRs) exist widely. Although without well-defined structures, they participate in many important biological processes. In addition, they are also widely related to human diseases and have become potential targets in drug discovery. However, there is a big gap between the experimental annotations related to IDPs/IDRs and their actual number. In recent decades, the computational methods related to IDPs/IDRs have been developed vigorously, including predicting IDPs/IDRs, the binding modes of IDPs/IDRs, the binding sites of IDPs/IDRs, and the molecular functions of IDPs/IDRs according to different tasks. In view of the correlation between these predictors, we have reviewed these prediction methods uniformly for the first time, summarized their computational methods and predictive performance, and discussed some problems and perspectives.
Collapse
Affiliation(s)
- Bingqing Han
- Mathematical Intelligence Application Lab, Institute for Mathematical Sciences, Renmin University of China, Beijing 100872, China
| | - Chongjiao Ren
- Mathematical Intelligence Application Lab, Institute for Mathematical Sciences, Renmin University of China, Beijing 100872, China
| | - Wenda Wang
- Mathematical Intelligence Application Lab, Institute for Mathematical Sciences, Renmin University of China, Beijing 100872, China
| | - Jiashan Li
- Mathematical Intelligence Application Lab, Institute for Mathematical Sciences, Renmin University of China, Beijing 100872, China
| | - Xinqi Gong
- Mathematical Intelligence Application Lab, Institute for Mathematical Sciences, Renmin University of China, Beijing 100872, China
- Beijing Academy of Intelligence, Beijing 100083, China
| |
Collapse
|
16
|
Anbo H, Ota M, Fukuchi S. Computational Methods to Predict Intrinsically Disordered Regions and Functional Regions in Them. Methods Mol Biol 2023; 2627:231-245. [PMID: 36959451 DOI: 10.1007/978-1-0716-2974-1_13] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/25/2023]
Abstract
Intrinsically disordered regions (IDRs) are protein regions that do not adopt fixed tertiary structures. Since these regions lack ordered three-dimensional structures, they should be excluded from the target portions of homology modeling. IDRs can be predicted from the amino acid sequences, because their amino acid compositions are different from that of the structured domains. This chapter provides a review of the prediction methods of IDRs and a case study of IDR prediction.
Collapse
Affiliation(s)
- Hiroto Anbo
- Faculty of Engineering, Maebashi Institute of Technology, Maebashi, Japan
| | - Motonori Ota
- Graduate School of Information Sciences, Nagoya University, Nagoya, Japan
| | - Satoshi Fukuchi
- Faculty of Engineering, Maebashi Institute of Technology, Maebashi, Japan.
| |
Collapse
|
17
|
Chen Y, Cattoglio C, Dailey GM, Zhu Q, Tjian R, Darzacq X. Mechanisms governing target search and binding dynamics of hypoxia-inducible factors. eLife 2022; 11:e75064. [PMID: 36322456 PMCID: PMC9681212 DOI: 10.7554/elife.75064] [Citation(s) in RCA: 17] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2021] [Accepted: 11/01/2022] [Indexed: 11/07/2022] Open
Abstract
Transcription factors (TFs) are classically attributed a modular construction, containing well-structured sequence-specific DNA-binding domains (DBDs) paired with disordered activation domains (ADs) responsible for protein-protein interactions targeting co-factors or the core transcription initiation machinery. However, this simple division of labor model struggles to explain why TFs with identical DNA-binding sequence specificity determined in vitro exhibit distinct binding profiles in vivo. The family of hypoxia-inducible factors (HIFs) offer a stark example: aberrantly expressed in several cancer types, HIF-1α and HIF-2α subunit isoforms recognize the same DNA motif in vitro - the hypoxia response element (HRE) - but only share a subset of their target genes in vivo, while eliciting contrasting effects on cancer development and progression under certain circumstances. To probe the mechanisms mediating isoform-specific gene regulation, we used live-cell single particle tracking (SPT) to investigate HIF nuclear dynamics and how they change upon genetic perturbation or drug treatment. We found that HIF-α subunits and their dimerization partner HIF-1β exhibit distinct diffusion and binding characteristics that are exquisitely sensitive to concentration and subunit stoichiometry. Using domain-swap variants, mutations, and a HIF-2α specific inhibitor, we found that although the DBD and dimerization domains are important, another main determinant of chromatin binding and diffusion behavior is the AD-containing intrinsically disordered region (IDR). Using Cut&Run and RNA-seq as orthogonal genomic approaches, we also confirmed IDR-dependent binding and activation of a specific subset of HIF target genes. These findings reveal a previously unappreciated role of IDRs in regulating the TF search and binding process that contribute to functional target site selectivity on chromatin.
Collapse
Affiliation(s)
- Yu Chen
- Department of Molecular and Cell Biology, University of California, BerkeleyBerkeleyUnited States
- Howard Hughes Medical Institute, University of California, BerkeleyBerkeleyUnited States
- Li Ka Shing Center for Biomedical & Health Sciences, University of California, BerkeleyBerkeleyUnited States
| | - Claudia Cattoglio
- Department of Molecular and Cell Biology, University of California, BerkeleyBerkeleyUnited States
- Howard Hughes Medical Institute, University of California, BerkeleyBerkeleyUnited States
- Li Ka Shing Center for Biomedical & Health Sciences, University of California, BerkeleyBerkeleyUnited States
| | - Gina M Dailey
- Department of Molecular and Cell Biology, University of California, BerkeleyBerkeleyUnited States
- Li Ka Shing Center for Biomedical & Health Sciences, University of California, BerkeleyBerkeleyUnited States
| | - Qiulin Zhu
- Department of Molecular and Cell Biology, University of California, BerkeleyBerkeleyUnited States
- Li Ka Shing Center for Biomedical & Health Sciences, University of California, BerkeleyBerkeleyUnited States
| | - Robert Tjian
- Department of Molecular and Cell Biology, University of California, BerkeleyBerkeleyUnited States
- Howard Hughes Medical Institute, University of California, BerkeleyBerkeleyUnited States
- Li Ka Shing Center for Biomedical & Health Sciences, University of California, BerkeleyBerkeleyUnited States
| | - Xavier Darzacq
- Department of Molecular and Cell Biology, University of California, BerkeleyBerkeleyUnited States
- Li Ka Shing Center for Biomedical & Health Sciences, University of California, BerkeleyBerkeleyUnited States
| |
Collapse
|
18
|
Fang M, He Y, Du Z, Uversky VN. DeepCLD: An Efficient Sequence-Based Predictor of Intrinsically Disordered Proteins. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:3154-3159. [PMID: 34727037 DOI: 10.1109/tcbb.2021.3124273] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Intrinsic disorder is common in proteins, plays important roles in protein functionality, and is commonly associated with various human diseases. To have an accurate tool for the annotation of intrinsic disorder in proteins, this paper proposes a novel algorithm, DeepCLD, for sequence-based prediction of intrinsically disordered proteins. This algorithm uses amino acid position specific scoring matrix (PSSM) to capture the intrinsic variability characteristic of sequence patterns, ResNet to preserve feature space structure, and bidirectional CudnnLSTM as recurrent layer to further improve the efficiency. Futhermore, DeepCLD also utilized the attention mechanism to solve the problem of gradient disappearing in deep network. Comparative analyses show that DeepCLD has faster training speed and higher prediction accuracy than comparable methods.
Collapse
|
19
|
Kamal M, Tokmakjian L, Knox J, Mastrangelo P, Ji J, Cai H, Wojciechowski JW, Hughes MP, Takács K, Chu X, Pei J, Grolmusz V, Kotulska M, Forman-Kay JD, Roy PJ. A spatiotemporal reconstruction of the C. elegans pharyngeal cuticle reveals a structure rich in phase-separating proteins. eLife 2022; 11:e79396. [PMID: 36259463 PMCID: PMC9629831 DOI: 10.7554/elife.79396] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2022] [Accepted: 10/11/2022] [Indexed: 11/19/2022] Open
Abstract
How the cuticles of the roughly 4.5 million species of ecdysozoan animals are constructed is not well understood. Here, we systematically mine gene expression datasets to uncover the spatiotemporal blueprint for how the chitin-based pharyngeal cuticle of the nematode Caenorhabditis elegans is built. We demonstrate that the blueprint correctly predicts expression patterns and functional relevance to cuticle development. We find that as larvae prepare to molt, catabolic enzymes are upregulated and the genes that encode chitin synthase, chitin cross-linkers, and homologs of amyloid regulators subsequently peak in expression. Forty-eight percent of the gene products secreted during the molt are predicted to be intrinsically disordered proteins (IDPs), many of which belong to four distinct families whose transcripts are expressed in overlapping waves. These include the IDPAs, IDPBs, and IDPCs, which are introduced for the first time here. All four families have sequence properties that drive phase separation and we demonstrate phase separation for one exemplar in vitro. This systematic analysis represents the first blueprint for cuticle construction and highlights the massive contribution that phase-separating materials make to the structure.
Collapse
Affiliation(s)
- Muntasir Kamal
- Department of Molecular Genetics, University of TorontoTorontoCanada
- The Donnelly Centre for Cellular and Biomolecular Research, University of TorontoTorontoCanada
| | - Levon Tokmakjian
- The Donnelly Centre for Cellular and Biomolecular Research, University of TorontoTorontoCanada
- Department of Pharmacology and Toxicology, University of TorontoTorontoCanada
| | - Jessica Knox
- Department of Molecular Genetics, University of TorontoTorontoCanada
- The Donnelly Centre for Cellular and Biomolecular Research, University of TorontoTorontoCanada
| | - Peter Mastrangelo
- Department of Molecular Genetics, University of TorontoTorontoCanada
- The Donnelly Centre for Cellular and Biomolecular Research, University of TorontoTorontoCanada
| | - Jingxiu Ji
- Department of Molecular Genetics, University of TorontoTorontoCanada
- The Donnelly Centre for Cellular and Biomolecular Research, University of TorontoTorontoCanada
| | - Hao Cai
- Molecular Medicine Program, The Hospital for Sick ChildrenTorontoCanada
| | - Jakub W Wojciechowski
- Wroclaw University of Science and Technology, Faculty of Fundamental Problems of Technology, Department of Biomedical EngineeringWroclawPoland
| | - Michael P Hughes
- Department of Cell and Molecular Biology, St. Jude Children’s Research HospitalMemphisUnited States
| | - Kristóf Takács
- PIT Bioinformatics Group, Institute of Mathematics, Eötvös UniversityBudapestHungary
| | - Xiaoquan Chu
- Center for Quantitative Biology, Academy for Advanced Interdisciplinary Studies, Peking UniversityBeijingChina
| | - Jianfeng Pei
- Department of Computer Science and Technology, Tsinghua UniversityBeijingChina
| | - Vince Grolmusz
- PIT Bioinformatics Group, Institute of Mathematics, Eötvös UniversityBudapestHungary
| | - Malgorzata Kotulska
- Wroclaw University of Science and Technology, Faculty of Fundamental Problems of Technology, Department of Biomedical EngineeringWroclawPoland
| | - Julie Deborah Forman-Kay
- Molecular Medicine Program, The Hospital for Sick ChildrenTorontoCanada
- Department of Biochemistry, University of TorontoTorontoCanada
| | - Peter J Roy
- Department of Molecular Genetics, University of TorontoTorontoCanada
- The Donnelly Centre for Cellular and Biomolecular Research, University of TorontoTorontoCanada
- Department of Pharmacology and Toxicology, University of TorontoTorontoCanada
| |
Collapse
|
20
|
Ilzhöfer D, Heinzinger M, Rost B. SETH predicts nuances of residue disorder from protein embeddings. FRONTIERS IN BIOINFORMATICS 2022; 2:1019597. [PMID: 36304335 PMCID: PMC9580958 DOI: 10.3389/fbinf.2022.1019597] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2022] [Accepted: 09/20/2022] [Indexed: 11/07/2022] Open
Abstract
Predictions for millions of protein three-dimensional structures are only a few clicks away since the release of AlphaFold2 results for UniProt. However, many proteins have so-called intrinsically disordered regions (IDRs) that do not adopt unique structures in isolation. These IDRs are associated with several diseases, including Alzheimer’s Disease. We showed that three recent disorder measures of AlphaFold2 predictions (pLDDT, “experimentally resolved” prediction and “relative solvent accessibility”) correlated to some extent with IDRs. However, expert methods predict IDRs more reliably by combining complex machine learning models with expert-crafted input features and evolutionary information from multiple sequence alignments (MSAs). MSAs are not always available, especially for IDRs, and are computationally expensive to generate, limiting the scalability of the associated tools. Here, we present the novel method SETH that predicts residue disorder from embeddings generated by the protein Language Model ProtT5, which explicitly only uses single sequences as input. Thereby, our method, relying on a relatively shallow convolutional neural network, outperformed much more complex solutions while being much faster, allowing to create predictions for the human proteome in about 1 hour on a consumer-grade PC with one NVIDIA GeForce RTX 3060. Trained on a continuous disorder scale (CheZOD scores), our method captured subtle variations in disorder, thereby providing important information beyond the binary classification of most methods. High performance paired with speed revealed that SETH’s nuanced disorder predictions for entire proteomes capture aspects of the evolution of organisms. Additionally, SETH could also be used to filter out regions or proteins with probable low-quality AlphaFold2 3D structures to prioritize running the compute-intensive predictions for large data sets. SETH is freely publicly available at: https://github.com/Rostlab/SETH.
Collapse
Affiliation(s)
- Dagmar Ilzhöfer
- Faculty of Informatics, TUM (Technical University of Munich), Munich, Germany
| | - Michael Heinzinger
- Faculty of Informatics, TUM (Technical University of Munich), Munich, Germany,Center of Doctoral Studies in Informatics and Its Applications (CeDoSIA), TUM Graduate School, Garching, Germany,*Correspondence: Michael Heinzinger,
| | - Burkhard Rost
- Faculty of Informatics, TUM (Technical University of Munich), Munich, Germany,Institute for Advanced Study (TUM-IAS), TUM (Technical University of Munich), Garching, Germany,TUM School of Life Sciences Weihenstephan (WZW), TUM (Technical University of Munich), Freising, Germany
| |
Collapse
|
21
|
Chen R, Li X, Yang Y, Song X, Wang C, Qiao D. Prediction of protein-protein interaction sites in intrinsically disordered proteins. Front Mol Biosci 2022; 9:985022. [PMID: 36250006 PMCID: PMC9567019 DOI: 10.3389/fmolb.2022.985022] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2022] [Accepted: 07/27/2022] [Indexed: 11/25/2022] Open
Abstract
Intrinsically disordered proteins (IDPs) participate in many biological processes by interacting with other proteins, including the regulation of transcription, translation, and the cell cycle. With the increasing amount of disorder sequence data available, it is thus crucial to identify the IDP binding sites for functional annotation of these proteins. Over the decades, many computational approaches have been developed to predict protein-protein binding sites of IDP (IDP-PPIS) based on protein sequence information. Moreover, there are new IDP-PPIS predictors developed every year with the rapid development of artificial intelligence. It is thus necessary to provide an up-to-date overview of these methods in this field. In this paper, we collected 30 representative predictors published recently and summarized the databases, features and algorithms. We described the procedure how the features were generated based on public data and used for the prediction of IDP-PPIS, along with the methods to generate the feature representations. All the predictors were divided into three categories: scoring functions, machine learning-based prediction, and consensus approaches. For each category, we described the details of algorithms and their performances. Hopefully, our manuscript will not only provide a full picture of the status quo of IDP binding prediction, but also a guide for selecting different methods. More importantly, it will shed light on the inspirations for future development trends and principles.
Collapse
Affiliation(s)
- Ranran Chen
- Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan, China
- National Institute of Health Data Science of China, Shandong University, Jinan, China
| | - Xinlu Li
- Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan, China
- National Institute of Health Data Science of China, Shandong University, Jinan, China
| | - Yaqing Yang
- Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan, China
- National Institute of Health Data Science of China, Shandong University, Jinan, China
| | - Xixi Song
- Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan, China
- National Institute of Health Data Science of China, Shandong University, Jinan, China
| | - Cheng Wang
- Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan, China
- National Institute of Health Data Science of China, Shandong University, Jinan, China
- *Correspondence: Cheng Wang, ; Dongdong Qiao,
| | - Dongdong Qiao
- Shandong Mental Health Center, Shandong University, Jinan, China
- *Correspondence: Cheng Wang, ; Dongdong Qiao,
| |
Collapse
|
22
|
Protein Function Analysis through Machine Learning. Biomolecules 2022; 12:biom12091246. [PMID: 36139085 PMCID: PMC9496392 DOI: 10.3390/biom12091246] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2022] [Revised: 08/22/2022] [Accepted: 08/31/2022] [Indexed: 11/16/2022] Open
Abstract
Machine learning (ML) has been an important arsenal in computational biology used to elucidate protein function for decades. With the recent burgeoning of novel ML methods and applications, new ML approaches have been incorporated into many areas of computational biology dealing with protein function. We examine how ML has been integrated into a wide range of computational models to improve prediction accuracy and gain a better understanding of protein function. The applications discussed are protein structure prediction, protein engineering using sequence modifications to achieve stability and druggability characteristics, molecular docking in terms of protein–ligand binding, including allosteric effects, protein–protein interactions and protein-centric drug discovery. To quantify the mechanisms underlying protein function, a holistic approach that takes structure, flexibility, stability, and dynamics into account is required, as these aspects become inseparable through their interdependence. Another key component of protein function is conformational dynamics, which often manifest as protein kinetics. Computational methods that use ML to generate representative conformational ensembles and quantify differences in conformational ensembles important for function are included in this review. Future opportunities are highlighted for each of these topics.
Collapse
|
23
|
Hong Y, Song J, Ko J, Lee J, Shin WH. S-Pred: protein structural property prediction using MSA transformer. Sci Rep 2022; 12:13891. [PMID: 35974061 PMCID: PMC9381718 DOI: 10.1038/s41598-022-18205-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2022] [Accepted: 08/08/2022] [Indexed: 11/10/2022] Open
Abstract
Predicting the local structural features of a protein from its amino acid sequence helps its function prediction to be revealed and assists in three-dimensional structural modeling. As the sequence-structure gap increases, prediction methods have been developed to bridge this gap. Additionally, as the size of the structural database and computing power increase, the performance of these methods have also significantly improved. Herein, we present a powerful new tool called S-Pred, which can predict eight-state secondary structures (SS8), accessible surface areas (ASAs), and intrinsically disordered regions (IDRs) from a given sequence. For feature prediction, S-Pred uses multiple sequence alignment (MSA) of a query sequence as an input. The MSA input is converted to features by the MSA Transformer, which is a protein language model that uses an attention mechanism. A long short-term memory (LSTM) was employed to produce the final prediction. The performance of S-Pred was evaluated on several test sets, and the program consistently provided accurate predictions. The accuracy of the SS8 prediction was approximately 76%, and the Pearson’s correlation between the experimental and predicted ASAs was 0.84. Additionally, an IDR could be accurately predicted with an F1-score of 0.514. The program is freely available at https://github.com/arontier/S_Pred_Paper and https://ad3.io as a code and a web server.
Collapse
Affiliation(s)
- Yiyu Hong
- Arontier Co., Seoul, 06735, Republic of Korea
| | - Jinung Song
- Arontier Co., Seoul, 06735, Republic of Korea
| | - Junsu Ko
- Arontier Co., Seoul, 06735, Republic of Korea
| | - Juyong Lee
- Arontier Co., Seoul, 06735, Republic of Korea.,Division of Chemistry and Biochemistry, Department of Chemistry, Kangwon National University, Chuncheon, 24341, Republic of Korea
| | - Woong-Hee Shin
- Arontier Co., Seoul, 06735, Republic of Korea. .,Department of Chemistry Education, Sunchon National University, Suncheon, 57922, Republic of Korea. .,Department of Advanced Components and Materials Engineering, Sunchon National University, Suncheon, 57922, Republic of Korea.
| |
Collapse
|
24
|
Tomaž Š, Gruden K, Coll A. TGA transcription factors-Structural characteristics as basis for functional variability. FRONTIERS IN PLANT SCIENCE 2022; 13:935819. [PMID: 35958211 PMCID: PMC9360754 DOI: 10.3389/fpls.2022.935819] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 05/04/2022] [Accepted: 07/04/2022] [Indexed: 06/15/2023]
Abstract
TGA transcription factors are essential regulators of various cellular processes, their activity connected to different hormonal pathways, interacting proteins and regulatory elements. Belonging to the basic region leucine zipper (bZIP) family, TGAs operate by binding to their target DNA sequence as dimers through a conserved bZIP domain. Despite sharing the core DNA-binding sequence, the TGA paralogues exert somewhat different DNA-binding preferences. Sequence variability of their N- and C-terminal protein parts indicates their importance in defining TGA functional specificity through interactions with diverse proteins, affecting their DNA-binding properties. In this review, we provide a short and concise summary on plant TGA transcription factors from a structural point of view, including the relation of their structural characteristics to their functional roles in transcription regulation.
Collapse
Affiliation(s)
- Špela Tomaž
- Department of Biotechnology and Systems Biology, National Institute of Biology, Ljubljana, Slovenia
- Jožef Stefan International Postgraduate School, Ljubljana, Slovenia
| | - Kristina Gruden
- Department of Biotechnology and Systems Biology, National Institute of Biology, Ljubljana, Slovenia
| | - Anna Coll
- Department of Biotechnology and Systems Biology, National Institute of Biology, Ljubljana, Slovenia
| |
Collapse
|
25
|
Compositional Bias of Intrinsically Disordered Proteins and Regions and Their Predictions. Biomolecules 2022; 12:biom12070888. [PMID: 35883444 PMCID: PMC9313023 DOI: 10.3390/biom12070888] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2022] [Revised: 06/10/2022] [Accepted: 06/10/2022] [Indexed: 11/17/2022] Open
Abstract
Intrinsically disordered regions (IDRs) carry out many cellular functions and vary in length and placement in protein sequences. This diversity leads to variations in the underlying compositional biases, which were demonstrated for the short vs. long IDRs. We analyze compositional biases across four classes of disorder: fully disordered proteins; short IDRs; long IDRs; and binding IDRs. We identify three distinct biases: for the fully disordered proteins, the short IDRs and the long and binding IDRs combined. We also investigate compositional bias for putative disorder produced by leading disorder predictors and find that it is similar to the bias of the native disorder. Interestingly, the accuracy of disorder predictions across different methods is correlated with the correctness of the compositional bias of their predictions highlighting the importance of the compositional bias. The predictive quality is relatively low for the disorder classes with compositional bias that is the most different from the “generic” disorder bias, while being much higher for the classes with the most similar bias. We discover that different predictors perform best across different classes of disorder. This suggests that no single predictor is universally best and motivates the development of new architectures that combine models that target specific disorder classes.
Collapse
|
26
|
AlphaFold2: A Role for Disordered Protein/Region Prediction? Int J Mol Sci 2022; 23:ijms23094591. [PMID: 35562983 PMCID: PMC9104326 DOI: 10.3390/ijms23094591] [Citation(s) in RCA: 57] [Impact Index Per Article: 28.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2022] [Revised: 04/18/2022] [Accepted: 04/19/2022] [Indexed: 01/27/2023] Open
Abstract
The development of AlphaFold2 marked a paradigm-shift in the structural biology community. Herein, we assess the ability of AlphaFold2 to predict disordered regions against traditional sequence-based disorder predictors. We find that AlphaFold2 performs well at discriminating disordered regions, but also note that the disorder predictor one constructs from an AlphaFold2 structure determines accuracy. In particular, a naïve, but non-trivial assumption that residues assigned to helices, strands, and H-bond stabilized turns are likely ordered and all other residues are disordered results in a dramatic overestimation in disorder; conversely, the predicted local distance difference test (pLDDT) provides an excellent measure of residue-wise disorder. Furthermore, by employing molecular dynamics (MD) simulations, we note an interesting relationship between the pLDDT and secondary structure, that may explain our observations and suggests a broader application of the pLDDT for characterizing the local dynamics of intrinsically disordered proteins and regions (IDPs/IDRs).
Collapse
|
27
|
An in-frame deletion mutation in the degron tail of auxin coreceptor IAA2 confers resistance to the herbicide 2,4-D in Sisymbrium orientale. Proc Natl Acad Sci U S A 2022; 119:2105819119. [PMID: 35217601 PMCID: PMC8892348 DOI: 10.1073/pnas.2105819119] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/09/2021] [Indexed: 12/13/2022] Open
Abstract
Synthetic auxin herbicides intersect basic plant developmental biology and applied weed management. We investigated resistance to 2,4-D in the Australian weed Sisymbrium orientale (Indian hedge mustard). We identified a mechanism involving an in-frame 27-bp deletion in the degron tail of auxin coreceptor IAA2, one member of the gene family of Aux/IAA auxin co-receptors. We show that this deletion in IAA2 is a gain-of-function mutation that confers synthetic auxin resistance. This field-evolved mechanism of resistance to synthetic auxin herbicides confirms previous biochemical studies showing the role of the Aux/IAA degron tail in regulating Aux/IAA protein degradation upon auxin perception. The deletion mutation could be generated in crops using gene-editing approaches for cross-resistance to multiple synthetic auxin herbicides. The natural auxin indole-3-acetic acid (IAA) is a key regulator of many aspects of plant growth and development. Synthetic auxin herbicides such as 2,4-D mimic the effects of IAA by inducing strong auxinic-signaling responses in plants. To determine the mechanism of 2,4-D resistance in a Sisymbrium orientale (Indian hedge mustard) weed population, we performed a transcriptome analysis of 2,4-D-resistant (R) and -susceptible (S) genotypes that revealed an in-frame 27-nucleotide deletion removing nine amino acids in the degron tail (DT) of the auxin coreceptor Aux/IAA2 (SoIAA2). The deletion allele cosegregated with 2,4-D resistance in recombinant inbred lines. Further, this deletion was also detected in several 2,4-D-resistant field populations of this species. Arabidopsis transgenic lines expressing the SoIAA2 mutant allele were resistant to 2,4-D and dicamba. The IAA2-DT deletion reduced binding to TIR1 in vitro with both natural and synthetic auxins, causing reduced association and increased dissociation rates. This mechanism of synthetic auxin herbicide resistance assigns an in planta function to the DT region of this Aux/IAA coreceptor for its role in synthetic auxin binding kinetics and reveals a potential biotechnological approach to produce synthetic auxin-resistant crops using gene-editing.
Collapse
|
28
|
Zhao J, Wang Z. Identifying Intrinsically Disordered Protein Regions through a Deep Neural Network with Three Novel Sequence Features. Life (Basel) 2022; 12:life12030345. [PMID: 35330096 PMCID: PMC8950681 DOI: 10.3390/life12030345] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2022] [Revised: 02/22/2022] [Accepted: 02/23/2022] [Indexed: 11/26/2022] Open
Abstract
The fast, reliable, and accurate identification of IDPRs is essential, as in recent years it has come to be recognized more and more that IDPRs have a wide impact on many important physiological processes, such as molecular recognition and molecular assembly, the regulation of transcription and translation, protein phosphorylation, cellular signal transduction, etc. For the sake of cost-effectiveness, it is imperative to develop computational approaches for identifying IDPRs. In this study, a deep neural structure where a variant VGG19 is situated between two MLP networks is developed for identifying IDPRs. Furthermore, for the first time, three novel sequence features—i.e., persistent entropy and the probabilities associated with two and three consecutive amino acids of the protein sequence—are introduced for identifying IDPRs. The simulation results show that our neural structure either performs considerably better than other known methods or, when relying on a much smaller training set, attains a similar performance. Our deep neural structure, which exploits the VGG19 structure, is effective for identifying IDPRs. Furthermore, three novel sequence features—i.e., the persistent entropy and the probabilities associated with two and three consecutive amino acids of the protein sequence—could be used as valuable sequence features in the further development of identifying IDPRs.
Collapse
|
29
|
Pakhrin SC, Aoki-Kinoshita KF, Caragea D, KC DB. DeepNGlyPred: A Deep Neural Network-Based Approach for Human N-Linked Glycosylation Site Prediction. Molecules 2021; 26:molecules26237314. [PMID: 34885895 PMCID: PMC8658957 DOI: 10.3390/molecules26237314] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2021] [Revised: 11/22/2021] [Accepted: 11/26/2021] [Indexed: 12/21/2022] Open
Abstract
Protein N-linked glycosylation is a post-translational modification that plays an important role in a myriad of biological processes. Computational prediction approaches serve as complementary methods for the characterization of glycosylation sites. Most of the existing predictors for N-linked glycosylation utilize the information that the glycosylation site occurs at the N-X-[S/T] sequon, where X is any amino acid except proline. Not all N-X-[S/T] sequons are glycosylated, thus the N-X-[S/T] sequon is a necessary but not sufficient determinant for protein glycosylation. In that regard, computational prediction of N-linked glycosylation sites confined to N-X-[S/T] sequons is an important problem. Here, we report DeepNGlyPred a deep learning-based approach that encodes the positive and negative sequences in the human proteome dataset (extracted from N-GlycositeAtlas) using sequence-based features (gapped-dipeptide), predicted structural features, and evolutionary information. DeepNGlyPred produces SN, SP, MCC, and ACC of 88.62%, 73.92%, 0.60, and 79.41%, respectively on N-GlyDE independent test set, which is better than the compared approaches. These results demonstrate that DeepNGlyPred is a robust computational technique to predict N-Linked glycosylation sites confined to N-X-[S/T] sequon. DeepNGlyPred will be a useful resource for the glycobiology community.
Collapse
Affiliation(s)
- Subash C. Pakhrin
- School of Computing, Wichita State University, 1845 Fairmount St., Wichita, KS 67260, USA;
| | | | - Doina Caragea
- Department of Computer Science, Kansas State University, Manhattan, KS 66506, USA;
| | - Dukka B. KC
- Department of Computer Science, Michigan Technological University, Houghton, MI 49931, USA
- Correspondence: ; Tel.: +1-906-487-1657
| |
Collapse
|
30
|
Emenecker RJ, Griffith D, Holehouse AS. Metapredict: a fast, accurate, and easy-to-use predictor of consensus disorder and structure. Biophys J 2021; 120:4312-4319. [PMID: 34480923 PMCID: PMC8553642 DOI: 10.1016/j.bpj.2021.08.039] [Citation(s) in RCA: 77] [Impact Index Per Article: 25.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2021] [Revised: 08/08/2021] [Accepted: 08/30/2021] [Indexed: 01/02/2023] Open
Abstract
Intrinsically disordered proteins and protein regions make up a substantial fraction of many proteomes in which they play a wide variety of essential roles. A critical first step in understanding the role of disordered protein regions in biological function is to identify those disordered regions correctly. Computational methods for disorder prediction have emerged as a core set of tools to guide experiments, interpret results, and develop hypotheses. Given the multiple different predictors available, consensus scores have emerged as a popular approach to mitigate biases or limitations of any single method. Consensus scores integrate the outcome of multiple independent disorder predictors and provide a per-residue value that reflects the number of tools that predict a residue to be disordered. Although consensus scores help mitigate the inherent problems of using any single disorder predictor, they are computationally expensive to generate. They also necessitate the installation of multiple different software tools, which can be prohibitively difficult. To address this challenge, we developed a deep-learning-based predictor of consensus disorder scores. Our predictor, metapredict, utilizes a bidirectional recurrent neural network trained on the consensus disorder scores from 12 proteomes. By benchmarking metapredict using two orthogonal approaches, we found that metapredict is among the most accurate disorder predictors currently available. Metapredict is also remarkably fast, enabling proteome-scale disorder prediction in minutes. Importantly, metapredict is a fully open source and is distributed as a Python package, a collection of command-line tools, and a web server, maximizing the potential practical utility of the predictor. We believe metapredict offers a convenient, accessible, accurate, and high-performance predictor for single-proteins and proteomes alike.
Collapse
Affiliation(s)
- Ryan J Emenecker
- Department of Biochemistry and Molecular Biophysics, Washington University School of Medicine, St. Louis, Missouri; Center for Science and Engineering Living Systems (CSELS), St. Louis, Missouri; Center for Engineering Mechanobiology, Washington University, St. Louis, Missouri
| | - Daniel Griffith
- Department of Biochemistry and Molecular Biophysics, Washington University School of Medicine, St. Louis, Missouri; Center for Science and Engineering Living Systems (CSELS), St. Louis, Missouri
| | - Alex S Holehouse
- Department of Biochemistry and Molecular Biophysics, Washington University School of Medicine, St. Louis, Missouri; Center for Science and Engineering Living Systems (CSELS), St. Louis, Missouri.
| |
Collapse
|
31
|
Emenecker RJ, Griffith D, Holehouse AS. Metapredict: a fast, accurate, and easy-to-use predictor of consensus disorder and structure. Biophys J 2021; 120:4312-4319. [PMID: 34480923 DOI: 10.1101/2021.05.30.446349] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2021] [Revised: 08/08/2021] [Accepted: 08/30/2021] [Indexed: 05/28/2023] Open
Abstract
Intrinsically disordered proteins and protein regions make up a substantial fraction of many proteomes in which they play a wide variety of essential roles. A critical first step in understanding the role of disordered protein regions in biological function is to identify those disordered regions correctly. Computational methods for disorder prediction have emerged as a core set of tools to guide experiments, interpret results, and develop hypotheses. Given the multiple different predictors available, consensus scores have emerged as a popular approach to mitigate biases or limitations of any single method. Consensus scores integrate the outcome of multiple independent disorder predictors and provide a per-residue value that reflects the number of tools that predict a residue to be disordered. Although consensus scores help mitigate the inherent problems of using any single disorder predictor, they are computationally expensive to generate. They also necessitate the installation of multiple different software tools, which can be prohibitively difficult. To address this challenge, we developed a deep-learning-based predictor of consensus disorder scores. Our predictor, metapredict, utilizes a bidirectional recurrent neural network trained on the consensus disorder scores from 12 proteomes. By benchmarking metapredict using two orthogonal approaches, we found that metapredict is among the most accurate disorder predictors currently available. Metapredict is also remarkably fast, enabling proteome-scale disorder prediction in minutes. Importantly, metapredict is a fully open source and is distributed as a Python package, a collection of command-line tools, and a web server, maximizing the potential practical utility of the predictor. We believe metapredict offers a convenient, accessible, accurate, and high-performance predictor for single-proteins and proteomes alike.
Collapse
Affiliation(s)
- Ryan J Emenecker
- Department of Biochemistry and Molecular Biophysics, Washington University School of Medicine, St. Louis, Missouri; Center for Science and Engineering Living Systems (CSELS), St. Louis, Missouri; Center for Engineering Mechanobiology, Washington University, St. Louis, Missouri
| | - Daniel Griffith
- Department of Biochemistry and Molecular Biophysics, Washington University School of Medicine, St. Louis, Missouri; Center for Science and Engineering Living Systems (CSELS), St. Louis, Missouri
| | - Alex S Holehouse
- Department of Biochemistry and Molecular Biophysics, Washington University School of Medicine, St. Louis, Missouri; Center for Science and Engineering Living Systems (CSELS), St. Louis, Missouri.
| |
Collapse
|
32
|
Griffith D, Holehouse AS. PARROT is a flexible recurrent neural network framework for analysis of large protein datasets. eLife 2021; 10:e70576. [PMID: 34533455 PMCID: PMC8448528 DOI: 10.7554/elife.70576] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2021] [Accepted: 09/06/2021] [Indexed: 11/29/2022] Open
Abstract
The rise of high-throughput experiments has transformed how scientists approach biological questions. The ubiquity of large-scale assays that can test thousands of samples in a day has necessitated the development of new computational approaches to interpret this data. Among these tools, machine learning approaches are increasingly being utilized due to their ability to infer complex nonlinear patterns from high-dimensional data. Despite their effectiveness, machine learning (and in particular deep learning) approaches are not always accessible or easy to implement for those with limited computational expertise. Here we present PARROT, a general framework for training and applying deep learning-based predictors on large protein datasets. Using an internal recurrent neural network architecture, PARROT is capable of tackling both classification and regression tasks while only requiring raw protein sequences as input. We showcase the potential uses of PARROT on three diverse machine learning tasks: predicting phosphorylation sites, predicting transcriptional activation function of peptides generated by high-throughput reporter assays, and predicting the fibrillization propensity of amyloid beta with data generated by deep mutational scanning. Through these examples, we demonstrate that PARROT is easy to use, performs comparably to state-of-the-art computational tools, and is applicable for a wide array of biological problems.
Collapse
Affiliation(s)
- Daniel Griffith
- Department of Biochemistry and Molecular Biophysics, Washington University School of MedicineSt LouisUnited States
- Center for Science and Engineering Living Systems, Washington UniversitySt LouisUnited States
| | - Alex S Holehouse
- Department of Biochemistry and Molecular Biophysics, Washington University School of MedicineSt LouisUnited States
- Center for Science and Engineering Living Systems, Washington UniversitySt LouisUnited States
| |
Collapse
|
33
|
Chen Z, Zhao P, Li C, Li F, Xiang D, Chen YZ, Akutsu T, Daly RJ, Webb GI, Zhao Q, Kurgan L, Song J. iLearnPlus: a comprehensive and automated machine-learning platform for nucleic acid and protein sequence analysis, prediction and visualization. Nucleic Acids Res 2021; 49:e60. [PMID: 33660783 PMCID: PMC8191785 DOI: 10.1093/nar/gkab122] [Citation(s) in RCA: 107] [Impact Index Per Article: 35.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2020] [Revised: 02/05/2021] [Accepted: 02/25/2021] [Indexed: 12/14/2022] Open
Abstract
Sequence-based analysis and prediction are fundamental bioinformatic tasks that facilitate understanding of the sequence(-structure)-function paradigm for DNAs, RNAs and proteins. Rapid accumulation of sequences requires equally pervasive development of new predictive models, which depends on the availability of effective tools that support these efforts. We introduce iLearnPlus, the first machine-learning platform with graphical- and web-based interfaces for the construction of machine-learning pipelines for analysis and predictions using nucleic acid and protein sequences. iLearnPlus provides a comprehensive set of algorithms and automates sequence-based feature extraction and analysis, construction and deployment of models, assessment of predictive performance, statistical analysis, and data visualization; all without programming. iLearnPlus includes a wide range of feature sets which encode information from the input sequences and over twenty machine-learning algorithms that cover several deep-learning approaches, outnumbering the current solutions by a wide margin. Our solution caters to experienced bioinformaticians, given the broad range of options, and biologists with no programming background, given the point-and-click interface and easy-to-follow design process. We showcase iLearnPlus with two case studies concerning prediction of long noncoding RNAs (lncRNAs) from RNA transcripts and prediction of crotonylation sites in protein chains. iLearnPlus is an open-source platform available at https://github.com/Superzchen/iLearnPlus/ with the webserver at http://ilearnplus.erc.monash.edu/.
Collapse
Affiliation(s)
- Zhen Chen
- Collaborative Innovation Center of Henan Grain Crops, Henan Agricultural University, Zhengzhou 450046, China
| | - Pei Zhao
- State Key Laboratory of Cotton Biology, Institute of Cotton Research of Chinese Academy of Agricultural Sciences (CAAS), Anyang 455000, China
| | - Chen Li
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
| | - Fuyi Li
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia.,Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia.,Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, Melbourne, Victoria 3000, Australia
| | - Dongxu Xiang
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia.,Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
| | - Yong-Zi Chen
- Laboratory of Tumor Cell Biology, Key Laboratory of Cancer Prevention and Therapy, National Clinical Research Center for Cancer, Tianjin Medical University Cancer Institute and Hospital, Tianjin Medical University, Tianjin 300060, China
| | - Tatsuya Akutsu
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Kyoto 611-0011, Japan
| | - Roger J Daly
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
| | - Geoffrey I Webb
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
| | - Quanzhi Zhao
- Collaborative Innovation Center of Henan Grain Crops, Henan Agricultural University, Zhengzhou 450046, China.,Key Laboratory of Rice Biology in Henan Province, Henan Agricultural University, Zhengzhou 450046, China
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, USA
| | - Jiangning Song
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia.,Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
| |
Collapse
|
34
|
Suh D, Lee JW, Choi S, Lee Y. Recent Applications of Deep Learning Methods on Evolution- and Contact-Based Protein Structure Prediction. Int J Mol Sci 2021; 22:6032. [PMID: 34199677 PMCID: PMC8199773 DOI: 10.3390/ijms22116032] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2021] [Revised: 05/29/2021] [Accepted: 05/29/2021] [Indexed: 01/23/2023] Open
Abstract
The new advances in deep learning methods have influenced many aspects of scientific research, including the study of the protein system. The prediction of proteins' 3D structural components is now heavily dependent on machine learning techniques that interpret how protein sequences and their homology govern the inter-residue contacts and structural organization. Especially, methods employing deep neural networks have had a significant impact on recent CASP13 and CASP14 competition. Here, we explore the recent applications of deep learning methods in the protein structure prediction area. We also look at the potential opportunities for deep learning methods to identify unknown protein structures and functions to be discovered and help guide drug-target interactions. Although significant problems still need to be addressed, we expect these techniques in the near future to play crucial roles in protein structural bioinformatics as well as in drug discovery.
Collapse
Affiliation(s)
- Donghyuk Suh
- Global AI Drug Discovery Center, School of Pharmaceutical Sciences, College of Pharmacy and Graduate, Ewha Womans University, Seoul 03760, Korea; (D.S.); (J.W.L.); (S.C.)
| | - Jai Woo Lee
- Global AI Drug Discovery Center, School of Pharmaceutical Sciences, College of Pharmacy and Graduate, Ewha Womans University, Seoul 03760, Korea; (D.S.); (J.W.L.); (S.C.)
| | - Sun Choi
- Global AI Drug Discovery Center, School of Pharmaceutical Sciences, College of Pharmacy and Graduate, Ewha Womans University, Seoul 03760, Korea; (D.S.); (J.W.L.); (S.C.)
| | - Yoonji Lee
- College of Pharmacy, Chung-Ang University, Seoul 06974, Korea
| |
Collapse
|
35
|
Coates HW, Capell-Hattam IM, Brown AJ. The mammalian cholesterol synthesis enzyme squalene monooxygenase is proteasomally truncated to a constitutively active form. J Biol Chem 2021; 296:100731. [PMID: 33933449 PMCID: PMC8166775 DOI: 10.1016/j.jbc.2021.100731] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2020] [Revised: 04/24/2021] [Accepted: 04/28/2021] [Indexed: 02/06/2023] Open
Abstract
Squalene monooxygenase (SM, also known as squalene epoxidase) is a rate-limiting enzyme of cholesterol synthesis that converts squalene to monooxidosqualene and is oncogenic in numerous cancer types. SM is subject to feedback regulation via cholesterol-induced proteasomal degradation, which depends on its lipid-sensing N-terminal regulatory domain. We previously identified an endogenous truncated form of SM with a similar abundance to full-length SM, but whether this truncated form is functional or subject to the same regulatory mechanisms as full-length SM is not known. Here, we show that truncated SM differs from full-length SM in two major ways: it is cholesterol resistant and adopts a peripheral rather than integral association with the endoplasmic reticulum membrane. However, truncated SM retains full SM activity and is therefore constitutively active. Truncation of SM occurs during its endoplasmic reticulum–associated degradation and requires the proteasome, which partially degrades the SM N-terminus and disrupts cholesterol-sensing elements within the regulatory domain. Furthermore, truncation relies on a ubiquitin signal that is distinct from that required for cholesterol-induced degradation. Using mutagenesis, we demonstrate that partial proteasomal degradation of SM depends on both an intrinsically disordered region near the truncation site and the stability of the adjacent catalytic domain, which escapes degradation. These findings uncover an additional layer of complexity in the post-translational regulation of cholesterol synthesis and establish SM as the first eukaryotic enzyme found to undergo proteasomal truncation.
Collapse
Affiliation(s)
- Hudson W Coates
- School of Biotechnology and Biomolecular Sciences, UNSW Sydney, Sydney, NSW, Australia
| | | | - Andrew J Brown
- School of Biotechnology and Biomolecular Sciences, UNSW Sydney, Sydney, NSW, Australia.
| |
Collapse
|
36
|
Katuwawala A, Ghadermarzi S, Hu G, Wu Z, Kurgan L. QUARTERplus: Accurate disorder predictions integrated with interpretable residue-level quality assessment scores. Comput Struct Biotechnol J 2021; 19:2597-2606. [PMID: 34025946 PMCID: PMC8122155 DOI: 10.1016/j.csbj.2021.04.066] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2021] [Revised: 04/24/2021] [Accepted: 04/24/2021] [Indexed: 12/13/2022] Open
Abstract
A recent advance in the disorder prediction field is the development of the quality assessment (QA) scores. QA scores complement the propensities produced by the disorder predictors by identifying regions where these predictions are more likely to be correct. We develop, empirically test and release a new QA tool, QUARTERplus, that addresses several key drawbacks of the current QA method, QUARTER. QUARTERplus is the first solution that utilizes QA scores and the associated input disorder predictions to produce very accurate disorder predictions with the help of a modern deep learning meta-model. The deep neural network utilizes the QA scores to identify and fix the regions where the original/input disorder predictions are poor. More importantly, the accurate QUATERplus's predictions are accompanied by easy to interpret residue-level QA scores that reliably quantify their residue-level predictive quality. We provide these interpretable QA scores for QUARTERplus and 10 other popular disorder predictors. Empirical tests on a large and independent (low similarity) test dataset show that QUARTERplus predictions secure AUC = 0.93 and are statistically more accurate than the results of twelve state-of-the-art disorder predictors. We also demonstrate that the new QA scores produced by QUARTERplus are highly correlated with the actual predictive quality and that they can be effectively used to identify regions of correct disorder predictions. This feature empowers the users to easily identify which parts of the predictions generated by the modern disorder predictors are more trustworthy. QUARTERplus is available as a convenient webserver at http://biomine.cs.vcu.edu/servers/QUARTERplus/.
Collapse
Affiliation(s)
- Akila Katuwawala
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA 23284, USA
| | - Sina Ghadermarzi
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA 23284, USA
| | - Gang Hu
- School of Statistics and Data Science, LPMC and KLMDASR, Nankai University, Tianjin 300071, China
| | - Zhonghua Wu
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin 300071, China
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA 23284, USA
| |
Collapse
|
37
|
Ehlén Å, Sessa G, Zinn-Justin S, Carreira A. The phospho-dependent role of BRCA2 on the maintenance of chromosome integrity. Cell Cycle 2021; 20:731-741. [PMID: 33691600 PMCID: PMC8098065 DOI: 10.1080/15384101.2021.1892994] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2020] [Revised: 01/21/2021] [Accepted: 02/16/2021] [Indexed: 12/18/2022] Open
Abstract
Chromosomal instability is a hallmark of cancer. The tumor suppressor protein BRCA2 performs an important role in the maintenance of genome integrity particularly in interphase; as a mediator of homologous recombination DNA repair pathway, it participates in the repair of DNA double-strand breaks, inter-strand crosslinks and replicative DNA lesions. BRCA2 also protects stalled replication forks from aberrant degradation. Defects in these functions lead to structural chromosomal aberrations. BRCA2 is a large protein containing highly disordered regions that are heavily phosphorylated particularly in mitosis. The functions of these modifications are getting elucidated and reveal emerging activities in chromosome alignment, chromosome segregation and abscission during cell division. Defects in these activities result in numerical chromosomal aberrations. In addition to BRCA2, other factors of the DNA damage response (DDR) participate in mitosis in close association with cell cycle kinases and phosphatases suggesting that the maintenance of genome integrity functions of these factors extends beyond DNA repair. Here we will discuss the regulation of BRCA2 functions through phosphorylation by cell cycle kinases particularly in mitosis, and illustrate with some examples how BRCA2 and other DDR proteins partially rewire their interactions, essentially via phosphorylation, to fulfill mitotic specific functions that ensure chromosome stability.
Collapse
Affiliation(s)
- Åsa Ehlén
- Institut Curie, PSL University, CNRS, UMR3348, Orsay, France
- Paris-Saclay University CNRS, UMR3348, Orsay, France
| | - Gaetana Sessa
- Institut Curie, PSL University, CNRS, UMR3348, Orsay, France
- Paris-Saclay University CNRS, UMR3348, Orsay, France
| | - Sophie Zinn-Justin
- Institute for Integrative Biology of the Cell (I2BC), CEA, CNRS, Univ Paris-Sud, Université Paris-Saclay, Gif-sur-Yvette Cedex, France
| | - Aura Carreira
- Institut Curie, PSL University, CNRS, UMR3348, Orsay, France
- Paris-Saclay University CNRS, UMR3348, Orsay, France
| |
Collapse
|
38
|
Identification of Intrinsically Disordered Protein Regions Based on Deep Neural Network-VGG16. ALGORITHMS 2021. [DOI: 10.3390/a14040107] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/12/2023]
Abstract
The accurate of i identificationntrinsically disordered proteins or protein regions is of great importance, as they are involved in critical biological process and related to various human diseases. In this paper, we develop a deep neural network that is based on the well-known VGG16. Our deep neural network is then trained through using 1450 proteins from the dataset DIS1616 and the trained neural network is tested on the remaining 166 proteins. Our trained neural network is also tested on the blind test set R80 and MXD494 to further demonstrate the performance of our model. The MCC value of our trained deep neural network is 0.5132 on the test set DIS166, 0.5270 on the blind test set R80 and 0.4577 on the blind test set MXD494. All of these MCC values of our trained deep neural network exceed the corresponding values of existing prediction methods.
Collapse
|
39
|
In Silico Analysis of Huntingtin Homologs in Lower Eukaryotes. Int J Mol Sci 2021; 22:ijms22063214. [PMID: 33809947 PMCID: PMC8004120 DOI: 10.3390/ijms22063214] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2021] [Revised: 03/09/2021] [Accepted: 03/17/2021] [Indexed: 12/11/2022] Open
Abstract
Huntington’s disease is a rare neurodegenerative and autosomal dominant disorder. HD is caused by a mutation in the gene coding for huntingtin (Htt). The result is the production of a mutant Htt with an abnormally long polyglutamine repeat that leads to pathological Htt aggregates. Although the structure of human Htt has been determined, albeit at low resolution, its functions and how they are performed are largely unknown. Moreover, there is little information on the structure and function of Htt in other organisms. The comparison of Htt homologs can help to understand if there is a functional conservation of domains in the evolution of Htt in eukaryotes. In this work, through a computational approach, Htt homologs from lower eukaryotes have been analysed, identifying ordered domains and modelling their structure. Based on the structural models, a putative function for most of the domains has been predicted. A putative C. elegans Htt-like protein has also been analysed following the same approach. The results obtained support the notion that this protein is a orthologue of human Htt.
Collapse
|
40
|
Zhou JB, Xiong Y, An K, Ye ZQ, Wu YD. IDRMutPred: predicting disease-associated germline nonsynonymous single nucleotide variants (nsSNVs) in intrinsically disordered regions. Bioinformatics 2021; 36:4977-4983. [PMID: 32756939 PMCID: PMC7755418 DOI: 10.1093/bioinformatics/btaa618] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2019] [Revised: 06/28/2020] [Accepted: 07/01/2020] [Indexed: 01/09/2023] Open
Abstract
Motivation Despite of the lack of folded structure, intrinsically disordered regions (IDRs) of proteins play versatile roles in various biological processes, and many nonsynonymous single nucleotide variants (nsSNVs) in IDRs are associated with human diseases. The continuous accumulation of nsSNVs resulted from the wide application of NGS has driven the development of disease-association prediction methods for decades. However, their performance on nsSNVs in IDRs remains inferior, possibly due to the domination of nsSNVs from structured regions in training data. Therefore, it is highly demanding to build a disease-association predictor specifically for nsSNVs in IDRs with better performance. Results We present IDRMutPred, a machine learning-based tool specifically for predicting disease-associated germline nsSNVs in IDRs. Based on 17 selected optimal features that are extracted from sequence alignments, protein annotations, hydrophobicity indices and disorder scores, IDRMutPred was trained using three ensemble learning algorithms on the training dataset containing only IDR nsSNVs. The evaluation on the two testing datasets shows that all the three prediction models outperform 17 other popular general predictors significantly, achieving the ACC between 0.856 and 0.868 and MCC between 0.713 and 0.737. IDRMutPred will prioritize disease-associated IDR germline nsSNVs more reliably than general predictors. Availability and implementation The software is freely available at http://www.wdspdb.com/IDRMutPred. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jing-Bo Zhou
- Lab of Computational Chemistry and Drug Design, State Key Laboratory of Chemical Oncogenomics, Peking University Shenzhen Graduate School, Shenzhen 518055, China
| | - Yao Xiong
- Lab of Computational Chemistry and Drug Design, State Key Laboratory of Chemical Oncogenomics, Peking University Shenzhen Graduate School, Shenzhen 518055, China
| | - Ke An
- Lab of Computational Chemistry and Drug Design, State Key Laboratory of Chemical Oncogenomics, Peking University Shenzhen Graduate School, Shenzhen 518055, China
| | - Zhi-Qiang Ye
- Lab of Computational Chemistry and Drug Design, State Key Laboratory of Chemical Oncogenomics, Peking University Shenzhen Graduate School, Shenzhen 518055, China.,Shenzhen Bay Laboratory, Shenzhen 518055, China
| | - Yun-Dong Wu
- Lab of Computational Chemistry and Drug Design, State Key Laboratory of Chemical Oncogenomics, Peking University Shenzhen Graduate School, Shenzhen 518055, China.,Shenzhen Bay Laboratory, Shenzhen 518055, China.,College of Chemistry and Molecular Engineering, Peking University, Beijing 100871, China
| |
Collapse
|
41
|
Pei J, Grishin NV. The DBSAV Database: Predicting Deleteriousness of Single Amino Acid Variations in the Human Proteome. J Mol Biol 2021; 433:166915. [PMID: 33676930 DOI: 10.1016/j.jmb.2021.166915] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2020] [Revised: 02/28/2021] [Accepted: 03/01/2021] [Indexed: 12/22/2022]
Abstract
Deleterious single amino acid variation (SAV) is one of the leading causes of human diseases. Evaluating the functional impact of SAVs is crucial for diagnosis of genetic disorders. We previously developed a deep convolutional neural network predictor, DeepSAV, to evaluate the deleterious effects of SAVs on protein function based on various sequence, structural, and functional properties. DeepSAV scores of rare SAVs observed in the human population are aggregated into a gene-level score called GTS (Gene Tolerance of rare SAVs) that reflects a gene's tolerance to deleterious missense mutations and serves as a useful tool to study gene-disease associations. In this study, we aim to enhance the performance of DeepSAV by using expanded datasets of pathogenic and benign variants, more features, and neural network optimization. We found that multiple sequence alignments built from vertebrate-level orthologs yield better prediction results compared to those built from mammalian-level orthologs. For multiple sequence alignments built from BLAST searches, optimal performance was achieved with a sequence identify cutoff of 50% to remove distant homologs. The new version of DeepSAV exhibits the best performance among standalone predictors of deleterious effects of SAVs. We developed the DBSAV database (http://prodata.swmed.edu/DBSAV) that reports GTS scores of human genes and DeepSAV scores of SAVs in the human proteome, including pathogenic and benign SAVs, population-level SAVs, and all possible SAVs by single nucleotide variations. This database serves as a useful resource for research of human SAVs and their relationships with protein functions and human diseases.
Collapse
Affiliation(s)
- Jimin Pei
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, TX 75390, USA
| | - Nick V Grishin
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, TX 75390, USA; Departments of Biophysics and Biochemistry, University of Texas Southwestern Medical Center, Dallas, TX 75390, USA.
| |
Collapse
|
42
|
Fan X, Wang H, Zhao Y, Li Y, Tsui KL. An Adaptive Weight Learning-Based Multitask Deep Network for Continuous Blood Pressure Estimation Using Electrocardiogram Signals. SENSORS 2021; 21:s21051595. [PMID: 33668778 PMCID: PMC7956522 DOI: 10.3390/s21051595] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/08/2021] [Revised: 01/27/2021] [Accepted: 02/07/2021] [Indexed: 11/16/2022]
Abstract
Estimating blood pressure via combination analysis with electrocardiogram and photoplethysmography signals has attracted growing interest in continuous monitoring patients’ health conditions. However, most wearable/portal monitoring devices generally acquire only one kind of physiological signals due to the consideration of energy cost, device weight and size, etc. In this study, a novel adaptive weight learning-based multitask deep learning framework based on single lead electrocardiogram signals is proposed for continuous blood pressure estimation. Specifically, the proposed method utilizes a 2-layer bidirectional long short-term memory network as the sharing layer, followed by three identical architectures of 2-layer fully connected networks for task-specific blood pressure estimation. To learn the importance of task-specific losses automatically, an adaptive weight learning scheme based on the trend of validation loss is proposed. Extensive experiment results on Physionet Multiparameter Intelligent Monitoring in Intensive Care (MIMIC) II waveform database demonstrate that the proposed method using electrocardiogram signals obtains estimating performance of 0.12±10.83 mmHg, 0.13±5.90 mmHg, and 0.08±6.47 mmHg for systolic blood pressure, diastolic blood pressure, and mean arterial pressure, respectively. It can meet the requirements of the British Hypertension Society standard and US Association of Advancement of Medical Instrumentation standard with a considerable margin. Combined with a wearable/portal electrocardiogram device, the proposed model can be deployed to a healthcare system to provide a long-term continuous blood pressure monitoring service, which would help to reduce the incidence of malignant complications to hypertension.
Collapse
Affiliation(s)
- Xiaomao Fan
- School of Computer Science, South China Normal University, Guangzhou 510631, China;
| | - Hailiang Wang
- School of Design, Hong Kong Polytechnic University, Hong Kong, China;
| | - Yang Zhao
- School of Data Science, City University of Hong Kong, Hong Kong, China;
- Correspondence: or
| | - Ye Li
- Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China;
| | - Kwok Leung Tsui
- School of Data Science, City University of Hong Kong, Hong Kong, China;
| |
Collapse
|
43
|
Synergistic role of nucleotides and lipids for the self-assembly of Shs1 septin oligomers. Biochem J 2021; 477:2697-2714. [PMID: 32726433 DOI: 10.1042/bcj20200199] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2020] [Revised: 07/07/2020] [Accepted: 07/09/2020] [Indexed: 12/25/2022]
Abstract
Budding yeast septins are essential for cell division and polarity. Septins assemble as palindromic linear octameric complexes. The function and ultra-structural organization of septins are finely governed by their molecular polymorphism. In particular, in budding yeast, the end subunit can stand either as Shs1 or Cdc11. We have dissected, here, for the first time, the behavior of the Shs1 protomer bound to membranes at nanometer resolution, in complex with the other septins. Using electron microscopy, we have shown that on membranes, Shs1 protomers self-assemble into rings, bundles, filaments or two-dimensional gauzes. Using a set of specific mutants we have demonstrated a synergistic role of both nucleotides and lipids for the organization and oligomerization of budding yeast septins. Besides, cryo-electron tomography assays show that vesicles are deformed by the interaction between Shs1 oligomers and lipids. The Shs1-Shs1 interface is stabilized by the presence of phosphoinositides, allowing the visualization of micrometric long filaments formed by Shs1 protomers. In addition, molecular modeling experiments have revealed a potential molecular mechanism regarding the selectivity of septin subunits for phosphoinositide lipids.
Collapse
|
44
|
Bian Y, Xie XQ. Generative chemistry: drug discovery with deep learning generative models. J Mol Model 2021; 27:71. [PMID: 33543405 PMCID: PMC10984615 DOI: 10.1007/s00894-021-04674-8] [Citation(s) in RCA: 38] [Impact Index Per Article: 12.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2020] [Accepted: 01/13/2021] [Indexed: 12/15/2022]
Abstract
The de novo design of molecular structures using deep learning generative models introduces an encouraging solution to drug discovery in the face of the continuously increased cost of new drug development. From the generation of original texts, images, and videos, to the scratching of novel molecular structures the creativity of deep learning generative models exhibits the height machine intelligence can achieve. The purpose of this paper is to review the latest advances in generative chemistry which relies on generative modeling to expedite the drug discovery process. This review starts with a brief history of artificial intelligence in drug discovery to outline this emerging paradigm. Commonly used chemical databases, molecular representations, and tools in cheminformatics and machine learning are covered as the infrastructure for generative chemistry. The detailed discussions on utilizing cutting-edge generative architectures, including recurrent neural network, variational autoencoder, adversarial autoencoder, and generative adversarial network for compound generation are focused. Challenges and future perspectives follow.
Collapse
Affiliation(s)
- Yuemin Bian
- Department of Pharmaceutical Sciences and Computational Chemical Genomics Screening Center, School of Pharmacy, University of Pittsburgh, Pittsburgh, PA, 15261, USA
- NIH National Center of Excellence for Computational Drug Abuse Research, University of Pittsburgh, Pittsburgh, PA, 15261, USA
| | - Xiang-Qun Xie
- Department of Pharmaceutical Sciences and Computational Chemical Genomics Screening Center, School of Pharmacy, University of Pittsburgh, Pittsburgh, PA, 15261, USA.
- NIH National Center of Excellence for Computational Drug Abuse Research, University of Pittsburgh, Pittsburgh, PA, 15261, USA.
- Drug Discovery Institute, University of Pittsburgh, 335 Sutherland Drive, 206 Salk Pavilion, Pittsburgh, PA, 15261, USA.
- Departments of Computational Biology and Structural Biology, School of Medicine, University of Pittsburgh, PA, 15261, Pittsburgh, USA.
| |
Collapse
|
45
|
Fluorescent thermal shift-based method for detection of NF-κB binding to double-stranded DNA. Sci Rep 2021; 11:2331. [PMID: 33504856 PMCID: PMC7840993 DOI: 10.1038/s41598-021-81743-1] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2020] [Accepted: 01/07/2021] [Indexed: 12/18/2022] Open
Abstract
The nuclear factor kappa B (NF-κB) family of dimeric transcription factors regulates a wide range of genes by binding to their specific DNA regulatory sequences. NF-κB is an important therapeutic target linked to a number of cancers as well as autoimmune and inflammatory diseases. Therefore, effective high-throughput methods for the detection of NF-κB DNA binding are essential for studying its transcriptional activity and for inhibitory drug screening. We describe here a novel fluorescence-based assay for quantitative detection of κB consensus double-stranded (ds) DNA binding by measuring the thermal stability of the NF-κB proteins. Specifically, DNA binding proficient NF-κB probes, consisting of the N-terminal p65/RelA (aa 1-306) and p50 (aa 1-367) regions, were designed using bioinformatic analysis of protein hydrophobicity, folding and sequence similarities. By measuring the SYPRO Orange fluorescence during thermal denaturation of the probes, we detected and quantified a shift in the melting temperatures (ΔTm) of p65/RelA and p50 produced by the dsDNA binding. The increase in Tm was proportional to the concentration of dsDNA with apparent dissociation constants (KD) of 2.228 × 10-6 M and 0.794 × 10-6 M, respectively. The use of withaferin A (WFA), dimethyl fumarate (DMF) and p-xyleneselenocyanate (p-XSC) verified the suitability of this assay for measuring dose-dependent antagonistic effects on DNA binding. In addition, the assay can be used to analyse the direct binding of inhibitors and their effects on structural stability of the protein probe. This may facilitate the identification and rational design of new drug candidates interfering with NF-κB functions.
Collapse
|
46
|
Abstract
Many virus-encoded proteins have intrinsically disordered regions that lack a stable, folded three-dimensional structure. These disordered proteins often play important functional roles in virus replication, such as down-regulating host defense mechanisms. With the widespread availability of next-generation sequencing, the number of new virus genomes with predicted open reading frames is rapidly outpacing our capacity for directly characterizing protein structures through crystallography. Hence, computational methods for structural prediction play an important role. A large number of predictors focus on the problem of classifying residues into ordered and disordered regions, and these methods tend to be validated on a diverse training set of proteins from eukaryotes, prokaryotes, and viruses. In this study, we investigate whether some predictors outperform others in the context of virus proteins and compared our findings with data from non-viral proteins. We evaluate the prediction accuracy of 21 methods, many of which are only available as web applications, on a curated set of 126 proteins encoded by viruses. Furthermore, we apply a random forest classifier to these predictor outputs. Based on cross-validation experiments, this ensemble approach confers a substantial improvement in accuracy, e.g., a mean 36 per cent gain in Matthews correlation coefficient. Lastly, we apply the random forest predictor to severe acute respiratory syndrome coronavirus 2 ORF6, an accessory gene that encodes a short (61 AA) and moderately disordered protein that inhibits the host innate immune response. We show that disorder prediction methods perform differently for viral and non-viral proteins, and that an ensemble approach can yield more robust and accurate predictions.
Collapse
Affiliation(s)
- Gal Almog
- Department of Pathology & Laboratory Medicine, Western University, Dental Sciences Building, Rm. 4044 London, Ontario, Canada, N6A 5C1
| | - Abayomi S Olabode
- Department of Pathology & Laboratory Medicine, Western University, Dental Sciences Building, Rm. 4044 London, Ontario, Canada, N6A 5C1
| | - Art F Y Poon
- Department of Pathology & Laboratory Medicine, Western University, Dental Sciences Building, Rm. 4044 London, Ontario, Canada, N6A 5C1.,Department of Applied Mathematics, Western University, Middlesex College Room 255, 1151 Richmond Street London, Ontario, Canada, N6A 5B7.,Department of Microbiology & Immunology, Western University, 1151 Richmond Street London, Ontario, Canada, N6A 3K
| |
Collapse
|
47
|
Xu G, Ren T, Chen Y, Che W. A One-Dimensional CNN-LSTM Model for Epileptic Seizure Recognition Using EEG Signal Analysis. Front Neurosci 2021; 14:578126. [PMID: 33390878 PMCID: PMC7772824 DOI: 10.3389/fnins.2020.578126] [Citation(s) in RCA: 35] [Impact Index Per Article: 11.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2020] [Accepted: 11/10/2020] [Indexed: 11/13/2022] Open
Abstract
Frequent epileptic seizures cause damage to the human brain, resulting in memory impairment, mental decline, and so on. Therefore, it is important to detect epileptic seizures and provide medical treatment in a timely manner. Currently, medical experts recognize epileptic seizure activity through the visual inspection of electroencephalographic (EEG) signal recordings of patients based on their experience, which takes much time and effort. In view of this, this paper proposes a one-dimensional convolutional neural network-long short-term memory (1D CNN-LSTM) model for automatic recognition of epileptic seizures through EEG signal analysis. Firstly, the raw EEG signal data are pre-processed and normalized. Then, a 1D convolutional neural network (CNN) is designed to effectively extract the features of the normalized EEG sequence data. In addition, the extracted features are then processed by the LSTM layers in order to further extract the temporal features. After that, the output features are fed into several fully connected layers for final epileptic seizure recognition. The performance of the proposed 1D CNN-LSTM model is verified on the public UCI epileptic seizure recognition data set. Experiments results show that the proposed method achieves high recognition accuracies of 99.39% and 82.00% on the binary and five-class epileptic seizure recognition tasks, respectively. Comparing results with traditional machine learning methods including k-nearest neighbors, support vector machines, and decision trees, other deep learning methods including standard deep neural network and CNN further verify the superiority of the proposed method.
Collapse
Affiliation(s)
- Gaowei Xu
- Department of Cardiology, Shanghai Tenth People's Hospital, Tongji University School of Medicine, Shanghai, China
| | - Tianhe Ren
- School of Informatics, Xiamen University, Xiamen, China
| | - Yu Chen
- Department of Dermatology & STD, Nantong First People's Hospital, Nantong, China
| | - Wenliang Che
- Department of Cardiology, Shanghai Tenth People's Hospital, Tongji University School of Medicine, Shanghai, China
| |
Collapse
|
48
|
Anbo H, Amagai H, Fukuchi S. NeProc predicts binding segments in intrinsically disordered regions without learning binding region sequences. Biophys Physicobiol 2020; 17:147-154. [PMID: 33304713 PMCID: PMC7692026 DOI: 10.2142/biophysico.bsj-2020026] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2020] [Accepted: 10/29/2020] [Indexed: 12/01/2022] Open
Abstract
Intrinsically disordered proteins are those proteins with intrinsically disordered regions. One of the unique characteristics of intrinsically disordered proteins is the existence of functional segments in intrinsically dis-ordered regions. These segments are involved in binding to partner molecules, such as protein and DNA, and play important roles in signaling pathways and/or transcriptional regulation. Although there are databases that gather information on such disordered binding regions, data remain limited. Therefore, it is desirable to develop programs to predict the disordered binding regions without using data for the binding regions. We developed a program, NeProc, to predict the disordered binding regions, which can be regarded as intrinsically disordered regions with a structural propensity. We only used data for the structural domains and intrinsically disordered regions to detect such regions. NeProc accepts a query amino acid sequence converted into a position specific score matrix, and uses two neural networks that employ different window sizes, a neural network of short windows, and a neural network of long windows. The performance of NeProc was comparable to that of existing programs of the disordered binding region prediction. This result presents the possibility to overcome the shortage of the disordered binding region data in the development of the prediction programs for these binding regions. NeProc is available at http://flab.neproc.org/neproc/index.html.
Collapse
Affiliation(s)
- Hiroto Anbo
- Department of Life Science and Informatics, Faculty of Engineering, Maebashi Institute of Technology, Maebashi, Gunma 371-0816, Japan
| | - Hiroki Amagai
- Department of Life Science and Informatics, Faculty of Engineering, Maebashi Institute of Technology, Maebashi, Gunma 371-0816, Japan
| | - Satoshi Fukuchi
- Department of Life Science and Informatics, Faculty of Engineering, Maebashi Institute of Technology, Maebashi, Gunma 371-0816, Japan
| |
Collapse
|
49
|
Katuwawala A, Kurgan L. Comparative Assessment of Intrinsic Disorder Predictions with a Focus on Protein and Nucleic Acid-Binding Proteins. Biomolecules 2020; 10:E1636. [PMID: 33291838 PMCID: PMC7762010 DOI: 10.3390/biom10121636] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2020] [Revised: 11/26/2020] [Accepted: 12/03/2020] [Indexed: 01/18/2023] Open
Abstract
With over 60 disorder predictors, users need help navigating the predictor selection task. We review 28 surveys of disorder predictors, showing that only 11 include assessment of predictive performance. We identify and address a few drawbacks of these past surveys. To this end, we release a novel benchmark dataset with reduced similarity to the training sets of the considered predictors. We use this dataset to perform a first-of-its-kind comparative analysis that targets two large functional families of disordered proteins that interact with proteins and with nucleic acids. We show that limiting sequence similarity between the benchmark and the training datasets has a substantial impact on predictive performance. We also demonstrate that predictive quality is sensitive to the use of the well-annotated order and inclusion of the fully structured proteins in the benchmark datasets, both of which should be considered in future assessments. We identify three predictors that provide favorable results using the new benchmark set. While we find that VSL2B offers the most accurate and robust results overall, ESpritz-DisProt and SPOT-Disorder perform particularly well for disordered proteins. Moreover, we find that predictions for the disordered protein-binding proteins suffer low predictive quality compared to generic disordered proteins and the disordered nucleic acids-binding proteins. This can be explained by the high disorder content of the disordered protein-binding proteins, which makes it difficult for the current methods to accurately identify ordered regions in these proteins. This finding motivates the development of a new generation of methods that would target these difficult-to-predict disordered proteins. We also discuss resources that support users in collecting and identifying high-quality disorder predictions.
Collapse
Affiliation(s)
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA 23284, USA;
| |
Collapse
|
50
|
Izumi H, Nafie LA, Dukor RK. SSSCPreds: Deep Neural Network-Based Software for the Prediction of Conformational Variability and Application to SARS-CoV-2. ACS OMEGA 2020; 5:30556-30567. [PMID: 33283104 PMCID: PMC7687297 DOI: 10.1021/acsomega.0c04472] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/11/2020] [Accepted: 11/05/2020] [Indexed: 05/05/2023]
Abstract
Amino acid mutations that improve protein stability and rigidity can accompany increases in binding affinity. Therefore, conserved amino acids located on a protein surface may be successfully targeted by antibodies. The quantitative deep mutational scanning approach is an excellent technique to understand viral evolution, and the obtained data can be utilized to develop a vaccine. However, the application of the approach to all of the proteins in general is difficult in terms of cost. To address this need, we report the construction of a deep neural network-based program for sequence-based prediction of supersecondary structure codes (SSSCs), called SSSCPrediction (SSSCPred). Further, to predict conformational flexibility or rigidity in proteins, a comparison program called SSSCPreds that consists of three deep neural network-based prediction systems (SSSCPred, SSSCPred100, and SSSCPred200) has also been developed. Using our algorithms we calculated here shows the degree of flexibility for the receptor-binding motif of SARS-CoV-2 spike protein and the rigidity of the unique motif (SSSC: SSSHSSHHHH) at the S2 subunit and has a value independent of the X-ray and Cryo-EM structures. The fact that the sequence flexibility/rigidity map of SARS-CoV-2 RBD resembles the sequence-to-phenotype maps of ACE2-binding affinity and expression, which were experimentally obtained by deep mutational scanning, suggests that the identical SSSC sequences among the ones predicted by three deep neural network-based systems correlate well with the sequences with both lower ACE2-binding affinity and lower expression. The combined analysis of predicted and observed SSSCs with keyword-tagged datasets would be helpful in understanding the structural correlation to the examined system.
Collapse
Affiliation(s)
- Hiroshi Izumi
- National
Institute of Advanced Industrial Science and Technology (AIST), AIST
Tsukuba West, 16-1 Onogawa, Tsukuba, Ibaraki 305-8569, Japan
| | - Laurence A. Nafie
- Department
of Chemistry, Syracuse University, Syracuse, New York 13244-4100, United States
- BioTools
Inc., 17546 SR 710 (Bee
Line Hwy), Jupiter, Florida 33458, United States
| | - Rina K. Dukor
- BioTools
Inc., 17546 SR 710 (Bee
Line Hwy), Jupiter, Florida 33458, United States
| |
Collapse
|