1
|
Caredda F, Pagnani A. Direct coupling analysis and the attention mechanism. BMC Bioinformatics 2025; 26:41. [PMID: 39915710 PMCID: PMC11804077 DOI: 10.1186/s12859-025-06062-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2024] [Accepted: 01/22/2025] [Indexed: 02/09/2025] Open
Abstract
Proteins are involved in nearly all cellular functions, encompassing roles in transport, signaling, enzymatic activity, and more. Their functionalities crucially depend on their complex three-dimensional arrangement. For this reason, being able to predict their structure from the amino acid sequence has been and still is a phenomenal computational challenge that the introduction of AlphaFold solved with unprecedented accuracy. However, the inherent complexity of AlphaFold's architectures makes it challenging to understand the rules that ultimately shape the protein's predicted structure. This study investigates a single-layer unsupervised model based on the attention mechanism. More precisely, we explore a Direct Coupling Analysis (DCA) method that mimics the attention mechanism of several popular Transformer architectures, such as AlphaFold itself. The model's parameters, notably fewer than those in standard DCA-based algorithms, can be directly used for extracting structural determinants such as the contact map of the protein family under study. Additionally, the functional form of the energy function of the model enables us to deploy a multi-family learning strategy, allowing us to effectively integrate information across multiple protein families, whereas standard DCA algorithms are typically limited to single protein families. Finally, we implemented a generative version of the model using an autoregressive architecture, capable of efficiently generating new proteins in silico.
Collapse
Affiliation(s)
- Francesco Caredda
- DISAT, Politecnico di Torino, Corso Duca degli Abruzzi, I-10129, Torino, Italy.
| | - Andrea Pagnani
- DISAT, Politecnico di Torino, Corso Duca degli Abruzzi, I-10129, Torino, Italy
- Italian Institute for Genomic Medicine, IRCCS Candiolo, SP-142, I-10060, Candiolo, Italy
- INFN, Sezione di Torino, Via Pietro Giuria, I-10125, Torino, Italy
| |
Collapse
|
2
|
Lupo U, Sgarbossa D, Milighetti M, Bitbol AF. DiffPaSS-high-performance differentiable pairing of protein sequences using soft scores. Bioinformatics 2024; 41:btae738. [PMID: 39672677 PMCID: PMC11676329 DOI: 10.1093/bioinformatics/btae738] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2024] [Revised: 12/05/2024] [Accepted: 12/11/2024] [Indexed: 12/15/2024] Open
Abstract
MOTIVATION Identifying interacting partners from two sets of protein sequences has important applications in computational biology. Interacting partners share similarities across species due to their common evolutionary history, and feature correlations in amino acid usage due to the need to maintain complementary interaction interfaces. Thus, the problem of finding interacting pairs can be formulated as searching for a pairing of sequences that maximizes a sequence similarity or a coevolution score. Several methods have been developed to address this problem, applying different approximate optimization methods to different scores. RESULTS We introduce Differentiable Pairing using Soft Scores (DiffPaSS), a differentiable framework for flexible, fast, and hyperparameter-free optimization for pairing interacting biological sequences, which can be applied to a wide variety of scores. We apply it to a benchmark prokaryotic dataset, using mutual information and neighbor graph alignment scores. DiffPaSS outperforms existing algorithms for optimizing the same scores. We demonstrate the usefulness of our paired alignments for the prediction of protein complex structure. DiffPaSS does not require sequences to be aligned, and we also apply it to nonaligned sequences from T-cell receptors. AVAILABILITY AND IMPLEMENTATION A PyTorch implementation and installable Python package are available at https://github.com/Bitbol-Lab/DiffPaSS.
Collapse
Affiliation(s)
- Umberto Lupo
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne CH-1015, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne CH-1015, Switzerland
| | - Damiano Sgarbossa
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne CH-1015, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne CH-1015, Switzerland
| | - Martina Milighetti
- Division of Infection and Immunity, University College London, London WC1E 6BT, United Kingdom
- Cancer Institute, University College London, London WC1E 6DD, United Kingdom
| | - Anne-Florence Bitbol
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne CH-1015, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne CH-1015, Switzerland
| |
Collapse
|
3
|
Zhang C, Wang Q, Li Y, Teng A, Hu G, Wuyun Q, Zheng W. The Historical Evolution and Significance of Multiple Sequence Alignment in Molecular Structure and Function Prediction. Biomolecules 2024; 14:1531. [PMID: 39766238 PMCID: PMC11673352 DOI: 10.3390/biom14121531] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2024] [Revised: 11/24/2024] [Accepted: 11/27/2024] [Indexed: 01/11/2025] Open
Abstract
Multiple sequence alignment (MSA) has evolved into a fundamental tool in the biological sciences, playing a pivotal role in predicting molecular structures and functions. With broad applications in protein and nucleic acid modeling, MSAs continue to underpin advancements across a range of disciplines. MSAs are not only foundational for traditional sequence comparison techniques but also increasingly important in the context of artificial intelligence (AI)-driven advancements. Recent breakthroughs in AI, particularly in protein and nucleic acid structure prediction, rely heavily on the accuracy and efficiency of MSAs to enhance remote homology detection and guide spatial restraints. This review traces the historical evolution of MSA, highlighting its significance in molecular structure and function prediction. We cover the methodologies used for protein monomers, protein complexes, and RNA, while also exploring emerging AI-based alternatives, such as protein language models, as complementary or replacement approaches to traditional MSAs in application tasks. By discussing the strengths, limitations, and applications of these methods, this review aims to provide researchers with valuable insights into MSA's evolving role, equipping them to make informed decisions in structural prediction research.
Collapse
Affiliation(s)
- Chenyue Zhang
- NITFID, School of Statistics and Data Science, LPMC and KLMDASR, Nankai University, Tianjin 300071, China; (C.Z.); (Y.L.); (G.H.)
| | - Qinxin Wang
- Suzhou New & High-Tech Innovation Service Center, Suzhou 215011, China;
| | - Yiyang Li
- NITFID, School of Statistics and Data Science, LPMC and KLMDASR, Nankai University, Tianjin 300071, China; (C.Z.); (Y.L.); (G.H.)
| | - Anqi Teng
- Bioscience and Biomedical Engineering Thrust, Systems Hub, The Hong Kong University of Science and Technology (Guangzhou), Guangzhou 511453, China;
| | - Gang Hu
- NITFID, School of Statistics and Data Science, LPMC and KLMDASR, Nankai University, Tianjin 300071, China; (C.Z.); (Y.L.); (G.H.)
| | - Qiqige Wuyun
- Department of Computer Science and Engineering, Michigan State University, East Lansing, MI 48824, USA
| | - Wei Zheng
- NITFID, School of Statistics and Data Science, LPMC and KLMDASR, Nankai University, Tianjin 300071, China; (C.Z.); (Y.L.); (G.H.)
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| |
Collapse
|
4
|
Dietler N, Abbara A, Choudhury S, Bitbol AF. Impact of phylogeny on the inference of functional sectors from protein sequence data. PLoS Comput Biol 2024; 20:e1012091. [PMID: 39312591 PMCID: PMC11449291 DOI: 10.1371/journal.pcbi.1012091] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2024] [Revised: 10/03/2024] [Accepted: 09/10/2024] [Indexed: 09/25/2024] Open
Abstract
Statistical analysis of multiple sequence alignments of homologous proteins has revealed groups of coevolving amino acids called sectors. These groups of amino-acid sites feature collective correlations in their amino-acid usage, and they are associated to functional properties. Modeling showed that nonlinear selection on an additive functional trait of a protein is generically expected to give rise to a functional sector. These modeling results motivated a principled method, called ICOD, which is designed to identify functional sectors, as well as mutational effects, from sequence data. However, a challenge for all methods aiming to identify sectors from multiple sequence alignments is that correlations in amino-acid usage can also arise from the mere fact that homologous sequences share common ancestry, i.e. from phylogeny. Here, we generate controlled synthetic data from a minimal model comprising both phylogeny and functional sectors. We use this data to dissect the impact of phylogeny on sector identification and on mutational effect inference by different methods. We find that ICOD is most robust to phylogeny, but that conservation is also quite robust. Next, we consider natural multiple sequence alignments of protein families for which deep mutational scan experimental data is available. We show that in this natural data, conservation and ICOD best identify sites with strong functional roles, in agreement with our results on synthetic data. Importantly, these two methods have different premises, since they respectively focus on conservation and on correlations. Thus, their joint use can reveal complementary information.
Collapse
Affiliation(s)
- Nicola Dietler
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Alia Abbara
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Subham Choudhury
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Anne-Florence Bitbol
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| |
Collapse
|
5
|
Lupo U, Sgarbossa D, Bitbol AF. Pairing interacting protein sequences using masked language modeling. Proc Natl Acad Sci U S A 2024; 121:e2311887121. [PMID: 38913900 PMCID: PMC11228504 DOI: 10.1073/pnas.2311887121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2023] [Accepted: 12/18/2023] [Indexed: 06/26/2024] Open
Abstract
Predicting which proteins interact together from amino acid sequences is an important task. We develop a method to pair interacting protein sequences which leverages the power of protein language models trained on multiple sequence alignments (MSAs), such as MSA Transformer and the EvoFormer module of AlphaFold. We formulate the problem of pairing interacting partners among the paralogs of two protein families in a differentiable way. We introduce a method called Differentiable Pairing using Alignment-based Language Models (DiffPALM) that solves it by exploiting the ability of MSA Transformer to fill in masked amino acids in multiple sequence alignments using the surrounding context. MSA Transformer encodes coevolution between functionally or structurally coupled amino acids within protein chains. It also captures inter-chain coevolution, despite being trained on single-chain data. Relying on MSA Transformer without fine-tuning, DiffPALM outperforms existing coevolution-based pairing methods on difficult benchmarks of shallow multiple sequence alignments extracted from ubiquitous prokaryotic protein datasets. It also outperforms an alternative method based on a state-of-the-art protein language model trained on single sequences. Paired alignments of interacting protein sequences are a crucial ingredient of supervised deep learning methods to predict the three-dimensional structure of protein complexes. Starting from sequences paired by DiffPALM substantially improves the structure prediction of some eukaryotic protein complexes by AlphaFold-Multimer. It also achieves competitive performance with using orthology-based pairing.
Collapse
Affiliation(s)
- Umberto Lupo
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne, Lausanne CH-1015, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne CH-1015, Switzerland
| | - Damiano Sgarbossa
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne, Lausanne CH-1015, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne CH-1015, Switzerland
| | - Anne-Florence Bitbol
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne, Lausanne CH-1015, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne CH-1015, Switzerland
| |
Collapse
|
6
|
Cocco S, Posani L, Monasson R. Functional effects of mutations in proteins can be predicted and interpreted by guided selection of sequence covariation information. Proc Natl Acad Sci U S A 2024; 121:e2312335121. [PMID: 38889151 PMCID: PMC11214004 DOI: 10.1073/pnas.2312335121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2023] [Accepted: 04/21/2024] [Indexed: 06/20/2024] Open
Abstract
Predicting the effects of one or more mutations to the in vivo or in vitro properties of a wild-type protein is a major computational challenge, due to the presence of epistasis, that is, of interactions between amino acids in the sequence. We introduce a computationally efficient procedure to build minimal epistatic models to predict mutational effects by combining evolutionary (homologous sequence) and few mutational-scan data. Mutagenesis measurements guide the selection of links in a sparse graphical model, while the parameters on the nodes and the edges are inferred from sequence data. We show, on 10 mutational scans, that our pipeline exhibits performances comparable to state-of-the-art deep networks trained on many more data, while requiring much less parameters and being hence more interpretable. In particular, the identified interactions adapt to the wild-type protein and to the fitness or biochemical property experimentally measured, mostly focus on key functional sites, and are not necessarily related to structural contacts. Therefore, our method is able to extract information relevant for one mutational experiment from homologous sequence data reflecting the multitude of structural and functional constraints acting on proteins throughout evolution.
Collapse
Affiliation(s)
- Simona Cocco
- Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR8023 and Paris Sciences & Lettres (PSL) Research, Sorbonne Université, 75005Paris, France
| | - Lorenzo Posani
- Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR8023 and Paris Sciences & Lettres (PSL) Research, Sorbonne Université, 75005Paris, France
| | - Rémi Monasson
- Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR8023 and Paris Sciences & Lettres (PSL) Research, Sorbonne Université, 75005Paris, France
| |
Collapse
|
7
|
Zhao H, Petrey D, Murray D, Honig B. ZEPPI: Proteome-scale sequence-based evaluation of protein-protein interaction models. Proc Natl Acad Sci U S A 2024; 121:e2400260121. [PMID: 38743624 PMCID: PMC11127014 DOI: 10.1073/pnas.2400260121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2024] [Accepted: 04/18/2024] [Indexed: 05/16/2024] Open
Abstract
We introduce ZEPPI (Z-score Evaluation of Protein-Protein Interfaces), a framework to evaluate structural models of a complex based on sequence coevolution and conservation involving residues in protein-protein interfaces. The ZEPPI score is calculated by comparing metrics for an interface to those obtained from randomly chosen residues. Since contacting residues are defined by the structural model, this obviates the need to account for indirect interactions. Further, although ZEPPI relies on species-paired multiple sequence alignments, its focus on interfacial residues allows it to leverage quite shallow alignments. ZEPPI can be implemented on a proteome-wide scale and is applied here to millions of structural models of dimeric complexes in the Escherichia coli and human interactomes found in the PrePPI database. PrePPI's scoring function is based primarily on the evaluation of protein-protein interfaces, and ZEPPI adds a new feature to this analysis through the incorporation of evolutionary information. ZEPPI performance is evaluated through applications to experimentally determined complexes and to decoys from the CASP-CAPRI experiment. As we discuss, the standard CAPRI scores used to evaluate docking models are based on model quality and not on the ability to give yes/no answers as to whether two proteins interact. ZEPPI is able to detect weak signals from PPI models that the CAPRI scores define as incorrect and, similarly, to identify potential PPIs defined as low confidence by the current PrePPI scoring function. A number of examples that illustrate how the combination of PrePPI and ZEPPI can yield functional hypotheses are provided.
Collapse
Affiliation(s)
- Haiqing Zhao
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY10032
| | - Donald Petrey
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY10032
| | - Diana Murray
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY10032
| | - Barry Honig
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY10032
- Department of Biochemistry and Molecular Biophysics, Columbia University Irving Medical Center, New York, NY10032
- Department of Medicine, Columbia University, New York, NY10032
- Zuckerman Institute, Columbia University, New York, NY10027
| |
Collapse
|
8
|
Fang T, Szklarczyk D, Hachilif R, von Mering C. Enhancing coevolutionary signals in protein-protein interaction prediction through clade-wise alignment integration. Sci Rep 2024; 14:6009. [PMID: 38472223 PMCID: PMC10933411 DOI: 10.1038/s41598-024-55655-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2023] [Accepted: 02/26/2024] [Indexed: 03/14/2024] Open
Abstract
Protein-protein interactions (PPIs) play essential roles in most biological processes. The binding interfaces between interacting proteins impose evolutionary constraints that have successfully been employed to predict PPIs from multiple sequence alignments (MSAs). To construct MSAs, critical choices have to be made: how to ensure the reliable identification of orthologs, and how to optimally balance the need for large alignments versus sufficient alignment quality. Here, we propose a divide-and-conquer strategy for MSA generation: instead of building a single, large alignment for each protein, multiple distinct alignments are constructed under distinct clades in the tree of life. Coevolutionary signals are searched separately within these clades, and are only subsequently integrated using machine learning techniques. We find that this strategy markedly improves overall prediction performance, concomitant with better alignment quality. Using the popular DCA algorithm to systematically search pairs of such alignments, a genome-wide all-against-all interaction scan in a bacterial genome is demonstrated. Given the recent successes of AlphaFold in predicting direct PPIs at atomic detail, a discover-and-refine approach is proposed: our method could provide a fast and accurate strategy for pre-screening the entire genome, submitting to AlphaFold only promising interaction candidates-thus reducing false positives as well as computation time.
Collapse
Affiliation(s)
- Tao Fang
- Department of Molecular Life Sciences, University of Zurich, 8057, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, 1015, Lausanne, Switzerland
| | - Damian Szklarczyk
- Department of Molecular Life Sciences, University of Zurich, 8057, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, 1015, Lausanne, Switzerland
| | - Radja Hachilif
- Department of Molecular Life Sciences, University of Zurich, 8057, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, 1015, Lausanne, Switzerland
| | - Christian von Mering
- Department of Molecular Life Sciences, University of Zurich, 8057, Zurich, Switzerland.
- SIB Swiss Institute of Bioinformatics, 1015, Lausanne, Switzerland.
| |
Collapse
|
9
|
Yan Z, Wang J. Evolution shapes interaction patterns for epistasis and specific protein binding in a two-component signaling system. Commun Chem 2024; 7:13. [PMID: 38233668 PMCID: PMC10794238 DOI: 10.1038/s42004-024-01098-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2023] [Accepted: 01/05/2024] [Indexed: 01/19/2024] Open
Abstract
The elegant design of protein sequence/structure/function relationships arises from the interaction patterns between amino acid positions. A central question is how evolutionary forces shape the interaction patterns that encode long-range epistasis and binding specificity. Here, we combined family-wide evolutionary analysis of natural homologous sequences and structure-oriented evolution simulation for two-component signaling (TCS) system. The magnitude-frequency relationship of coupling conservation between positions manifests a power-law-like distribution and the positions with highly coupling conservation are sparse but distributed intensely on the binding surfaces and hydrophobic core. The structure-specific interaction pattern involves further optimization of local frustrations at or near the binding surface to adapt the binding partner. The construction of family-wide conserved interaction patterns and structure-specific ones demonstrates that binding specificity is modulated by both direct intermolecular interactions and long-range epistasis across the binding complex. Evolution sculpts the interaction patterns via sequence variations at both family-wide and structure-specific levels for TCS system.
Collapse
Affiliation(s)
- Zhiqiang Yan
- Center for Theoretical Interdisciplinary Sciences, Wenzhou Institute, University of Chinese Academy of Sciences, Wenzhou, Zhejiang, 325001, PR China
| | - Jin Wang
- Department of Chemistry and Physics, State University of New York at Stony Brook, Stony Brook, NY, 11790, USA.
| |
Collapse
|
10
|
Simpkin AJ, Caballero I, McNicholas S, Stevenson K, Jiménez E, Sánchez Rodríguez F, Fando M, Uski V, Ballard C, Chojnowski G, Lebedev A, Krissinel E, Usón I, Rigden DJ, Keegan RM. Predicted models and CCP4. Acta Crystallogr D Struct Biol 2023; 79:806-819. [PMID: 37594303 PMCID: PMC10478639 DOI: 10.1107/s2059798323006289] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2023] [Accepted: 07/19/2023] [Indexed: 08/19/2023] Open
Abstract
In late 2020, the results of CASP14, the 14th event in a series of competitions to assess the latest developments in computational protein structure-prediction methodology, revealed the giant leap forward that had been made by Google's Deepmind in tackling the prediction problem. The level of accuracy in their predictions was the first instance of a competitor achieving a global distance test score of better than 90 across all categories of difficulty. This achievement represents both a challenge and an opportunity for the field of experimental structural biology. For structure determination by macromolecular X-ray crystallography, access to highly accurate structure predictions is of great benefit, particularly when it comes to solving the phase problem. Here, details of new utilities and enhanced applications in the CCP4 suite, designed to allow users to exploit predicted models in determining macromolecular structures from X-ray diffraction data, are presented. The focus is mainly on applications that can be used to solve the phase problem through molecular replacement.
Collapse
Affiliation(s)
- Adam J. Simpkin
- Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Liverpool L69 7ZB, United Kingdom
| | - Iracema Caballero
- Crystallographic Methods, Institute of Molecular Biology of Barcelona (IBMB–CSIC), Barcelona, Spain
| | - Stuart McNicholas
- York Structural Biology Laboratory, Department of Chemistry, The University of York, York YO10 5DD, United Kingdom
| | - Kyle Stevenson
- UKRI–STFC, Rutherford Appleton Laboratory, Research Complex at Harwell, Didcot OX11 0FA, United Kingdom
| | - Elisabet Jiménez
- Crystallographic Methods, Institute of Molecular Biology of Barcelona (IBMB–CSIC), Barcelona, Spain
| | - Filomeno Sánchez Rodríguez
- Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Liverpool L69 7ZB, United Kingdom
- York Structural Biology Laboratory, Department of Chemistry, The University of York, York YO10 5DD, United Kingdom
| | - Maria Fando
- UKRI–STFC, Rutherford Appleton Laboratory, Research Complex at Harwell, Didcot OX11 0FA, United Kingdom
| | - Ville Uski
- UKRI–STFC, Rutherford Appleton Laboratory, Research Complex at Harwell, Didcot OX11 0FA, United Kingdom
| | - Charles Ballard
- UKRI–STFC, Rutherford Appleton Laboratory, Research Complex at Harwell, Didcot OX11 0FA, United Kingdom
| | - Grzegorz Chojnowski
- European Molecular Biology Laboratory, Hamburg Unit, Notkestrasse 85, 22607 Hamburg, Germany
| | - Andrey Lebedev
- UKRI–STFC, Rutherford Appleton Laboratory, Research Complex at Harwell, Didcot OX11 0FA, United Kingdom
| | - Eugene Krissinel
- UKRI–STFC, Rutherford Appleton Laboratory, Research Complex at Harwell, Didcot OX11 0FA, United Kingdom
| | - Isabel Usón
- Crystallographic Methods, Institute of Molecular Biology of Barcelona (IBMB–CSIC), Barcelona, Spain
- ICREA, Institució Catalana de Recerca i Estudis Avançats, Passeig Lluís Companys 23, 08003 Barcelona, Spain
| | - Daniel J. Rigden
- Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Liverpool L69 7ZB, United Kingdom
| | - Ronan M. Keegan
- Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Liverpool L69 7ZB, United Kingdom
- UKRI–STFC, Rutherford Appleton Laboratory, Research Complex at Harwell, Didcot OX11 0FA, United Kingdom
| |
Collapse
|
11
|
Echeverria I, Braberg H, Krogan NJ, Sali A. Integrative structure determination of histones H3 and H4 using genetic interactions. FEBS J 2023; 290:2565-2575. [PMID: 35298864 PMCID: PMC9481981 DOI: 10.1111/febs.16435] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2021] [Revised: 02/11/2022] [Accepted: 03/15/2022] [Indexed: 11/28/2022]
Abstract
Integrative structure modeling is increasingly used for determining the architectures of biological assemblies, especially those that are structurally heterogeneous. Recently, we reported on how to convert in vivo genetic interaction measurements into spatial restraints for structural modeling: first, phenotypic profiles are generated for each point mutation and thousands of gene deletions or environmental perturbations. Following, the phenotypic profile similarities are converted into distance restraints on the pairs of mutated residues. We illustrate the approach by determining the structure of the histone H3-H4 complex. The method is implemented in our open-source IMP program, expanding the structural biology toolbox by allowing structural characterization based on in vivo data without the need to purify the target system. We compare genetic interaction measurements to other sources of structural information, such as residue coevolution and deep-learning structure prediction of complex subunits. We also suggest that determining genetic interactions could benefit from new technologies, such as CRISPR-Cas9 approaches to gene editing, especially for mammalian cells. Finally, we highlight the opportunity for using genetic interactions to determine recalcitrant biomolecular structures, such as those of disordered proteins, transient protein assemblies, and host-pathogen protein complexes.
Collapse
Affiliation(s)
- Ignacia Echeverria
- Department of Cellular and Molecular Pharmacology, University of California, San Francisco, San Francisco, CA 94158, USA
- Quantitative Biosciences Institute, University of California, San Francisco, San Francisco, CA 94158, USA
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, CA 94158, USA
| | - Hannes Braberg
- Department of Cellular and Molecular Pharmacology, University of California, San Francisco, San Francisco, CA 94158, USA
- Quantitative Biosciences Institute, University of California, San Francisco, San Francisco, CA 94158, USA
| | - Nevan J. Krogan
- Department of Cellular and Molecular Pharmacology, University of California, San Francisco, San Francisco, CA 94158, USA
- Quantitative Biosciences Institute, University of California, San Francisco, San Francisco, CA 94158, USA
- Gladstone Institute of Data Science and Biotechnology, J. David Gladstone Institutes, San Francisco, CA 94158, USA
- Department of Microbiology, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Andrej Sali
- Quantitative Biosciences Institute, University of California, San Francisco, San Francisco, CA 94158, USA
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, CA 94158, USA
- Department of Pharmaceutical Chemistry, University of California, San Francisco, San Francisco, CA 94158, USA
| |
Collapse
|
12
|
Ziegler C, Martin J, Sinner C, Morcos F. Latent generative landscapes as maps of functional diversity in protein sequence space. Nat Commun 2023; 14:2222. [PMID: 37076519 PMCID: PMC10113739 DOI: 10.1038/s41467-023-37958-z] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2022] [Accepted: 04/05/2023] [Indexed: 04/21/2023] Open
Abstract
Variational autoencoders are unsupervised learning models with generative capabilities, when applied to protein data, they classify sequences by phylogeny and generate de novo sequences which preserve statistical properties of protein composition. While previous studies focus on clustering and generative features, here, we evaluate the underlying latent manifold in which sequence information is embedded. To investigate properties of the latent manifold, we utilize direct coupling analysis and a Potts Hamiltonian model to construct a latent generative landscape. We showcase how this landscape captures phylogenetic groupings, functional and fitness properties of several systems including Globins, β-lactamases, ion channels, and transcription factors. We provide support on how the landscape helps us understand the effects of sequence variability observed in experimental data and provides insights on directed and natural protein evolution. We propose that combining generative properties and functional predictive power of variational autoencoders and coevolutionary analysis could be beneficial in applications for protein engineering and design.
Collapse
Affiliation(s)
- Cheyenne Ziegler
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX, 75080, USA
| | - Jonathan Martin
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX, 75080, USA
| | - Claude Sinner
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX, 75080, USA
| | - Faruck Morcos
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX, 75080, USA.
- Department of Bioengineering, University of Texas at Dallas, Richardson, TX, 75080, USA.
- Center for Systems Biology, University of Texas at Dallas, Richardson, TX, 75080, USA.
| |
Collapse
|
13
|
Gandarilla-Pérez CA, Pinilla S, Bitbol AF, Weigt M. Combining phylogeny and coevolution improves the inference of interaction partners among paralogous proteins. PLoS Comput Biol 2023; 19:e1011010. [PMID: 36996234 PMCID: PMC10089317 DOI: 10.1371/journal.pcbi.1011010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2022] [Revised: 04/11/2023] [Accepted: 03/08/2023] [Indexed: 04/01/2023] Open
Abstract
Predicting protein-protein interactions from sequences is an important goal of computational biology. Various sources of information can be used to this end. Starting from the sequences of two interacting protein families, one can use phylogeny or residue coevolution to infer which paralogs are specific interaction partners within each species. We show that these two signals can be combined to improve the performance of the inference of interaction partners among paralogs. For this, we first align the sequence-similarity graphs of the two families through simulated annealing, yielding a robust partial pairing. We next use this partial pairing to seed a coevolution-based iterative pairing algorithm. This combined method improves performance over either separate method. The improvement obtained is striking in the difficult cases where the average number of paralogs per species is large or where the total number of sequences is modest.
Collapse
Affiliation(s)
- Carlos A Gandarilla-Pérez
- Facultad de Física, Universidad de la Habana, San Lázaro y L, Vedado, Habana, Cuba
- Sorbonne Université, CNRS, Institut de Biologie Paris-Seine, Laboratoire de Biologie Computationnelle et Quantitative (LCQB, UMR 7238), Paris, France
| | - Sergio Pinilla
- Sorbonne Université, CNRS, Institut de Biologie Paris-Seine, Laboratoire de Biologie Computationnelle et Quantitative (LCQB, UMR 7238), Paris, France
- Sorbonne Université, CNRS, Institut de Biologie Paris-Seine, Laboratoire Jean Perrin (UMR 8237), Paris, France
| | - Anne-Florence Bitbol
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Martin Weigt
- Sorbonne Université, CNRS, Institut de Biologie Paris-Seine, Laboratoire de Biologie Computationnelle et Quantitative (LCQB, UMR 7238), Paris, France
| |
Collapse
|
14
|
Huang Y, Wuchty S, Zhou Y, Zhang Z. SGPPI: structure-aware prediction of protein-protein interactions in rigorous conditions with graph convolutional network. Brief Bioinform 2023; 24:6995378. [PMID: 36682013 DOI: 10.1093/bib/bbad020] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2022] [Revised: 11/17/2022] [Accepted: 01/05/2023] [Indexed: 01/23/2023] Open
Abstract
While deep learning (DL)-based models have emerged as powerful approaches to predict protein-protein interactions (PPIs), the reliance on explicit similarity measures (e.g. sequence similarity and network neighborhood) to known interacting proteins makes these methods ineffective in dealing with novel proteins. The advent of AlphaFold2 presents a significant opportunity and also a challenge to predict PPIs in a straightforward way based on monomer structures while controlling bias from protein sequences. In this work, we established Structure and Graph-based Predictions of Protein Interactions (SGPPI), a structure-based DL framework for predicting PPIs, using the graph convolutional network. In particular, SGPPI focused on protein patches on the protein-protein binding interfaces and extracted the structural, geometric and evolutionary features from the residue contact map to predict PPIs. We demonstrated that our model outperforms traditional machine learning methods and state-of-the-art DL-based methods using non-representation-bias benchmark datasets. Moreover, our model trained on human dataset can be reliably transferred to predict yeast PPIs, indicating that SGPPI can capture converging structural features of protein interactions across various species. The implementation of SGPPI is available at https://github.com/emerson106/SGPPI.
Collapse
Affiliation(s)
- Yan Huang
- State Key Laboratory of Livestock and Poultry Biotechnology Breeding, College of Biological Sciences, China Agricultural University, Beijing 100193, China
- Department of Biomedical Informatics, Ministry of Education Key Laboratory of Molecular Cardiovascular Sciences, Center for Non-Coding RNA Medicine, School of Basic Medical Sciences, Peking University, Beijing 100191, China
| | - Stefan Wuchty
- Department of Computer Science, University of Miami, Coral Gables, FL 33146, USA
- Department of Biology, University of Miami, Coral Gables, FL 33146, USA
- Sylvester Comprehensive Cancer Center, University of Miami, Miami, FL 33136, USA
- Institute of Data Science and Computing, University of Miami, Coral Gables, FL 33146, USA
| | - Yuan Zhou
- Department of Biomedical Informatics, Ministry of Education Key Laboratory of Molecular Cardiovascular Sciences, Center for Non-Coding RNA Medicine, School of Basic Medical Sciences, Peking University, Beijing 100191, China
| | - Ziding Zhang
- State Key Laboratory of Livestock and Poultry Biotechnology Breeding, College of Biological Sciences, China Agricultural University, Beijing 100193, China
| |
Collapse
|
15
|
Kleeorin Y, Russ WP, Rivoire O, Ranganathan R. Undersampling and the inference of coevolution in proteins. Cell Syst 2023; 14:210-219.e7. [PMID: 36693377 PMCID: PMC10911952 DOI: 10.1016/j.cels.2022.12.013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2021] [Revised: 01/02/2022] [Accepted: 12/23/2022] [Indexed: 01/24/2023]
Abstract
Protein structure, function, and evolution depend on local and collective epistatic interactions between amino acids. A powerful approach to defining these interactions is to construct models of couplings between amino acids that reproduce the empirical statistics (frequencies and correlations) observed in sequences comprising a protein family. The top couplings are then interpreted. Here, we show that as currently implemented, this inference unequally represents epistatic interactions, a problem that fundamentally arises from limited sampling of sequences in the context of distinct scales at which epistasis occurs in proteins. We show that these issues explain the ability of current approaches to predict tertiary contacts between amino acids and the inability to obviously expose larger networks of functionally relevant, collectively evolving residues called sectors. This work provides a necessary foundation for more deeply understanding and improving evolution-based models of proteins.
Collapse
Affiliation(s)
- Yaakov Kleeorin
- Center for Physics of Evolving Systems, Department of Biochemistry & Molecular Biology, University of Chicago, Chicago, IL 60637, USA
| | - William P Russ
- Green Center for Systems Biology, University of Texas Southwestern Medical Center, Dallas, TX 75390, USA
| | - Olivier Rivoire
- Center for Interdisciplinary Research in Biology (CIRB), Collège de France, CNRS, INSERM, PSL Research University, 75005 Paris, France.
| | - Rama Ranganathan
- Center for Physics of Evolving Systems, Department of Biochemistry & Molecular Biology, University of Chicago, Chicago, IL 60637, USA; The Pritzker School for Molecular Engineering, University of Chicago, Chicago, IL 60637, USA.
| |
Collapse
|
16
|
Malinverni D, Babu MM. Data-driven design of orthogonal protein-protein interactions. Sci Signal 2023; 16:eabm4484. [PMID: 36853962 DOI: 10.1126/scisignal.abm4484] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/02/2023]
Abstract
Engineering protein-protein interactions to generate new functions presents a challenge with great potential for many applications, ranging from therapeutics to synthetic biology. To avoid unwanted cross-talk with preexisting protein interaction networks in a cell, the specificity and selectivity of newly engineered proteins must be controlled. Here, we developed a computational strategy that mimics gene duplication and the divergence of preexisting interacting protein pairs to design new interactions. We used the bacterial PhoQ-PhoP two-component system as a model system to demonstrate the feasibility of this strategy and validated the approach with known experimental results. The designed protein pairs are predicted to exclusively interact with each other and to be insulated from potential cross-talk with their native partners. Thus, our approach enables exploration of uncharted regions of the protein sequence space and the design of new interacting protein pairs.
Collapse
Affiliation(s)
- Duccio Malinverni
- MRC Laboratory of Molecular Biology, Francis Crick Avenue, Cambridge Biomedical Campus, Cambridge CB2 0QH, UK.,Department of Structural Biology and Center of Excellence for Data Driven Discovery, St. Jude Children's Research Hospital, Memphis, TN 38105, USA
| | - M Madan Babu
- MRC Laboratory of Molecular Biology, Francis Crick Avenue, Cambridge Biomedical Campus, Cambridge CB2 0QH, UK.,Department of Structural Biology and Center of Excellence for Data Driven Discovery, St. Jude Children's Research Hospital, Memphis, TN 38105, USA
| |
Collapse
|
17
|
Sgarbossa D, Lupo U, Bitbol AF. Generative power of a protein language model trained on multiple sequence alignments. eLife 2023; 12:e79854. [PMID: 36734516 PMCID: PMC10038667 DOI: 10.7554/elife.79854] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2022] [Accepted: 02/02/2023] [Indexed: 02/04/2023] Open
Abstract
Computational models starting from large ensembles of evolutionarily related protein sequences capture a representation of protein families and learn constraints associated to protein structure and function. They thus open the possibility for generating novel sequences belonging to protein families. Protein language models trained on multiple sequence alignments, such as MSA Transformer, are highly attractive candidates to this end. We propose and test an iterative method that directly employs the masked language modeling objective to generate sequences using MSA Transformer. We demonstrate that the resulting sequences score as well as natural sequences, for homology, coevolution, and structure-based measures. For large protein families, our synthetic sequences have similar or better properties compared to sequences generated by Potts models, including experimentally validated ones. Moreover, for small protein families, our generation method based on MSA Transformer outperforms Potts models. Our method also more accurately reproduces the higher-order statistics and the distribution of sequences in sequence space of natural data than Potts models. MSA Transformer is thus a strong candidate for protein sequence generation and protein design.
Collapse
Affiliation(s)
- Damiano Sgarbossa
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL)LausanneSwitzerland
- SIB Swiss Institute of BioinformaticsLausanneSwitzerland
| | - Umberto Lupo
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL)LausanneSwitzerland
- SIB Swiss Institute of BioinformaticsLausanneSwitzerland
| | - Anne-Florence Bitbol
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL)LausanneSwitzerland
- SIB Swiss Institute of BioinformaticsLausanneSwitzerland
| |
Collapse
|
18
|
Dietler N, Lupo U, Bitbol AF. Impact of phylogeny on structural contact inference from protein sequence data. J R Soc Interface 2023; 20:20220707. [PMID: 36751926 PMCID: PMC9905998 DOI: 10.1098/rsif.2022.0707] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2022] [Accepted: 01/09/2023] [Indexed: 02/09/2023] Open
Abstract
Local and global inference methods have been developed to infer structural contacts from multiple sequence alignments of homologous proteins. They rely on correlations in amino acid usage at contacting sites. Because homologous proteins share a common ancestry, their sequences also feature phylogenetic correlations, which can impair contact inference. We investigate this effect by generating controlled synthetic data from a minimal model where the importance of contacts and of phylogeny can be tuned. We demonstrate that global inference methods, specifically Potts models, are more resilient to phylogenetic correlations than local methods, based on covariance or mutual information. This holds whether or not phylogenetic corrections are used, and may explain the success of global methods. We analyse the roles of selection strength and of phylogenetic relatedness. We show that sites that mutate early in the phylogeny yield false positive contacts. We consider natural data and realistic synthetic data, and our findings generalize to these cases. Our results highlight the impact of phylogeny on contact prediction from protein sequences and illustrate the interplay between the rich structure of biological data and inference.
Collapse
Affiliation(s)
- Nicola Dietler
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), 1015 Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Umberto Lupo
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), 1015 Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Anne-Florence Bitbol
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), 1015 Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| |
Collapse
|
19
|
Ward KM, Pickett BD, Ebbert MTW, Kauwe JSK, Miller JB. Web-Based Protein Interactions Calculator Identifies Likely Proteome Coevolution with Alzheimer’s Disease-Associated Proteins. Genes (Basel) 2022; 13:genes13081346. [PMID: 36011253 PMCID: PMC9407263 DOI: 10.3390/genes13081346] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2022] [Revised: 07/22/2022] [Accepted: 07/23/2022] [Indexed: 11/19/2022] Open
Abstract
Protein–protein functional interactions arise from either transitory or permanent biomolecular associations and often lead to the coevolution of the interacting residues. Although mutual information has traditionally been used to identify coevolving residues within the same protein, its application between coevolving proteins remains largely uncharacterized. Therefore, we developed the Protein Interactions Calculator (PIC) to efficiently identify coevolving residues between two protein sequences using mutual information. We verified the algorithm using 2102 known human protein interactions and 233 known bacterial protein interactions, with a respective 1975 and 252 non-interacting protein controls. The average PIC score for known human protein interactions was 4.5 times higher than non-interacting proteins (p = 1.03 × 10−108) and 1.94 times higher in bacteria (p = 1.22 × 10−35). We then used the PIC scores to determine the probability that two proteins interact. Using those probabilities, we paired 37 Alzheimer’s disease-associated proteins with 8608 other proteins and determined the likelihood that each pair interacts, which we report through a web interface. The PIC had significantly higher sensitivity and residue-specific resolution not available in other algorithms. Therefore, we propose that the PIC can be used to prioritize potential protein interactions, which can lead to a better understanding of biological processes and additional therapeutic targets belonging to protein interaction groups.
Collapse
Affiliation(s)
- Katrisa M. Ward
- Department of Biology, Brigham Young University, Provo, UT 84602, USA; (K.M.W.); (B.D.P.); (J.S.K.K.)
| | - Brandon D. Pickett
- Department of Biology, Brigham Young University, Provo, UT 84602, USA; (K.M.W.); (B.D.P.); (J.S.K.K.)
| | - Mark T. W. Ebbert
- Sanders-Brown Center on Aging, University of Kentucky, Lexington, KY 40536, USA;
- Division of Biomedical Informatics, Department of Internal Medicine, University of Kentucky, Lexington, KY 40506, USA
- Department of Neuroscience, University of Kentucky, Lexington, KY 40506, USA
| | - John S. K. Kauwe
- Department of Biology, Brigham Young University, Provo, UT 84602, USA; (K.M.W.); (B.D.P.); (J.S.K.K.)
| | - Justin B. Miller
- Sanders-Brown Center on Aging, University of Kentucky, Lexington, KY 40536, USA;
- Division of Biomedical Informatics, Department of Internal Medicine, University of Kentucky, Lexington, KY 40506, USA
- Department of Pathology and Laboratory Medicine, University of Kentucky, Lexington, KY 40506, USA
- Correspondence: ; Tel.: +1-859-562-0333
| |
Collapse
|
20
|
Si Y, Yan C. Protein complex structure prediction powered by multiple sequence alignments of interologs from multiple taxonomic ranks and AlphaFold2. Brief Bioinform 2022; 23:6596987. [PMID: 35649388 DOI: 10.1093/bib/bbac208] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2022] [Revised: 04/17/2022] [Accepted: 05/05/2022] [Indexed: 12/19/2022] Open
Abstract
AlphaFold2 can predict protein complex structures as long as a multiple sequence alignment (MSA) of the interologs of the target protein-protein interaction (PPI) can be provided. In this study, a simplified phylogeny-based approach was applied to generate the MSA of interologs, which was then used as the input to AlphaFold2 for protein complex structure prediction. In this extensively benchmarked protocol on nonredundant PPI dataset, including 107 bacterial PPIs and 442 eukaryotic PPIs, we show complex structures of 79.5% of the bacterial PPIs and 49.8% of the eukaryotic PPIs can be successfully predicted, which yielded significantly better performance than the application of MSA of interologs prepared by two existing approaches. Considering PPIs may not be conserved in species with long evolutionary distances, we further restricted interologs in the MSA to different taxonomic ranks of the species of the target PPI in protein complex structure prediction. We found that the success rates can be increased to 87.9% for the bacterial PPIs and 56.3% for the eukaryotic PPIs if interologs in the MSA are restricted to a specific taxonomic rank of the species of each target PPI. Finally, we show that the optimal taxonomic ranks for protein complex structure prediction can be selected with the application of the predicted template modeling (TM) scores of the output models.
Collapse
Affiliation(s)
- Yunda Si
- School of Physics, Huazhong University of Science and Technology, China
| | - Chengfei Yan
- School of Physics, Huazhong University of Science and Technology, China
| |
Collapse
|
21
|
Braberg H, Echeverria I, Kaake RM, Sali A, Krogan NJ. From systems to structure - using genetic data to model protein structures. Nat Rev Genet 2022; 23:342-354. [PMID: 35013567 PMCID: PMC8744059 DOI: 10.1038/s41576-021-00441-w] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/25/2021] [Indexed: 12/11/2022]
Abstract
Understanding the effects of genetic variation is a fundamental problem in biology that requires methods to analyse both physical and functional consequences of sequence changes at systems-wide and mechanistic scales. To achieve a systems view, protein interaction networks map which proteins physically interact, while genetic interaction networks inform on the phenotypic consequences of perturbing these protein interactions. Until recently, understanding the molecular mechanisms that underlie these interactions often required biophysical methods to determine the structures of the proteins involved. The past decade has seen the emergence of new approaches based on coevolution, deep mutational scanning and genome-scale genetic or chemical-genetic interaction mapping that enable modelling of the structures of individual proteins or protein complexes. Here, we review the emerging use of large-scale genetic datasets and deep learning approaches to model protein structures and their interactions, and discuss the integration of structural data from different sources.
Collapse
Affiliation(s)
- Hannes Braberg
- Department of Cellular and Molecular Pharmacology, University of California, San Francisco, San Francisco, CA, USA
- Quantitative Biosciences Institute, University of California, San Francisco, San Francisco, CA, USA
| | - Ignacia Echeverria
- Department of Cellular and Molecular Pharmacology, University of California, San Francisco, San Francisco, CA, USA
- Quantitative Biosciences Institute, University of California, San Francisco, San Francisco, CA, USA
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, CA, USA
| | - Robyn M Kaake
- Department of Cellular and Molecular Pharmacology, University of California, San Francisco, San Francisco, CA, USA
- Quantitative Biosciences Institute, University of California, San Francisco, San Francisco, CA, USA
- Gladstone Institutes, San Francisco, CA, USA
| | - Andrej Sali
- Quantitative Biosciences Institute, University of California, San Francisco, San Francisco, CA, USA
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, CA, USA
- Department of Pharmaceutical Chemistry, University of California, San Francisco, San Francisco, CA, USA
| | - Nevan J Krogan
- Department of Cellular and Molecular Pharmacology, University of California, San Francisco, San Francisco, CA, USA.
- Quantitative Biosciences Institute, University of California, San Francisco, San Francisco, CA, USA.
- Gladstone Institutes, San Francisco, CA, USA.
- Department of Microbiology, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
| |
Collapse
|
22
|
Gerardos A, Dietler N, Bitbol AF. Correlations from structure and phylogeny combine constructively in the inference of protein partners from sequences. PLoS Comput Biol 2022; 18:e1010147. [PMID: 35576238 PMCID: PMC9135348 DOI: 10.1371/journal.pcbi.1010147] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2021] [Revised: 05/26/2022] [Accepted: 04/27/2022] [Indexed: 11/19/2022] Open
Abstract
Inferring protein-protein interactions from sequences is an important task in computational biology. Recent methods based on Direct Coupling Analysis (DCA) or Mutual Information (MI) allow to find interaction partners among paralogs of two protein families. Does successful inference mainly rely on correlations from structural contacts or from phylogeny, or both? Do these two types of signal combine constructively or hinder each other? To address these questions, we generate and analyze synthetic data produced using a minimal model that allows us to control the amounts of structural constraints and phylogeny. We show that correlations from these two sources combine constructively to increase the performance of partner inference by DCA or MI. Furthermore, signal from phylogeny can rescue partner inference when signal from contacts becomes less informative, including in the realistic case where inter-protein contacts are restricted to a small subset of sites. We also demonstrate that DCA-inferred couplings between non-contact pairs of sites improve partner inference in the presence of strong phylogeny, while deteriorating it otherwise. Moreover, restricting to non-contact pairs of sites preserves inference performance in the presence of strong phylogeny. In a natural data set, as well as in realistic synthetic data based on it, we find that non-contact pairs of sites contribute positively to partner inference performance, and that restricting to them preserves performance, evidencing an important role of phylogeny.
Collapse
Affiliation(s)
- Andonis Gerardos
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Nicola Dietler
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Anne-Florence Bitbol
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| |
Collapse
|
23
|
Chi H, Zhou Q, Tutol JN, Phelps SM, Lee J, Kapadia P, Morcos F, Dodani SC. Coupling a Live Cell Directed Evolution Assay with Coevolutionary Landscapes to Engineer an Improved Fluorescent Rhodopsin Chloride Sensor. ACS Synth Biol 2022; 11:1627-1638. [PMID: 35389621 PMCID: PMC9184236 DOI: 10.1021/acssynbio.2c00033] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Our understanding of chloride in biology has been accelerated through the application of fluorescent protein-based sensors in living cells. These sensors can be generated and diversified to have a range of properties using laboratory-guided evolution. Recently, we established that the fluorescent proton-pumping rhodopsin wtGR from Gloeobacter violaceus can be converted into a fluorescent sensor for chloride. To unlock this non-natural function, a single point mutation at the Schiff counterion position (D121V) was introduced into wtGR fused to cyan fluorescent protein (CFP) resulting in GR1-CFP. Here, we have integrated coevolutionary analysis with directed evolution to understand how the rhodopsin sequence space can be explored and engineered to improve this starting point. We first show how evolutionary couplings are predictive of functional sites in the rhodopsin family and how a fitness metric based on a sequence can be used to quantify the known proton-pumping activities of GR-CFP variants. Then, we couple this ability to predict potential functional outcomes with a screening and selection assay in live Escherichia coli to reduce the mutational search space of five residues along the proton-pumping pathway in GR1-CFP. This iterative selection process results in GR2-CFP with four additional mutations: E132K, A84K, T125C, and V245I. Finally, bulk and single fluorescence measurements in live E. coli reveal that GR2-CFP is a reversible, ratiometric fluorescent sensor for extracellular chloride with an improved dynamic range. We anticipate that our framework will be applicable to other systems, providing a more efficient methodology to engineer fluorescent protein-based sensors with desired properties.
Collapse
|
24
|
Hernández DG, Sober SJ, Nemenman I. Unsupervised Bayesian Ising Approximation for decoding neural activity and other biological dictionaries. eLife 2022; 11:e68192. [PMID: 35315769 PMCID: PMC8989415 DOI: 10.7554/elife.68192] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2021] [Accepted: 03/19/2022] [Indexed: 11/13/2022] Open
Abstract
The problem of deciphering how low-level patterns (action potentials in the brain, amino acids in a protein, etc.) drive high-level biological features (sensorimotor behavior, enzymatic function) represents the central challenge of quantitative biology. The lack of general methods for doing so from the size of datasets that can be collected experimentally severely limits our understanding of the biological world. For example, in neuroscience, some sensory and motor codes have been shown to consist of precisely timed multi-spike patterns. However, the combinatorial complexity of such pattern codes have precluded development of methods for their comprehensive analysis. Thus, just as it is hard to predict a protein's function based on its sequence, we still do not understand how to accurately predict an organism's behavior based on neural activity. Here, we introduce the unsupervised Bayesian Ising Approximation (uBIA) for solving this class of problems. We demonstrate its utility in an application to neural data, detecting precisely timed spike patterns that code for specific motor behaviors in a songbird vocal system. In data recorded during singing from neurons in a vocal control region, our method detects such codewords with an arbitrary number of spikes, does so from small data sets, and accounts for dependencies in occurrences of codewords. Detecting such comprehensive motor control dictionaries can improve our understanding of skilled motor control and the neural bases of sensorimotor learning in animals. To further illustrate the utility of uBIA, we used it to identify the distinct sets of activity patterns that encode vocal motor exploration versus typical song production. Crucially, our method can be used not only for analysis of neural systems, but also for understanding the structure of correlations in other biological and nonbiological datasets.
Collapse
Affiliation(s)
- Damián G Hernández
- Department of Medical Physics, Centro Atómico Bariloche and Instituto BalseiroBarilocheArgentina
- Department of Physics, Emory UniversityAtlantaUnited States
| | - Samuel J Sober
- Department of Biology, Emory UniversityAtlantaUnited States
| | - Ilya Nemenman
- Department of Physics, Emory UniversityAtlantaUnited States
- Department of Biology, Emory UniversityAtlantaUnited States
- Initiative in Theory and Modeling of Living SystemsAtlantaUnited States
| |
Collapse
|
25
|
Extracting phylogenetic dimensions of coevolution reveals hidden functional signals. Sci Rep 2022; 12:820. [PMID: 35039514 PMCID: PMC8764114 DOI: 10.1038/s41598-021-04260-1] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2021] [Accepted: 12/17/2021] [Indexed: 11/08/2022] Open
Abstract
Despite the structural and functional information contained in the statistical coupling between pairs of residues in a protein, coevolution associated with function is often obscured by artifactual signals such as genetic drift, which shapes a protein's phylogenetic history and gives rise to concurrent variation between protein sequences that is not driven by selection for function. Here, we introduce a background model for phylogenetic contributions of statistical coupling that separates the coevolution signal due to inter-clade and intra-clade sequence comparisons and demonstrate that coevolution can be measured on multiple phylogenetic timescales within a single protein. Our method, nested coevolution (NC), can be applied as an extension to any coevolution metric. We use NC to demonstrate that poorly conserved residues can nonetheless have important roles in protein function. Moreover, NC improved the structural-contact predictions of several coevolution-based methods, particularly in subsampled alignments with fewer sequences. NC also lowered the noise in detecting functional sectors of collectively coevolving residues. Sectors of coevolving residues identified after application of NC were more spatially compact and phylogenetically distinct from the rest of the protein, and strongly enriched for mutations that disrupt protein activity. Thus, our conceptualization of the phylogenetic separation of coevolution provides the potential to further elucidate relationships among protein evolution, function, and genetic diseases.
Collapse
|
26
|
Aamir M, Karmakar P, Singh VK, Kashyap SP, Pandey S, Singh BK, Singh PM, Singh J. A novel insight into transcriptional and epigenetic regulation underlying sex expression and flower development in melon (Cucumis melo L.). PHYSIOLOGIA PLANTARUM 2021; 173:1729-1764. [PMID: 33547804 DOI: 10.1111/ppl.13357] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/10/2020] [Revised: 01/29/2021] [Accepted: 02/01/2021] [Indexed: 06/12/2023]
Abstract
Melon (Cucumis melo L.) is an important cucurbit and has been considered as a model plant for studying sex determination. The four most common sexual morphotypes in melon are monoecious (A-G-M), gynoecious (--ggM-), andromonoecious (A-G-mm), and hermaphrodite (--ggmm). Sex expression in melons is complex, as the genes and associated networks that govern the sex expression are not fully explored. Recently, RNA-seq transcriptomic profiling, ChIP-qPCR analysis integrated with gene ontology annotation and Kyoto Encyclopedia of Genes and Genomes pathways predicted the differentially expressed genes including sex-specific ACS and ACO genes, in regulating the sex-expression, phytohormonal cross-talk, signal transduction, and secondary metabolism in melons. Integration of transcriptional control through genetic interaction in between the ACS7, ACS11, and WIP1 in epistatic or hypostatic manner, along with the recruitment of H3K9ac and H3K27me3, epigenetically, overall determine sex expression. Alignment of protein sequences for establishing phylogenetic evolution, motif comparison, and protein-protein interaction supported the structural conservation while presence of the conserved hydrophilic and charged residues across the diverged evolutionary group predicted the functional conservation of the ACS protein. Presence of the putative cis-binding elements or DNA motifs, and its further comparison with DAP-seq-based cistrome and epicistrome of Arabidopsis, unraveled strong ancestry of melons with Arabidopsis. Motif comparison analysis also characterized putative genes and transcription factors involved in ethylene biosynthesis, signal transduction, and hormonal cross-talk related to sex expression. Overall, we have comprehensively reviewed research findings for a deeper insight into transcriptional and epigenetic regulation of sex expression and flower development in melons.
Collapse
Affiliation(s)
- Mohd Aamir
- Division of Crop Improvement, ICAR-Indian Institute of Vegetable Research (ICAR-IIVR), Varanasi, India
| | - Pradip Karmakar
- Division of Crop Improvement, ICAR-Indian Institute of Vegetable Research (ICAR-IIVR), Varanasi, India
| | - Vinay Kumar Singh
- Centre for Bioinformatics, School of Biotechnology, Institute of Science, Banaras Hindu University, Varanasi, India
| | - Sarvesh Pratap Kashyap
- Division of Crop Improvement, ICAR-Indian Institute of Vegetable Research (ICAR-IIVR), Varanasi, India
| | - Sudhakar Pandey
- Division of Crop Improvement, ICAR-Indian Institute of Vegetable Research (ICAR-IIVR), Varanasi, India
| | - Binod Kumar Singh
- Division of Crop Improvement, ICAR-Indian Institute of Vegetable Research (ICAR-IIVR), Varanasi, India
| | - Prabhakar Mohan Singh
- Division of Crop Improvement, ICAR-Indian Institute of Vegetable Research (ICAR-IIVR), Varanasi, India
| | - Jagdish Singh
- Division of Crop Improvement, ICAR-Indian Institute of Vegetable Research (ICAR-IIVR), Varanasi, India
| |
Collapse
|
27
|
Malbranke C, Bikard D, Cocco S, Monasson R. Improving sequence-based modeling of protein families using secondary-structure quality assessment. Bioinformatics 2021; 37:4083-4090. [PMID: 34117879 PMCID: PMC9502231 DOI: 10.1093/bioinformatics/btab442] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2021] [Revised: 06/03/2021] [Accepted: 06/16/2021] [Indexed: 12/03/2022] Open
Abstract
MOTIVATION Modeling of protein family sequence distribution from homologous sequence data recently received considerable attention, in particular for structure and function predictions, as well as for protein design. In particular, direct coupling analysis, a method to infer effective pairwise interactions between residues, was shown to capture important structural constraints and to successfully generate functional protein sequences. Building on this and other graphical models, we introduce a new framework to assess the quality of the secondary structures of the generated sequences with respect to reference structures for the family. RESULTS We introduce two scoring functions characterizing the likeliness of the secondary structure of a protein sequence to match a reference structure, called Dot Product and Pattern Matching. We test these scores on published experimental protein mutagenesis and design dataset, and show improvement in the detection of nonfunctional sequences. We also show that use of these scores help rejecting nonfunctional sequences generated by graphical models (Restricted Boltzmann Machines) learned from homologous sequence alignments. AVAILABILITY AND IMPLEMENTATION Data and code available at https://github.com/CyrilMa/ssqa. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Cyril Malbranke
- Laboratory of Physics of the Ecole Normale Superieure, PSL Research, CNRS UMR 8023, Sorbonne Université, Université de Paris, Paris, France
- Synthetic Biology, Microbiology Department, Institut Pasteur, Paris, France
| | - David Bikard
- Synthetic Biology, Microbiology Department, Institut Pasteur, Paris, France
| | - Simona Cocco
- Laboratory of Physics of the Ecole Normale Superieure, PSL Research, CNRS UMR 8023, Sorbonne Université, Université de Paris, Paris, France
| | - Rémi Monasson
- Laboratory of Physics of the Ecole Normale Superieure, PSL Research, CNRS UMR 8023, Sorbonne Université, Université de Paris, Paris, France
| |
Collapse
|
28
|
Xie Z, Xu J. Deep graph learning of inter-protein contacts. Bioinformatics 2021; 38:947-953. [PMID: 34755837 PMCID: PMC8796373 DOI: 10.1093/bioinformatics/btab761] [Citation(s) in RCA: 29] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2021] [Revised: 10/06/2021] [Accepted: 11/04/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Inter-protein (interfacial) contact prediction is very useful for in silico structural characterization of protein-protein interactions. Although deep learning has been applied to this problem, its accuracy is not as good as intra-protein contact prediction. RESULTS We propose a new deep learning method GLINTER (Graph Learning of INTER-protein contacts) for interfacial contact prediction of dimers, leveraging a rotational invariant representation of protein tertiary structures and a pretrained language model of multiple sequence alignments. Tested on the 13th and 14th CASP-CAPRI datasets, the average top L/10 precision achieved by GLINTER is 54% on the homodimers and 52% on all the dimers, much higher than 30% obtained by the latest deep learning method DeepHomo on the homodimers and 15% obtained by BIPSPI on all the dimers. Our experiments show that GLINTER-predicted contacts help improve selection of docking decoys. AVAILABILITY AND IMPLEMENTATION The software is available at https://github.com/zw2x/glinter. The datasets are available at https://github.com/zw2x/glinter/data. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ziwei Xie
- Toyota Technological Institute at Chicago, Chicago, IL 60637, USA
| | - Jinbo Xu
- To whom correspondence should be addressed.
| |
Collapse
|
29
|
Mehrabiani KM, Cheng RR, Onuchic JN. Expanding Direct Coupling Analysis to Identify Heterodimeric Interfaces from Limited Protein Sequence Data. J Phys Chem B 2021; 125:11408-11417. [PMID: 34618469 DOI: 10.1021/acs.jpcb.1c07145] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Direct coupling analysis (DCA) is a global statistical approach that uses information encoded in protein sequence data to predict spatial contacts in a three-dimensional structure of a folded protein. DCA has been widely used to predict the monomeric fold at amino acid resolution and to identify biologically relevant interaction sites within a folded protein. Going beyond single proteins, DCA has also been used to identify spatial contacts that stabilize the interaction in protein complex formation. However, extracting this higher order information necessary to predict dimer contacts presents a significant challenge. A DCA evolutionary signal is much stronger at the single protein level (intraprotein contacts) than at the protein-protein interface (interprotein contacts). Therefore, if DCA-derived information is to be used to predict the structure of these complexes, there is a need to identify statistically significant DCA predictions. We propose a simple Z-score measure that can filter good predictions despite noisy, limited data. This new methodology not only improves our prediction ability but also provides a quantitative measure for the validity of the prediction.
Collapse
Affiliation(s)
- Kareem M Mehrabiani
- Center for Theoretical Biological Physics, Rice University, Houston, Texas 77005, United States.,Systems, Synthetic, and Physical Biology, Rice University, Houston, Texas 77005, United States
| | - Ryan R Cheng
- Center for Theoretical Biological Physics, Rice University, Houston, Texas 77005, United States
| | - José N Onuchic
- Center for Theoretical Biological Physics, Rice University, Houston, Texas 77005, United States.,Systems, Synthetic, and Physical Biology, Rice University, Houston, Texas 77005, United States.,Department of Physics & Astronomy, Rice University, Houston, Texas 77005, United States.,Department of Chemistry, Rice University, Houston, Texas 77005, United States.,Department of Biosciences, Rice University, Houston, Texas 77005, United States
| |
Collapse
|
30
|
Jiang XL, Dimas RP, Chan CTY, Morcos F. Coevolutionary methods enable robust design of modular repressors by reestablishing intra-protein interactions. Nat Commun 2021; 12:5592. [PMID: 34552074 PMCID: PMC8458406 DOI: 10.1038/s41467-021-25851-6] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2021] [Accepted: 09/03/2021] [Indexed: 11/23/2022] Open
Abstract
Genetic sensors with unique combinations of DNA recognition and allosteric response can be created by hybridizing DNA-binding modules (DBMs) and ligand-binding modules (LBMs) from distinct transcriptional repressors. This module swapping approach is limited by incompatibility between DBMs and LBMs from different proteins, due to the loss of critical module-module interactions after hybridization. We determine a design strategy for restoring key interactions between DBMs and LBMs by using a computational model informed by coevolutionary traits in the LacI family. This model predicts the influence of proposed mutations on protein structure and function, quantifying the feasibility of each mutation for rescuing hybrid repressors. We accurately predict which hybrid repressors can be rescued by mutating residues to reinstall relevant module-module interactions. Experimental results confirm that dynamic ranges of gene expression induction were improved significantly in these mutants. This approach enhances the molecular and mechanistic understanding of LacI family proteins, and advances the ability to design modular genetic parts.
Collapse
Affiliation(s)
- Xian-Li Jiang
- Department of Biological Sciences, The University of Texas at Dallas, Dallas, TX, USA
- Department of Bioinformatics and Computational Biology, The University of Texas M.D. Anderson Cancer Center, Houston, TX, USA
| | - Rey P Dimas
- Department of Biology, The University of Texas at Tyler, Tyler, TX, USA
| | - Clement T Y Chan
- Department of Biomedical Engineering, University of North Texas, Denton, TX, USA.
- BioDiscovery Institute, University of North Texas, Denton, TX, USA.
| | - Faruck Morcos
- Department of Biological Sciences, The University of Texas at Dallas, Dallas, TX, USA.
- Department of Bioengineering, The University of Texas at Dallas, Dallas, TX, USA.
- Center for Systems Biology, The University of Texas at Dallas, Dallas, TX, USA.
| |
Collapse
|
31
|
Yan Y, Huang SY. Accurate prediction of inter-protein residue-residue contacts for homo-oligomeric protein complexes. Brief Bioinform 2021; 22:bbab038. [PMID: 33693482 PMCID: PMC8425427 DOI: 10.1093/bib/bbab038] [Citation(s) in RCA: 39] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2020] [Revised: 01/09/2021] [Indexed: 12/14/2022] Open
Abstract
Protein-protein interactions play a fundamental role in all cellular processes. Therefore, determining the structure of protein-protein complexes is crucial to understand their molecular mechanisms and develop drugs targeting the protein-protein interactions. Recently, deep learning has led to a breakthrough in intra-protein contact prediction, achieving an unusual high accuracy in recent Critical Assessment of protein Structure Prediction (CASP) structure prediction challenges. However, due to the limited number of known homologous protein-protein interactions and the challenge to generate joint multiple sequence alignments of two interacting proteins, the advances in inter-protein contact prediction remain limited. Here, we have proposed a deep learning model to predict inter-protein residue-residue contacts across homo-oligomeric protein interfaces, named as DeepHomo. Unlike previous deep learning approaches, we integrated intra-protein distance map and inter-protein docking pattern, in addition to evolutionary coupling, sequence conservation, and physico-chemical information of monomers. DeepHomo was extensively tested on both experimentally determined structures and realistic CASP-Critical Assessment of Predicted Interaction (CAPRI) targets. It was shown that DeepHomo achieved a high precision of >60% for the top predicted contact and outperformed state-of-the-art direct-coupling analysis and machine learning-based approaches. Integrating predicted inter-chain contacts into protein-protein docking significantly improved the docking accuracy on the benchmark dataset of realistic homo-dimeric targets from CASP-CAPRI experiments. DeepHomo is available at http://huanglab.phys.hust.edu.cn/DeepHomo/.
Collapse
Affiliation(s)
- Yumeng Yan
- School of Physics, Huazhong University of Science and Technology, Wuhan, Hubei 430074, PR China
| | - Sheng-You Huang
- School of Physics, Huazhong University of Science and Technology, Wuhan, Hubei 430074, PR China
| |
Collapse
|
32
|
Škrbić T, Maritan A, Giacometti A, Rose GD, Banavar JR. Building blocks of protein structures: Physics meets biology. Phys Rev E 2021; 104:014402. [PMID: 34412233 DOI: 10.1103/physreve.104.014402] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2021] [Accepted: 03/22/2021] [Indexed: 12/11/2022]
Abstract
The native state structures of globular proteins are stable and well packed indicating that self-interactions are favored over protein-solvent interactions under folding conditions. We use this as a guiding principle to derive the geometry of the building blocks of protein structures-α helices and strands assembled into β sheets-with no adjustable parameters, no amino acid sequence information, and no chemistry. There is an almost perfect fit between the dictates of mathematics and physics and the rules of quantum chemistry. Protein evolution is facilitated by sequence-independent platforms, which can elaborate sequence-dependent functional diversity. Our work highlights the vital role of discreteness in life and may have implications for the creation of artificial life and on the nature of life elsewhere in the cosmos.
Collapse
Affiliation(s)
- Tatjana Škrbić
- Department of Physics and Institute for Fundamental Science, University of Oregon, Eugene, Oregon 97403, USA.,Dipartimento di Scienze Molecolari e Nanosistemi, Università Ca' Foscari Venezia, Campus Scientifico, Edificio Alfa, via Torino 155, 30170 Venezia Mestre, Italy
| | - Amos Maritan
- Dipartimento di Fisica e Astronomia, Università di Padova and INFN, via Marzolo 8, 35131 Padova, Italy
| | - Achille Giacometti
- Dipartimento di Scienze Molecolari e Nanosistemi, Università Ca' Foscari Venezia, Campus Scientifico, Edificio Alfa, via Torino 155, 30170 Venezia Mestre, Italy.,European Center for Living Technologies (ECLT), Ca' Bottacin, Dorsoduro 3911, Calle Crosera, 30123 Venezia, Italy
| | - George D Rose
- T. C. Jenkins Department of Biophysics, Johns Hopkins University, 3400 North Charles Street, Baltimore, Maryland 21218-2683, USA
| | - Jayanth R Banavar
- Department of Physics and Institute for Fundamental Science, University of Oregon, Eugene, Oregon 97403, USA
| |
Collapse
|
33
|
Hu Y, Yu L, Fan H, Huang G, Wu Q, Nie Y, Liu S, Yan L, Wei F. Genomic Signatures of Coevolution between Nonmodel Mammals and Parasitic Roundworms. Mol Biol Evol 2021; 38:531-544. [PMID: 32960966 PMCID: PMC7826172 DOI: 10.1093/molbev/msaa243] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open
Abstract
Antagonistic coevolution between host and parasite drives species evolution. However, most of the studies only focus on parasitism adaptation and do not explore the coevolution mechanisms from the perspective of both host and parasite. Here, through the de novo sequencing and assembly of the genomes of giant panda roundworm, red panda roundworm, and lion roundworm parasitic on tiger, we investigated the genomic mechanisms of coevolution between nonmodel mammals and their parasitic roundworms and those of roundworm parasitism in general. The genome-wide phylogeny revealed that these parasitic roundworms have not phylogenetically coevolved with their hosts. The CTSZ and prolyl 4-hydroxylase subunit beta (P4HB) immunoregulatory proteins played a central role in protein interaction between mammals and parasitic roundworms. The gene tree comparison identified that seven pairs of interactive proteins had consistent phylogenetic topology, suggesting their coevolution during host–parasite interaction. These coevolutionary proteins were particularly relevant to immune response. In addition, we found that the roundworms of both pandas exhibited higher proportions of metallopeptidase genes, and some positively selected genes were highly related to their larvae’s fast development. Our findings provide novel insights into the genetic mechanisms of coevolution between nonmodel mammals and parasites and offer the valuable genomic resources for scientific ascariasis prevention in both pandas.
Collapse
Affiliation(s)
- Yibo Hu
- CAS Key Laboratory of Animal Ecology and Conservation Biology, Institute of Zoology, Chinese Academy of Sciences, Beijing, China.,University of Chinese Academy of Sciences, Beijing, China.,Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming, China
| | - Lijun Yu
- CAS Key Laboratory of Animal Ecology and Conservation Biology, Institute of Zoology, Chinese Academy of Sciences, Beijing, China.,University of Chinese Academy of Sciences, Beijing, China
| | - Huizhong Fan
- CAS Key Laboratory of Animal Ecology and Conservation Biology, Institute of Zoology, Chinese Academy of Sciences, Beijing, China
| | - Guangping Huang
- CAS Key Laboratory of Animal Ecology and Conservation Biology, Institute of Zoology, Chinese Academy of Sciences, Beijing, China
| | - Qi Wu
- CAS Key Laboratory of Animal Ecology and Conservation Biology, Institute of Zoology, Chinese Academy of Sciences, Beijing, China
| | - Yonggang Nie
- CAS Key Laboratory of Animal Ecology and Conservation Biology, Institute of Zoology, Chinese Academy of Sciences, Beijing, China.,Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming, China
| | - Shuai Liu
- CAS Key Laboratory of Animal Ecology and Conservation Biology, Institute of Zoology, Chinese Academy of Sciences, Beijing, China
| | - Li Yan
- CAS Key Laboratory of Animal Ecology and Conservation Biology, Institute of Zoology, Chinese Academy of Sciences, Beijing, China
| | - Fuwen Wei
- CAS Key Laboratory of Animal Ecology and Conservation Biology, Institute of Zoology, Chinese Academy of Sciences, Beijing, China.,University of Chinese Academy of Sciences, Beijing, China.,Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming, China
| |
Collapse
|
34
|
Ortet P, Fochesato S, Bitbol AF, Whitworth DE, Lalaouna D, Santaella C, Heulin T, Achouak W, Barakat M. Evolutionary history expands the range of signaling interactions in hybrid multikinase networks. Sci Rep 2021; 11:11763. [PMID: 34083699 PMCID: PMC8175716 DOI: 10.1038/s41598-021-91260-w] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2020] [Accepted: 05/19/2021] [Indexed: 12/02/2022] Open
Abstract
Two-component systems (TCSs) are ubiquitous signaling pathways, typically comprising a sensory histidine kinase (HK) and a response regulator, which communicate via intermolecular kinase-to-receiver domain phosphotransfer. Hybrid HKs constitute non-canonical TCS signaling pathways, with transmitter and receiver domains within a single protein communicating via intramolecular phosphotransfer. Here, we report how evolutionary relationships between hybrid HKs can be used as predictors of potential intermolecular and intramolecular interactions (‘phylogenetic promiscuity’). We used domain-swap genes chimeras to investigate the specificity of phosphotransfer within hybrid HKs of the GacS–GacA multikinase network of Pseudomonas brassicacearum. The receiver domain of GacS was replaced with those from nine donor hybrid HKs. Three chimeras with receivers from other hybrid HKs demonstrated correct functioning through complementation of a gacS mutant, which was dependent on strains having a functional gacA. Formation of functional chimeras was predictable on the basis of evolutionary heritage, and raises the possibility that HKs sharing a common ancestor with GacS might remain components of the contemporary GacS network. The results also demonstrate that understanding the evolutionary heritage of signaling domains in sophisticated networks allows their rational rewiring by simple domain transplantation, with implications for the creation of designer networks and inference of functional interactions.
Collapse
Affiliation(s)
- Philippe Ortet
- Aix Marseille Univ, CEA, CNRS, BIAM, LEMIRE, 13108, Saint Paul-Lez-Durance, France
| | - Sylvain Fochesato
- Aix Marseille Univ, CEA, CNRS, BIAM, LEMIRE, 13108, Saint Paul-Lez-Durance, France
| | - Anne-Florence Bitbol
- CNRS, Institut de Biologie Paris-Seine, Laboratoire Jean Perrin (UMR8237), Sorbonne Université, 75005, Paris, France.,Institute of Bioengineering, School of Life Sciences, Ecole Polytechnique Fédérale de Lausanne (EPFL), 1015, Lausanne, Switzerland
| | - David E Whitworth
- Institute of Biological, Environmental and Rural Sciences, Aberystwyth University, Ceredigion, SY23 3DD, UK
| | - David Lalaouna
- Aix Marseille Univ, CEA, CNRS, BIAM, LEMIRE, 13108, Saint Paul-Lez-Durance, France.,CNRS, ARN UPR 9002, Université de Strasbourg, 67000, Strasbourg, France
| | - Catherine Santaella
- Aix Marseille Univ, CEA, CNRS, BIAM, LEMIRE, 13108, Saint Paul-Lez-Durance, France
| | - Thierry Heulin
- Aix Marseille Univ, CEA, CNRS, BIAM, LEMIRE, 13108, Saint Paul-Lez-Durance, France
| | - Wafa Achouak
- Aix Marseille Univ, CEA, CNRS, BIAM, LEMIRE, 13108, Saint Paul-Lez-Durance, France
| | - Mohamed Barakat
- Aix Marseille Univ, CEA, CNRS, BIAM, LEMIRE, 13108, Saint Paul-Lez-Durance, France.
| |
Collapse
|
35
|
Isacchini G, Sethna Z, Elhanati Y, Nourmohammad A, Walczak AM, Mora T. Generative models of T-cell receptor sequences. Phys Rev E 2021; 101:062414. [PMID: 32688532 DOI: 10.1103/physreve.101.062414] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2020] [Accepted: 05/14/2020] [Indexed: 01/16/2023]
Abstract
T-cell receptors (TCR) are key proteins of the adaptive immune system, generated randomly in each individual, whose diversity underlies our ability to recognize infections and malignancies. Modeling the distribution of TCR sequences is of key importance for immunology and medical applications. Here, we compare two inference methods trained on high-throughput sequencing data: a knowledge-guided approach, which accounts for the details of sequence generation, supplemented by a physics-inspired model of selection; and a knowledge-free variational autoencoder based on deep artificial neural networks. We show that the knowledge-guided model outperforms the deep network approach at predicting TCR probabilities, while being more interpretable, at a lower computational cost.
Collapse
Affiliation(s)
- Giulio Isacchini
- Max Planck Institute for Dynamics and Self-organization, Am Faßberg 17, 37077 Göttingen, Germany.,Laboratoire de Physique de l'École Normale Supérieure (PSL University), CNRS, Sorbonne Université, and Université de Paris, 75005 Paris, France
| | - Zachary Sethna
- Memorial Sloan Kettering Cancer Center, New York, New York 10065, USA
| | - Yuval Elhanati
- Memorial Sloan Kettering Cancer Center, New York, New York 10065, USA
| | - Armita Nourmohammad
- Max Planck Institute for Dynamics and Self-organization, Am Faßberg 17, 37077 Göttingen, Germany.,Department of Physics, University of Washington, 3910 15th Avenue Northeast, Seattle, Washington 98195, USA.,Fred Hutchinson cancer Research Center, 1100 Fairview ave N, Seattle, Washington 98109, USA
| | - Aleksandra M Walczak
- Laboratoire de Physique de l'École Normale Supérieure (PSL University), CNRS, Sorbonne Université, and Université de Paris, 75005 Paris, France
| | - Thierry Mora
- Laboratoire de Physique de l'École Normale Supérieure (PSL University), CNRS, Sorbonne Université, and Université de Paris, 75005 Paris, France
| |
Collapse
|
36
|
Elofsson A. Toward Characterising the Cellular 3D-Proteome. FRONTIERS IN BIOINFORMATICS 2021; 1:598878. [PMID: 36353353 PMCID: PMC9638702 DOI: 10.3389/fbinf.2021.598878] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2020] [Accepted: 01/27/2021] [Indexed: 11/30/2022] Open
|
37
|
Trivial and nontrivial error sources account for misidentification of protein partners in mutual information approaches. Sci Rep 2021; 11:6902. [PMID: 33767294 PMCID: PMC7994710 DOI: 10.1038/s41598-021-86455-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2020] [Accepted: 03/15/2021] [Indexed: 12/01/2022] Open
Abstract
The problem of finding the correct set of partners for a given pair of interacting protein families based on multi-sequence alignments (MSAs) has received great attention over the years. Recently, the native contacts of two interacting proteins were shown to store the strongest mutual information (MI) signal to discriminate MSA concatenations with the largest fraction of correct pairings. Although that signal might be of practical relevance in the search for an effective heuristic to solve the problem, the number of MSA concatenations with near-native MI is large, imposing severe limitations. Here, a Genetic Algorithm that explores possible MSA concatenations according to a MI maximization criteria is shown to find degenerate solutions with two error sources, arising from mismatches among (i) similar and (ii) non-similar sequences. If mistakes made among similar sequences are disregarded, type-(i) solutions are found to resolve correct pairings at best true positive (TP) rates of 70%—far above the very same estimates in type-(ii) solutions. A machine learning classification algorithm helps to show further that differences between optimized solutions based on TP rates are not artificial and may have biological meaning associated with the three-dimensional distribution of the MI signal. Type-(i) solutions may therefore correspond to reliable results for predictive purposes, found here to be more likely obtained via MI maximization across protein systems having a minimum critical number of amino acid contacts on their interaction surfaces (N > 200).
Collapse
|
38
|
Green AG, Elhabashy H, Brock KP, Maddamsetti R, Kohlbacher O, Marks DS. Large-scale discovery of protein interactions at residue resolution using co-evolution calculated from genomic sequences. Nat Commun 2021; 12:1396. [PMID: 33654096 PMCID: PMC7925567 DOI: 10.1038/s41467-021-21636-z] [Citation(s) in RCA: 63] [Impact Index Per Article: 15.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2020] [Accepted: 01/27/2021] [Indexed: 12/28/2022] Open
Abstract
Increasing numbers of protein interactions have been identified in high-throughput experiments, but only a small proportion have solved structures. Recently, sequence coevolution-based approaches have led to a breakthrough in predicting monomer protein structures and protein interaction interfaces. Here, we address the challenges of large-scale interaction prediction at residue resolution with a fast alignment concatenation method and a probabilistic score for the interaction of residues. Importantly, this method (EVcomplex2) is able to assess the likelihood of a protein interaction, as we show here applied to large-scale experimental datasets where the pairwise interactions are unknown. We predict 504 interactions de novo in the E. coli membrane proteome, including 243 that are newly discovered. While EVcomplex2 does not require available structures, coevolving residue pairs can be used to produce structural models of protein interactions, as done here for membrane complexes including the Flagellar Hook-Filament Junction and the Tol/Pal complex. Our understanding of the residue-level details of protein interactions remains incomplete. Here, the authors show sequence coevolution can be used to infer interacting proteins with residue-level details, including predicting 467 interactions de novo in the Escherichia coli cell envelope proteome.
Collapse
Affiliation(s)
- Anna G Green
- Department of Systems Biology, Harvard Medical School, Boston, MA, 02115, USA
| | - Hadeer Elhabashy
- Biomolecular Interactions, Max Planck Institute for Developmental Biology, 72076, Tübingen, Germany.,Institute for Bioinformatics and Medical Informatics, University of Tübingen, Sand 14, 72076, Tübingen, Germany.,Department of Computer Science, University of Tübingen, WSI/ZBIT, Sand 14, 72076, Tübingen, Germany
| | - Kelly P Brock
- Department of Systems Biology, Harvard Medical School, Boston, MA, 02115, USA
| | - Rohan Maddamsetti
- Department of Systems Biology, Harvard Medical School, Boston, MA, 02115, USA
| | - Oliver Kohlbacher
- Biomolecular Interactions, Max Planck Institute for Developmental Biology, 72076, Tübingen, Germany. .,Institute for Bioinformatics and Medical Informatics, University of Tübingen, Sand 14, 72076, Tübingen, Germany. .,Department of Computer Science, University of Tübingen, WSI/ZBIT, Sand 14, 72076, Tübingen, Germany. .,Quantitative Biology Center, University of Tübingen, Auf der Morgenstelle 8, 72076, Tübingen, Germany. .,Institute for Translational Bioinformatics, University Hospital Tübingen, Sand 14, 72076, Tübingen, Germany.
| | - Debora S Marks
- Institute for Bioinformatics and Medical Informatics, University of Tübingen, Sand 14, 72076, Tübingen, Germany. .,Broad Institute of Harvard and MIT, Cambridge, MA, 02142, USA.
| |
Collapse
|
39
|
Raimondi D, Simm J, Arany A, Moreau Y. A novel method for data fusion over Entity-Relation graphs and its application to protein-protein interaction prediction. Bioinformatics 2021; 37:2275-2281. [PMID: 33560405 DOI: 10.1093/bioinformatics/btab092] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2020] [Revised: 01/14/2021] [Accepted: 02/04/2021] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Modern Bioinformatics is facing increasingly complex problems to solve, and we are indeed rapidly approaching an era in which the ability to seamlessly integrate heterogeneous sources of information will be crucial for the scientific progress. Here we present a novel non-linear data fusion framework that generalizes the conventional Matrix Factorization paradigm allowing inference over arbitrary Entity-Relation graphs, and we applied it to the prediction of Protein-Protein Interactions (PPIs). Improving our knowledge of Protein Protein Interaction (PPI) networks at the proteome scale is indeed crucial to understand protein function, physiological and disease states and cell life in general. RESULTS We devised three data-fusion based models for the proteome-level prediction of PPIs, and we show that our method outperforms state of the art approaches on common benchmarks. Moreover, we investigate its predictions on newly published PPIs, showing that this new data has a clear shift in its underlying distributions and we thus train and test our models on this extended dataset. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Jaak Simm
- ESAT-STADIUS, KU Leuven, 3001 Leuven, Belgium
| | - Adam Arany
- ESAT-STADIUS, KU Leuven, 3001 Leuven, Belgium
| | - Yves Moreau
- ESAT-STADIUS, KU Leuven, 3001 Leuven, Belgium
| |
Collapse
|
40
|
Zhang YH, Zeng T, Chen L, Huang T, Cai YD. Determining protein-protein functional associations by functional rules based on gene ontology and KEGG pathway. BIOCHIMICA ET BIOPHYSICA ACTA-PROTEINS AND PROTEOMICS 2021; 1869:140621. [PMID: 33561576 DOI: 10.1016/j.bbapap.2021.140621] [Citation(s) in RCA: 55] [Impact Index Per Article: 13.8] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Received: 08/27/2020] [Revised: 02/01/2021] [Accepted: 02/02/2021] [Indexed: 12/26/2022]
Abstract
Protein-protein interactions (PPIs) describe the direct physical contact of two proteins that usually results in specific biological functions or regulatory processes. The characterization and study of PPIs through the investigation of their pattern and principle have remained a question in biological studies. Various experimental and computational methods have been used for PPI studies, but most of them are based on the sequence similarity with current validated PPI participators or cellular localization patterns. Most methods ignore the fact that PPIs are defined by their specific biological functions. In this study, we constructed a novel rule-based computational method using gene ontology and KEGG pathway annotation of PPI participators that correspond to the complicated biological effects of PPIs. Our newly presented computational method identified a group of biological functions that are tightly associated with PPIs and provided a new function-based tool for PPI studies in a rule manner.
Collapse
Affiliation(s)
- Yu-Hang Zhang
- School of Life Sciences, Shanghai University, Shanghai 200444, China; Channing Division of Network Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA.
| | - Tao Zeng
- CAS Key Laboratory of Computational Biology, Bio-Med Big Data Center, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China.
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China.
| | - Tao Huang
- Key Laboratory of Tissue Microenvironment and Tumor, Shanghai Institute of Nutrition and Health, Chinese Academy of Sciences, Shanghai 200031, China.
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai 200444, China.
| |
Collapse
|
41
|
Salmanian S, Pezeshk H, Sadeghi M. Inter-protein residue covariation information unravels physically interacting protein dimers. BMC Bioinformatics 2020; 21:584. [PMID: 33334319 PMCID: PMC7745481 DOI: 10.1186/s12859-020-03930-7] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2020] [Accepted: 12/09/2020] [Indexed: 01/04/2023] Open
Abstract
BACKGROUND Predicting physical interaction between proteins is one of the greatest challenges in computational biology. There are considerable various protein interactions and a huge number of protein sequences and synthetic peptides with unknown interacting counterparts. Most of co-evolutionary methods discover a combination of physical interplays and functional associations. However, there are only a handful of approaches which specifically infer physical interactions. Hybrid co-evolutionary methods exploit inter-protein residue coevolution to unravel specific physical interacting proteins. In this study, we introduce a hybrid co-evolutionary-based approach to predict physical interplays between pairs of protein families, starting from protein sequences only. RESULTS In the present analysis, pairs of multiple sequence alignments are constructed for each dimer and the covariation between residues in those pairs are calculated by CCMpred (Contacts from Correlated Mutations predicted) and three mutual information based approaches for ten accessible surface area threshold groups. Then, whole residue couplings between proteins of each dimer are unified into a single Frobenius norm value. Norms of residue contact matrices of all dimers in different accessible surface area thresholds are fed into support vector machine as single or multiple feature models. The results of training the classifiers by single features show no apparent different accuracies in distinct methods for different accessible surface area thresholds. Nevertheless, mutual information product and context likelihood of relatedness procedures may roughly have an overall higher and lower performances than other two methods for different accessible surface area cut-offs, respectively. The results also demonstrate that training support vector machine with multiple norm features for several accessible surface area thresholds leads to a considerable improvement of prediction performance. In this context, CCMpred roughly achieves an overall better performance than mutual information based approaches. The best accuracy, sensitivity, specificity, precision and negative predictive value for that method are 0.98, 1, 0.962, 0.96, and 0.962, respectively. CONCLUSIONS In this paper, by feeding norm values of protein dimers into support vector machines in different accessible surface area thresholds, we demonstrate that even small number of proteins in pairs of multiple alignments could allow one to accurately discriminate between positive and negative dimers.
Collapse
Affiliation(s)
- Sara Salmanian
- Department of Bioinformatics, Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran
| | - Hamid Pezeshk
- School of Mathematics, Statistics and Computer Science, College of Science, University of Tehran, Tehran, Iran
- Present Address: Department of Mathematics and Statistics, Concordia University, Montreal, Canada
- School of Biological Sciences, Institute for Research in Fundamental Sciences, Tehran, Iran
| | - Mehdi Sadeghi
- National Institute of Genetic Engineering and Biotechnology, Tehran, Iran
| |
Collapse
|
42
|
Muntoni AP, Pagnani A, Weigt M, Zamponi F. Aligning biological sequences by exploiting residue conservation and coevolution. Phys Rev E 2020; 102:062409. [PMID: 33465950 DOI: 10.1103/physreve.102.062409] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2020] [Accepted: 11/12/2020] [Indexed: 11/07/2022]
Abstract
Sequences of nucleotides (for DNA and RNA) or amino acids (for proteins) are central objects in biology. Among the most important computational problems is that of sequence alignment, i.e., arranging sequences from different organisms in such a way to identify similar regions, to detect evolutionary relationships between sequences, and to predict biomolecular structure and function. This is typically addressed through profile models, which capture position specificities like conservation in sequences but assume an independent evolution of different positions. Over recent years, it has been well established that coevolution of different amino-acid positions is essential for maintaining three-dimensional structure and function. Modeling approaches based on inverse statistical physics can catch the coevolution signal in sequence ensembles, and they are now widely used in predicting protein structure, protein-protein interactions, and mutational landscapes. Here, we present DCAlign, an efficient alignment algorithm based on an approximate message-passing strategy, which is able to overcome the limitations of profile models, to include coevolution among positions in a general way, and to be therefore universally applicable to protein- and RNA-sequence alignment without the need of using complementary structural information. The potential of DCAlign is carefully explored using well-controlled simulated data, as well as real protein and RNA sequences.
Collapse
Affiliation(s)
- Anna Paola Muntoni
- Department of Applied Science and Technology (DISAT), Politecnico di Torino, Corso Duca degli Abruzzi 24, I-10129 Torino, Italy
- Laboratoire de Physique de l'Ecole Normale Supérieure, ENS, Université PSL, CNRS, Sorbonne Université, Université de Paris, F-75005 Paris, France
- Sorbonne Université, CNRS, Institut de Biologie Paris Seine, Biologie Computationnelle et Quantitative LCQB, F-75005 Paris, France
| | - Andrea Pagnani
- Department of Applied Science and Technology (DISAT), Politecnico di Torino, Corso Duca degli Abruzzi 24, I-10129 Torino, Italy
- Italian Institute for Genomic Medicine, IRCCS Candiolo, SP-142, I-10060 Candiolo (TO), Italy
- INFN, Sezione di Torino, Via Giuria 1, I-10125 Torino, Italy
| | - Martin Weigt
- Sorbonne Université, CNRS, Institut de Biologie Paris Seine, Biologie Computationnelle et Quantitative LCQB, F-75005 Paris, France
| | - Francesco Zamponi
- Laboratoire de Physique de l'Ecole Normale Supérieure, ENS, Université PSL, CNRS, Sorbonne Université, Université de Paris, F-75005 Paris, France
| |
Collapse
|
43
|
Geisler M, Hegedűs T. A twist in the ABC: regulation of ABC transporter trafficking and transport by FK506-binding proteins. FEBS Lett 2020; 594:3986-4000. [PMID: 33125703 DOI: 10.1002/1873-3468.13983] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2020] [Revised: 10/02/2020] [Accepted: 10/15/2020] [Indexed: 01/07/2023]
Abstract
Post-transcriptional regulation of ATP-binding cassette (ABC) proteins has been so far shown to encompass protein phosphorylation, maturation, and ubiquitination. Yet, recent accumulating evidence implicates FK506-binding proteins (FKBPs), a type of peptidylprolyl cis-trans isomerase (PPIase) proteins, in ABC transporter regulation. In this perspective article, we summarize current knowledge on ABC transporter regulation by FKBPs, which seems to be conserved over kingdoms and ABC subfamilies. We uncover striking functional similarities but also differences between regulatory FKBP-ABC modules in plants and mammals. We dissect a PPIase- and HSP90-dependent and independent impact of FKBPs on ABC biogenesis and transport activity. We propose and discuss a putative new mode of transient ABC transporter regulation by cis-trans isomerization of X-prolyl bonds.
Collapse
Affiliation(s)
- Markus Geisler
- Department of Biology, University of Fribourg, Switzerland
| | - Tamás Hegedűs
- Department of Biophysics and Radiation Biology, Semmelweis University, Budapest, Hungary
| |
Collapse
|
44
|
Russ WP, Figliuzzi M, Stocker C, Barrat-Charlaix P, Socolich M, Kast P, Hilvert D, Monasson R, Cocco S, Weigt M, Ranganathan R. An evolution-based model for designing chorismate mutase enzymes. Science 2020; 369:440-445. [PMID: 32703877 DOI: 10.1126/science.aba3304] [Citation(s) in RCA: 137] [Impact Index Per Article: 27.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2019] [Accepted: 05/13/2020] [Indexed: 02/02/2023]
Abstract
The rational design of enzymes is an important goal for both fundamental and practical reasons. Here, we describe a process to learn the constraints for specifying proteins purely from evolutionary sequence data, design and build libraries of synthetic genes, and test them for activity in vivo using a quantitative complementation assay. For chorismate mutase, a key enzyme in the biosynthesis of aromatic amino acids, we demonstrate the design of natural-like catalytic function with substantial sequence diversity. Further optimization focuses the generative model toward function in a specific genomic context. The data show that sequence-based statistical models suffice to specify proteins and provide access to an enormous space of functional sequences. This result provides a foundation for a general process for evolution-based design of artificial proteins.
Collapse
Affiliation(s)
- William P Russ
- University of Texas Southwestern Medical Center, Dallas, TX, USA
| | - Matteo Figliuzzi
- Sorbonne Université, CNRS, Institut de Biologie Paris Seine, Laboratoire de Biologie Computationnelle and Quantitative, Paris, France
| | | | - Pierre Barrat-Charlaix
- Sorbonne Université, CNRS, Institut de Biologie Paris Seine, Laboratoire de Biologie Computationnelle and Quantitative, Paris, France.,Biozentrum, University of Basel, Basel, Switzerland
| | - Michael Socolich
- Center for Physics of Evolving Systems, Biochemistry and Molecular Biology and the Pritzker School for Molecular Engineering, University of Chicago, Chicago, IL, USA
| | - Peter Kast
- Laboratory of Organic Chemistry, ETH Zurich, Switzerland
| | - Donald Hilvert
- Laboratory of Organic Chemistry, ETH Zurich, Switzerland
| | - Remi Monasson
- Laboratoire de Physique de l'Ecole Normale Supérieure, PSL and CNRS, Paris, France
| | - Simona Cocco
- Laboratoire de Physique de l'Ecole Normale Supérieure, PSL and CNRS, Paris, France
| | - Martin Weigt
- Sorbonne Université, CNRS, Institut de Biologie Paris Seine, Laboratoire de Biologie Computationnelle and Quantitative, Paris, France.
| | - Rama Ranganathan
- Center for Physics of Evolving Systems, Biochemistry and Molecular Biology and the Pritzker School for Molecular Engineering, University of Chicago, Chicago, IL, USA.
| |
Collapse
|
45
|
Correa Marrero M, Immink RGH, de Ridder D, van Dijk ADJ. Improved inference of intermolecular contacts through protein-protein interaction prediction using coevolutionary analysis. Bioinformatics 2020; 35:2036-2042. [PMID: 30398547 DOI: 10.1093/bioinformatics/bty924] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2018] [Revised: 10/11/2018] [Accepted: 11/05/2018] [Indexed: 01/09/2023] Open
Abstract
MOTIVATION Predicting residue-residue contacts between interacting proteins is an important problem in bioinformatics. The growing wealth of sequence data can be used to infer these contacts through correlated mutation analysis on multiple sequence alignments of interacting homologs of the proteins of interest. This requires correct identification of pairs of interacting proteins for many species, in order to avoid introducing noise (i.e. non-interacting sequences) in the analysis that will decrease predictive performance. RESULTS We have designed Ouroboros, a novel algorithm to reduce such noise in intermolecular contact prediction. Our method iterates between weighting proteins according to how likely they are to interact based on the correlated mutations signal, and predicting correlated mutations based on the weighted sequence alignment. We show that this approach accurately discriminates between protein interaction versus non-interaction and simultaneously improves the prediction of intermolecular contact residues compared to a naive application of correlated mutation analysis. This requires no training labels concerning interactions or contacts. Furthermore, the method relaxes the assumption of one-to-one interaction of previous approaches, allowing for the study of many-to-many interactions. AVAILABILITY AND IMPLEMENTATION Source code and test data are available at www.bif.wur.nl/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Richard G H Immink
- Laboratory of Molecular Biology, Department of Plant Sciences.,Bioscience, Wageningen Plant Research
| | | | - Aalt D J van Dijk
- Bioinformatics Group, Department of Plant Sciences.,Bioscience, Wageningen Plant Research.,Biometris, Department of Plant Sciences, Wageningen University & Research, Wageningen PB, The Netherlands
| |
Collapse
|
46
|
Andreani J, Quignot C, Guerois R. Structural prediction of protein interactions and docking using conservation and coevolution. WILEY INTERDISCIPLINARY REVIEWS-COMPUTATIONAL MOLECULAR SCIENCE 2020. [DOI: 10.1002/wcms.1470] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Affiliation(s)
- Jessica Andreani
- Université Paris‐Saclay CEA, CNRS, Institute for Integrative Biology of the Cell (I2BC) Gif‐sur‐Yvette France
| | - Chloé Quignot
- Université Paris‐Saclay CEA, CNRS, Institute for Integrative Biology of the Cell (I2BC) Gif‐sur‐Yvette France
| | - Raphael Guerois
- Université Paris‐Saclay CEA, CNRS, Institute for Integrative Biology of the Cell (I2BC) Gif‐sur‐Yvette France
| |
Collapse
|
47
|
Gandarilla-Pérez CA, Mergny P, Weigt M, Bitbol AF. Statistical physics of interacting proteins: Impact of dataset size and quality assessed in synthetic sequences. Phys Rev E 2020; 101:032413. [PMID: 32290011 DOI: 10.1103/physreve.101.032413] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2019] [Accepted: 03/04/2020] [Indexed: 11/07/2022]
Abstract
Identifying protein-protein interactions is crucial for a systems-level understanding of the cell. Recently, algorithms based on inverse statistical physics, e.g., direct coupling analysis (DCA), have allowed to use evolutionarily related sequences to address two conceptually related inference tasks: finding pairs of interacting proteins and identifying pairs of residues which form contacts between interacting proteins. Here we address two underlying questions: How are the performances of both inference tasks related? How does performance depend on dataset size and the quality? To this end, we formalize both tasks using Ising models defined over stochastic block models, with individual blocks representing single proteins and interblock couplings protein-protein interactions; controlled synthetic sequence data are generated by Monte Carlo simulations. We show that DCA is able to address both inference tasks accurately when sufficiently large training sets of known interaction partners are available and that an iterative pairing algorithm allows to make predictions even without a training set. Noise in the training data deteriorates performance. In both tasks we find a quadratic scaling relating dataset quality and size that is consistent with noise adding in square-root fashion and signal adding linearly when increasing the dataset. This implies that it is generally good to incorporate more data even if their quality are imperfect, thereby shedding light on the empirically observed performance of DCA applied to natural protein sequences.
Collapse
Affiliation(s)
- Carlos A Gandarilla-Pérez
- Sorbonne Université, CNRS, Institut de Biologie Paris-Seine, Laboratoire de Biologie Computationnelle et Quantitative (LCQB, UMR 7238), F-75005 Paris, France.,Facultad de Física, Universidad de la Habana, San Lázaro y L, Vedado, Habana 4, CP-10400, Cuba
| | - Pierre Mergny
- Sorbonne Université, CNRS, Institut de Biologie Paris-Seine, Laboratoire de Biologie Computationnelle et Quantitative (LCQB, UMR 7238), F-75005 Paris, France.,Sorbonne Université, CNRS, Institut de Biologie Paris-Seine, Laboratoire Jean Perrin (LJP, UMR 8237), F-75005 Paris, France
| | - Martin Weigt
- Sorbonne Université, CNRS, Institut de Biologie Paris-Seine, Laboratoire de Biologie Computationnelle et Quantitative (LCQB, UMR 7238), F-75005 Paris, France
| | - Anne-Florence Bitbol
- Sorbonne Université, CNRS, Institut de Biologie Paris-Seine, Laboratoire Jean Perrin (LJP, UMR 8237), F-75005 Paris, France.,Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), CH-1015 Lausanne, Switzerland
| |
Collapse
|
48
|
Epistatic contributions promote the unification of incompatible models of neutral molecular evolution. Proc Natl Acad Sci U S A 2020; 117:5873-5882. [PMID: 32123092 PMCID: PMC7084075 DOI: 10.1073/pnas.1913071117] [Citation(s) in RCA: 28] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022] Open
Abstract
Mathematical models of evolution help us understand mechanisms driving protein-sequence change. Previous models recapitulate a disjoint subset of statistical features of natural sequences. We present a neutral evolution model that unifies features including extreme variance of the molecular clock’s tick rate and the observation of an evolutionary Stokes shift, an irreversible effect of mutations in the fitness landscape during sequence evolution. We show that interactions between amino acid sites, which inform our fitness metric, are required to observe these features. These interactions are inferred by using direct coupling analysis, which has been successfully utilized to predict protein structures, dynamics, and complexes from coevolutionary information. We anticipate our model will have applications in phylogenetics, ancestral reconstruction of sequences, and protein design. We introduce a model of amino acid sequence evolution that accounts for the statistical behavior of real sequences induced by epistatic interactions. We base the model dynamics on parameters derived from multiple sequence alignments analyzed by using direct coupling analysis methodology. Known statistical properties such as overdispersion, heterotachy, and gamma-distributed rate-across-sites are shown to be emergent properties of this model while being consistent with neutral evolution theory, thereby unifying observations from previously disjointed evolutionary models of sequences. The relationship between site restriction and heterotachy is characterized by tracking the effective alphabet dynamics of sites. We also observe an evolutionary Stokes shift in the fitness of sequences that have undergone evolution under our simulation. By analyzing the structural information of some proteins, we corroborate that the strongest Stokes shifts derive from sites that physically interact in networks near biochemically important regions. Perspectives on the implementation of our model in the context of the molecular clock are discussed.
Collapse
|
49
|
Barreto CAV, Baptista SJ, Preto AJ, Matos-Filipe P, Mourão J, Melo R, Moreira I. Prediction and targeting of GPCR oligomer interfaces. PROGRESS IN MOLECULAR BIOLOGY AND TRANSLATIONAL SCIENCE 2020; 169:105-149. [PMID: 31952684 DOI: 10.1016/bs.pmbts.2019.11.007] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Abstract
GPCR oligomerization has emerged as a hot topic in the GPCR field in the last years. Receptors that are part of these oligomers can influence each other's function, although it is not yet entirely understood how these interactions work. The existence of such a highly complex network of interactions between GPCRs generates the possibility of alternative targets for new therapeutic approaches. However, challenges still exist in the characterization of these complexes, especially at the interface level. Different experimental approaches, such as FRET or BRET, are usually combined to study GPCR oligomer interactions. Computational methods have been applied as a useful tool for retrieving information from GPCR sequences and the few X-ray-resolved oligomeric structures that are accessible, as well as for predicting new and trustworthy GPCR oligomeric interfaces. Machine-learning (ML) approaches have recently helped with some hindrances of other methods. By joining and evaluating multiple structure-, sequence- and co-evolution-based features on the same algorithm, it is possible to dilute the issues of particular structures and residues that arise from the experimental methodology into all-encompassing algorithms capable of accurately predict GPCR-GPCR interfaces. All these methods used as a single or a combined approach provide useful information about GPCR oligomerization and its role in GPCR function and dynamics. Altogether, we present experimental, computational and machine-learning methods used to study oligomers interfaces, as well as strategies that have been used to target these dynamic complexes.
Collapse
Affiliation(s)
- Carlos A V Barreto
- Center for Neuroscience and Cell Biology, University of Coimbra, Coimbra, Portugal
| | - Salete J Baptista
- Center for Neuroscience and Cell Biology, University of Coimbra, Coimbra, Portugal; Centro de Ciências e Tecnologias Nucleares, Instituto Superior Técnico, Universidade de Lisboa, CTN, LRS, Portugal
| | - António José Preto
- Center for Neuroscience and Cell Biology, University of Coimbra, Coimbra, Portugal
| | - Pedro Matos-Filipe
- Center for Neuroscience and Cell Biology, University of Coimbra, Coimbra, Portugal
| | - Joana Mourão
- Center for Neuroscience and Cell Biology, University of Coimbra, Coimbra, Portugal; Institute for Interdisciplinary Research, University of Coimbra, Coimbra, Portugal
| | - Rita Melo
- Center for Neuroscience and Cell Biology, University of Coimbra, Coimbra, Portugal; Centro de Ciências e Tecnologias Nucleares, Instituto Superior Técnico, Universidade de Lisboa, CTN, LRS, Portugal
| | - Irina Moreira
- Center for Neuroscience and Cell Biology, University of Coimbra, Coimbra, Portugal; Science and Technology Faculty, University of Coimbra, Coimbra, Portugal.
| |
Collapse
|
50
|
Gueudré T, Baldassi C, Pagnani A, Weigt M. Predicting Interacting Protein Pairs by Coevolutionary Paralog Matching. Methods Mol Biol 2020; 2074:57-65. [PMID: 31583630 DOI: 10.1007/978-1-4939-9873-9_5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Even if we know that two families of homologous proteins interact, we do not necessarily know, which specific proteins interact inside each species. The reason is that most families contain paralogs, i.e., more than one homologous sequence per species. We have developed a tool to predict interacting paralogs between the two protein families, which is based on the idea of inter-protein coevolution: our algorithm matches those members of the two protein families, which belong to the same species and collectively maximize the detectable coevolutionary signal. It is applicable even in cases, where simpler methods based, e.g., on genomic co-localization of genes coding for interacting proteins or orthology-based methods fail. In this method paper, we present an efficient implementation of this idea based on freely available software.
Collapse
Affiliation(s)
| | - Carlo Baldassi
- Bocconi Institute for Data Science and Analytics, Bocconi University, Milan, Italy
- INFN, Sezione di Torino, Torino, Italy
| | - Andrea Pagnani
- Italian Institute for Genomic Medicine, Turin, Italy
- INFN, Sezione di Torino, Torino, Italy
- Dipartimento di Scienza Applicata e Tecnologia, Politecnico di Torino, Torino, Italy
| | - Martin Weigt
- Sorbonne Université, CNRS, Institut de Biologie Paris Seine, Biologie Computationnelle et Quantitative-LCQB, Paris, France.
| |
Collapse
|