1
|
Dietler N, Abbara A, Choudhury S, Bitbol AF. Impact of phylogeny on the inference of functional sectors from protein sequence data. PLoS Comput Biol 2024; 20:e1012091. [PMID: 39312591 PMCID: PMC11449291 DOI: 10.1371/journal.pcbi.1012091] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2024] [Revised: 10/03/2024] [Accepted: 09/10/2024] [Indexed: 09/25/2024] Open
Abstract
Statistical analysis of multiple sequence alignments of homologous proteins has revealed groups of coevolving amino acids called sectors. These groups of amino-acid sites feature collective correlations in their amino-acid usage, and they are associated to functional properties. Modeling showed that nonlinear selection on an additive functional trait of a protein is generically expected to give rise to a functional sector. These modeling results motivated a principled method, called ICOD, which is designed to identify functional sectors, as well as mutational effects, from sequence data. However, a challenge for all methods aiming to identify sectors from multiple sequence alignments is that correlations in amino-acid usage can also arise from the mere fact that homologous sequences share common ancestry, i.e. from phylogeny. Here, we generate controlled synthetic data from a minimal model comprising both phylogeny and functional sectors. We use this data to dissect the impact of phylogeny on sector identification and on mutational effect inference by different methods. We find that ICOD is most robust to phylogeny, but that conservation is also quite robust. Next, we consider natural multiple sequence alignments of protein families for which deep mutational scan experimental data is available. We show that in this natural data, conservation and ICOD best identify sites with strong functional roles, in agreement with our results on synthetic data. Importantly, these two methods have different premises, since they respectively focus on conservation and on correlations. Thus, their joint use can reveal complementary information.
Collapse
Affiliation(s)
- Nicola Dietler
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Alia Abbara
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Subham Choudhury
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Anne-Florence Bitbol
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| |
Collapse
|
2
|
Mishra SK, Priya P, Rai GP, Haque R, Shanker A. Coevolution based immunoinformatics approach considering variability of epitopes to combat different strains: A case study using spike protein of SARS-CoV-2. Comput Biol Med 2023; 163:107233. [PMID: 37422941 DOI: 10.1016/j.compbiomed.2023.107233] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2022] [Revised: 06/03/2023] [Accepted: 07/01/2023] [Indexed: 07/11/2023]
Abstract
In the recent past several vaccines were developed to combat the COVID-19 disease. Unfortunately, the protective efficacy of the current vaccines has been reduced due to the high mutation rate in SARS-CoV-2. Here, we successfully implemented a coevolution based immunoinformatics approach to design an epitope-based peptide vaccine considering variability in spike protein of SARS-CoV-2. The spike glycoprotein was investigated for B- and T-cell epitope prediction. Identified T-cell epitopes were mapped on previously reported coevolving amino acids in the spike protein to introduce mutation. The non-mutated and mutated vaccine components were constructed by selecting epitopes showing overlapping with the predicted B-cell epitopes and highest antigenicity. Selected epitopes were linked with the help of a linker to construct a single vaccine component. Non-mutated and mutated vaccine component sequences were modelled and validated. The in-silico expression level of the vaccine constructs (non-mutated and mutated) in E. coli K12 shows promising results. The molecular docking analysis of vaccine components with toll-like receptor 5 (TLR5) demonstrated strong binding affinity. The time series calculations including root mean square deviation (RMSD), radius of gyration (RGYR), and energy of the system over 100 ns trajectory obtained from all atom molecular dynamics simulation showed stability of the system. The combined coevolutionary and immunoinformatics approach used in this study will certainly help to design an effective peptide vaccine that may work against different strains of SARS-CoV-2. Moreover, the strategy used in this study can be implemented on other pathogens.
Collapse
Affiliation(s)
- Saurav Kumar Mishra
- Department of Bioinformatics, Central University of South Bihar, Gaya, Bihar, India
| | - Prerna Priya
- Department of Botany, Purnea Mahila College, Purnia, Bihar, India
| | - Gyan Prakash Rai
- Department of Bioinformatics, Central University of South Bihar, Gaya, Bihar, India
| | - Rizwanul Haque
- Department of Biotechnology, Central University of South Bihar, Gaya, Bihar, India
| | - Asheesh Shanker
- Department of Bioinformatics, Central University of South Bihar, Gaya, Bihar, India.
| |
Collapse
|
3
|
Sgarbossa D, Lupo U, Bitbol AF. Generative power of a protein language model trained on multiple sequence alignments. eLife 2023; 12:e79854. [PMID: 36734516 PMCID: PMC10038667 DOI: 10.7554/elife.79854] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2022] [Accepted: 02/02/2023] [Indexed: 02/04/2023] Open
Abstract
Computational models starting from large ensembles of evolutionarily related protein sequences capture a representation of protein families and learn constraints associated to protein structure and function. They thus open the possibility for generating novel sequences belonging to protein families. Protein language models trained on multiple sequence alignments, such as MSA Transformer, are highly attractive candidates to this end. We propose and test an iterative method that directly employs the masked language modeling objective to generate sequences using MSA Transformer. We demonstrate that the resulting sequences score as well as natural sequences, for homology, coevolution, and structure-based measures. For large protein families, our synthetic sequences have similar or better properties compared to sequences generated by Potts models, including experimentally validated ones. Moreover, for small protein families, our generation method based on MSA Transformer outperforms Potts models. Our method also more accurately reproduces the higher-order statistics and the distribution of sequences in sequence space of natural data than Potts models. MSA Transformer is thus a strong candidate for protein sequence generation and protein design.
Collapse
Affiliation(s)
- Damiano Sgarbossa
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL)LausanneSwitzerland
- SIB Swiss Institute of BioinformaticsLausanneSwitzerland
| | - Umberto Lupo
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL)LausanneSwitzerland
- SIB Swiss Institute of BioinformaticsLausanneSwitzerland
| | - Anne-Florence Bitbol
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL)LausanneSwitzerland
- SIB Swiss Institute of BioinformaticsLausanneSwitzerland
| |
Collapse
|
4
|
Dietler N, Lupo U, Bitbol AF. Impact of phylogeny on structural contact inference from protein sequence data. J R Soc Interface 2023; 20:20220707. [PMID: 36751926 PMCID: PMC9905998 DOI: 10.1098/rsif.2022.0707] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2022] [Accepted: 01/09/2023] [Indexed: 02/09/2023] Open
Abstract
Local and global inference methods have been developed to infer structural contacts from multiple sequence alignments of homologous proteins. They rely on correlations in amino acid usage at contacting sites. Because homologous proteins share a common ancestry, their sequences also feature phylogenetic correlations, which can impair contact inference. We investigate this effect by generating controlled synthetic data from a minimal model where the importance of contacts and of phylogeny can be tuned. We demonstrate that global inference methods, specifically Potts models, are more resilient to phylogenetic correlations than local methods, based on covariance or mutual information. This holds whether or not phylogenetic corrections are used, and may explain the success of global methods. We analyse the roles of selection strength and of phylogenetic relatedness. We show that sites that mutate early in the phylogeny yield false positive contacts. We consider natural data and realistic synthetic data, and our findings generalize to these cases. Our results highlight the impact of phylogeny on contact prediction from protein sequences and illustrate the interplay between the rich structure of biological data and inference.
Collapse
Affiliation(s)
- Nicola Dietler
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), 1015 Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Umberto Lupo
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), 1015 Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Anne-Florence Bitbol
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), 1015 Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| |
Collapse
|
5
|
Young BD, Cook ME, Costabile BK, Samanta R, Zhuang X, Sevdalis SE, Varney KM, Mancia F, Matysiak S, Lattman E, Weber DJ. Binding and Functional Folding (BFF): A Physiological Framework for Studying Biomolecular Interactions and Allostery. J Mol Biol 2022; 434:167872. [PMID: 36354074 PMCID: PMC10871162 DOI: 10.1016/j.jmb.2022.167872] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2022] [Revised: 09/20/2022] [Accepted: 10/24/2022] [Indexed: 11/06/2022]
Abstract
EF-hand Ca2+-binding proteins (CBPs), such as S100 proteins (S100s) and calmodulin (CaM), are signaling proteins that undergo conformational changes upon increasing intracellular Ca2+. Upon binding Ca2+, S100 proteins and CaM interact with protein targets and induce important biological responses. The Ca2+-binding affinity of CaM and most S100s in the absence of target is weak (CaKD > 1 μM). However, upon effector protein binding, the Ca2+ affinity of these proteins increases via heterotropic allostery (CaKD < 1 μM). Because of the high number and micromolar concentrations of EF-hand CBPs in a cell, at any given time, allostery is required physiologically, allowing for (i) proper Ca2+ homeostasis and (ii) strict maintenance of Ca2+-signaling within a narrow dynamic range of free Ca2+ ion concentrations, [Ca2+]free. In this review, mechanisms of allostery are coalesced into an empirical "binding and functional folding (BFF)" physiological framework. At the molecular level, folding (F), binding and folding (BF), and BFF events include all atoms in the biomolecular complex under study. The BFF framework is introduced with two straightforward BFF types for proteins (type 1, concerted; type 2, stepwise) and considers how homologous and nonhomologous amino acid residues of CBPs and their effector protein(s) evolved to provide allosteric tightening of Ca2+ and simultaneously determine how specific and relatively promiscuous CBP-target complexes form as both are needed for proper cellular function.
Collapse
Affiliation(s)
- Brianna D Young
- The Center for Biomolecular Therapeutics (CBT), Department of Biochemistry and Molecular Biology, University of Maryland School of Medicine, Baltimore, MD 21201, USA
| | - Mary E Cook
- The Center for Biomolecular Therapeutics (CBT), Department of Biochemistry and Molecular Biology, University of Maryland School of Medicine, Baltimore, MD 21201, USA
| | - Brianna K Costabile
- Department of Physiology and Cellular Biophysics, Columbia University, New York, NY 10032, USA
| | - Riya Samanta
- Biophysics Graduate Program, University of Maryland, College Park, MD 20742, USA; Fischell Department of Bioengineering, University of Maryland, College Park, MD 20742, USA
| | - Xinhao Zhuang
- The Center for Biomolecular Therapeutics (CBT), Department of Biochemistry and Molecular Biology, University of Maryland School of Medicine, Baltimore, MD 21201, USA
| | - Spiridon E Sevdalis
- The Center for Biomolecular Therapeutics (CBT), Department of Biochemistry and Molecular Biology, University of Maryland School of Medicine, Baltimore, MD 21201, USA
| | - Kristen M Varney
- The Center for Biomolecular Therapeutics (CBT), Department of Biochemistry and Molecular Biology, University of Maryland School of Medicine, Baltimore, MD 21201, USA
| | - Filippo Mancia
- Department of Physiology and Cellular Biophysics, Columbia University, New York, NY 10032, USA
| | - Silvina Matysiak
- Biophysics Graduate Program, University of Maryland, College Park, MD 20742, USA; Fischell Department of Bioengineering, University of Maryland, College Park, MD 20742, USA
| | - Eaton Lattman
- The Center for Biomolecular Therapeutics (CBT), Department of Biochemistry and Molecular Biology, University of Maryland School of Medicine, Baltimore, MD 21201, USA; Department of Physics, Arizona State University, Tempe, AZ 85287, USA
| | - David J Weber
- The Center for Biomolecular Therapeutics (CBT), Department of Biochemistry and Molecular Biology, University of Maryland School of Medicine, Baltimore, MD 21201, USA; The Institute of Bioscience and Biotechnology Research (IBBR), Rockville, MD 20850, USA.
| |
Collapse
|
6
|
Lupo U, Sgarbossa D, Bitbol AF. Protein language models trained on multiple sequence alignments learn phylogenetic relationships. Nat Commun 2022; 13:6298. [PMID: 36273003 PMCID: PMC9588007 DOI: 10.1038/s41467-022-34032-y] [Citation(s) in RCA: 14] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2022] [Accepted: 10/07/2022] [Indexed: 12/25/2022] Open
Abstract
Self-supervised neural language models with attention have recently been applied to biological sequence data, advancing structure, function and mutational effect prediction. Some protein language models, including MSA Transformer and AlphaFold's EvoFormer, take multiple sequence alignments (MSAs) of evolutionarily related proteins as inputs. Simple combinations of MSA Transformer's row attentions have led to state-of-the-art unsupervised structural contact prediction. We demonstrate that similarly simple, and universal, combinations of MSA Transformer's column attentions strongly correlate with Hamming distances between sequences in MSAs. Therefore, MSA-based language models encode detailed phylogenetic relationships. We further show that these models can separate coevolutionary signals encoding functional and structural constraints from phylogenetic correlations reflecting historical contingency. To assess this, we generate synthetic MSAs, either without or with phylogeny, from Potts models trained on natural MSAs. We find that unsupervised contact prediction is substantially more resilient to phylogenetic noise when using MSA Transformer versus inferred Potts models.
Collapse
Affiliation(s)
- Umberto Lupo
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), CH-1015, Lausanne, Switzerland.
- SIB Swiss Institute of Bioinformatics, CH-1015, Lausanne, Switzerland.
| | - Damiano Sgarbossa
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), CH-1015, Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, CH-1015, Lausanne, Switzerland
| | - Anne-Florence Bitbol
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), CH-1015, Lausanne, Switzerland.
- SIB Swiss Institute of Bioinformatics, CH-1015, Lausanne, Switzerland.
| |
Collapse
|
7
|
Wu C, Guo D. Computational Docking Reveals Co-Evolution of C4 Carbon Delivery Enzymes in Diverse Plants. Int J Mol Sci 2022; 23:12688. [PMID: 36293547 PMCID: PMC9604239 DOI: 10.3390/ijms232012688] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2022] [Revised: 10/14/2022] [Accepted: 10/19/2022] [Indexed: 11/16/2022] Open
Abstract
Proteins are modular functionalities regulating multiple cellular activities in prokaryotes and eukaryotes. As a consequence of higher plants adapting to arid and thermal conditions, C4 photosynthesis is the carbon fixation process involving multi-enzymes working in a coordinated fashion. However, how these enzymes interact with each other and whether they co-evolve in parallel to maintain interactions in different plants remain elusive to date. Here, we report our findings on the global protein co-evolution relationship and local dynamics of co-varying site shifts in key C4 photosynthetic enzymes. We found that in most of the selected key C4 photosynthetic enzymes, global pairwise co-evolution events exist to form functional couplings. Besides, protein-protein interactions between these enzymes may suggest their unknown functionalities in the carbon delivery process. For PEPC and PPCK regulation pairs, pocket formation at the interactive interface are not necessary for their function. This feature is distinct from another well-known regulation pair in C4 photosynthesis, namely, PPDK and PPDK-RP, where the pockets are necessary. Our findings facilitate the discovery of novel protein regulation types and contribute to expanding our knowledge about C4 photosynthesis.
Collapse
Affiliation(s)
| | - Dianjing Guo
- State Key Laboratory of Agrobiotechnology, School of Life Sciences, The Chinese University of Hong Kong, Shatin, Hong Kong SAR, China
| |
Collapse
|
8
|
Gerardos A, Dietler N, Bitbol AF. Correlations from structure and phylogeny combine constructively in the inference of protein partners from sequences. PLoS Comput Biol 2022; 18:e1010147. [PMID: 35576238 PMCID: PMC9135348 DOI: 10.1371/journal.pcbi.1010147] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2021] [Revised: 05/26/2022] [Accepted: 04/27/2022] [Indexed: 11/19/2022] Open
Abstract
Inferring protein-protein interactions from sequences is an important task in computational biology. Recent methods based on Direct Coupling Analysis (DCA) or Mutual Information (MI) allow to find interaction partners among paralogs of two protein families. Does successful inference mainly rely on correlations from structural contacts or from phylogeny, or both? Do these two types of signal combine constructively or hinder each other? To address these questions, we generate and analyze synthetic data produced using a minimal model that allows us to control the amounts of structural constraints and phylogeny. We show that correlations from these two sources combine constructively to increase the performance of partner inference by DCA or MI. Furthermore, signal from phylogeny can rescue partner inference when signal from contacts becomes less informative, including in the realistic case where inter-protein contacts are restricted to a small subset of sites. We also demonstrate that DCA-inferred couplings between non-contact pairs of sites improve partner inference in the presence of strong phylogeny, while deteriorating it otherwise. Moreover, restricting to non-contact pairs of sites preserves inference performance in the presence of strong phylogeny. In a natural data set, as well as in realistic synthetic data based on it, we find that non-contact pairs of sites contribute positively to partner inference performance, and that restricting to them preserves performance, evidencing an important role of phylogeny.
Collapse
Affiliation(s)
- Andonis Gerardos
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Nicola Dietler
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Anne-Florence Bitbol
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| |
Collapse
|