1
|
Bochtler M. How the technologies behind self-driving cars, social networks, ChatGPT, and DALL-E2 are changing structural biology. Bioessays 2025; 47:e2400155. [PMID: 39404756 DOI: 10.1002/bies.202400155] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2024] [Revised: 09/08/2024] [Accepted: 09/26/2024] [Indexed: 12/22/2024]
Abstract
The performance of deep Neural Networks (NNs) in the text (ChatGPT) and image (DALL-E2) domains has attracted worldwide attention. Convolutional NNs (CNNs), Large Language Models (LLMs), Denoising Diffusion Probabilistic Models (DDPMs)/Noise Conditional Score Networks (NCSNs), and Graph NNs (GNNs) have impacted computer vision, language editing and translation, automated conversation, image generation, and social network management. Proteins can be viewed as texts written with the alphabet of amino acids, as images, or as graphs of interacting residues. Each of these perspectives suggests the use of tools from a different area of deep learning for protein structural biology. Here, I review how CNNs, LLMs, DDPMs/NCSNs, and GNNs have led to major advances in protein structure prediction, inverse folding, protein design, and small molecule design. This review is primarily intended as a deep learning primer for practicing experimental structural biologists. However, extensive references to the deep learning literature should also make it relevant to readers who have a background in machine learning, physics or statistics, and an interest in protein structural biology.
Collapse
Affiliation(s)
- Matthias Bochtler
- International institute of Molecular and Cell Biology in Warsaw, Warsaw, Poland
- Institute of Biochemistry and Biophysics, Warsaw, Poland
| |
Collapse
|
2
|
Zeng HL, Yang CL, Jing B, Barton J, Aurell E. Two fitness inference schemes compared using allele frequencies from 1068 391 sequences sampled in the UK during the COVID-19 pandemic. Phys Biol 2024; 22:016003. [PMID: 39536448 DOI: 10.1088/1478-3975/ad9213] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2024] [Accepted: 11/13/2024] [Indexed: 11/16/2024]
Abstract
Throughout the course of the SARS-CoV-2 pandemic, genetic variation has contributed to the spread and persistence of the virus. For example, various mutations have allowed SARS-CoV-2 to escape antibody neutralization or to bind more strongly to the receptors that it uses to enter human cells. Here, we compared two methods that estimate the fitness effects of viral mutations using the abundant sequence data gathered over the course of the pandemic. Both approaches are grounded in population genetics theory but with different assumptions. One approach, tQLE, features an epistatic fitness landscape and assumes that alleles are nearly in linkage equilibrium. Another approach, MPL, assumes a simple, additive fitness landscape, but allows for any level of correlation between alleles. We characterized differences in the distributions of fitness values inferred by each approach and in the ranks of fitness values that they assign to sequences across time. We find that in a large fraction of weeks the two methods are in good agreement as to their top-ranked sequences, i.e. as to which sequences observed that week are most fit. We also find that agreement between the ranking of sequences varies with genetic unimodality in the population in a given week.
Collapse
Affiliation(s)
- Hong-Li Zeng
- School of Science, Nanjing University of Posts and Telecommunications, Key Laboratory of Radio and Micro-Nano Electronics of Jiangsu Province, Nanjing 210023, People's Republic of China
| | - Cheng-Long Yang
- School of Science, Nanjing University of Posts and Telecommunications, Key Laboratory of Radio and Micro-Nano Electronics of Jiangsu Province, Nanjing 210023, People's Republic of China
| | - Bo Jing
- School of Science, Nanjing University of Posts and Telecommunications, Key Laboratory of Radio and Micro-Nano Electronics of Jiangsu Province, Nanjing 210023, People's Republic of China
| | - John Barton
- Department of Computational & Systems Biology, University of Pittsburgh School of Medicine, Pittsburgh, PA 15260, United States of America
| | - Erik Aurell
- Department of Computational Science and Technology, AlbaNova University Center, SE-106 91 Stockholm, Sweden
| |
Collapse
|
3
|
Shimagaki KS, Barton JP. Efficient epistasis inference via higher-order covariance matrix factorization. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.10.14.618287. [PMID: 39464126 PMCID: PMC11507688 DOI: 10.1101/2024.10.14.618287] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/29/2024]
Abstract
Epistasis can profoundly influence evolutionary dynamics. Temporal genetic data, consisting of sequences sampled repeatedly from a population over time, provides a unique resource to understand how epistasis shapes evolution. However, detecting epistatic interactions from sequence data is technically challenging. Existing methods for identifying epistasis are computationally demanding, limiting their applicability to real-world data. Here, we present a novel computational method for inferring epistasis that significantly reduces computational costs without sacrificing accuracy. We validated our approach in simulations and applied it to study HIV-1 evolution over multiple years in a data set of 16 individuals. There we observed a strong excess of negative epistatic interactions between beneficial mutations, especially mutations involved in immune escape. Our method is general and could be used to characterize epistasis in other large data sets.
Collapse
Affiliation(s)
- Kai S. Shimagaki
- Department of Computational and Systems Biology, University of Pittsburgh School of Medicine, USA
- Department of Physics and Astronomy, University of Pittsburgh, USA
| | - John P. Barton
- Department of Computational and Systems Biology, University of Pittsburgh School of Medicine, USA
- Department of Physics and Astronomy, University of Pittsburgh, USA
| |
Collapse
|
4
|
Dietler N, Abbara A, Choudhury S, Bitbol AF. Impact of phylogeny on the inference of functional sectors from protein sequence data. PLoS Comput Biol 2024; 20:e1012091. [PMID: 39312591 PMCID: PMC11449291 DOI: 10.1371/journal.pcbi.1012091] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2024] [Revised: 10/03/2024] [Accepted: 09/10/2024] [Indexed: 09/25/2024] Open
Abstract
Statistical analysis of multiple sequence alignments of homologous proteins has revealed groups of coevolving amino acids called sectors. These groups of amino-acid sites feature collective correlations in their amino-acid usage, and they are associated to functional properties. Modeling showed that nonlinear selection on an additive functional trait of a protein is generically expected to give rise to a functional sector. These modeling results motivated a principled method, called ICOD, which is designed to identify functional sectors, as well as mutational effects, from sequence data. However, a challenge for all methods aiming to identify sectors from multiple sequence alignments is that correlations in amino-acid usage can also arise from the mere fact that homologous sequences share common ancestry, i.e. from phylogeny. Here, we generate controlled synthetic data from a minimal model comprising both phylogeny and functional sectors. We use this data to dissect the impact of phylogeny on sector identification and on mutational effect inference by different methods. We find that ICOD is most robust to phylogeny, but that conservation is also quite robust. Next, we consider natural multiple sequence alignments of protein families for which deep mutational scan experimental data is available. We show that in this natural data, conservation and ICOD best identify sites with strong functional roles, in agreement with our results on synthetic data. Importantly, these two methods have different premises, since they respectively focus on conservation and on correlations. Thus, their joint use can reveal complementary information.
Collapse
Affiliation(s)
- Nicola Dietler
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Alia Abbara
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Subham Choudhury
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Anne-Florence Bitbol
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| |
Collapse
|
5
|
Sgarbossa D, Lupo U, Bitbol AF. Generative power of a protein language model trained on multiple sequence alignments. eLife 2023; 12:e79854. [PMID: 36734516 PMCID: PMC10038667 DOI: 10.7554/elife.79854] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2022] [Accepted: 02/02/2023] [Indexed: 02/04/2023] Open
Abstract
Computational models starting from large ensembles of evolutionarily related protein sequences capture a representation of protein families and learn constraints associated to protein structure and function. They thus open the possibility for generating novel sequences belonging to protein families. Protein language models trained on multiple sequence alignments, such as MSA Transformer, are highly attractive candidates to this end. We propose and test an iterative method that directly employs the masked language modeling objective to generate sequences using MSA Transformer. We demonstrate that the resulting sequences score as well as natural sequences, for homology, coevolution, and structure-based measures. For large protein families, our synthetic sequences have similar or better properties compared to sequences generated by Potts models, including experimentally validated ones. Moreover, for small protein families, our generation method based on MSA Transformer outperforms Potts models. Our method also more accurately reproduces the higher-order statistics and the distribution of sequences in sequence space of natural data than Potts models. MSA Transformer is thus a strong candidate for protein sequence generation and protein design.
Collapse
Affiliation(s)
- Damiano Sgarbossa
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL)LausanneSwitzerland
- SIB Swiss Institute of BioinformaticsLausanneSwitzerland
| | - Umberto Lupo
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL)LausanneSwitzerland
- SIB Swiss Institute of BioinformaticsLausanneSwitzerland
| | - Anne-Florence Bitbol
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL)LausanneSwitzerland
- SIB Swiss Institute of BioinformaticsLausanneSwitzerland
| |
Collapse
|
6
|
Dietler N, Lupo U, Bitbol AF. Impact of phylogeny on structural contact inference from protein sequence data. J R Soc Interface 2023; 20:20220707. [PMID: 36751926 PMCID: PMC9905998 DOI: 10.1098/rsif.2022.0707] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2022] [Accepted: 01/09/2023] [Indexed: 02/09/2023] Open
Abstract
Local and global inference methods have been developed to infer structural contacts from multiple sequence alignments of homologous proteins. They rely on correlations in amino acid usage at contacting sites. Because homologous proteins share a common ancestry, their sequences also feature phylogenetic correlations, which can impair contact inference. We investigate this effect by generating controlled synthetic data from a minimal model where the importance of contacts and of phylogeny can be tuned. We demonstrate that global inference methods, specifically Potts models, are more resilient to phylogenetic correlations than local methods, based on covariance or mutual information. This holds whether or not phylogenetic corrections are used, and may explain the success of global methods. We analyse the roles of selection strength and of phylogenetic relatedness. We show that sites that mutate early in the phylogeny yield false positive contacts. We consider natural data and realistic synthetic data, and our findings generalize to these cases. Our results highlight the impact of phylogeny on contact prediction from protein sequences and illustrate the interplay between the rich structure of biological data and inference.
Collapse
Affiliation(s)
- Nicola Dietler
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), 1015 Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Umberto Lupo
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), 1015 Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Anne-Florence Bitbol
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), 1015 Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| |
Collapse
|
7
|
Lupo U, Sgarbossa D, Bitbol AF. Protein language models trained on multiple sequence alignments learn phylogenetic relationships. Nat Commun 2022; 13:6298. [PMID: 36273003 PMCID: PMC9588007 DOI: 10.1038/s41467-022-34032-y] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2022] [Accepted: 10/07/2022] [Indexed: 12/25/2022] Open
Abstract
Self-supervised neural language models with attention have recently been applied to biological sequence data, advancing structure, function and mutational effect prediction. Some protein language models, including MSA Transformer and AlphaFold's EvoFormer, take multiple sequence alignments (MSAs) of evolutionarily related proteins as inputs. Simple combinations of MSA Transformer's row attentions have led to state-of-the-art unsupervised structural contact prediction. We demonstrate that similarly simple, and universal, combinations of MSA Transformer's column attentions strongly correlate with Hamming distances between sequences in MSAs. Therefore, MSA-based language models encode detailed phylogenetic relationships. We further show that these models can separate coevolutionary signals encoding functional and structural constraints from phylogenetic correlations reflecting historical contingency. To assess this, we generate synthetic MSAs, either without or with phylogeny, from Potts models trained on natural MSAs. We find that unsupervised contact prediction is substantially more resilient to phylogenetic noise when using MSA Transformer versus inferred Potts models.
Collapse
Affiliation(s)
- Umberto Lupo
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), CH-1015, Lausanne, Switzerland.
- SIB Swiss Institute of Bioinformatics, CH-1015, Lausanne, Switzerland.
| | - Damiano Sgarbossa
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), CH-1015, Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, CH-1015, Lausanne, Switzerland
| | - Anne-Florence Bitbol
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), CH-1015, Lausanne, Switzerland.
- SIB Swiss Institute of Bioinformatics, CH-1015, Lausanne, Switzerland.
| |
Collapse
|
8
|
Zeng HL, Liu Y, Dichio V, Aurell E. Temporal epistasis inference from more than 3 500 000 SARS-CoV-2 genomic sequences. Phys Rev E 2022; 106:044409. [PMID: 36397507 DOI: 10.1103/physreve.106.044409] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2022] [Accepted: 09/19/2022] [Indexed: 06/16/2023]
Abstract
We use direct coupling analysis (DCA) to determine epistatic interactions between loci of variability of the SARS-CoV-2 virus, segmenting genomes by month of sampling. We use full-length, high-quality genomes from the GISAID repository up to October 2021 for a total of over 3 500 000 genomes. We find that DCA terms are more stable over time than correlations but nevertheless change over time as mutations disappear from the global population or reach fixation. Correlations are enriched for phylogenetic effects, and in particularly statistical dependencies at short genomic distances, while DCA brings out links at longer genomic distance. We discuss the validity of a DCA analysis under these conditions in terms of a transient auasilinkage equilibrium state. We identify putative epistatic interaction mutations involving loci in spike.
Collapse
Affiliation(s)
- Hong-Li Zeng
- School of Science, Nanjing University of Posts and Telecommunications, New Energy Technology Engineering Laboratory of Jiangsu Province, Nanjing 210023, China
| | - Yue Liu
- School of Science, Nanjing University of Posts and Telecommunications, New Energy Technology Engineering Laboratory of Jiangsu Province, Nanjing 210023, China
| | - Vito Dichio
- Inria Paris, Aramis Project Team, Paris 75013, France
- Institut du Cerveau, ICM, Inserm U 1127, CNRS UMR 7225, Sorbonne Université, Paris, France
| | - Erik Aurell
- Department of Computational Science and Technology, AlbaNova University Center, SE-106 91 Stockholm, Sweden
| |
Collapse
|
9
|
Gerardos A, Dietler N, Bitbol AF. Correlations from structure and phylogeny combine constructively in the inference of protein partners from sequences. PLoS Comput Biol 2022; 18:e1010147. [PMID: 35576238 PMCID: PMC9135348 DOI: 10.1371/journal.pcbi.1010147] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2021] [Revised: 05/26/2022] [Accepted: 04/27/2022] [Indexed: 11/19/2022] Open
Abstract
Inferring protein-protein interactions from sequences is an important task in computational biology. Recent methods based on Direct Coupling Analysis (DCA) or Mutual Information (MI) allow to find interaction partners among paralogs of two protein families. Does successful inference mainly rely on correlations from structural contacts or from phylogeny, or both? Do these two types of signal combine constructively or hinder each other? To address these questions, we generate and analyze synthetic data produced using a minimal model that allows us to control the amounts of structural constraints and phylogeny. We show that correlations from these two sources combine constructively to increase the performance of partner inference by DCA or MI. Furthermore, signal from phylogeny can rescue partner inference when signal from contacts becomes less informative, including in the realistic case where inter-protein contacts are restricted to a small subset of sites. We also demonstrate that DCA-inferred couplings between non-contact pairs of sites improve partner inference in the presence of strong phylogeny, while deteriorating it otherwise. Moreover, restricting to non-contact pairs of sites preserves inference performance in the presence of strong phylogeny. In a natural data set, as well as in realistic synthetic data based on it, we find that non-contact pairs of sites contribute positively to partner inference performance, and that restricting to them preserves performance, evidencing an important role of phylogeny.
Collapse
Affiliation(s)
- Andonis Gerardos
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Nicola Dietler
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Anne-Florence Bitbol
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| |
Collapse
|
10
|
Extracting phylogenetic dimensions of coevolution reveals hidden functional signals. Sci Rep 2022; 12:820. [PMID: 35039514 PMCID: PMC8764114 DOI: 10.1038/s41598-021-04260-1] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2021] [Accepted: 12/17/2021] [Indexed: 11/08/2022] Open
Abstract
Despite the structural and functional information contained in the statistical coupling between pairs of residues in a protein, coevolution associated with function is often obscured by artifactual signals such as genetic drift, which shapes a protein's phylogenetic history and gives rise to concurrent variation between protein sequences that is not driven by selection for function. Here, we introduce a background model for phylogenetic contributions of statistical coupling that separates the coevolution signal due to inter-clade and intra-clade sequence comparisons and demonstrate that coevolution can be measured on multiple phylogenetic timescales within a single protein. Our method, nested coevolution (NC), can be applied as an extension to any coevolution metric. We use NC to demonstrate that poorly conserved residues can nonetheless have important roles in protein function. Moreover, NC improved the structural-contact predictions of several coevolution-based methods, particularly in subsampled alignments with fewer sequences. NC also lowered the noise in detecting functional sectors of collectively coevolving residues. Sectors of coevolving residues identified after application of NC were more spatially compact and phylogenetically distinct from the rest of the protein, and strongly enriched for mutations that disrupt protein activity. Thus, our conceptualization of the phylogenetic separation of coevolution provides the potential to further elucidate relationships among protein evolution, function, and genetic diseases.
Collapse
|
11
|
On the effect of phylogenetic correlations in coevolution-based contact prediction in proteins. PLoS Comput Biol 2021; 17:e1008957. [PMID: 34029316 PMCID: PMC8177639 DOI: 10.1371/journal.pcbi.1008957] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2020] [Revised: 06/04/2021] [Accepted: 04/09/2021] [Indexed: 12/04/2022] Open
Abstract
Coevolution-based contact prediction, either directly by coevolutionary couplings resulting from global statistical sequence models or using structural supervision and deep learning, has found widespread application in protein-structure prediction from sequence. However, one of the basic assumptions in global statistical modeling is that sequences form an at least approximately independent sample of an unknown probability distribution, which is to be learned from data. In the case of protein families, this assumption is obviously violated by phylogenetic relations between protein sequences. It has turned out to be notoriously difficult to take phylogenetic correlations into account in coevolutionary model learning. Here, we propose a complementary approach: we develop strategies to randomize or resample sequence data, such that conservation patterns and phylogenetic relations are preserved, while intrinsic (i.e. structure- or function-based) coevolutionary couplings are removed. A comparison between the results of Direct Coupling Analysis applied to real and to resampled data shows that the largest coevolutionary couplings, i.e. those used for contact prediction, are only weakly influenced by phylogeny. However, the phylogeny-induced spurious couplings in the resampled data are compatible in size with the first false-positive contact predictions from real data. Dissecting functional from phylogeny-induced couplings might therefore extend accurate contact predictions to the range of intermediate-size couplings. Many homologous protein families contain thousands of highly diverged amino-acid sequences, which fold into close-to-identical three-dimensional structures and fulfill almost identical biological tasks. Global coevolutionary models, like those inferred by the Direct Coupling Analysis (DCA), assume that families can be considered as samples of some unknown statistical model, and that the parameters of these models represent evolutionary constraints acting on protein sequences. To learn these models from data, DCA and related approaches have to also assume that the distinct sequences in a protein family are close to independent, while in reality they are characterized by involved hierarchical phylogenetic relationships. Here we propose Null models for sequence alignments, which maintain patterns of amino-acid conservation and phylogeny contained in the data, but destroy any coevolutionary couplings, frequently used in protein structure prediction. We find that phylogeny actually induces spurious non-zero couplings. These are, however, significantly smaller that the largest couplings derived from natural sequences, and therefore have only little influence on the first predicted contacts. However, in the range of intermediate couplings, they may lead to statistically significant effects. Dissecting phylogenetic from functional couplings might therefore extend the range of accurately predicted structural contacts down to smaller coupling strengths than those currently used.
Collapse
|
12
|
Information Theory in Molecular Evolution: From Models to Structures and Dynamics. ENTROPY 2021; 23:e23040482. [PMID: 33921557 PMCID: PMC8073717 DOI: 10.3390/e23040482] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 04/15/2021] [Accepted: 04/15/2021] [Indexed: 11/27/2022]
|
13
|
Zeng HL, Dichio V, Rodríguez Horta E, Thorell K, Aurell E. Global analysis of more than 50,000 SARS-CoV-2 genomes reveals epistasis between eight viral genes. Proc Natl Acad Sci U S A 2020; 117:31519-31526. [PMID: 33203681 PMCID: PMC7733830 DOI: 10.1073/pnas.2012331117] [Citation(s) in RCA: 34] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022] Open
Abstract
Genome-wide epistasis analysis is a powerful tool to infer gene interactions, which can guide drug and vaccine development and lead to deeper understanding of microbial pathogenesis. We have considered all complete severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) genomes deposited in the Global Initiative on Sharing All Influenza Data (GISAID) repository until four different cutoff dates, and used direct coupling analysis together with an assumption of quasi-linkage equilibrium to infer epistatic contributions to fitness from polymorphic loci. We find eight interactions, of which three are between pairs where one locus lies in gene ORF3a, both loci holding nonsynonymous mutations. We also find interactions between two loci in gene nsp13, both holding nonsynonymous mutations, and four interactions involving one locus holding a synonymous mutation. Altogether, we infer interactions between loci in viral genes ORF3a and nsp2, nsp12, and nsp6, between ORF8 and nsp4, and between loci in genes nsp2, nsp13, and nsp14. The paper opens the prospect to use prominent epistatically linked pairs as a starting point to search for combinatorial weaknesses of recombinant viral pathogens.
Collapse
Affiliation(s)
- Hong-Li Zeng
- New Energy Technology Engineering Laboratory of Jiangsu Province, School of Science, Nanjing University of Posts and Telecommunications, Nanjing, 210023, China
- Nordic Institute for Theoretical Physics, Royal Institute of Technology and Stockholm University, 10691 Stockholm, Sweden
| | - Vito Dichio
- Nordic Institute for Theoretical Physics, Royal Institute of Technology and Stockholm University, 10691 Stockholm, Sweden
- Department of Physics, University of Trieste, 34151 Trieste, Italy
- Department of Computational Science and Technology, AlbaNova University Center, 10691 Stockholm, Sweden
| | - Edwin Rodríguez Horta
- Group of Complex Systems and Statistical Physics, Department of Theoretical Physics, Physics Faculty, University of Havana, 10400 Havana, Cuba
| | - Kaisa Thorell
- Department of Infectious Diseases, Institute of Biomedicine, Sahlgrenska Academy, University of Gothenburg, 40530 Gothenburg, Sweden
- Center for Translational Microbiome Research, Department of Microbiology, Cell and Tumor Biology, Karolinska Institutet, 17177 Stockholm, Sweden
| | - Erik Aurell
- Department of Computational Science and Technology, AlbaNova University Center, 10691 Stockholm, Sweden;
| |
Collapse
|