1
|
Rix G, Williams RL, Hu VJ, Spinner H, Pisera A(O, Marks DS, Liu CC. Continuous evolution of user-defined genes at 1 million times the genomic mutation rate. Science 2024; 386:eadm9073. [PMID: 39509492 PMCID: PMC11750425 DOI: 10.1126/science.adm9073] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2023] [Accepted: 09/10/2024] [Indexed: 11/15/2024]
Abstract
When nature evolves a gene over eons at scale, it produces a diversity of homologous sequences with patterns of conservation and change that contain rich structural, functional, and historical information about the gene. However, natural gene diversity accumulates slowly and likely excludes large regions of functional sequence space, limiting the information that is encoded and extractable. We introduce upgraded orthogonal DNA replication (OrthoRep) systems that radically accelerate the evolution of chosen genes under selection in yeast. When applied to a maladapted biosynthetic enzyme, we obtained collections of extensively diverged sequences with patterns that revealed structural and environmental constraints shaping the enzyme's activity. Our upgraded OrthoRep systems should support the discovery of factors influencing gene evolution, uncover previously unknown regions of fitness landscapes, and find broad applications in biomolecular engineering.
Collapse
Affiliation(s)
- Gordon Rix
- Department of Molecular Biology and Biochemistry, University of California; Irvine, CA, 92617, USA
| | - Rory L. Williams
- Department of Biomedical Engineering, University of California; Irvine, CA, 92617, USA
| | - Vincent J. Hu
- Department of Biomedical Engineering, University of California; Irvine, CA, 92617, USA
| | - Han Spinner
- Department of Systems Biology, Harvard Medical School; Boston, MA, 02115, USA
| | | | - Debora S. Marks
- Department of Systems Biology, Harvard Medical School; Boston, MA, 02115, USA
- Broad Institute of Harvard and MIT; Cambridge, MA, 02142, USA
| | - Chang C. Liu
- Department of Molecular Biology and Biochemistry, University of California; Irvine, CA, 92617, USA
- Department of Biomedical Engineering, University of California; Irvine, CA, 92617, USA
- Department of Chemistry, University of California; Irvine, CA, 92617, USA
- Center for Synthetic Biology, University of California; Irvine, CA, 92617, USA
| |
Collapse
|
2
|
Porter LL, Artsimovitch I, Ramírez-Sarmiento CA. Metamorphic proteins and how to find them. Curr Opin Struct Biol 2024; 86:102807. [PMID: 38537533 PMCID: PMC11102287 DOI: 10.1016/j.sbi.2024.102807] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2024] [Revised: 03/05/2024] [Accepted: 03/06/2024] [Indexed: 04/04/2024]
Abstract
In the last two decades, our existing notion that most foldable proteins have a unique native state has been challenged by the discovery of metamorphic proteins, which reversibly interconvert between multiple, sometimes highly dissimilar, native states. As the number of known metamorphic proteins increases, several computational and experimental strategies have emerged for gaining insights about their refolding processes and identifying unknown metamorphic proteins amongst the known proteome. In this review, we describe the current advances in biophysically and functionally ascertaining the structural interconversions of metamorphic proteins and how coevolution can be harnessed to identify novel metamorphic proteins from sequence information. We also discuss the challenges and ongoing efforts in using artificial intelligence-based protein structure prediction methods to discover metamorphic proteins and predict their corresponding three-dimensional structures.
Collapse
Affiliation(s)
- Lauren L Porter
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA; Biochemistry and Biophysics Center, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, MD 20892, USA.
| | - Irina Artsimovitch
- Department of Microbiology and Center for RNA Biology, The Ohio State University, Columbus, OH 43210, USA.
| | - César A Ramírez-Sarmiento
- Institute for Biological and Medical Engineering, Schools of Engineering, Medicine and Biological Sciences, Pontificia Universidad Católica de Chile, Santiago 7820436, Chile; ANID, Millennium Science Initiative Program, Millennium Institute for Integrative Biology (iBio), Santiago 833150, Chile.
| |
Collapse
|
3
|
Kalogeropoulos K, Bohn MF, Jenkins DE, Ledergerber J, Sørensen CV, Hofmann N, Wade J, Fryer T, Thi Tuyet Nguyen G, Auf dem Keller U, Laustsen AH, Jenkins TP. A comparative study of protein structure prediction tools for challenging targets: Snake venom toxins. Toxicon 2024; 238:107559. [PMID: 38113945 DOI: 10.1016/j.toxicon.2023.107559] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2023] [Revised: 12/06/2023] [Accepted: 12/08/2023] [Indexed: 12/21/2023]
Abstract
Protein structure determination is a critical aspect of biological research, enabling us to understand protein function and potential applications. Recent advances in deep learning and artificial intelligence have led to the development of several protein structure prediction tools, such as AlphaFold2 and ColabFold. However, their performance has primarily been evaluated on well-characterised proteins and their ability to predict sturtctures of proteins lacking experimental structures, such as many snake venom toxins, has been less scrutinised. In this study, we evaluated three modelling tools on their prediction of over 1000 snake venom toxin structures for which no experimental structures exist. Our findings show that AlphaFold2 (AF2) performed the best across all assessed parameters. We also observed that ColabFold (CF) only scored slightly worse than AF2, while being computationally less intensive. All tools struggled with regions of intrinsic disorder, such as loops and propeptide regions, and performed well in predicting the structure of functional domains. Overall, our study highlights the importance of exercising caution when working with proteins with no experimental structures available, particularly those that are large and contain flexible regions. Nonetheless, leveraging computational structure prediction tools can provide valuable insights into the modelling of protein interactions with different targets and reveal potential binding sites, active sites, and conformational changes, as well as into the design of potential molecular binders for reagent, diagnostic, or therapeutic purposes.
Collapse
Affiliation(s)
| | - Markus-Frederik Bohn
- Department of Biotechnology and Biomedicine, Technical University of Denmark, Kongens Lyngby, Denmark
| | | | - Jann Ledergerber
- Department of Biotechnology and Biomedicine, Technical University of Denmark, Kongens Lyngby, Denmark; Department of Chemistry and Applied Bioscience, ETH Zurich, Zurich, Switzerland
| | - Christoffer V Sørensen
- Department of Biotechnology and Biomedicine, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Nils Hofmann
- Department of Biotechnology and Biomedicine, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Jack Wade
- Department of Biotechnology and Biomedicine, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Thomas Fryer
- Department of Biotechnology and Biomedicine, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Giang Thi Tuyet Nguyen
- Department of Biotechnology and Biomedicine, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Ulrich Auf dem Keller
- Department of Biotechnology and Biomedicine, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Andreas H Laustsen
- Department of Biotechnology and Biomedicine, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Timothy P Jenkins
- Department of Biotechnology and Biomedicine, Technical University of Denmark, Kongens Lyngby, Denmark.
| |
Collapse
|
4
|
Wayment-Steele HK, Ojoawo A, Otten R, Apitz JM, Pitsawong W, Hömberger M, Ovchinnikov S, Colwell L, Kern D. Predicting multiple conformations via sequence clustering and AlphaFold2. Nature 2024; 625:832-839. [PMID: 37956700 PMCID: PMC10808063 DOI: 10.1038/s41586-023-06832-9] [Citation(s) in RCA: 102] [Impact Index Per Article: 102.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2023] [Accepted: 11/03/2023] [Indexed: 11/15/2023]
Abstract
AlphaFold2 (ref. 1) has revolutionized structural biology by accurately predicting single structures of proteins. However, a protein's biological function often depends on multiple conformational substates2, and disease-causing point mutations often cause population changes within these substates3,4. We demonstrate that clustering a multiple-sequence alignment by sequence similarity enables AlphaFold2 to sample alternative states of known metamorphic proteins with high confidence. Using this method, named AF-Cluster, we investigated the evolutionary distribution of predicted structures for the metamorphic protein KaiB5 and found that predictions of both conformations were distributed in clusters across the KaiB family. We used nuclear magnetic resonance spectroscopy to confirm an AF-Cluster prediction: a cyanobacteria KaiB variant is stabilized in the opposite state compared with the more widely studied variant. To test AF-Cluster's sensitivity to point mutations, we designed and experimentally verified a set of three mutations predicted to flip KaiB from Rhodobacter sphaeroides from the ground to the fold-switched state. Finally, screening for alternative states in protein families without known fold switching identified a putative alternative state for the oxidoreductase Mpt53 in Mycobacterium tuberculosis. Further development of such bioinformatic methods in tandem with experiments will probably have a considerable impact on predicting protein energy landscapes, essential for illuminating biological function.
Collapse
Affiliation(s)
- Hannah K Wayment-Steele
- Department of Biochemistry, Brandeis University and Howard Hughes Medical Institute, Waltham, MA, USA
| | - Adedolapo Ojoawo
- Department of Biochemistry, Brandeis University and Howard Hughes Medical Institute, Waltham, MA, USA
| | - Renee Otten
- Department of Biochemistry, Brandeis University and Howard Hughes Medical Institute, Waltham, MA, USA
- Treeline Biosciences, Watertown, MA, USA
| | - Julia M Apitz
- Department of Biochemistry, Brandeis University and Howard Hughes Medical Institute, Waltham, MA, USA
| | - Warintra Pitsawong
- Department of Biochemistry, Brandeis University and Howard Hughes Medical Institute, Waltham, MA, USA
- Biomolecular Discovery, Relay Therapeutics, Cambridge, MA, USA
| | - Marc Hömberger
- Department of Biochemistry, Brandeis University and Howard Hughes Medical Institute, Waltham, MA, USA
- Treeline Biosciences, Watertown, MA, USA
| | | | - Lucy Colwell
- Google Research, Cambridge, MA, USA
- Cambridge University, Cambridge, UK
| | - Dorothee Kern
- Department of Biochemistry, Brandeis University and Howard Hughes Medical Institute, Waltham, MA, USA.
| |
Collapse
|
5
|
Rix G, Williams RL, Spinner H, Hu VJ, Marks DS, Liu CC. Continuous evolution of user-defined genes at 1-million-times the genomic mutation rate. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.11.13.566922. [PMID: 38014077 PMCID: PMC10680746 DOI: 10.1101/2023.11.13.566922] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/29/2023]
Abstract
When nature maintains or evolves a gene's function over millions of years at scale, it produces a diversity of homologous sequences whose patterns of conservation and change contain rich structural, functional, and historical information about the gene. However, natural gene diversity likely excludes vast regions of functional sequence space and includes phylogenetic and evolutionary eccentricities, limiting what information we can extract. We introduce an accessible experimental approach for compressing long-term gene evolution to laboratory timescales, allowing for the direct observation of extensive adaptation and divergence followed by inference of structural, functional, and environmental constraints for any selectable gene. To enable this approach, we developed a new orthogonal DNA replication (OrthoRep) system that durably hypermutates chosen genes at a rate of >10 -4 substitutions per base in vivo . When OrthoRep was used to evolve a conditionally essential maladapted enzyme, we obtained thousands of unique multi-mutation sequences with many pairs >60 amino acids apart (>15% divergence), revealing known and new factors influencing enzyme adaptation. The fitness of evolved sequences was not predictable by advanced machine learning models trained on natural variation. We suggest that OrthoRep supports the prospective and systematic discovery of constraints shaping gene evolution, uncovering of new regions in fitness landscapes, and general applications in biomolecular engineering.
Collapse
|
6
|
Schafer JW, Porter LL. Evolutionary selection of proteins with two folds. Nat Commun 2023; 14:5478. [PMID: 37673981 PMCID: PMC10482954 DOI: 10.1038/s41467-023-41237-2] [Citation(s) in RCA: 17] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2023] [Accepted: 08/24/2023] [Indexed: 09/08/2023] Open
Abstract
Although most globular proteins fold into a single stable structure, an increasing number have been shown to remodel their secondary and tertiary structures in response to cellular stimuli. State-of-the-art algorithms predict that these fold-switching proteins adopt only one stable structure, missing their functionally critical alternative folds. Why these algorithms predict a single fold is unclear, but all of them infer protein structure from coevolved amino acid pairs. Here, we hypothesize that coevolutionary signatures are being missed. Suspecting that single-fold variants could be masking these signatures, we developed an approach, called Alternative Contact Enhancement (ACE), to search both highly diverse protein superfamilies-composed of single-fold and fold-switching variants-and protein subfamilies with more fold-switching variants. ACE successfully revealed coevolution of amino acid pairs uniquely corresponding to both conformations of 56/56 fold-switching proteins from distinct families. Then, we used ACE-derived contacts to (1) predict two experimentally consistent conformations of a candidate protein with unsolved structure and (2) develop a blind prediction pipeline for fold-switching proteins. The discovery of widespread dual-fold coevolution indicates that fold-switching sequences have been preserved by natural selection, implying that their functionalities provide evolutionary advantage and paving the way for predictions of diverse protein structures from single sequences.
Collapse
Affiliation(s)
- Joseph W Schafer
- National Library of Medicine, National Center for Biotechnology Information, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Lauren L Porter
- National Library of Medicine, National Center for Biotechnology Information, National Institutes of Health, Bethesda, MD, 20894, USA.
- National Heart, Lung, and Blood Institute, Biochemistry and Biophysics Center, National Institutes of Health, Bethesda, MD, 20892, USA.
| |
Collapse
|
7
|
Konecki DM, Hamrick S, Wang C, Agosto MA, Wensel TG, Lichtarge O. CovET: A covariation-evolutionary trace method that identifies protein structure-function modules. J Biol Chem 2023; 299:104896. [PMID: 37290531 PMCID: PMC10338321 DOI: 10.1016/j.jbc.2023.104896] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2023] [Revised: 06/01/2023] [Accepted: 06/02/2023] [Indexed: 06/10/2023] Open
Abstract
Measuring the relative effect that any two sequence positions have on each other may improve protein design or help better interpret coding variants. Current approaches use statistics and machine learning but rarely consider phylogenetic divergences which, as shown by Evolutionary Trace studies, provide insight into the functional impact of sequence perturbations. Here, we reframe covariation analyses in the Evolutionary Trace framework to measure the relative tolerance to perturbation of each residue pair during evolution. This approach (CovET) systematically accounts for phylogenetic divergences: at each divergence event, we penalize covariation patterns that belie evolutionary coupling. We find that while CovET approximates the performance of existing methods to predict individual structural contacts, it performs significantly better at finding structural clusters of coupled residues and ligand binding sites. For example, CovET found more functionally critical residues when we examined the RNA recognition motif and WW domains. It correlates better with large-scale epistasis screen data. In the dopamine D2 receptor, top CovET residue pairs recovered accurately the allosteric activation pathway characterized for Class A G protein-coupled receptors. These data suggest that CovET ranks highest the sequence position pairs that play critical functional roles through epistatic and allosteric interactions in evolutionarily relevant structure-function motifs. CovET complements current methods and may shed light on fundamental molecular mechanisms of protein structure and function.
Collapse
Affiliation(s)
- Daniel M Konecki
- Quantitative and Computational Biosciences Graduate Program, Baylor College of Medicine, Houston, Texas, USA
| | - Spencer Hamrick
- Chemical, Physical, and Structural Biology Graduate Program, Baylor College of Medicine, Houston, Texas, USA
| | - Chen Wang
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA
| | - Melina A Agosto
- Verna and Marrs McLean Department of Biochemistry and Molecular Biology, Baylor College of Medicine, Houston, Texas, USA
| | - Theodore G Wensel
- Quantitative and Computational Biosciences Graduate Program, Baylor College of Medicine, Houston, Texas, USA; Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA; Verna and Marrs McLean Department of Biochemistry and Molecular Biology, Baylor College of Medicine, Houston, Texas, USA; Cancer and Cell Biology Graduate Program, Baylor College of Medicine, Houston, Texas, USA
| | - Olivier Lichtarge
- Quantitative and Computational Biosciences Graduate Program, Baylor College of Medicine, Houston, Texas, USA; Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA; Verna and Marrs McLean Department of Biochemistry and Molecular Biology, Baylor College of Medicine, Houston, Texas, USA; Cancer and Cell Biology Graduate Program, Baylor College of Medicine, Houston, Texas, USA; Computational and Integrative Biomedical Research Center, Baylor College of Medicine, Houston, Texas, USA.
| |
Collapse
|
8
|
Nigam D, Muthukrishnan E, Flores-López LF, Nigam M, Wamaitha MJ. Comparative Genome Analysis of Old World and New World TYLCV Reveals a Biasness toward Highly Variable Amino Acids in Coat Protein. PLANTS (BASEL, SWITZERLAND) 2023; 12:1995. [PMID: 37653912 PMCID: PMC10223811 DOI: 10.3390/plants12101995] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/13/2023] [Revised: 05/01/2023] [Accepted: 05/08/2023] [Indexed: 09/02/2023]
Abstract
Begomoviruses, belonging to the family Geminiviridae and the genus Begomovirus, are DNA viruses that are transmitted by whitefly Bemisia tabaci (Gennadius) in a circulative persistent manner. They can easily adapt to new hosts and environments due to their wide host range and global distribution. However, the factors responsible for their adaptability and coevolutionary forces are yet to be explored. Among BGVs, TYLCV exhibits the broadest range of hosts. In this study, we have identified variable and coevolving amino acid sites in the proteins of Tomato yellow leaf curl virus (TYLCV) isolates from Old World (African, Indian, Japanese, and Oceania) and New World (Central and Southern America). We focused on mutations in the coat protein (CP), as it is highly variable and interacts with both vectors and host plants. Our observations indicate that some mutations were accumulating in Old World TYLCV isolates due to positive selection, with the S149N mutation being of particular interest. This mutation is associated with TYLCV isolates that have spread in Europe and Asia and is dominant in 78% of TYLCV isolates. On the other hand, the S149T mutation is restricted to isolates from Saudi Arabia. We further explored the implications of these amino acid changes through structural modeling. The results presented in this study suggest that certain hypervariable regions in the genome of TYLCV are conserved and may be important for adapting to different host environments. These regions could contribute to the mutational robustness of the virus, allowing it to persist in different host populations.
Collapse
Affiliation(s)
- Deepti Nigam
- Institute for Genomics of Crop Abiotic Stress Tolerance, Department of Plant and Soil Science, Texas Tech University (TTU), Lubbock, TX 79409, USA
- Plant Pathology and Plant-Microbe Biology Section, School of Integrative Plant Science, Cornell University, Ithaca, NY 14850, USA
| | | | - Luis Fernando Flores-López
- Departamento de Biotecnología y Bioquímica, Centro de Investigacióny de Estudios Avanzados de IPN (CINVESTAV) Unidad Irapuato, Irapuato 368224, Mexico
| | - Manisha Nigam
- Department of Biochemistry, Hemvati Nandan Bahuguna Garhwal University, Srinagar 246174, Uttarakhand, India
| | - Mwathi Jane Wamaitha
- Kenya Agricultural and Livestock Research Organization (KALRO), Nairobi P.O. Box 14733-00800, Kenya
| |
Collapse
|
9
|
Durham J, Zhang J, Humphreys IR, Pei J, Cong Q. Recent advances in predicting and modeling protein-protein interactions. Trends Biochem Sci 2023; 48:527-538. [PMID: 37061423 DOI: 10.1016/j.tibs.2023.03.003] [Citation(s) in RCA: 17] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2022] [Revised: 03/03/2023] [Accepted: 03/17/2023] [Indexed: 04/17/2023]
Abstract
Protein-protein interactions (PPIs) drive biological processes, and disruption of PPIs can cause disease. With recent breakthroughs in structure prediction and a deluge of genomic sequence data, computational methods to predict PPIs and model spatial structures of protein complexes are now approaching the accuracy of experimental approaches for permanent interactions and show promise for elucidating transient interactions. As we describe here, the key to this success is rich evolutionary information deciphered from thousands of homologous sequences that coevolve in interacting partners. This covariation signal, revealed by sophisticated statistical and machine learning (ML) algorithms, predicts physiological interactions. Accurate artificial intelligence (AI)-based modeling of protein structures promises to provide accurate 3D models of PPIs at a proteome-wide scale.
Collapse
Affiliation(s)
- Jesse Durham
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, TX, USA; Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, USA; Harold C. Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX, USA
| | - Jing Zhang
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, TX, USA; Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, USA; Harold C. Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX, USA
| | - Ian R Humphreys
- Department of Biochemistry, University of Washington, Seattle, WA, USA; Institute for Protein Design, University of Washington, Seattle, WA, USA
| | - Jimin Pei
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, TX, USA; Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, USA; Harold C. Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX, USA
| | - Qian Cong
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, TX, USA; Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, USA; Harold C. Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX, USA.
| |
Collapse
|
10
|
Baltoumas FA, Karatzas E, Paez-Espino D, Venetsianou NK, Aplakidou E, Oulas A, Finn RD, Ovchinnikov S, Pafilis E, Kyrpides NC, Pavlopoulos GA. Exploring microbial functional biodiversity at the protein family level-From metagenomic sequence reads to annotated protein clusters. FRONTIERS IN BIOINFORMATICS 2023; 3:1157956. [PMID: 36959975 PMCID: PMC10029925 DOI: 10.3389/fbinf.2023.1157956] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2023] [Accepted: 02/21/2023] [Indexed: 03/06/2023] Open
Abstract
Metagenomics has enabled accessing the genetic repertoire of natural microbial communities. Metagenome shotgun sequencing has become the method of choice for studying and classifying microorganisms from various environments. To this end, several methods have been developed to process and analyze the sequence data from raw reads to end-products such as predicted protein sequences or families. In this article, we provide a thorough review to simplify such processes and discuss the alternative methodologies that can be followed in order to explore biodiversity at the protein family level. We provide details for analysis tools and we comment on their scalability as well as their advantages and disadvantages. Finally, we report the available data repositories and recommend various approaches for protein family annotation related to phylogenetic distribution, structure prediction and metadata enrichment.
Collapse
Affiliation(s)
- Fotis A. Baltoumas
- Institute for Fundamental Biomedical Research, BSRC “Alexander Fleming”, Vari, Greece
| | - Evangelos Karatzas
- Institute for Fundamental Biomedical Research, BSRC “Alexander Fleming”, Vari, Greece
| | - David Paez-Espino
- Lawrence Berkeley National Laboratory, DOE Joint Genome Institute, Berkeley, CA, United States
| | - Nefeli K. Venetsianou
- Institute for Fundamental Biomedical Research, BSRC “Alexander Fleming”, Vari, Greece
| | - Eleni Aplakidou
- Institute for Fundamental Biomedical Research, BSRC “Alexander Fleming”, Vari, Greece
| | - Anastasis Oulas
- The Cyprus Institute of Neurology and Genetics, Nicosia, Cyprus
| | - Robert D. Finn
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Cambridge, United Kingdom
| | - Sergey Ovchinnikov
- John Harvard Distinguished Science Fellowship Program, Harvard University, Cambridge, MA, United States
| | - Evangelos Pafilis
- Institute of Marine Biology, Biotechnology and Aquaculture (IMBBC), Hellenic Centre for Marine Research (HCMR), Heraklion, Greece
| | - Nikos C. Kyrpides
- Lawrence Berkeley National Laboratory, DOE Joint Genome Institute, Berkeley, CA, United States
| | - Georgios A. Pavlopoulos
- Institute for Fundamental Biomedical Research, BSRC “Alexander Fleming”, Vari, Greece
- Center of New Biotechnologies and Precision Medicine, Department of Medicine, School of Health Sciences, National and Kapodistrian University of Athens, Athens, Greece
- Hellenic Army Academy, Vari, Greece
| |
Collapse
|
11
|
Schafer JW, Porter LL. Evolutionary selection of proteins with two folds. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.01.18.524637. [PMID: 36789442 PMCID: PMC9928049 DOI: 10.1101/2023.01.18.524637] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]
Abstract
Although most globular proteins fold into a single stable structure 1 , an increasing number have been shown to remodel their secondary and tertiary structures in response to cellular stimuli 2 . State-of-the-art algorithms 3-5 predict that these fold-switching proteins assume only one stable structure 6,7 , missing their functionally critical alternative folds. Why these algorithms predict a single fold is unclear, but all of them infer protein structure from coevolved amino acid pairs. Here, we hypothesize that coevolutionary signatures are being missed. Suspecting that over-represented single-fold sequences may be masking these signatures, we developed an approach to search both highly diverse protein superfamilies-composed of single-fold and fold-switching variants-and protein subfamilies with more fold-switching variants. This approach successfully revealed coevolution of amino acid pairs uniquely corresponding to both conformations of 56/58 fold-switching proteins from distinct families. Then, using a set of coevolved amino acid pairs predicted by our approach, we successfully biased AlphaFold2 5 to predict two experimentally consistent conformations of a candidate protein with unsolved structure. The discovery of widespread dual-fold coevolution indicates that fold-switching sequences have been preserved by natural selection, implying that their functionalities provide evolutionary advantage and paving the way for predictions of diverse protein structures from single sequences.
Collapse
Affiliation(s)
- Joseph W. Schafer
- National Library of Medicine, National Center for Biotechnology Information, National Institutes of Health, Bethesda, MD 20894, USA
| | - Lauren L. Porter
- National Library of Medicine, National Center for Biotechnology Information, National Institutes of Health, Bethesda, MD 20894, USA
- National Heart, Lung, and Blood Institute, Biochemistry and Biophysics Center, National Institutes of Health, Bethesda, MD 20892, USA
| |
Collapse
|
12
|
García-Jacas CR, García-González LA, Martinez-Rios F, Tapia-Contreras IP, Brizuela CA. Handcrafted versus non-handcrafted (self-supervised) features for the classification of antimicrobial peptides: complementary or redundant? Brief Bioinform 2022; 23:6754757. [PMID: 36215083 DOI: 10.1093/bib/bbac428] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2022] [Revised: 08/28/2022] [Accepted: 09/02/2022] [Indexed: 12/14/2022] Open
Abstract
Antimicrobial peptides (AMPs) have received a great deal of attention given their potential to become a plausible option to fight multi-drug resistant bacteria as well as other pathogens. Quantitative sequence-activity models (QSAMs) have been helpful to discover new AMPs because they allow to explore a large universe of peptide sequences and help reduce the number of wet lab experiments. A main aspect in the building of QSAMs based on shallow learning is to determine an optimal set of protein descriptors (features) required to discriminate between sequences with different antimicrobial activities. These features are generally handcrafted from peptide sequence datasets that are labeled with specific antimicrobial activities. However, recent developments have shown that unsupervised approaches can be used to determine features that outperform human-engineered (handcrafted) features. Thus, knowing which of these two approaches contribute to a better classification of AMPs, it is a fundamental question in order to design more accurate models. Here, we present a systematic and rigorous study to compare both types of features. Experimental outcomes show that non-handcrafted features lead to achieve better performances than handcrafted features. However, the experiments also prove that an improvement in performance is achieved when both types of features are merged. A relevance analysis reveals that non-handcrafted features have higher information content than handcrafted features, while an interaction-based importance analysis reveals that handcrafted features are more important. These findings suggest that there is complementarity between both types of features. Comparisons regarding state-of-the-art deep models show that shallow models yield better performances both when fed with non-handcrafted features alone and when fed with non-handcrafted and handcrafted features together.
Collapse
Affiliation(s)
- César R García-Jacas
- Cátedras CONACYT - Departamento de Ciencias de la Computación, Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE), 22860 Ensenada, Baja California, México
| | - Luis A García-González
- Departamento de Ciencias de la Computación, Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE), 22860 Ensenada, Baja California, México
| | | | - Issac P Tapia-Contreras
- Departamento de Ciencias de la Computación, Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE), 22860 Ensenada, Baja California, México
| | - Carlos A Brizuela
- Departamento de Ciencias de la Computación, Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE), 22860 Ensenada, Baja California, México
| |
Collapse
|
13
|
Braberg H, Echeverria I, Kaake RM, Sali A, Krogan NJ. From systems to structure - using genetic data to model protein structures. Nat Rev Genet 2022; 23:342-354. [PMID: 35013567 PMCID: PMC8744059 DOI: 10.1038/s41576-021-00441-w] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/25/2021] [Indexed: 12/11/2022]
Abstract
Understanding the effects of genetic variation is a fundamental problem in biology that requires methods to analyse both physical and functional consequences of sequence changes at systems-wide and mechanistic scales. To achieve a systems view, protein interaction networks map which proteins physically interact, while genetic interaction networks inform on the phenotypic consequences of perturbing these protein interactions. Until recently, understanding the molecular mechanisms that underlie these interactions often required biophysical methods to determine the structures of the proteins involved. The past decade has seen the emergence of new approaches based on coevolution, deep mutational scanning and genome-scale genetic or chemical-genetic interaction mapping that enable modelling of the structures of individual proteins or protein complexes. Here, we review the emerging use of large-scale genetic datasets and deep learning approaches to model protein structures and their interactions, and discuss the integration of structural data from different sources.
Collapse
Affiliation(s)
- Hannes Braberg
- Department of Cellular and Molecular Pharmacology, University of California, San Francisco, San Francisco, CA, USA
- Quantitative Biosciences Institute, University of California, San Francisco, San Francisco, CA, USA
| | - Ignacia Echeverria
- Department of Cellular and Molecular Pharmacology, University of California, San Francisco, San Francisco, CA, USA
- Quantitative Biosciences Institute, University of California, San Francisco, San Francisco, CA, USA
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, CA, USA
| | - Robyn M Kaake
- Department of Cellular and Molecular Pharmacology, University of California, San Francisco, San Francisco, CA, USA
- Quantitative Biosciences Institute, University of California, San Francisco, San Francisco, CA, USA
- Gladstone Institutes, San Francisco, CA, USA
| | - Andrej Sali
- Quantitative Biosciences Institute, University of California, San Francisco, San Francisco, CA, USA
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, CA, USA
- Department of Pharmaceutical Chemistry, University of California, San Francisco, San Francisco, CA, USA
| | - Nevan J Krogan
- Department of Cellular and Molecular Pharmacology, University of California, San Francisco, San Francisco, CA, USA.
- Quantitative Biosciences Institute, University of California, San Francisco, San Francisco, CA, USA.
- Gladstone Institutes, San Francisco, CA, USA.
- Department of Microbiology, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
| |
Collapse
|
14
|
Dumont ME, Konopka JB. Comparison of Experimental Approaches Used to Determine the Structure and Function of the Class D G Protein-Coupled Yeast α-Factor Receptor. Biomolecules 2022; 12:biom12060761. [PMID: 35740886 PMCID: PMC9220813 DOI: 10.3390/biom12060761] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2022] [Revised: 05/26/2022] [Accepted: 05/27/2022] [Indexed: 02/01/2023] Open
Abstract
The Saccharomyces cerevisiae α-factor mating pheromone receptor (Ste2p) has been studied as a model for the large medically important family of G protein-coupled receptors. Diverse yeast genetic screens and high-throughput mutagenesis of STE2 identified a large number of loss-of-function, constitutively-active, dominant-negative, and intragenic second-site suppressor mutants as well as mutations that specifically affect pheromone binding. Facile genetic manipulation of Ste2p also aided in targeted biochemical approaches, such as probing the aqueous accessibility of substituted cysteine residues in order to identify the boundaries of the seven transmembrane segments, and the use of cysteine disulfide crosslinking to identify sites of intramolecular contacts in the transmembrane helix bundle of Ste2p and sites of contacts between the monomers in a Ste2p dimer. Recent publication of a series of high-resolution cryo-EM structures of Ste2p in ligand-free, agonist-bound and antagonist-bound states now makes it possible to evaluate the results of these genetic and biochemical strategies, in comparison to three-dimensional structures showing activation-related conformational changes. The results indicate that the genetic and biochemical strategies were generally effective, and provide guidance as to how best to apply these experimental strategies to other proteins. These strategies continue to be useful in defining mechanisms of signal transduction in the context of the available structures and suggest aspects of receptor function beyond what can be discerned from the available structures.
Collapse
Affiliation(s)
- Mark E. Dumont
- Department of Biochemistry and Biophysics, University of Rochester Medical Center, Rochester, NY 14642, USA
- Correspondence: ; Tel.: +1-585-275-2466
| | - James B. Konopka
- Department of Microbiology and Immunology, Stony Brook University, Stony Brook, NY 11794-5222, USA;
| |
Collapse
|
15
|
Sanchez-Pulido L, Ponting CP. Extending the Horizon of Homology Detection with Coevolution-based Structure Prediction. J Mol Biol 2021; 433:167106. [PMID: 34139218 PMCID: PMC8527833 DOI: 10.1016/j.jmb.2021.167106] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2021] [Revised: 06/09/2021] [Accepted: 06/09/2021] [Indexed: 12/12/2022]
Abstract
Traditional sequence analysis algorithms fail to identify distant homologies when they lie beyond a detection horizon. In this review, we discuss how co-evolution-based contact and distance prediction methods are pushing back this homology detection horizon, thereby yielding new functional insights and experimentally testable hypotheses. Based on correlated substitutions, these methods divine three-dimensional constraints among amino acids in protein sequences that were previously devoid of all annotated domains and repeats. The new algorithms discern hidden structure in an otherwise featureless sequence landscape. Their revelatory impact promises to be as profound as the use, by archaeologists, of ground-penetrating radar to discern long-hidden, subterranean structures. As examples of this, we describe how triplicated structures reflecting longin domains in MON1A-like proteins, or UVR-like repeats in DISC1, emerge from their predicted contact and distance maps. These methods also help to resolve structures that do not conform to a "beads-on-a-string" model of protein domains. In one such example, we describe CFAP298 whose ubiquitin-like domain was previously challenging to perceive owing to a large sequence insertion within it. More generally, the new algorithms permit an easier appreciation of domain families and folds whose evolution involved structural insertion or rearrangement. As we exemplify with α1-antitrypsin, coevolution-based predicted contacts may also yield insights into protein dynamics and conformational change. This new combination of structure prediction (using innovative co-evolution based methods) and homology inference (using more traditional sequence analysis approaches) shows great promise for bringing into view a sea of evolutionary relationships that had hitherto lain far beyond the horizon of homology detection.
Collapse
Affiliation(s)
- Luis Sanchez-Pulido
- Medical Research Council Human Genetics Unit, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh EH4 2XU, UK.
| | - Chris P Ponting
- Medical Research Council Human Genetics Unit, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh EH4 2XU, UK.
| |
Collapse
|
16
|
Burkholz S, Pokhrel S, Kraemer BR, Mochly-Rosen D, Carback RT, Hodge T, Harris P, Ciotlos S, Wang L, Herst CV, Rubsamen R. Paired SARS-CoV-2 spike protein mutations observed during ongoing SARS-CoV-2 viral transfer from humans to minks and back to humans. INFECTION, GENETICS AND EVOLUTION : JOURNAL OF MOLECULAR EPIDEMIOLOGY AND EVOLUTIONARY GENETICS IN INFECTIOUS DISEASES 2021; 93:104897. [PMID: 33971305 PMCID: PMC8103774 DOI: 10.1016/j.meegid.2021.104897] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/14/2021] [Revised: 05/01/2021] [Accepted: 05/05/2021] [Indexed: 12/24/2022]
Abstract
A mutation analysis of SARS-CoV-2 genomes collected around the world sorted by sequence, date, geographic location, and species has revealed a large number of variants from the initial reference sequence in Wuhan. This analysis also reveals that humans infected with SARS-CoV-2 have infected mink populations in the Netherlands, Denmark, United States, and Canada. In these animals, a small set of mutations in the spike protein receptor binding domain (RBD), often occurring in specific combinations, has transferred back into humans. The viral genomic mutations in minks observed in the Netherlands and Denmark show the potential for new mutations on the SARS-CoV-2 spike protein RBD to be introduced into humans by zoonotic transfer. Our data suggests that close attention to viral transfer from humans to farm animals and pets will be required to prevent build-up of a viral reservoir for potential future zoonotic transfer.
Collapse
Affiliation(s)
- Scott Burkholz
- Flow Pharma, Inc., 4829 Galaxy Parkway, Suite K, Warrensville Heights, OH 44128, United States of America
| | - Suman Pokhrel
- Department of Chemical and Systems Biology, Stanford University School of Medicine, 291 Campus Drive, Stanford, CA 94305, United States of America
| | - Benjamin R Kraemer
- Department of Chemical and Systems Biology, Stanford University School of Medicine, 291 Campus Drive, Stanford, CA 94305, United States of America
| | - Daria Mochly-Rosen
- Department of Chemical and Systems Biology, Stanford University School of Medicine, 291 Campus Drive, Stanford, CA 94305, United States of America
| | - Richard T Carback
- Flow Pharma, Inc., 4829 Galaxy Parkway, Suite K, Warrensville Heights, OH 44128, United States of America
| | - Tom Hodge
- Flow Pharma, Inc., 4829 Galaxy Parkway, Suite K, Warrensville Heights, OH 44128, United States of America
| | - Paul Harris
- Department of Medicine, College of Physicians and Surgeons, Columbia University, 630 W 168th St, New York, NY 10032, United States of America
| | - Serban Ciotlos
- Flow Pharma, Inc., 4829 Galaxy Parkway, Suite K, Warrensville Heights, OH 44128, United States of America
| | - Lu Wang
- Flow Pharma, Inc., 4829 Galaxy Parkway, Suite K, Warrensville Heights, OH 44128, United States of America
| | - C V Herst
- Flow Pharma, Inc., 4829 Galaxy Parkway, Suite K, Warrensville Heights, OH 44128, United States of America
| | - Reid Rubsamen
- Flow Pharma, Inc., 4829 Galaxy Parkway, Suite K, Warrensville Heights, OH 44128, United States of America; Department of Anesthesiology and Perioperative Medicine, University Hospitals, Cleveland Medical Center, Case Western Reserve School of Medicine, 11100 Euclid Ave, Cleveland, OH 44106, United States of America; Department of Anesthesia, Critical Care and Pain Medicine, Massachusetts General Hospital, 55 Fruit St, Boston, MA 02114, United States of America.
| |
Collapse
|
17
|
Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool K, Bates R, Žídek A, Potapenko A, Bridgland A, Meyer C, Kohl SAA, Ballard AJ, Cowie A, Romera-Paredes B, Nikolov S, Jain R, Adler J, Back T, Petersen S, Reiman D, Clancy E, Zielinski M, Steinegger M, Pacholska M, Berghammer T, Bodenstein S, Silver D, Vinyals O, Senior AW, Kavukcuoglu K, Kohli P, Hassabis D. Highly accurate protein structure prediction with AlphaFold. Nature 2021; 596:583-589. [PMID: 34265844 PMCID: PMC8371605 DOI: 10.1038/s41586-021-03819-2] [Citation(s) in RCA: 19341] [Impact Index Per Article: 4835.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2021] [Accepted: 07/12/2021] [Indexed: 02/07/2023]
Abstract
Proteins are essential to life, and understanding their structure can facilitate a mechanistic understanding of their function. Through an enormous experimental effort1-4, the structures of around 100,000 unique proteins have been determined5, but this represents a small fraction of the billions of known protein sequences6,7. Structural coverage is bottlenecked by the months to years of painstaking effort required to determine a single protein structure. Accurate computational approaches are needed to address this gap and to enable large-scale structural bioinformatics. Predicting the three-dimensional structure that a protein will adopt based solely on its amino acid sequence-the structure prediction component of the 'protein folding problem'8-has been an important open research problem for more than 50 years9. Despite recent progress10-14, existing methods fall far short of atomic accuracy, especially when no homologous structure is available. Here we provide the first computational method that can regularly predict protein structures with atomic accuracy even in cases in which no similar structure is known. We validated an entirely redesigned version of our neural network-based model, AlphaFold, in the challenging 14th Critical Assessment of protein Structure Prediction (CASP14)15, demonstrating accuracy competitive with experimental structures in a majority of cases and greatly outperforming other methods. Underpinning the latest version of AlphaFold is a novel machine learning approach that incorporates physical and biological knowledge about protein structure, leveraging multi-sequence alignments, into the design of the deep learning algorithm.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - Martin Steinegger
- School of Biological Sciences, Seoul National University, Seoul, South Korea
- Artificial Intelligence Institute, Seoul National University, Seoul, South Korea
| | | | | | | | | | | | | | | | | | | |
Collapse
|
18
|
Ju F, Zhu J, Shao B, Kong L, Liu TY, Zheng WM, Bu D. CopulaNet: Learning residue co-evolution directly from multiple sequence alignment for protein structure prediction. Nat Commun 2021; 12:2535. [PMID: 33953201 PMCID: PMC8100175 DOI: 10.1038/s41467-021-22869-8] [Citation(s) in RCA: 36] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2020] [Accepted: 03/28/2021] [Indexed: 11/29/2022] Open
Abstract
Residue co-evolution has become the primary principle for estimating inter-residue distances of a protein, which are crucially important for predicting protein structure. Most existing approaches adopt an indirect strategy, i.e., inferring residue co-evolution based on some hand-crafted features, say, a covariance matrix, calculated from multiple sequence alignment (MSA) of target protein. This indirect strategy, however, cannot fully exploit the information carried by MSA. Here, we report an end-to-end deep neural network, CopulaNet, to estimate residue co-evolution directly from MSA. The key elements of CopulaNet include: (i) an encoder to model context-specific mutation for each residue; (ii) an aggregator to model residue co-evolution, and thereafter estimate inter-residue distances. Using CASP13 (the 13th Critical Assessment of Protein Structure Prediction) target proteins as representatives, we demonstrate that CopulaNet can predict protein structure with improved accuracy and efficiency. This study represents a step toward improved end-to-end prediction of inter-residue distances and protein tertiary structures.
Collapse
Affiliation(s)
- Fusong Ju
- Key Lab of Intelligent Information Processing, State Key Lab of Computer Architecture, Big-data Academy, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | | | - Bin Shao
- Microsoft Research Asia, Beijing, China
| | - Lupeng Kong
- Key Lab of Intelligent Information Processing, State Key Lab of Computer Architecture, Big-data Academy, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | | | - Wei-Mou Zheng
- University of Chinese Academy of Sciences, Beijing, China
- Institute of Theoretical Physics, Chinese Academy of Sciences, Beijing, China
| | - Dongbo Bu
- Key Lab of Intelligent Information Processing, State Key Lab of Computer Architecture, Big-data Academy, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China.
- University of Chinese Academy of Sciences, Beijing, China.
| |
Collapse
|
19
|
Fleishman SJ, Horovitz A. Extending the New Generation of Structure Predictors to Account for Dynamics and Allostery. J Mol Biol 2021; 433:167007. [PMID: 33901536 DOI: 10.1016/j.jmb.2021.167007] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2021] [Revised: 04/18/2021] [Accepted: 04/19/2021] [Indexed: 10/21/2022]
Abstract
Recent progress in structure-prediction methods that rely on deep learning suggests that the atomic structure of almost any protein may soon be predictable directly from its amino acid sequence. This much-awaited revolution was driven by substantial improvements in the reliability of methods for inferring the spatial distances between amino acid pairs from an analysis of homologous sequences. Improved reliability has been accompanied, however, by a reduced ability to detect amino acid relationships that are not due to direct spatial contacts, such as those that arise from protein dynamics or allostery. Given the central importance of dynamics and allostery to protein activity, we argue that an important future advance would extend modeling beyond predicting a single static structure. Here, we briefly review some of the developments that have led to the remarkable recent achievement in structure prediction and speculate what methods and sources of information may be leveraged in the future to develop a modeling framework that addresses protein dynamics and allostery.
Collapse
Affiliation(s)
- Sarel J Fleishman
- Department of Biomolecular Sciences, Weizmann Institute of Science, Rehovot 7600001, Israel.
| | - Amnon Horovitz
- Department of Chemical and Structural Biology, Weizmann Institute of Science, Rehovot 7600001, Israel.
| |
Collapse
|
20
|
Rives A, Meier J, Sercu T, Goyal S, Lin Z, Liu J, Guo D, Ott M, Zitnick CL, Ma J, Fergus R. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc Natl Acad Sci U S A 2021. [PMID: 33876751 DOI: 10.1101/622803] [Citation(s) in RCA: 40] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/15/2023] Open
Abstract
In the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised learning has led to major advances in representation learning and statistical generation. In the life sciences, the anticipated growth of sequencing promises unprecedented data on natural sequence diversity. Protein language modeling at the scale of evolution is a logical step toward predictive and generative artificial intelligence for biology. To this end, we use unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million protein sequences spanning evolutionary diversity. The resulting model contains information about biological properties in its representations. The representations are learned from sequence data alone. The learned representation space has a multiscale organization reflecting structure from the level of biochemical properties of amino acids to remote homology of proteins. Information about secondary and tertiary structure is encoded in the representations and can be identified by linear projections. Representation learning produces features that generalize across a range of applications, enabling state-of-the-art supervised prediction of mutational effect and secondary structure and improving state-of-the-art features for long-range contact prediction.
Collapse
Affiliation(s)
- Alexander Rives
- Facebook AI Research, New York, NY 10003;
- Department of Computer Science, New York University, New York, NY 10012
| | | | - Tom Sercu
- Facebook AI Research, New York, NY 10003
| | | | - Zeming Lin
- Department of Computer Science, New York University, New York, NY 10012
| | - Jason Liu
- Facebook AI Research, New York, NY 10003
| | - Demi Guo
- Harvard University, Cambridge, MA 02138
| | - Myle Ott
- Facebook AI Research, New York, NY 10003
| | | | - Jerry Ma
- Booth School of Business, University of Chicago, Chicago, IL 60637
- Yale Law School, New Haven, CT 06511
| | - Rob Fergus
- Department of Computer Science, New York University, New York, NY 10012
| |
Collapse
|
21
|
Rives A, Meier J, Sercu T, Goyal S, Lin Z, Liu J, Guo D, Ott M, Zitnick CL, Ma J, Fergus R. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc Natl Acad Sci U S A 2021; 118:e2016239118. [PMID: 33876751 PMCID: PMC8053943 DOI: 10.1073/pnas.2016239118] [Citation(s) in RCA: 950] [Impact Index Per Article: 237.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
Abstract
In the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised learning has led to major advances in representation learning and statistical generation. In the life sciences, the anticipated growth of sequencing promises unprecedented data on natural sequence diversity. Protein language modeling at the scale of evolution is a logical step toward predictive and generative artificial intelligence for biology. To this end, we use unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million protein sequences spanning evolutionary diversity. The resulting model contains information about biological properties in its representations. The representations are learned from sequence data alone. The learned representation space has a multiscale organization reflecting structure from the level of biochemical properties of amino acids to remote homology of proteins. Information about secondary and tertiary structure is encoded in the representations and can be identified by linear projections. Representation learning produces features that generalize across a range of applications, enabling state-of-the-art supervised prediction of mutational effect and secondary structure and improving state-of-the-art features for long-range contact prediction.
Collapse
Affiliation(s)
- Alexander Rives
- Facebook AI Research, New York, NY 10003;
- Department of Computer Science, New York University, New York, NY 10012
| | | | - Tom Sercu
- Facebook AI Research, New York, NY 10003
| | | | - Zeming Lin
- Department of Computer Science, New York University, New York, NY 10012
| | - Jason Liu
- Facebook AI Research, New York, NY 10003
| | - Demi Guo
- Harvard University, Cambridge, MA 02138
| | - Myle Ott
- Facebook AI Research, New York, NY 10003
| | | | - Jerry Ma
- Booth School of Business, University of Chicago, Chicago, IL 60637
- Yale Law School, New Haven, CT 06511
| | - Rob Fergus
- Department of Computer Science, New York University, New York, NY 10012
| |
Collapse
|
22
|
Werner M, Gapsys V, de Groot BL. One Plus One Makes Three: Triangular Coupling of Correlated Amino Acid Mutations. J Phys Chem Lett 2021; 12:3195-3201. [PMID: 33760609 PMCID: PMC8041375 DOI: 10.1021/acs.jpclett.1c00380] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2021] [Accepted: 03/17/2021] [Indexed: 06/12/2023]
Abstract
Correlated mutations have played a pivotal role in the recent success in protein fold prediction. Understanding nonadditive effects of mutations is crucial for altering protein structure, as mutations of multiple residues may change protein stability or binding affinity in a manner unforeseen by the investigation of single mutants. While the couplings between amino acids can be inferred from homologous protein sequences, the physical mechanisms underlying these correlations remain elusive. In this work we demonstrate that calculations based on the first-principles of statistical mechanics are capable of capturing the effects of nonadditivities in protein mutations. The identified thermodynamic couplings cover the short-range as well as previously unknown long-range correlations. We further explore a set of mutations in staphyloccocal nuclease to unravel an intricate interaction pathway underlying the correlations between amino acid mutations.
Collapse
Affiliation(s)
- Martin Werner
- Computational
Biomolecular Dynamics Group, Max-Planck
Institute for Biophysical Chemistry, Am Fassberg 11, 37077 Göttingen, Germany
| | - Vytautas Gapsys
- Computational
Biomolecular Dynamics Group, Max-Planck
Institute for Biophysical Chemistry, Am Fassberg 11, 37077 Göttingen, Germany
| | - Bert L. de Groot
- Computational
Biomolecular Dynamics Group, Max-Planck
Institute for Biophysical Chemistry, Am Fassberg 11, 37077 Göttingen, Germany
| |
Collapse
|
23
|
Thadani NN, Zhou Q, Reyes Gamas K, Butler S, Bueno C, Schafer NP, Morcos F, Wolynes PG, Suh J. Frustration and Direct-Coupling Analyses to Predict Formation and Function of Adeno-Associated Virus. Biophys J 2020; 120:489-503. [PMID: 33359833 DOI: 10.1016/j.bpj.2020.12.018] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2020] [Revised: 11/08/2020] [Accepted: 12/08/2020] [Indexed: 01/03/2023] Open
Abstract
Adeno-associated virus (AAV) is a promising gene therapy vector because of its efficient gene delivery and relatively mild immunogenicity. To improve delivery target specificity, researchers use combinatorial and rational library design strategies to generate novel AAV capsid variants. These approaches frequently propose high proportions of nonforming or noninfective capsid protein sequences that reduce the effective depth of synthesized vector DNA libraries, thereby raising the discovery cost of novel vectors. We evaluated two computational techniques for their ability to estimate the impact of residue mutations on AAV capsid protein-protein interactions and thus predict changes in vector fitness, reasoning that these approaches might inform the design of functionally enriched AAV libraries and accelerate therapeutic candidate identification. The Frustratometer computes an energy function derived from the energy landscape theory of protein folding. Direct-coupling analysis (DCA) is a statistical framework that captures residue coevolution within proteins. We applied the Frustratometer to select candidate protein residues predicted to favor assembled or disassembled capsid states, then predicted mutation effects at these sites using the Frustratometer and DCA. Capsid mutants were experimentally assessed for changes in virus formation, stability, and transduction ability. The Frustratometer-based metric showed a counterintuitive correlation with viral stability, whereas a DCA-derived metric was highly correlated with virus transduction ability in the small population of residues studied. Our results suggest that coevolutionary models may be able to elucidate complex capsid residue-residue interaction networks essential for viral function, but further study is needed to understand the relationship between protein energy simulations and viral capsid metastability.
Collapse
Affiliation(s)
| | - Qin Zhou
- Department of Biological Sciences, University of Texas at Dallas, Richardson, Texas
| | | | - Susan Butler
- Department of Bioengineering, Rice University, Houston, Texas
| | - Carlos Bueno
- Center for Theoretical Biological Physics, Rice University, Houston, Texas; Department of Chemical and Biomolecular Engineering, Rice University, Houston, Texas
| | - Nicholas P Schafer
- Center for Theoretical Biological Physics, Rice University, Houston, Texas; Department of Chemistry, Rice University, Houston, Texas
| | - Faruck Morcos
- Department of Biological Sciences, University of Texas at Dallas, Richardson, Texas; Center for Systems Biology, University of Texas at Dallas, Richardson, Texas; Department of Bioengineering, University of Texas at Dallas, Richardson, Texas
| | - Peter G Wolynes
- Center for Theoretical Biological Physics, Rice University, Houston, Texas; Department of Chemistry, Rice University, Houston, Texas; Department of Biosciences, Rice University, Houston, Texas; Department of Physics, Rice University, Houston, Texas
| | - Junghae Suh
- Department of Bioengineering, Rice University, Houston, Texas; Department of Biosciences, Rice University, Houston, Texas; Department of Chemical and Biomolecular Engineering, Rice University, Houston, Texas; Systems, Synthetic, and Physical Biology Program, Rice University, Houston, Texas.
| |
Collapse
|
24
|
Qiu J, Nechaev D, Rost B. Protein-protein and protein-nucleic acid binding residues important for common and rare sequence variants in human. BMC Bioinformatics 2020; 21:452. [PMID: 33050876 PMCID: PMC7557062 DOI: 10.1186/s12859-020-03759-0] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2020] [Accepted: 09/16/2020] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Any two unrelated people differ by about 20,000 missense mutations (also referred to as SAVs: Single Amino acid Variants or missense SNV). Many SAVs have been predicted to strongly affect molecular protein function. Common SAVs (> 5% of population) were predicted to have, on average, more effect on molecular protein function than rare SAVs (< 1% of population). We hypothesized that the prevalence of effect in common over rare SAVs might partially be caused by common SAVs more often occurring at interfaces of proteins with other proteins, DNA, or RNA, thereby creating subgroup-specific phenotypes. We analyzed SAVs from 60,706 people through the lens of two prediction methods, one (SNAP2) predicting the effects of SAVs on molecular protein function, the other (ProNA2020) predicting residues in DNA-, RNA- and protein-binding interfaces. RESULTS Three results stood out. Firstly, SAVs predicted to occur at binding interfaces were predicted to more likely affect molecular function than those predicted as not binding (p value < 2.2 × 10-16). Secondly, for SAVs predicted to occur at binding interfaces, common SAVs were predicted more strongly with effect on protein function than rare SAVs (p value < 2.2 × 10-16). Restriction to SAVs with experimental annotations confirmed all results, although the resulting subsets were too small to establish statistical significance for any result. Thirdly, the fraction of SAVs predicted at binding interfaces differed significantly between tissues, e.g. urinary bladder tissue was found abundant in SAVs predicted at protein-binding interfaces, and reproductive tissues (ovary, testis, vagina, seminal vesicle and endometrium) in SAVs predicted at DNA-binding interfaces. CONCLUSIONS Overall, the results suggested that residues at protein-, DNA-, and RNA-binding interfaces contributed toward predicting that common SAVs more likely affect molecular function than rare SAVs.
Collapse
Affiliation(s)
- Jiajun Qiu
- Department of Informatics, I12-Chair of Bioinformatics and Computational Biology, Technical University of Munich (TUM), Boltzmannstrasse 3, 85748, Garching, Munich, Germany. .,TUM Graduate School, Center of Doctoral Studies in Informatics and Its Applications (CeDoSIA), 85748, Garching, Germany. .,Biobank of Ninth People's Hospital, Shanghai Ninth People's Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, 200125, China.
| | - Dmitrii Nechaev
- Department of Informatics, I12-Chair of Bioinformatics and Computational Biology, Technical University of Munich (TUM), Boltzmannstrasse 3, 85748, Garching, Munich, Germany.,TUM Graduate School, Center of Doctoral Studies in Informatics and Its Applications (CeDoSIA), 85748, Garching, Germany
| | - Burkhard Rost
- Department of Informatics, I12-Chair of Bioinformatics and Computational Biology, Technical University of Munich (TUM), Boltzmannstrasse 3, 85748, Garching, Munich, Germany.,Institute of Advanced Study (TUM-IAS), Lichtenbergstr. 2a, 85748, Garching, Munich, Germany.,Institute for Food and Plant Sciences (WZW) Weihenstephan, Alte Akademie 8, 85354, Freising, Germany
| |
Collapse
|
25
|
Fantini M, Lisi S, De Los Rios P, Cattaneo A, Pastore A. Protein Structural Information and Evolutionary Landscape by In Vitro Evolution. Mol Biol Evol 2020; 37:1179-1192. [PMID: 31670785 PMCID: PMC7086169 DOI: 10.1093/molbev/msz256] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Protein structure is tightly intertwined with function according to the laws of evolution. Understanding how structure determines function has been the aim of structural biology for decades. Here, we have wondered instead whether it is possible to exploit the function for which a protein was evolutionary selected to gain information on protein structure and on the landscape explored during the early stages of molecular and natural evolution. To answer to this question, we developed a new methodology, which we named CAMELS (Coupling Analysis by Molecular Evolution Library Sequencing), that is able to obtain the in vitro evolution of a protein from an artificial selection based on function. We were able to observe with CAMELS many features of the TEM-1 beta-lactamase local fold exclusively by generating and sequencing large libraries of mutational variants. We demonstrated that we can, whenever a functional phenotypic selection of a protein is available, sketch the structural and evolutionary landscape of a protein without utilizing purified proteins, collecting physical measurements, or relying on the pool of natural protein variants.
Collapse
Affiliation(s)
- Marco Fantini
- BioSNS Laboratory of Biology, Scuola Normale Superiore (SNS), Pisa, Italy
| | - Simonetta Lisi
- BioSNS Laboratory of Biology, Scuola Normale Superiore (SNS), Pisa, Italy
| | - Paolo De Los Rios
- Institute of Physics, School of Basic Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
| | - Antonino Cattaneo
- BioSNS Laboratory of Biology, Scuola Normale Superiore (SNS), Pisa, Italy
- European Brain Research Institute, Rome, Italy
| | - Annalisa Pastore
- Department of Clinical and Basic Neuroscience, Maurice Wohl Institute, King's College London, London, United Kingdom
- Dementia Research Institute, King’s College London, London, United Kingdom
| |
Collapse
|
26
|
Bahrami A, Najafi A, Hashemi M, Miraie-Ashtiani SR. PSSP: Protein splice site prediction algorithm using Bayesian approach. J Bioinform Comput Biol 2020; 17:1950034. [PMID: 32019415 DOI: 10.1142/s0219720019500343] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
This study aimed to introduce an algorithm and identify intein motif and blocks involved in protein splicing, and explore the underlying methods in the development of detection of protein motifs. Inteins are mobile protein splicing elements capable of self-splicing post-translationally. They exist in viruses and bacteriophage, notwithstanding this broad phylogenetic distribution, all inteins apportion common structural features. A method was developed to predict intein in a raw sequence, using a ranking and scoring scheme based on amino acid θ value tables. This method aided in the identification and assessment of patterns characterizing the intein sequences. New intein conserved properties are revealed and the known ones are described and localized. We have computed the θ value of each amino acid at block A positions +1 to +13, block B positions l+13 to l+26 and block G positions -7 to +1 for the three categories. The consensus amino acids thus found are listed at the end of each row. We gave statistics for the distance between the blocks, block A to B, block B to F, and block F to G with the average being 66.1, 294, and 10.2 amino acids, respectively. The actual blocks A, B, and G of the one intein found in vacuolar membrane ATPase subunit, a precursor protein, are ranked 1. The results indicate all of the block sequences that are found in nine proteins are ranked at top of the list. The intein sequence is used to search the databases for intein-like proteins. Understanding the functional, structural, and dynamical aspects of inteins is important for intein engineering and the betterment of intein database.
Collapse
Affiliation(s)
- Abolfazl Bahrami
- Department of Animal Science, University College of Agriculture and Natural Resources, University of Tehran, Karaj, Islamic Republic of Iran
| | - Ali Najafi
- Molecular Biology Research Center, Systems Biology and Poisonings Institute, Baqiyatallah University of Medical Sciences, Tehran, Iran
| | - Mohammadreza Hashemi
- Department of Animal Science, University College of Agriculture and Natural Resources, University of Tehran, Karaj, Islamic Republic of Iran
| | - Seyed Reza Miraie-Ashtiani
- Department of Animal Science, University College of Agriculture and Natural Resources, University of Tehran, Karaj, Islamic Republic of Iran
| |
Collapse
|
27
|
Machine learning for protein folding and dynamics. Curr Opin Struct Biol 2020; 60:77-84. [DOI: 10.1016/j.sbi.2019.12.005] [Citation(s) in RCA: 60] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2019] [Revised: 11/21/2019] [Accepted: 12/05/2019] [Indexed: 12/17/2022]
|
28
|
Affiliation(s)
- Melissa Chiasson
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Douglas M Fowler
- Department of Genome Sciences, University of Washington, Seattle, WA, USA. .,Department of Bioengineering, University of Washington, Seattle, WA, USA. .,Genetic Networks Program, CIFAR, Toronto, Ontario, Canada.
| |
Collapse
|
29
|
Senior AW, Evans R, Jumper J, Kirkpatrick J, Sifre L, Green T, Qin C, Žídek A, Nelson AWR, Bridgland A, Penedones H, Petersen S, Simonyan K, Crossan S, Kohli P, Jones DT, Silver D, Kavukcuoglu K, Hassabis D. Improved protein structure prediction using potentials from deep learning. Nature 2020; 577:706-710. [PMID: 31942072 DOI: 10.1038/s41586-019-1923-7] [Citation(s) in RCA: 1501] [Impact Index Per Article: 300.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2019] [Accepted: 12/10/2019] [Indexed: 12/16/2022]
Abstract
Protein structure prediction can be used to determine the three-dimensional shape of a protein from its amino acid sequence1. This problem is of fundamental importance as the structure of a protein largely determines its function2; however, protein structures can be difficult to determine experimentally. Considerable progress has recently been made by leveraging genetic information. It is possible to infer which amino acid residues are in contact by analysing covariation in homologous sequences, which aids in the prediction of protein structures3. Here we show that we can train a neural network to make accurate predictions of the distances between pairs of residues, which convey more information about the structure than contact predictions. Using this information, we construct a potential of mean force4 that can accurately describe the shape of a protein. We find that the resulting potential can be optimized by a simple gradient descent algorithm to generate structures without complex sampling procedures. The resulting system, named AlphaFold, achieves high accuracy, even for sequences with fewer homologous sequences. In the recent Critical Assessment of Protein Structure Prediction5 (CASP13)-a blind assessment of the state of the field-AlphaFold created high-accuracy structures (with template modelling (TM) scores6 of 0.7 or higher) for 24 out of 43 free modelling domains, whereas the next best method, which used sampling and contact information, achieved such accuracy for only 14 out of 43 domains. AlphaFold represents a considerable advance in protein-structure prediction. We expect this increased accuracy to enable insights into the function and malfunction of proteins, especially in cases for which no structures for homologous proteins have been experimentally determined7.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - David T Jones
- The Francis Crick Institute, London, UK.,University College London, London, UK
| | | | | | | |
Collapse
|
30
|
Grazhdankin E, Stepniewski M, Xhaard H. Modeling membrane proteins: The importance of cysteine amino-acids. J Struct Biol 2020; 209:107400. [DOI: 10.1016/j.jsb.2019.10.002] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2019] [Revised: 09/11/2019] [Accepted: 10/03/2019] [Indexed: 12/14/2022]
|
31
|
He Z, Dong T, Wu W, Chen W, Liu X, Li L. Evolutionary Rates and Phylogeographical Analysis of Odontoglossum Ringspot Virus Based on the 166 Coat Protein Gene Sequences. THE PLANT PATHOLOGY JOURNAL 2019; 35:498-507. [PMID: 31632224 PMCID: PMC6788419 DOI: 10.5423/ppj.oa.04.2019.0113] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/22/2019] [Revised: 07/10/2019] [Accepted: 08/19/2019] [Indexed: 06/10/2023]
Abstract
Odontoglossum ringspot virus (ORSV) is a member of the genus Tobamovirus. It is one of the most prevalent viruses infecting orchids worldwide. Earlier studies reported the genetic variability of ORSV isolates from Korea and China. However, the evolutionary rate, timescale, and phylogeographical analyses of ORSV were unclear. Twenty-one coat protein (CP) gene sequences of ORSV were determined in this study, and used them together with 145 CP sequences obtained from GenBank to infer the genetic diversities, evolutionary rate, timescale and migration of ORSV populations. Evolutionary rate of ORSV populations was 1.25 × 10-3 nucleotides/site/y. The most recent common ancestors came from 30 year ago (95% confidence intervals, 26-40). Based on CP gene, ORSV migrated from mainland China and South Korea to Taiwan island, Germany, Australia, Singapore, and Indonesia, and it also circulated within east Asia. Our study is the first attempt to evaluate the evolutionary rates, timescales and migration dynamics of ORSV.
Collapse
Affiliation(s)
| | | | | | | | | | - Liangjun Li
- Corresponding author: Phone) +86-514-87979394, FAX) +86-514-87347537, E-mail)
| |
Collapse
|
32
|
Szurmant H. Evolutionary couplings of amino acid residues reveal structure and function of bacterial signaling proteins. Mol Microbiol 2019; 112:432-437. [PMID: 31102561 DOI: 10.1111/mmi.14282] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/15/2019] [Indexed: 12/12/2022]
Abstract
The genomic era along with major advances in high-throughput sequencing technology has led to a rapid expansion of the genomic and consequently the protein sequence space. Bacterial extracytoplasmic function sigma factors have emerged as an important group of signaling proteins in bacteria involved in many regulatory decisions, most notably the adaptation to cell envelope stress. Their wide prevalence and amplification among bacterial genomes has led to sub-group classification and the realization of diverse signaling mechanisms. Mathematical frameworks have been developed to utilize extensive protein sequence alignments to extract co-evolutionary signals of interaction. This has proven useful in a number of different biological fields, including de novo structure prediction, protein-protein partner identification and the elucidation of alternative protein conformations for signal proteins, to name a few. The mathematical tools, commonly referred to under the name 'Direct Coupling Analysis' have now been applied to deduce molecular mechanisms of activation for sub-groups of extracytoplasmic sigma factors adding to previous successes on bacterial two-component signaling proteins. The amplification of signal transduction protein genes in bacterial genomes made them the first to be amenable to this approach but the sequences are available now to aid the molecular microbiologist, no matter their protein pathway of interest.
Collapse
Affiliation(s)
- Hendrik Szurmant
- Basic Medical Science, College of Osteopathic Medicine of the Pacific, Western University of Health Sciences, Pomona, CA, USA
| |
Collapse
|
33
|
Schmiedel JM, Lehner B. Determining protein structures using deep mutagenesis. Nat Genet 2019; 51:1177-1186. [PMID: 31209395 PMCID: PMC7610650 DOI: 10.1038/s41588-019-0431-x] [Citation(s) in RCA: 96] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2018] [Accepted: 04/29/2019] [Indexed: 12/12/2022]
Abstract
Determining the three-dimensional structures of macromolecules is a major goal of biological research, because of the close relationship between structure and function; however, thousands of protein domains still have unknown structures. Structure determination usually relies on physical techniques including X-ray crystallography, NMR spectroscopy and cryo-electron microscopy. Here we present a method that allows the high-resolution three-dimensional backbone structure of a biological macromolecule to be determined only from measurements of the activity of mutant variants of the molecule. This genetic approach to structure determination relies on the quantification of genetic interactions (epistasis) between mutations and the discrimination of direct from indirect interactions. This provides an alternative experimental strategy for structure determination, with the potential to reveal functional and in vivo structures.
Collapse
Affiliation(s)
- Jörn M Schmiedel
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain
| | - Ben Lehner
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain.
- Universitat Pompeu Fabra (UPF), Barcelona, Spain.
- ICREA, Barcelona, Spain.
| |
Collapse
|
34
|
Goedert M. Aaron Klug and the study of Alzheimer's disease. Alzheimers Dement 2019; 15:859-861. [PMID: 31010788 DOI: 10.1016/j.jalz.2019.03.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
35
|
Chisholm PJ, Busch JW, Crowder DW. Effects of life history and ecology on virus evolutionary potential. Virus Res 2019; 265:1-9. [PMID: 30831177 DOI: 10.1016/j.virusres.2019.02.018] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2018] [Revised: 02/27/2019] [Accepted: 02/28/2019] [Indexed: 11/28/2022]
Abstract
The life history traits of viruses pose many consequences for viral population structure. In turn, population structure may influence the evolutionary trajectory of a virus. Here we review factors that affect the evolutionary potential of viruses, including rates of mutation and recombination, bottlenecks, selection pressure, and ecological factors such as the requirement for hosts and vectors. Mutation, while supplying a pool of raw genetic material, also results in the generation of numerous unfit mutants. The infection of multiple host species may expand a virus' ecological niche, although it may come at a cost to genetic diversity. Vector-borne viruses often experience a diminished frequency of positive selection and exhibit little diversity, and resistance against vector-borne viruses may thus be more durable than against non-vectored viruses. Evidence indicates that adaptation to a vector is more evolutionarily difficult than adaptation to a host. Overall, a better understanding of how various factors influence viral dynamics in both plant and animal pathosystems will lead to more effective anti-viral treatments and countermeasures.
Collapse
Affiliation(s)
- Paul J Chisholm
- Department of Entomology, Washington State University, 166 FSHN Building, Pullman, WA, 99164, USA.
| | - Jeremiah W Busch
- School of Biological Sciences, Washington State University, PO Box 644236, Pullman, WA, 99164, USA.
| | - David W Crowder
- Department of Entomology, Washington State University, 166 FSHN Building, Pullman, WA, 99164, USA.
| |
Collapse
|
36
|
Wuyun Q, Zheng W, Peng Z, Yang J. A large-scale comparative assessment of methods for residue-residue contact prediction. Brief Bioinform 2019; 19:219-230. [PMID: 27802931 DOI: 10.1093/bib/bbw106] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2016] [Indexed: 11/14/2022] Open
Abstract
Sequence-based prediction of residue-residue contact in proteins becomes increasingly more important for improving protein structure prediction in the big data era. In this study, we performed a large-scale comparative assessment of 15 locally installed contact predictors. To assess these methods, we collected a big data set consisting of 680 nonredundant proteins covering different structural classes and target difficulties. We investigated a wide range of factors that may influence the precision of contact prediction, including target difficulty, structural class, the alignment depth and distribution of contact pairs in a protein structure. We found that: (1) the machine learning-based methods outperform the direct-coupling-based methods for short-range contact prediction, while the latter are significantly better for long-range contact prediction. The consensus-based methods, which combine machine learning and direct-coupling methods, perform the best. (2) The target difficulty does not have clear influence on the machine learning-based methods, while it does affect the direct-coupling and consensus-based methods significantly. (3) The alignment depth has relatively weak effect on the machine learning-based methods. However, for the direct-coupling-based methods and consensus-based methods, the predicted contacts for targets with deeper alignment tend to be more accurate. (4) All methods perform relatively better on β and α + β proteins than on α proteins. (5) Residues buried in the core of protein structure are more prone to be in contact than residues on the surface (22 versus 6%). We believe these are useful results for guiding future development of new approach to contact prediction.
Collapse
Affiliation(s)
- Qiqige Wuyun
- School of Mathematical Sciences, Nankai University, Tianjin, China
| | - Wei Zheng
- School of Mathematical Sciences, Nankai University, Tianjin, China
| | - Zhenling Peng
- Center for Applied Mathematics, Tianjin University, Tianjin, China
| | - Jianyi Yang
- School of Mathematical Sciences, Nankai University, Tianjin, China
| |
Collapse
|
37
|
Jarmolinska AI, Zhou Q, Sulkowska JI, Morcos F. DCA-MOL: A PyMOL Plugin To Analyze Direct Evolutionary Couplings. J Chem Inf Model 2019; 59:625-629. [DOI: 10.1021/acs.jcim.8b00690] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]
Affiliation(s)
- Aleksandra I. Jarmolinska
- Centre of New Technologies, University of Warsaw, Banacha 2c, 02-097, Warsaw, Poland
- College of Inter-Faculty Individual Studies in Mathematics and Natural Sciences, Banacha 2c, 02-097 Warsaw, Poland
| | - Qin Zhou
- Department of Biological Sciences, University of Texas at Dallas, Richardson, Texas 75080, United States
| | - Joanna I. Sulkowska
- Centre of New Technologies, University of Warsaw, Banacha 2c, 02-097, Warsaw, Poland
- Faculty of Chemistry, University of Warsaw, Pasteura 1, 02-093 Warsaw, Poland
| | - Faruck Morcos
- Department of Biological Sciences, University of Texas at Dallas, Richardson, Texas 75080, United States
- Center for Systems Biology, University of Texas at Dallas, Richardson, Texas 75080, United States
| |
Collapse
|
38
|
Koehl P, Orland H, Delarue M. Numerical Encodings of Amino Acids in Multivariate Gaussian Modeling of Protein Multiple Sequence Alignments. Molecules 2018; 24:E104. [PMID: 30597916 PMCID: PMC6337344 DOI: 10.3390/molecules24010104] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2018] [Revised: 12/21/2018] [Accepted: 12/24/2018] [Indexed: 11/17/2022] Open
Abstract
Residues in proteins that are in close spatial proximity are more prone to covariate as their interactions are likely to be preserved due to structural and evolutionary constraints. If we can detect and quantify such covariation, physical contacts may then be predicted in the structure of a protein solely from the sequences that decorate it. To carry out such predictions, and following the work of others, we have implemented a multivariate Gaussian model to analyze correlation in multiple sequence alignments. We have explored and tested several numerical encodings of amino acids within this model. We have shown that 1D encodings based on amino acid biochemical and biophysical properties, as well as higher dimensional encodings computed from the principal components of experimentally derived mutation/substitution matrices, do not perform as well as a simple twenty dimensional encoding with each amino acid represented with a vector of one along its own dimension and zero elsewhere. The optimum obtained from representations based on substitution matrices is reached by using 10 to 12 principal components; the corresponding performance is less than the performance obtained with the 20-dimensional binary encoding. We highlight also the importance of the prior when constructing the multivariate Gaussian model of a multiple sequence alignment.
Collapse
Affiliation(s)
- Patrice Koehl
- Department of Computer Science, University of California, Davis, CA 95211, USA.
| | - Henri Orland
- Institut de Physique Théorique, CEA Saclay, 91191 Gif-sur-Yvette CEDEX, France.
| | - Marc Delarue
- Department of Structural Biology and Chemistry and UMR 3528 du CNRS, Institut Pasteur, 75015 Paris, France.
| |
Collapse
|
39
|
Zhou PY, Sze-To A, Wong AKC. Discovery and disentanglement of aligned residue associations from aligned pattern clusters to reveal subgroup characteristics. BMC Med Genomics 2018; 11:103. [PMID: 30453949 PMCID: PMC6245498 DOI: 10.1186/s12920-018-0417-z] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
Background A protein family has similar and diverse functions locally conserved. An aligned pattern cluster (APC) can reflect the conserved functionality. Discovering aligned residue associations (ARAs) in APCs can reveal subtle inner working characteristics of conserved regions of protein families. However, ARAs corresponding to different functionalities/subgroups/classes could be entangled because of subtle multiple entwined factors. Methods To discover and disentangle patterns from mixed-mode datasets, such as APCs when the residues are replaced by their fundamental biochemical properties list, this paper presents a novel method, Extended Aligned Residual Association Discovery and Disentanglement (E-ARADD). E-ARADD discretizes the numerical dataset to transform the mixed-mode dataset into an event-value dataset, constructs an ARA Frequency Matrix and then converts it into an adjusted Statistical Residual (SR) Vector Space (SRV) capturing statistical deviation from randomness. By applying Principal Component (PC) Decomposition on SRV, PCs ranked by their variance are obtained. Finally, the disentangled ARAs are discovered when the projections on a PC is re-projected to a vector space with the same basis vectors of SRV. Results Experiments on synthetic, cytochrome c and class A scavenger data have shown that E-ARADD can a) disentangle the entwined ARAs in APCs (with residues or biochemical properties), b) reveal subtle AR clusters relating to classes, subtle subgroups or specific functionalities. Conclusions E-ARADD can discover and disentangle ARs and ARAs entangled in functionality and location of protein families to reveal functional subgroups and subgroup characteristics of biological conserved regions. Experimental results on synthetic data provides the proof-of-concept validation on the successful disentanglement that reveals class-associated ARAs with or without class labels as input. Experiments on cytochrome c data proved the efficacy of E-ARADD in handing both types of residue data. Our novel methodology is not only able to discover and disentangle ARs and ARAs in specific statistical/functional (PCs and RSRVs) spaces, but also their locations in the protein family functional domains. The success of E-ARADD shows its great potential to proteomic research, drug discovery and precision and personalized genetic medicine.
Collapse
Affiliation(s)
- Pei-Yuan Zhou
- Systems Design Engineering, University of Waterloo, Waterloo, ON, Canada
| | - Antonio Sze-To
- Systems Design Engineering, University of Waterloo, Waterloo, ON, Canada
| | - Andrew K C Wong
- Systems Design Engineering, University of Waterloo, Waterloo, ON, Canada.
| |
Collapse
|
40
|
Lomonossoff GP, Wege C. TMV Particles: The Journey From Fundamental Studies to Bionanotechnology Applications. Adv Virus Res 2018; 102:149-176. [PMID: 30266172 PMCID: PMC7112118 DOI: 10.1016/bs.aivir.2018.06.003] [Citation(s) in RCA: 33] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
Ever since its initial characterization in the 19th century, tobacco mosaic virus (TMV) has played a prominent role in the development of modern virology and molecular biology. In particular, research on the three-dimensional structure of the virus particles and the mechanism by which these assemble from their constituent protein and RNA components has made TMV a paradigm for our current view of the morphogenesis of self-assembling structures, including viral particles. More recently, this knowledge has been applied to the development of novel reagents and structures for applications in biomedicine and bionanotechnology. In this article, we review how fundamental science has led to TMV being at the vanguard of these new technologies.
Collapse
Affiliation(s)
| | - Christina Wege
- Department of Molecular Biology and Plant Virology, Institute of Biomaterials and Biomolecular Systems, University of Stuttgart, Stuttgart, Germany
| |
Collapse
|
41
|
Kassem MM, Christoffersen LB, Cavalli A, Lindorff-Larsen K. Enhancing coevolution-based contact prediction by imposing structural self-consistency of the contacts. Sci Rep 2018; 8:11112. [PMID: 30042380 PMCID: PMC6057941 DOI: 10.1038/s41598-018-29357-y] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2018] [Accepted: 07/10/2018] [Indexed: 11/29/2022] Open
Abstract
Based on the development of new algorithms and growth of sequence databases, it has recently become possible to build robust higher-order sequence models based on sets of aligned protein sequences. Such models have proven useful in de novo structure prediction, where the sequence models are used to find pairs of residues that co-vary during evolution, and hence are likely to be in spatial proximity in the native protein. The accuracy of these algorithms, however, drop dramatically when the number of sequences in the alignment is small. We have developed a method that we termed CE-YAPP (CoEvolution-YAPP), that is based on YAPP (Yet Another Peak Processor), which has been shown to solve a similar problem in NMR spectroscopy. By simultaneously performing structure prediction and contact assignment, CE-YAPP uses structural self-consistency as a filter to remove false positive contacts. Furthermore, CE-YAPP solves another problem, namely how many contacts to choose from the ordered list of covarying amino acid pairs. We show that CE-YAPP consistently improves contact prediction from multiple sequence alignments, in particular for proteins that are difficult targets. We further show that the structures determined from CE-YAPP are also in better agreement with those determined using traditional methods in structural biology.
Collapse
Affiliation(s)
- Maher M Kassem
- Linderstrøm-Lang Centre for Protein Science, Department of Biology, University of Copenhagen, Ole Maaløes Vej 5, Copenhagen, DK, 2200, Denmark
| | - Lars B Christoffersen
- Linderstrøm-Lang Centre for Protein Science, Department of Biology, University of Copenhagen, Ole Maaløes Vej 5, Copenhagen, DK, 2200, Denmark
| | - Andrea Cavalli
- Institute for Research in Biomedicine, Università della Svizzera italiana (USI), Via Vincenzo Vela 6, 6500, Bellinzona, Switzerland.
| | - Kresten Lindorff-Larsen
- Linderstrøm-Lang Centre for Protein Science, Department of Biology, University of Copenhagen, Ole Maaløes Vej 5, Copenhagen, DK, 2200, Denmark.
| |
Collapse
|
42
|
Delarue M, Koehl P. Combined approaches from physics, statistics, and computer science for ab initio protein structure prediction: ex unitate vires (unity is strength)? F1000Res 2018; 7. [PMID: 30079234 PMCID: PMC6058471 DOI: 10.12688/f1000research.14870.1] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 07/19/2018] [Indexed: 11/20/2022] Open
Abstract
Connecting the dots among the amino acid sequence of a protein, its structure, and its function remains a central theme in molecular biology, as it would have many applications in the treatment of illnesses related to misfolding or protein instability. As a result of high-throughput sequencing methods, biologists currently live in a protein sequence-rich world. However, our knowledge of protein structure based on experimental data remains comparatively limited. As a consequence, protein structure prediction has established itself as a very active field of research to fill in this gap. This field, once thought to be reserved for theoretical biophysicists, is constantly reinventing itself, borrowing ideas informed by an ever-increasing assembly of scientific domains, from biology, chemistry, (statistical) physics, mathematics, computer science, statistics, bioinformatics, and more recently data sciences. We review the recent progress arising from this integration of knowledge, from the development of specific computer architecture to allow for longer timescales in physics-based simulations of protein folding to the recent advances in predicting contacts in proteins based on detection of coevolution using very large data sets of aligned protein sequences.
Collapse
Affiliation(s)
- Marc Delarue
- Unité Dynamique Structurale des Macromolécules, Institut Pasteur, and UMR 3528 du CNRS, Paris, France
| | - Patrice Koehl
- Department of Computer Science, Genome Center, University of California, Davis, Davis, California, USA
| |
Collapse
|
43
|
Gil N, Fiser A. Identifying functionally informative evolutionary sequence profiles. Bioinformatics 2018; 34:1278-1286. [PMID: 29211823 PMCID: PMC5905606 DOI: 10.1093/bioinformatics/btx779] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2017] [Accepted: 11/29/2017] [Indexed: 01/06/2023] Open
Abstract
Motivation Multiple sequence alignments (MSAs) can provide essential input to many bioinformatics applications, including protein structure prediction and functional annotation. However, the optimal selection of sequences to obtain biologically informative MSAs for such purposes is poorly explored, and has traditionally been performed manually. Results We present Selection of Alignment by Maximal Mutual Information (SAMMI), an automated, sequence-based approach to objectively select an optimal MSA from a large set of alternatives sampled from a general sequence database search. The hypothesis of this approach is that the mutual information among MSA columns will be maximal for those MSAs that contain the most diverse set possible of the most structurally and functionally homogeneous protein sequences. SAMMI was tested to select MSAs for functional site residue prediction by analysis of conservation patterns on a set of 435 proteins obtained from protein-ligand (peptides, nucleic acids and small substrates) and protein-protein interaction databases. Availability and implementation: A freely accessible program, including source code, implementing SAMMI is available at https://github.com/nelsongil92/SAMMI.git. Contact andras.fiser@einstein.yu.edu. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Nelson Gil
- Department of Systems & Computational Biology, Albert Einstein College of Medicine, Bronx, NY 10461, USA
| | - Andras Fiser
- Department of Systems & Computational Biology, Albert Einstein College of Medicine, Bronx, NY 10461, USA
| |
Collapse
|
44
|
Luria N, Smith E, Sela N, Lachman O, Bekelman I, Koren A, Dombrovsky A. A local strain of Paprika mild mottle virus breaks L3 resistance in peppers and is accelerated in Tomato brown rugose fruit virus-infected Tm-22-resistant tomatoes. Virus Genes 2018; 54:280-289. [DOI: 10.1007/s11262-018-1539-2] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2017] [Accepted: 02/02/2018] [Indexed: 10/18/2022]
|
45
|
Zhou PY, Lee ESA, Sze-To A, Wong AKC. Revealing Subtle Functional Subgroups in Class A Scavenger Receptors by Pattern Discovery and Disentanglement of Aligned Pattern Clusters. Proteomes 2018; 6:proteomes6010010. [PMID: 29419792 PMCID: PMC5874769 DOI: 10.3390/proteomes6010010] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2017] [Revised: 02/01/2018] [Accepted: 02/01/2018] [Indexed: 11/16/2022] Open
Abstract
A protein family has similar and diverse functions locally conserved as aligned sequence segments. Further discovering their association patterns could reveal subtle family subgroup characteristics. Since aligned residues associations (ARAs) in Aligned Pattern Clusters (APCs) are complex and intertwined due to entangled function, factors, and variance in the source environment, we have recently developed a novel method: Aligned Residue Association Discovery and Disentanglement (ARADD) to solve this problem. ARADD first obtains from an APC an ARA Frequency Matrix and converts it to an adjusted statistical residual vectorspace (SRV). It then disentangles the SRV into Principal Components (PCs) and Re-projects their vectors to a SRV to reveal succinct orthogonal AR groups. In this study, we applied ARADD to class A scavenger receptors (SR-A), a subclass of a diverse protein family binding to modified lipoproteins with diverse biological functionalities not explicitly known. Our experimental results demonstrated that ARADD can unveil subtle subgroups in sequence segments with diverse functionality and highly variable sequence lengths. We also demonstrated that the ARAs captured in a Position Weight Matrix or an APC were entangled in biological function and domain location but disentangled by ARADD to reveal different subclasses without knowing their actual occurrence positions.
Collapse
Affiliation(s)
- Pei-Yuan Zhou
- VaryWave Technology Co., Ltd., 538A, Core Building 2, Hong Kong Science Park, Shatin, NT, Hong Kong.
| | - En-Shiun Annie Lee
- VerticalScope Inc., 111 Peter Street, Suite 900, Toronto, ON M5V 2H1, Canada.
| | - Antonio Sze-To
- Systems Design Engineering, 5th, 6th Floor, 200 University Avenue West, University of Waterloo, Waterloo, ON N2L 3G1, Canada.
| | - Andrew K C Wong
- Systems Design Engineering, 5th, 6th Floor, 200 University Avenue West, University of Waterloo, Waterloo, ON N2L 3G1, Canada.
| |
Collapse
|
46
|
Abstract
Covariance analysis of protein sequence alignments uses coevolving pairs of sequence positions to predict features of protein structure and function. However, current methods ignore the phylogenetic relationships between sequences, potentially corrupting the identification of covarying positions. Here, we use random matrix theory to demonstrate the existence of a power law tail that distinguishes the spectrum of covariance caused by phylogeny from that caused by structural interactions. The power law is essentially independent of the phylogenetic tree topology, depending on just two parameters-the sequence length and the average branch length. We demonstrate that these power law tails are ubiquitous in the large protein sequence alignments used to predict contacts in 3D structure, as predicted by our theory. This suggests that to decouple phylogenetic effects from the interactions between sequence distal sites that control biological function, it is necessary to remove or down-weight the eigenvectors of the covariance matrix with largest eigenvalues. We confirm that truncating these eigenvectors improves contact prediction.
Collapse
|
47
|
Abstract
Sequence and structure space are nowadays sufficiently large that we can use computational methods to model the structure of proteins based on sequence similarity alone. Not only useful as a standalone tool, homology modelling has also had a transformative effect on the ease with which we can solve crystal structures and electron density maps. Another technique-molecular dynamics-aims to model protein structures from first principles and, thanks to increases in computational power, is slowly becoming a viable tool for studying protein complexes. Finally, the prediction of protein assembly pathways from three-dimensional structures of complexes is also now becoming possible.
Collapse
Affiliation(s)
- Jonathan N Wells
- MRC Human Genetics Unit, Institute of Genetics and Molecular Medicine, University of Edinburgh, Edinburgh, UK.
| | - L Therese Bergendahl
- MRC Human Genetics Unit, Institute of Genetics and Molecular Medicine, University of Edinburgh, Edinburgh, UK
| | - Joseph A Marsh
- MRC Human Genetics Unit, Institute of Genetics and Molecular Medicine, University of Edinburgh, Edinburgh, UK
| |
Collapse
|
48
|
Fantini M, Malinverni D, De Los Rios P, Pastore A. New Techniques for Ancient Proteins: Direct Coupling Analysis Applied on Proteins Involved in Iron Sulfur Cluster Biogenesis. Front Mol Biosci 2017; 4:40. [PMID: 28664160 PMCID: PMC5471300 DOI: 10.3389/fmolb.2017.00040] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2017] [Accepted: 05/24/2017] [Indexed: 12/01/2022] Open
Abstract
Direct coupling analysis (DCA) is a powerful statistical inference tool used to study protein evolution. It was introduced to predict protein folds and protein-protein interactions, and has also been applied to the prediction of entire interactomes. Here, we have used it to analyze three proteins of the iron-sulfur biogenesis machine, an essential metabolic pathway conserved in all organisms. We show that DCA can correctly reproduce structural features of the CyaY/frataxin family (a protein involved in the human disease Friedreich's ataxia) despite being based on the relatively small number of sequences allowed by its genomic distribution. This result gives us confidence in the method. Its application to the iron-sulfur cluster scaffold protein IscU, which has been suggested to function both as an ordered and a disordered form, allows us to distinguish evolutionary traces of the structured species, suggesting that, if present in the cell, the disordered form has not left evolutionary imprinting. We observe instead, for the first time, direct indications of how the protein can dimerize head-to-head and bind 4Fe4S clusters. Analysis of the alternative scaffold protein IscA provides strong support to a coordination of the cluster by a dimeric form rather than a tetramer, as previously suggested. Our analysis also suggests the presence in solution of a mixture of monomeric and dimeric species, and guides us to the prevalent one. Finally, we used DCA to analyze interactions between some of these proteins, and discuss the potentials and limitations of the method.
Collapse
Affiliation(s)
- Marco Fantini
- BioSNS, Faculty of Mathematical and Natural Sciences, Scuola Normale SuperiorePisa, Italy
| | - Duccio Malinverni
- Institute of Physics, School of Basic Sciences, and Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de LausanneLausanne, Switzerland
| | - Paolo De Los Rios
- Institute of Physics, School of Basic Sciences, and Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de LausanneLausanne, Switzerland
| | - Annalisa Pastore
- Maurice Wohl Institute, King's CollegeLondon, United Kingdom.,Molecular Medicine Department, University of PaviaPavia, Italy
| |
Collapse
|
49
|
Murray JM, Maher S, Mota T, Suzuki K, Kelleher AD, Center RJ, Purcell D. Differentiating founder and chronic HIV envelope sequences. PLoS One 2017; 12:e0171572. [PMID: 28187204 PMCID: PMC5302377 DOI: 10.1371/journal.pone.0171572] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2016] [Accepted: 01/23/2017] [Indexed: 11/27/2022] Open
Abstract
Significant progress has been made in characterizing broadly neutralizing antibodies against the HIV envelope glycoprotein Env, but an effective vaccine has proven elusive. Vaccine development would be facilitated if common features of early founder virus required for transmission could be identified. Here we employ a combination of bioinformatic and operations research methods to determine the most prevalent features that distinguish 78 subtype B and 55 subtype C founder Env sequences from an equal number of chronic sequences. There were a number of equivalent optimal networks (based on the fewest covarying amino acid (AA) pairs or a measure of maximal covariance) that separated founders from chronics: 13 pairs for subtype B and 75 for subtype C. Every subtype B optimal solution contained the founder pairs 178–346 Asn-Val, 232–236 Thr-Ser, 240–340 Lys-Lys, 279–315 Asp-Lys, 291–792 Ala-Ile, 322–347 Asp-Thr, 535–620 Leu-Asp, 742–837 Arg-Phe, and 750–836 Asp-Ile; the most common optimal pairs for subtype C were 644–781 Lys-Ala (74 of 75 networks), 133–287 Ala-Gln (73/75) and 307–337 Ile-Gln (73/75). No pair was present in all optimal subtype C solutions highlighting the difficulty in targeting transmission with a single vaccine strain. Relative to the size of its domain (0.35% of Env), the α4β7 binding site occurred most frequently among optimal pairs, especially for subtype C: 4.2% of optimal pairs (1.2% for subtype B). Early sequences from 5 subtype B pre-seroconverters each exhibited at least one clone containing an optimal feature 553–624 (Ser-Asn), 724–747 (Arg-Arg), or 46–293 (Arg-Glu).
Collapse
Affiliation(s)
- John M. Murray
- School of Mathematics and Statistics, UNSW Sydney, Sydney, New South Wales, Australia
- * E-mail:
| | - Stephen Maher
- School of Mathematics and Statistics, UNSW Sydney, Sydney, New South Wales, Australia
- Zuse Institute Berlin, Berlin, Germany
| | - Talia Mota
- Department of Microbiology and Immunology, Peter Doherty Institute for Infection and Immunity, University of Melbourne, Melbourne, Victoria, Australia
| | - Kazuo Suzuki
- The Kirby Institute, UNSW Sydney, Sydney, New South Wales, Australia
| | | | - Rob J. Center
- Department of Microbiology and Immunology, Peter Doherty Institute for Infection and Immunity, University of Melbourne, Melbourne, Victoria, Australia
| | - Damian Purcell
- Department of Microbiology and Immunology, Peter Doherty Institute for Infection and Immunity, University of Melbourne, Melbourne, Victoria, Australia
| |
Collapse
|
50
|
Abstract
Specific protein-protein interactions are crucial in the cell, both to ensure the formation and stability of multiprotein complexes and to enable signal transduction in various pathways. Functional interactions between proteins result in coevolution between the interaction partners, causing their sequences to be correlated. Here we exploit these correlations to accurately identify, from sequence data alone, which proteins are specific interaction partners. Our general approach, which employs a pairwise maximum entropy model to infer couplings between residues, has been successfully used to predict the 3D structures of proteins from sequences. Thus inspired, we introduce an iterative algorithm to predict specific interaction partners from two protein families whose members are known to interact. We first assess the algorithm's performance on histidine kinases and response regulators from bacterial two-component signaling systems. We obtain a striking 0.93 true positive fraction on our complete dataset without any a priori knowledge of interaction partners, and we uncover the origin of this success. We then apply the algorithm to proteins from ATP-binding cassette (ABC) transporter complexes, and obtain accurate predictions in these systems as well. Finally, we present two metrics that accurately distinguish interacting protein families from noninteracting ones, using only sequence data.
Collapse
|