1
|
Gracia B, Montes P, Gutierrez AM, Arun B, Karras GI. Protein-folding chaperones predict structure-function relationships and cancer risk in BRCA1 mutation carriers. Cell Rep 2024; 43:113803. [PMID: 38368609 PMCID: PMC10941025 DOI: 10.1016/j.celrep.2024.113803] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2023] [Revised: 12/28/2023] [Accepted: 02/01/2024] [Indexed: 02/20/2024] Open
Abstract
Predicting the risk of cancer mutations is critical for early detection and prevention, but differences in allelic severity of human carriers confound risk predictions. Here, we elucidate protein folding as a cellular mechanism driving differences in mutation severity of tumor suppressor BRCA1. Using a high-throughput protein-protein interaction assay, we show that protein-folding chaperone binding patterns predict the pathogenicity of variants in the BRCA1 C-terminal (BRCT) domain. HSP70 selectively binds 94% of pathogenic BRCA1-BRCT variants, most of which engage HSP70 more than HSP90. Remarkably, the magnitude of HSP70 binding linearly correlates with loss of folding and function. We identify a prevalent class of human hypomorphic BRCA1 variants that bind moderately to chaperones and retain partial folding and function. Furthermore, chaperone binding signifies greater mutation penetrance and earlier cancer onset in the clinic. Our findings demonstrate the utility of chaperones as quantitative cellular biosensors of variant folding, phenotypic severity, and cancer risk.
Collapse
Affiliation(s)
- Brant Gracia
- Department of Genetics, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA
| | - Patricia Montes
- Department of Genetics, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA
| | - Angelica Maria Gutierrez
- Department of Breast Medical Oncology and Clinical Cancer Genetics, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA
| | - Banu Arun
- Department of Breast Medical Oncology and Clinical Cancer Genetics, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA
| | - Georgios Ioannis Karras
- Department of Genetics, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA; Genetics and Epigenetics Graduate Program, The University of Texas MD Anderson Cancer Center UTHealth Houston Graduate School of Biomedical Sciences, Houston, TX, USA.
| |
Collapse
|
2
|
Du K, Huang H. Development of anti-PD-L1 antibody based on structure prediction of AlphaFold2. Front Immunol 2023; 14:1275999. [PMID: 37942332 PMCID: PMC10628240 DOI: 10.3389/fimmu.2023.1275999] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2023] [Accepted: 10/11/2023] [Indexed: 11/10/2023] Open
Abstract
Accurate structural information plays a crucial role in comprehending biological processes and designing drugs. Indeed, the remarkable precision of the AlphaFold2 has facilitated significant advancements in predicting molecular structures, encompassing antibodies and antigens. This breakthrough has paved the way for rational drug design, ushering in new possibilities in the field of pharmaceutical development. Within this study, performing analysis and humanization guided by the structures predicted by AlphaFold2. Notably, the resulting humanized antibody, h3D5-hIgG1, demonstrated exceptional binding affinity to the PD-L1 protein. The KD value of parental antibody 3D5-hIgG1 was increased by nearly 7 times after humanization. Both h3D5-hIgG1 and 3D5-hIgG1 bound to cells expressing human PD-L1 with EC50 values of 5.13 and 9.92nM, respectively. Humanization resulted in a twofold increase in the binding capacity of the antibody, with h3D5-hIgG1 exhibiting superior performance compared to the parental antibody 3D5-hIgG1. Furthermore, h3D5-hIgG1 promoted cytokine secretion of T cells, and significantly suppressed MC38-hPD-L1 tumor growth. This study highlights the potential for artificial intelligence-assisted drug development, which is poised to become a prominent trend in the future.
Collapse
Affiliation(s)
- Kun Du
- School of Chemical Engineering and Technology, Tianjin University, Tianjin, China
- Frontiers Science Center for Synthetic Biology and Key Laboratory of Systems Bioengineering (Ministry of Education), Tianjin University, Tianjin, China
| | - He Huang
- School of Chemical Engineering and Technology, Tianjin University, Tianjin, China
- Frontiers Science Center for Synthetic Biology and Key Laboratory of Systems Bioengineering (Ministry of Education), Tianjin University, Tianjin, China
| |
Collapse
|
3
|
Gracia B, Montes P, Gutierrez AM, Arun B, Karras GI. Protein-Folding Chaperones Predict Structure-Function Relationships and Cancer Risk in BRCA1 Mutation Carriers. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.09.14.557795. [PMID: 37745493 PMCID: PMC10515940 DOI: 10.1101/2023.09.14.557795] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/26/2023]
Abstract
Identifying pathogenic mutations and predicting their impact on protein structure, function and phenotype remain major challenges in genome sciences. Protein-folding chaperones participate in structure-function relationships by facilitating the folding of protein variants encoded by mutant genes. Here, we utilize a high-throughput protein-protein interaction assay to test HSP70 and HSP90 chaperone interactions as predictors of pathogenicity for variants in the tumor suppressor BRCA1. Chaperones bind 77% of pathogenic BRCA1-BRCT variants, most of which engaged HSP70 more than HSP90. Remarkably, the magnitude of chaperone binding to variants is proportional to the degree of structural and phenotypic defect induced by BRCA1 mutation. Quantitative chaperone interactions identified BRCA1-BRCT separation-of-function variants and hypomorphic alleles missed by pathogenicity prediction algorithms. Furthermore, increased chaperone binding signified greater cancer risk in human BRCA1 carriers. Altogether, our study showcases the utility of chaperones as quantitative cellular biosensors of variant folding and phenotypic severity. HIGHLIGHTS Chaperones detect an abundance of pathogenic folding variants of BRCA1-BRCT.Degree of chaperone binding reflects severity of structural and phenotypic defect.Chaperones identify separation-of-function and hypomorphic variants. Chaperone interactions indicate penetrance and expressivity of BRCA1 alleles.
Collapse
|
4
|
Reuss JM, Alonso-Gamo L, Garcia-Aranda M, Reuss D, Albi M, Albi B, Vilaboa D, Vilaboa B. Oral Mucosa in Cancer Patients-Putting the Pieces Together: A Narrative Review and New Perspectives. Cancers (Basel) 2023; 15:3295. [PMID: 37444405 DOI: 10.3390/cancers15133295] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2023] [Revised: 06/15/2023] [Accepted: 06/18/2023] [Indexed: 07/15/2023] Open
Abstract
The oral mucosa is a key player in cancer patients and during cancer treatment. The increasing prevalence of cancer and cancer-therapy-associated side effects are behind the major role that oral mucosa plays in oncological patients. Oral mucositis is a debilitating severe complication caused by the early toxicity of chemo and/or radiotherapy that can restrict treatment outcome possibilities, even challenging a patient's survival. It has been referred to as the most feared cancer treatment complication. Predictive variables as to who will be affected, and to what extent, are still unclear. Additionally, oral mucositis is one of the sources of the increasing economic burden of cancer, not only for patients and their families but also for institutions and governments. All efforts should be implemented in the search for new approaches to minimize the apparently ineluctable outburst of oral mucositis during cancer treatment. New perspectives derived from different approaches to explaining the interrelation between oral mucositis and the oral microbiome or the similarities with genitourinary mucosa may help elucidate the biomolecular pathways and mechanisms behind oral mucosa cancer-therapy-related toxicity, and what is more important is its management in order to minimize treatment side effects and provide enhanced cancer support.
Collapse
Affiliation(s)
- Jose Manuel Reuss
- Department of Postgraduate Prosthodontics, Universidad Complutense de Madrid, 28040 Madrid, Spain
| | - Laura Alonso-Gamo
- Department of Pediatrics, Hospital Infanta Cristina, 28981 Madrid, Spain
| | - Mariola Garcia-Aranda
- Centro Integral Oncológico Clara Campal, Department of Oncologic Radiotherapy, Hospital Universitario Sanchinarro, 28050 Madrid, Spain
| | - Debora Reuss
- Lecturer Dental School, Universidad San Pablo CEU, 28003 Madrid, Spain
| | - Manuel Albi
- Department of Gynecology and Obstetrics, Quironsalud Group Public Hospitals, 28223 Madrid, Spain
| | - Beatriz Albi
- Department of Gynecology and Obstetrics, Hospital Universitario Fundación Jiménez Díaz, 28040 Madrid, Spain
| | - Debora Vilaboa
- Aesthetic Dentistry Department, Universidad San Pablo CEU, 28003 Madrid, Spain
| | | |
Collapse
|
5
|
Kuang D, Issakova D, Kim J. Learning Proteome Domain Folding Using LSTMs in an Empirical Kernel Space. J Mol Biol 2022; 434:167686. [PMID: 35716781 DOI: 10.1016/j.jmb.2022.167686] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2022] [Revised: 06/08/2022] [Accepted: 06/10/2022] [Indexed: 11/30/2022]
Abstract
The recognition of protein structural folds is the starting point for protein function inference and for many structural prediction tools. We previously introduced the idea of using empirical comparisons to create a data-augmented feature space called PESS (Protein Empirical Structure Space)1 as a novel approach for protein structure prediction. Here, we extend the previous approach by generating the PESS feature space over fixed-length subsequences of query peptides, and applying a sequential neural network model, with one long short-term memory cell layer followed by a fully connected layer. Using this approach, we show that only a small group of domains as a training set is needed to achieve near state-of-the-art accuracy on fold recognition. Our method improves on the previous approach by reducing the training set required and improving the model's ability to generalize across species, which will help fold prediction for newly discovered proteins.
Collapse
Affiliation(s)
- Da Kuang
- Department of Computer and Information Science, University of Pennsylvania, Philadelphia, USA.
| | - Dina Issakova
- Department of Biology, University of Pennsylvania, Philadelphia, USA.
| | - Junhyong Kim
- Department of Computer and Information Science, University of Pennsylvania, Philadelphia, USA; Department of Biology, University of Pennsylvania, Philadelphia, USA.
| |
Collapse
|
6
|
Tunyasuvunakool K, Adler J, Wu Z, Green T, Zielinski M, Žídek A, Bridgland A, Cowie A, Meyer C, Laydon A, Velankar S, Kleywegt GJ, Bateman A, Evans R, Pritzel A, Figurnov M, Ronneberger O, Bates R, Kohl SAA, Potapenko A, Ballard AJ, Romera-Paredes B, Nikolov S, Jain R, Clancy E, Reiman D, Petersen S, Senior AW, Kavukcuoglu K, Birney E, Kohli P, Jumper J, Hassabis D. Highly accurate protein structure prediction for the human proteome. Nature 2021; 596:590-596. [PMID: 34293799 PMCID: PMC8387240 DOI: 10.1038/s41586-021-03828-1] [Citation(s) in RCA: 1378] [Impact Index Per Article: 459.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2021] [Accepted: 07/16/2021] [Indexed: 02/07/2023]
Abstract
Protein structures can provide invaluable information, both for reasoning about biological processes and for enabling interventions such as structure-based drug development or targeted mutagenesis. After decades of effort, 17% of the total residues in human protein sequences are covered by an experimentally determined structure1. Here we markedly expand the structural coverage of the proteome by applying the state-of-the-art machine learning method, AlphaFold2, at a scale that covers almost the entire human proteome (98.5% of human proteins). The resulting dataset covers 58% of residues with a confident prediction, of which a subset (36% of all residues) have very high confidence. We introduce several metrics developed by building on the AlphaFold model and use them to interpret the dataset, identifying strong multi-domain predictions as well as regions that are likely to be disordered. Finally, we provide some case studies to illustrate how high-quality predictions could be used to generate biological hypotheses. We are making our predictions freely available to the community and anticipate that routine large-scale and high-accuracy structure prediction will become an important tool that will allow new questions to be addressed from a structural perspective.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | | | | | - Sameer Velankar
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK
| | - Gerard J Kleywegt
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - Ewan Birney
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK
| | | | | | | |
Collapse
|
7
|
Pliss A, Kuzmin AN, Lita A, Kumar R, Celiku O, Atilla-Gokcumen GE, Gokcumen O, Chandra D, Larion M, Prasad PN. A Single-Organelle Optical Omics Platform for Cell Science and Biomarker Discovery. Anal Chem 2021; 93:8281-8290. [PMID: 34048235 DOI: 10.1021/acs.analchem.1c01131] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Research in fundamental cell biology and pathology could be revolutionized by developing the capacity for quantitative molecular analysis of subcellular structures. To that end, we introduce the Ramanomics platform, based on confocal Raman microspectrometry coupled to a biomolecular component analysis algorithm, which together enable us to molecularly profile single organelles in a live-cell environment. This emerging omics approach categorizes the entire molecular makeup of a sample into about a dozen of general classes and subclasses of biomolecules and quantifies their amounts in submicrometer volumes. A major contribution of our study is an attempt to bridge Raman spectrometry with big-data analysis in order to identify complex patterns of biomolecules in a single cellular organelle and leverage discovery of disease biomarkers. Our data reveal significant variations in organellar composition between different cell lines. We also demonstrate the merits of Ramanomics for identifying diseased cells by using prostate cancer as an example. We report large-scale molecular transformations in the mitochondria, Golgi apparatus, and endoplasmic reticulum that accompany the development of prostate cancer. Based on these findings, we propose that Ramanomics datasets in distinct organelles constitute signatures of cellular metabolism in healthy and diseased states.
Collapse
Affiliation(s)
- Artem Pliss
- Institute for Lasers, Photonics and Biophotonics and Department of Chemistry, Natural Science Complex, University at Buffalo, The State University of New York, Buffalo, New York 14260, United States
| | - Andrey N Kuzmin
- Institute for Lasers, Photonics and Biophotonics and Department of Chemistry, Natural Science Complex, University at Buffalo, The State University of New York, Buffalo, New York 14260, United States
| | - Adrian Lita
- Neuro-Oncology Branch, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, Maryland 20892, United States
| | - Rahul Kumar
- Department of Pharmacology and Therapeutics, Roswell Park Comprehensive Cancer Center, Buffalo, New York 14263, United States
| | - Orieta Celiku
- Neuro-Oncology Branch, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, Maryland 20892, United States
| | - G Ekin Atilla-Gokcumen
- Department of Chemistry, Natural Science Complex, University at Buffalo, The State University of New York, Buffalo, New York 14260, United States
| | - Omer Gokcumen
- Department of Biological Sciences, Cooke Hall, University at Buffalo, The State University of New York, Buffalo, New York 14260, United States
| | - Dhyan Chandra
- Department of Pharmacology and Therapeutics, Roswell Park Comprehensive Cancer Center, Buffalo, New York 14263, United States
| | - Mioara Larion
- Neuro-Oncology Branch, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, Maryland 20892, United States
| | - Paras N Prasad
- Institute for Lasers, Photonics and Biophotonics and Department of Chemistry, Natural Science Complex, University at Buffalo, The State University of New York, Buffalo, New York 14260, United States
| |
Collapse
|
8
|
Pechmann S. Programmed Trade-offs in Protein Folding Networks. Structure 2020; 28:1361-1375.e4. [PMID: 33053320 DOI: 10.1016/j.str.2020.09.009] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2020] [Revised: 07/25/2020] [Accepted: 09/23/2020] [Indexed: 12/14/2022]
Abstract
Molecular chaperones as specialized protein quality control enzymes form the core of cellular protein homeostasis. How chaperones selectively interact with their substrate proteins thus allocate their overall limited capacity remains poorly understood. Here, I present an integrated analysis of sequence and structural determinants that define interactions of protein domains as the basic protein folding unit with the Saccharomyces cerevisiae Hsp70 Ssb. Structural homologs of single-domain proteins that differentially interact with Ssb for de novo folding were found to systematically differ in complexity of their folding landscapes, selective use of nonoptimal codons, and presence of short discriminative sequences, thus highlighting pervasive trade-offs in chaperone-assisted protein folding landscapes. However, short discriminative sequences were found to contribute by far the strongest signal toward explaining Ssb interactions. This observation suggested that some chaperone interactions may be directly programmed in the amino acid sequences rather than responding to folding challenges, possibly for regulatory advantages.
Collapse
Affiliation(s)
- Sebastian Pechmann
- Département de biochimie, Université de Montréal, 2900 Boulevard Edouard-Montpetit, Montréal, QC H3T 1J4, Canada.
| |
Collapse
|
9
|
Koehler Leman J, Weitzner BD, Renfrew PD, Lewis SM, Moretti R, Watkins AM, Mulligan VK, Lyskov S, Adolf-Bryfogle J, Labonte JW, Krys J, Bystroff C, Schief W, Gront D, Schueler-Furman O, Baker D, Bradley P, Dunbrack R, Kortemme T, Leaver-Fay A, Strauss CEM, Meiler J, Kuhlman B, Gray JJ, Bonneau R. Better together: Elements of successful scientific software development in a distributed collaborative community. PLoS Comput Biol 2020; 16:e1007507. [PMID: 32365137 PMCID: PMC7197760 DOI: 10.1371/journal.pcbi.1007507] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
Many scientific disciplines rely on computational methods for data analysis, model generation, and prediction. Implementing these methods is often accomplished by researchers with domain expertise but without formal training in software engineering or computer science. This arrangement has led to underappreciation of sustainability and maintainability of scientific software tools developed in academic environments. Some software tools have avoided this fate, including the scientific library Rosetta. We use this software and its community as a case study to show how modern software development can be accomplished successfully, irrespective of subject area. Rosetta is one of the largest software suites for macromolecular modeling, with 3.1 million lines of code and many state-of-the-art applications. Since the mid 1990s, the software has been developed collaboratively by the RosettaCommons, a community of academics from over 60 institutions worldwide with diverse backgrounds including chemistry, biology, physiology, physics, engineering, mathematics, and computer science. Developing this software suite has provided us with more than two decades of experience in how to effectively develop advanced scientific software in a global community with hundreds of contributors. Here we illustrate the functioning of this development community by addressing technical aspects (like version control, testing, and maintenance), community-building strategies, diversity efforts, software dissemination, and user support. We demonstrate how modern computational research can thrive in a distributed collaborative community. The practices described here are independent of subject area and can be readily adopted by other software development communities.
Collapse
Affiliation(s)
- Julia Koehler Leman
- Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, United States of America
- Dept of Biology, New York University, New York, NY, United States of America
| | - Brian D. Weitzner
- Dept of Chemical and Biomolecular Engineering, Johns Hopkins University, Baltimore, MD, United States of America
- Dept of Biochemistry, University of Washington, Seattle, WA, United States of America
- Institute for Protein Design, University of Washington, Seattle, WA, United States of America
- Lyell Immunopharma, Seattle, WA, United States of America
| | - P. Douglas Renfrew
- Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, United States of America
| | - Steven M. Lewis
- Dept of Biochemistry and Biophysics, University of North Carolina at Chapel Hill, Chapel Hill, NC, United States of America
- Dept of Biochemistry, Duke University, Durham, NC, United States of America
- Cyrus Biotechnology, Seattle, WA United States of America
| | - Rocco Moretti
- Dept of Chemistry, Vanderbilt University, Nashville, TN, United States of America
| | - Andrew M. Watkins
- Dept of Biochemistry, Stanford University School of Medicine, Stanford CA, United States of America
| | - Vikram Khipple Mulligan
- Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, United States of America
- Dept of Biochemistry, University of Washington, Seattle, WA, United States of America
- Institute for Protein Design, University of Washington, Seattle, WA, United States of America
| | - Sergey Lyskov
- Dept of Chemical and Biomolecular Engineering, Johns Hopkins University, Baltimore, MD, United States of America
| | - Jared Adolf-Bryfogle
- Dept of Immunology and Microbiology, The Scripps Research Institute, La Jolla, CA, United States of America
| | - Jason W. Labonte
- Dept of Chemical and Biomolecular Engineering, Johns Hopkins University, Baltimore, MD, United States of America
- Dept of Chemistry, Franklin & Marshall College, Lancaster, PA, United States of America
| | - Justyna Krys
- Dept of Chemistry, University of Warsaw, Warsaw, Poland
| | | | - Christopher Bystroff
- Dept of Biological Sciences, Rensselaer Polytechnic Institute, Troy, NY, United States of America
| | - William Schief
- Dept of Immunology and Microbiology, The Scripps Research Institute, La Jolla, CA, United States of America
| | - Dominik Gront
- Dept of Chemistry, University of Warsaw, Warsaw, Poland
| | - Ora Schueler-Furman
- Dept of Microbiology and Molecular Genetics, IMRIC, Ein Kerem Faculty of Medicine, Hebrew University of Jerusalem, Jerusalem, Israel
| | - David Baker
- Dept of Biochemistry, University of Washington, Seattle, WA, United States of America
- Institute for Protein Design, University of Washington, Seattle, WA, United States of America
| | - Philip Bradley
- Fred Hutchinson Cancer Research Center, Seattle, WA, United States of America
| | - Roland Dunbrack
- Institute for Cancer Research, Fox Chase Cancer Center, Philadelphia PA, United States of America
| | - Tanja Kortemme
- Dept of Bioengineering and Therapeutic Sciences, University of California San Francisco, CA, United States of America
| | - Andrew Leaver-Fay
- Dept of Biochemistry and Biophysics, University of North Carolina at Chapel Hill, Chapel Hill, NC, United States of America
| | - Charlie E. M. Strauss
- Bioscience Division, Los Alamos National Laboratory, Los Alamos, NM, United States of America
| | - Jens Meiler
- Depts of Chemistry, Pharmacology and Biomedical Informatics, Vanderbilt University, Nashville, TN, United States of America
- Center for Structural Biology, Vanderbilt University, Nashville, TN, United States of America
- Institute for Chemical Biology, Vanderbilt University, Nashville, TN, United States of America
- Institute for Drug Discovery, Leipzig University, Leipzig, Germany
| | - Brian Kuhlman
- Dept of Biochemistry and Biophysics, University of North Carolina at Chapel Hill, Chapel Hill, NC, United States of America
| | - Jeffrey J. Gray
- Dept of Chemical and Biomolecular Engineering, Johns Hopkins University, Baltimore, MD, United States of America
| | - Richard Bonneau
- Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, United States of America
- Dept of Biology, New York University, New York, NY, United States of America
- Dept of Computer Science, New York University, New York, NY, United States of America
- Center for Data Science, New York University, New York, NY, United States of America
| |
Collapse
|
10
|
Abstract
Next-generation sequencing has made major strides in the past decade. Studies based on large sequencing data sets are growing in number, and public archives for raw sequencing data have been doubling in size every 18 months. Leveraging these data requires researchers to use large-scale computational resources. Cloud computing, a model whereby users rent computers and storage from large data centres, is a solution that is gaining traction in genomics research. Here, we describe how cloud computing is used in genomics for research and large-scale collaborations, and argue that its elasticity, reproducibility and privacy features make it ideally suited for the large-scale reanalysis of publicly available archived data, including privacy-protected data.
Collapse
Affiliation(s)
- Ben Langmead
- Department of Computer Science, Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Abhinav Nellore
- Department of Biomedical Engineering, Department of Surgery, Computational Biology Program, Oregon Health and Science University, Portland, OR, USA
| |
Collapse
|
11
|
Nardo AE, Añón MC, Parisi G. Large-scale mapping of bioactive peptides in structural and sequence space. PLoS One 2018; 13:e0191063. [PMID: 29351315 PMCID: PMC5774755 DOI: 10.1371/journal.pone.0191063] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2017] [Accepted: 12/27/2017] [Indexed: 12/11/2022] Open
Abstract
Health-enhancing potential bioactive peptide (BP) has driven an interest in food proteins as well as in the development of predictive methods. Research in this area has been especially active to use them as components in functional foods. Apparently, BPs do not have a given biological function in the containing proteins and they do not evolve under independent evolutionary constraints. In this work we performed a large-scale mapping of BPs in sequence and structural space. Using well curated BP deposited in BIOPEP database, we searched for exact matches in non-redundant sequences databases. Proteins containing BPs, were used in fold-recognition methods to predict the corresponding folds and BPs occurrences were mapped. We found that fold distribution of BP occurrences possibly reflects sequence relative abundance in databases. However, we also found that proteins with 5 or more than 5 BP in their sequences correspond to well populated protein folds, called superfolds. Also, we found that in well populated superfamilies, BPs tend to adopt similar locations in the protein fold, suggesting the existence of hotspots. We think that our results could contribute to the development of new bioinformatics pipeline to improve BP detection.
Collapse
Affiliation(s)
- Agustina E. Nardo
- Departamento de Ciencia y Tecnología, Universidad Nacional de Quilmes, CONICET, Bernal, Argentina
- Centro de Investigación y Desarrollo en Criotecnología de Alimentos, Facultad de Ciencia Exactas, Universidad Nacional de la Plata - Comisión de Investigaciones Científicas - CONICET, La Plata, Argentina
| | - M. Cristina Añón
- Centro de Investigación y Desarrollo en Criotecnología de Alimentos, Facultad de Ciencia Exactas, Universidad Nacional de la Plata - Comisión de Investigaciones Científicas - CONICET, La Plata, Argentina
| | - Gustavo Parisi
- Departamento de Ciencia y Tecnología, Universidad Nacional de Quilmes, CONICET, Bernal, Argentina
| |
Collapse
|
12
|
Monzon AM, Zea DJ, Marino-Buslje C, Parisi G. Homology modeling in a dynamical world. Protein Sci 2017; 26:2195-2206. [PMID: 28815769 DOI: 10.1002/pro.3274] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2017] [Revised: 08/09/2017] [Accepted: 08/09/2017] [Indexed: 12/31/2022]
Abstract
A key concept in template-based modeling (TBM) is the high correlation between sequence and structural divergence, with the practical consequence that homologous proteins that are similar at the sequence level will also be similar at the structural level. However, conformational diversity of the native state will reduce the correlation between structural and sequence divergence, because structural variation can appear without sequence diversity. In this work, we explore the impact that conformational diversity has on the relationship between structural and sequence divergence. We find that the extent of conformational diversity can be as high as the maximum structural divergence among families. Also, as expected, conformational diversity impairs the well-established correlation between sequence and structural divergence, which is nosier than previously suggested. However, we found that this noise can be resolved using a priori information coming from the structure-function relationship. We show that protein families with low conformational diversity show a well-correlated relationship between sequence and structural divergence, which is severely reduced in proteins with larger conformational diversity. This lack of correlation could impair TBM results in highly dynamical proteins. Finally, we also find that the presence of order/disorder can provide useful beforehand information for better TBM performance.
Collapse
Affiliation(s)
- Alexander Miguel Monzon
- Departamento de Ciencia y Tecnología, Universidad Nacional de Quilmes, CONICET, B1876BXD, Bernal, Argentina
| | - Diego Javier Zea
- Structural Bioinformatics Unit, Fundación Instituto Leloir, CONICET, C1405BWE Ciudad Autónoma de Buenos Aires, Argentina
| | - Cristina Marino-Buslje
- Structural Bioinformatics Unit, Fundación Instituto Leloir, CONICET, C1405BWE Ciudad Autónoma de Buenos Aires, Argentina
| | - Gustavo Parisi
- Departamento de Ciencia y Tecnología, Universidad Nacional de Quilmes, CONICET, B1876BXD, Bernal, Argentina
| |
Collapse
|
13
|
Bao W, Wang D, Chen Y. Classification of Protein Structure Classes on Flexible Neutral Tree. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2017; 14:1122-1133. [PMID: 28113983 DOI: 10.1109/tcbb.2016.2610967] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Accurate classification on protein structural is playing an important role in Bioinformatics. An increase in evidence demonstrates that a variety of classification methods have been employed in such a field. In this research, the features of amino acids composition, secondary structure's feature, and correlation coefficient of amino acid dimers and amino acid triplets have been used. Flexible neutral tree (FNT), a particular tree structure neutral network, has been employed as the classification model in the protein structures' classification framework. Considering different feature groups owing diverse roles in the model, impact factors of different groups have been put forward in this research. In order to evaluate different impact factors, Impact Factors Scaling (IFS) algorithm, which aim at reducing redundant information of the selected features in some degree, have been put forward. To examine the performance of such framework, the 640, 1189, and ASTRAL datasets are employed as the low-homology protein structure benchmark datasets. Experimental results demonstrate that the performance of the proposed method is better than the other methods in the low-homology protein tertiary structures.
Collapse
|
14
|
Middleton SA, Illuminati J, Kim J. Complete fold annotation of the human proteome using a novel structural feature space. Sci Rep 2017; 7:46321. [PMID: 28406174 PMCID: PMC5390313 DOI: 10.1038/srep46321] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2017] [Accepted: 03/14/2017] [Indexed: 11/11/2022] Open
Abstract
Recognition of protein structural fold is the starting point for many structure prediction tools and protein function inference. Fold prediction is computationally demanding and recognizing novel folds is difficult such that the majority of proteins have not been annotated for fold classification. Here we describe a new machine learning approach using a novel feature space that can be used for accurate recognition of all 1,221 currently known folds and inference of unknown novel folds. We show that our method achieves better than 94% accuracy even when many folds have only one training example. We demonstrate the utility of this method by predicting the folds of 34,330 human protein domains and showing that these predictions can yield useful insights into potential biological function, such as prediction of RNA-binding ability. Our method can be applied to de novo fold prediction of entire proteomes and identify candidate novel fold families.
Collapse
Affiliation(s)
- Sarah A Middleton
- Genomics and Computational Biology Program, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Joseph Illuminati
- Department of Computer Science, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Junhyong Kim
- Genomics and Computational Biology Program, University of Pennsylvania, Philadelphia, PA 19104, USA.,Department of Biology, University of Pennsylvania, Philadelphia, PA 19104, USA
| |
Collapse
|
15
|
Van Holle S, Rougé P, Van Damme EJM. Evolution and structural diversification of Nictaba-like lectin genes in food crops with a focus on soybean (Glycine max). ANNALS OF BOTANY 2017; 119:901-914. [PMID: 28087663 PMCID: PMC5379587 DOI: 10.1093/aob/mcw259] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/02/2016] [Revised: 10/24/2016] [Accepted: 11/17/2016] [Indexed: 05/10/2023]
Abstract
Background and Aims The Nictaba family groups all proteins that show homology to Nictaba, the tobacco lectin. So far, Nictaba and an Arabidopsis thaliana homologue have been shown to be implicated in the plant stress response. The availability of more than 50 sequenced plant genomes provided the opportunity for a genome-wide identification of Nictaba -like genes in 15 species, representing members of the Fabaceae, Poaceae, Solanaceae, Musaceae, Arecaceae, Malvaceae and Rubiaceae. Additionally, phylogenetic relationships between the different species were explored. Furthermore, this study included domain organization analysis, searching for orthologous genes in the legume family and transcript profiling of the Nictaba -like lectin genes in soybean. Methods Using a combination of BLASTp, InterPro analysis and hidden Markov models, the genomes of Medicago truncatula , Cicer arietinum , Lotus japonicus , Glycine max , Cajanus cajan , Phaseolus vulgaris , Theobroma cacao , Solanum lycopersicum , Solanum tuberosum , Coffea canephora , Oryza sativa , Zea mays, Sorghum bicolor , Musa acuminata and Elaeis guineensis were searched for Nictaba -like genes. Phylogenetic analysis was performed using RAxML and additional protein domains in the Nictaba-like sequences were identified using InterPro. Expression analysis of the soybean Nictaba -like genes was investigated using microarray data. Key Results Nictaba -like genes were identified in all studied species and analysis of the duplication events demonstrated that both tandem and segmental duplication contributed to the expansion of the Nictaba gene family in angiosperms. The single-domain Nictaba protein and the multi-domain F-box Nictaba architectures are ubiquitous among all analysed species and microarray analysis revealed differential expression patterns for all soybean Nictaba-like genes. Conclusions Taken together, the comparative genomics data contributes to our understanding of the Nictaba -like gene family in species for which the occurrence of Nictaba domains had not yet been investigated. Given the ubiquitous nature of these genes, they have probably acquired new functions over time and are expected to take on various roles in plant development and defence.
Collapse
Affiliation(s)
- Sofie Van Holle
- Laboratory of Biochemistry and Glycobiology, Department of Molecular Biotechnology, Ghent University, Coupure Links 653, 9000 Ghent, Belgium
| | - Pierre Rougé
- UMR 152 PHARMA-DEV, Université de Toulouse, IRD, UPS, Chemin des Maraîchers 35, 31400 Toulouse, France
| | - Els J. M. Van Damme
- Laboratory of Biochemistry and Glycobiology, Department of Molecular Biotechnology, Ghent University, Coupure Links 653, 9000 Ghent, Belgium
| |
Collapse
|
16
|
Sheu MJ, Hsieh MJ, Chou YE, Wang PH, Yeh CB, Yang SF, Lee HL, Liu YF. Effects of ADAMTS14 genetic polymorphism and cigarette smoking on the clinicopathologic development of hepatocellular carcinoma. PLoS One 2017; 12:e0172506. [PMID: 28231306 PMCID: PMC5322915 DOI: 10.1371/journal.pone.0172506] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2016] [Accepted: 02/05/2017] [Indexed: 01/12/2023] Open
Abstract
Background ADAMTS14 is a member of the ADAMTS (adisintegrin and metalloproteinase with thrombospondin motifs), which are proteolytic enzymes with a variety of further ancillary domain in the C-terminal region for substrate specificity and enzyme localization via extracellular matrix association. However, whether ADAMTS14 genetic variants play a role in hepatocellular carcinoma (HCC) susceptibility remains unknown. Methodology/Principal findings Four non-synonymous single-nucleotide polymorphisms (nsSNPs) of the ADAMTS14 gene were examined from 680 controls and 340 patients with HCC. Among 141 HCC patients with smoking behaviour, we found significant associations of the rs12774070 (CC+AA vs CC) and rs61573157 (CT+TT vs CC) variants with a clinical stage of HCC (OR: 2.500 and 2.767; 95% CI: 1.148–5.446 and 1.096–6.483; P = 0.019 and 0.026, respectively) and tumour size (OR: 2.387 and 2.659; 95% CI: 1.098–5.188 and 1.055–6.704; P = 0.026 and 0.034, respectively), but not with lymph node metastasis or other clinical statuses. Moreover, an additional integrated in silico analysis proposed that rs12774070 and rs61573157 affected essential post-translation O-glycosylation site within the 3rd thrombospondin type 1 repeat and a novel proline-rich region embedded within the C-terminal extension, respectively. Conclusions Taken together, our results suggest an involvement of ADAMTS14 SNP rs12774070 and rs61573157 in the liver tumorigenesis and implicate the ADAMTS14 gene polymorphism as a predict factor during the progression of HCC.
Collapse
Affiliation(s)
- Ming-Jen Sheu
- Department of Gastroenterology and Hepatology, Chi Mei Medical Center, Tainan, Taiwan
| | - Ming-Ju Hsieh
- Institute of Medicine, Chung Shan Medical University, Taichung, Taiwan
- Cancer Research Center, Changhua Christian Hospital, Changhua, Taiwan
- Graduate Institute of Biomedical Sciences, China Medical University, Taichung, Taiwan
| | - Ying-Erh Chou
- School of Medicine, Chung Shan Medical University, Taichung, Taiwan
- Department of Medical Research, Chung Shan Medical University Hospital, Taichung, Taiwan
| | - Po-Hui Wang
- Institute of Medicine, Chung Shan Medical University, Taichung, Taiwan
- Department of Obstetrics and Gynecology, Chung Shan Medical University Hospital, Taichung, Taiwan
| | - Chao-Bin Yeh
- School of Medicine, Chung Shan Medical University, Taichung, Taiwan
- Department of Emergency Medicine, Chung Shan Medical University Hospital, Taichung, Taiwan
| | - Shun-Fa Yang
- Institute of Medicine, Chung Shan Medical University, Taichung, Taiwan
- Department of Medical Research, Chung Shan Medical University Hospital, Taichung, Taiwan
| | - Hsiang-Lin Lee
- School of Medicine, Chung Shan Medical University, Taichung, Taiwan
- Deptartment of Surgery, Chung Shan Medical University Hospital, Taichung, Taiwan
| | - Yu-Fan Liu
- Department of Biomedical Sciences, College of Medicine Sciences and Technology, Chung Shan Medical University, Taichung, Taiwan
- Division of Allergy, Department of Pediatrics, Chung-Shan Medical University Hospital, Taichung, Taiwan
- * E-mail:
| |
Collapse
|
17
|
Taghipour S, Zarrineh P, Ganjtabesh M, Nowzari-Dalini A. Improving protein complex prediction by reconstructing a high-confidence protein-protein interaction network of Escherichia coli from different physical interaction data sources. BMC Bioinformatics 2017; 18:10. [PMID: 28049415 PMCID: PMC5209909 DOI: 10.1186/s12859-016-1422-x] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2016] [Accepted: 12/12/2016] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Although different protein-protein physical interaction (PPI) datasets exist for Escherichia coli, no common methodology exists to integrate these datasets and extract reliable modules reflecting the existing biological process and protein complexes. Naïve Bayesian formula is the highly accepted method to integrate different PPI datasets into a single weighted PPI network, but detecting proper weights in such network is still a major problem. RESULTS In this paper, we proposed a new methodology to integrate various physical PPI datasets into a single weighted PPI network in a way that the detected modules in PPI network exhibit the highest similarity to available functional modules. We used the co-expression modules as functional modules, and we shown that direct functional modules detected from Gene Ontology terms could be used as an alternative dataset. After running this integrating methodology over six different physical PPI datasets, orthologous high-confidence interactions from a related organism and two AP-MS PPI datasets gained high weights in the integrated networks, while the weights for one AP-MS PPI dataset and two other datasets derived from public databases have converged to zero. The majority of detected modules shaped around one or few hub protein(s). Still, a large number of highly interacting protein modules were detected which are functionally relevant and are likely to construct protein complexes. CONCLUSIONS We provided a new high confidence protein complex prediction method supported by functional studies and literature mining.
Collapse
Affiliation(s)
- Shirin Taghipour
- Department of Computer Science, School of Mathematics, Statistics, and Computer Science, University of Tehran, P.O.Box: 14155-6455, Tehran, Iran
| | - Peyman Zarrineh
- Department of Computer Science, School of Mathematics, Statistics, and Computer Science, University of Tehran, P.O.Box: 14155-6455, Tehran, Iran
| | - Mohammad Ganjtabesh
- Department of Computer Science, School of Mathematics, Statistics, and Computer Science, University of Tehran, P.O.Box: 14155-6455, Tehran, Iran.
| | - Abbas Nowzari-Dalini
- Department of Computer Science, School of Mathematics, Statistics, and Computer Science, University of Tehran, P.O.Box: 14155-6455, Tehran, Iran
| |
Collapse
|
18
|
Jing R, Sun J, Wang Y, Li M. Domain position prediction based on sequence information by using fuzzy mean operator. Proteins 2015; 83:1462-9. [PMID: 26009844 DOI: 10.1002/prot.24833] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2015] [Revised: 04/23/2015] [Accepted: 05/17/2015] [Indexed: 11/09/2022]
Abstract
The prediction of protein domain region is an advantageous process on the study of protein structure and function. In this study, we proposed a new method, which is composed of fuzzy mean operator and region division, to predict the particular positions of domains in a target protein based on its sequence. The whole sequence is aligned and scored by using fuzzy mean operator, and the final determination of domain region position is realized by region division. A published benchmark is used for the comparison with previous researches. In addition, we generate two extra datasets to examine the stability of this method. Finally, the prediction accuracy of independent test dataset achieved by our method was up to 84.13%. We wish that this method could be useful for related researches.
Collapse
Affiliation(s)
- Runyu Jing
- Chemical Information Center (CIC), College of Chemistry, Sichuan University, Chengdu, 610064, China
| | - Jing Sun
- Chemical Information Center (CIC), College of Chemistry, Sichuan University, Chengdu, 610064, China
| | - Yuelong Wang
- Chemical Information Center (CIC), College of Chemistry, Sichuan University, Chengdu, 610064, China
| | - Menglong Li
- Chemical Information Center (CIC), College of Chemistry, Sichuan University, Chengdu, 610064, China
| |
Collapse
|
19
|
Secondary and Tertiary Structure Prediction of Proteins: A Bioinformatic Approach. COMPLEX SYSTEM MODELLING AND CONTROL THROUGH INTELLIGENT SOFT COMPUTATIONS 2015. [DOI: 10.1007/978-3-319-12883-2_19] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/27/2022]
|
20
|
Zou T, Williams N, Ozkan SB, Ghosh K. Proteome folding kinetics is limited by protein halflife. PLoS One 2014; 9:e112701. [PMID: 25393560 PMCID: PMC4231061 DOI: 10.1371/journal.pone.0112701] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2014] [Accepted: 10/10/2014] [Indexed: 12/29/2022] Open
Abstract
How heterogeneous are proteome folding timescales and what physical principles, if any, dictate its limits? We answer this by predicting copy number weighted folding speed distribution – using the native topology – for E.coli and Yeast proteome. E.coli and Yeast proteomes yield very similar distributions with average folding times of 100 milliseconds and 170 milliseconds, respectively. The topology-based folding time distribution is well described by a diffusion-drift mutation model on a flat-fitness landscape in free energy barrier between two boundaries: i) the lowest barrier height determined by the upper limit of folding speed and ii) the highest barrier height governed by the lower speed limit of folding. While the fastest time scale of the distribution is near the experimentally measured speed limit of 1 microsecond (typical of barrier-less folders), we find the slowest folding time to be around seconds (8 seconds for Yeast distribution), approximately an order of magnitude less than the fastest halflife (approximately 2 minutes) in the Yeast proteome. This separation of timescale implies even the fastest degrading protein will have moderately high (96%) probability of folding before degradation. The overall agreement with the flat-fitness landscape model further hints that proteome folding times did not undergo additional major selection pressures – to make proteins fold faster – other than the primary requirement to “sufficiently beat the clock” against its lifetime. Direct comparison between the predicted folding time and experimentally measured halflife further shows 99% of the proteome have a folding time less than their corresponding lifetime. These two findings together suggest that proteome folding kinetics may be bounded by protein halflife.
Collapse
Affiliation(s)
- Taisong Zou
- Center for Biological Physics, Department of Physics, Arizona State University, Tempe, Arizona, United States of America
| | - Nickolas Williams
- Department of Physics and Astronomy, University of Denver, Denver, Colorado, United States of America
| | - S. Banu Ozkan
- Center for Biological Physics, Department of Physics, Arizona State University, Tempe, Arizona, United States of America
| | - Kingshuk Ghosh
- Department of Physics and Astronomy, University of Denver, Denver, Colorado, United States of America
- * E-mail:
| |
Collapse
|
21
|
Mahajan S, de Brevern AG, Sanejouand YH, Srinivasan N, Offmann B. Use of a structural alphabet to find compatible folds for amino acid sequences. Protein Sci 2014; 24:145-53. [PMID: 25297700 DOI: 10.1002/pro.2581] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2014] [Accepted: 10/06/2014] [Indexed: 01/01/2023]
Abstract
The structural annotation of proteins with no detectable homologs of known 3D structure identified using sequence-search methods is a major challenge today. We propose an original method that computes the conditional probabilities for the amino-acid sequence of a protein to fit to known protein 3D structures using a structural alphabet, known as "Protein Blocks" (PBs). PBs constitute a library of 16 local structural prototypes that approximate every part of protein backbone structures. It is used to encode 3D protein structures into 1D PB sequences and to capture sequence to structure relationships. Our method relies on amino acid occurrence matrices, one for each PB, to score global and local threading of query amino acid sequences to protein folds encoded into PB sequences. It does not use any information from residue contacts or sequence-search methods or explicit incorporation of hydrophobic effect. The performance of the method was assessed with independent test datasets derived from SCOP 1.75A. With a Z-score cutoff that achieved 95% specificity (i.e., less than 5% false positives), global and local threading showed sensitivity of 64.1% and 34.2%, respectively. We further tested its performance on 57 difficult CASP10 targets that had no known homologs in PDB: 38 compatible templates were identified by our approach and 66% of these hits yielded correctly predicted structures. This method scales-up well and offers promising perspectives for structural annotations at genomic level. It has been implemented in the form of a web-server that is freely available at http://www.bo-protscience.fr/forsa.
Collapse
Affiliation(s)
- Swapnil Mahajan
- Université de La Réunion, DSIMB, UMR-S S1134, Saint Denis Messag Cedex 09, La Réunion, F-97715, France; INSERM, UMR-S 1134, DSIMB, F-75739, Paris, France; Laboratoire d'Excellence, GR-Ex, Paris, F-75739, France; Université de Nantes, UFIP CNRS UMR 6286 Faculté des Sciences et Techniques, 2 rue de la Houssinière, 44392, Nantes Cedex 03, France
| | | | | | | | | |
Collapse
|
22
|
Joseph AP, de Brevern AG. From local structure to a global framework: recognition of protein folds. J R Soc Interface 2014; 11:20131147. [PMID: 24740960 DOI: 10.1098/rsif.2013.1147] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open
Abstract
Protein folding has been a major area of research for many years. Nonetheless, the mechanisms leading to the formation of an active biological fold are still not fully apprehended. The huge amount of available sequence and structural information provides hints to identify the putative fold for a given sequence. Indeed, protein structures prefer a limited number of local backbone conformations, some being characterized by preferences for certain amino acids. These preferences largely depend on the local structural environment. The prediction of local backbone conformations has become an important factor to correctly identifying the global protein fold. Here, we review the developments in the field of local structure prediction and especially their implication in protein fold recognition.
Collapse
Affiliation(s)
- Agnel Praveen Joseph
- Science and Technology Facilities Council, Rutherford Appleton Laboratory, Harwell Oxford, , Didcot OX11 0QX, UK
| | | |
Collapse
|
23
|
Rhee SY, Mutwil M. Towards revealing the functions of all genes in plants. TRENDS IN PLANT SCIENCE 2014; 19:212-21. [PMID: 24231067 DOI: 10.1016/j.tplants.2013.10.006] [Citation(s) in RCA: 146] [Impact Index Per Article: 14.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/06/2013] [Revised: 10/10/2013] [Accepted: 10/16/2013] [Indexed: 05/19/2023]
Abstract
The great recent progress made in identifying the molecular parts lists of organisms revealed the paucity of our understanding of what most of the parts do. In this review, we introduce computational and statistical approaches and omics data used for inferring gene function in plants, with an emphasis on network-based inference. We also discuss caveats associated with network-based function predictions such as performance assessment, annotation propagation, the guilt-by-association concept, and the meaning of hubs. Finally, we note the current limitations and possible future directions such as the need for gold standard data from several species, unified access to data and tools, quantitative comparison of data and tool quality, and high-throughput experimental validation platforms for systematic gene function elucidation in plants.
Collapse
Affiliation(s)
- Seung Yon Rhee
- Carnegie Institution for Science, Department of Plant Biology, 260 Panama St, Stanford, CA 94305, USA.
| | - Marek Mutwil
- Max Planck Institute for Molecular Plant Physiology, 14476 Potsdam, Germany.
| |
Collapse
|
24
|
Abrusán G, Zhang Y, Szilágyi A. Structure prediction and analysis of DNA transposon and LINE retrotransposon proteins. J Biol Chem 2013; 288:16127-38. [PMID: 23530042 PMCID: PMC3668768 DOI: 10.1074/jbc.m113.451500] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2013] [Revised: 03/21/2013] [Indexed: 01/15/2023] Open
Abstract
Despite the considerable amount of research on transposable elements, no large-scale structural analyses of the TE proteome have been performed so far. We predicted the structures of hundreds of proteins from a representative set of DNA and LINE transposable elements and used the obtained structural data to provide the first general structural characterization of TE proteins and to estimate the frequency of TE domestication and horizontal transfer events. We show that 1) ORF1 and Gag proteins of retrotransposons contain high amounts of structural disorder; thus, despite their very low conservation, the presence of disordered regions and probably their chaperone function is conserved. 2) The distribution of SCOP classes in DNA transposons and LINEs indicates that the proteins of DNA transposons are more ancient, containing folds that already existed when the first cellular organisms appeared. 3) DNA transposon proteins have lower contact order than randomly selected reference proteins, indicating rapid folding, most likely to avoid protein aggregation. 4) Structure-based searches for TE homologs indicate that the overall frequency of TE domestication events is low, whereas we found a relatively high number of cases where horizontal transfer, frequently involving parasites, is the most likely explanation for the observed homology.
Collapse
Affiliation(s)
- György Abrusán
- Synthetic and Systems Biology Unit, Institute of Biochemistry, Biological Research Centre of the Hungarian Academy of Sciences, 6701 Szeged, Hungary.
| | | | | |
Collapse
|
25
|
Youngs N, Penfold-Brown D, Drew K, Shasha D, Bonneau R. Parametric Bayesian priors and better choice of negative examples improve protein function prediction. Bioinformatics 2013; 29:1190-8. [PMID: 23511543 PMCID: PMC3634187 DOI: 10.1093/bioinformatics/btt110] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open
Abstract
MOTIVATION Computational biologists have demonstrated the utility of using machine learning methods to predict protein function from an integration of multiple genome-wide data types. Yet, even the best performing function prediction algorithms rely on heuristics for important components of the algorithm, such as choosing negative examples (proteins without a given function) or determining key parameters. The improper choice of negative examples, in particular, can hamper the accuracy of protein function prediction. RESULTS We present a novel approach for choosing negative examples, using a parameterizable Bayesian prior computed from all observed annotation data, which also generates priors used during function prediction. We incorporate this new method into the GeneMANIA function prediction algorithm and demonstrate improved accuracy of our algorithm over current top-performing function prediction methods on the yeast and mouse proteomes across all metrics tested. AVAILABILITY Code and Data are available at: http://bonneaulab.bio.nyu.edu/funcprop.html
Collapse
Affiliation(s)
- Noah Youngs
- Department of Computer Science, Center for Genomics and Systems Biology, New York University, New York, NY 10003, USA
| | | | | | | | | |
Collapse
|
26
|
Abstract
A 3D atomistic model of a plant cellulose synthase (CESA) has remained elusive despite over forty years of experimental effort. Here, we report a computationally predicted 3D structure of 506 amino acids of cotton CESA within the cytosolic region. Comparison of the predicted plant CESA structure with the solved structure of a bacterial cellulose-synthesizing protein validates the overall fold of the modeled glycosyltransferase (GT) domain. The coaligned plant and bacterial GT domains share a six-stranded β-sheet, five α-helices, and conserved motifs similar to those required for catalysis in other GT-2 glycosyltransferases. Extending beyond the cross-kingdom similarities related to cellulose polymerization, the predicted structure of cotton CESA reveals that plant-specific modules (plant-conserved region and class-specific region) fold into distinct subdomains on the periphery of the catalytic region. Computational results support the importance of the plant-conserved region and/or class-specific region in CESA oligomerization to form the multimeric cellulose-synthesis complexes that are characteristic of plants. Relatively high sequence conservation between plant CESAs allowed mapping of known mutations and two previously undescribed mutations that perturb cellulose synthesis in Arabidopsis thaliana to their analogous positions in the modeled structure. Most of these mutation sites are near the predicted catalytic region, and the confluence of other mutation sites supports the existence of previously undefined functional nodes within the catalytic core of CESA. Overall, the predicted tertiary structure provides a platform for the biochemical engineering of plant CESAs.
Collapse
|
27
|
Brylinski M. The utility of artificially evolved sequences in protein threading and fold recognition. J Theor Biol 2013; 328:77-88. [PMID: 23542050 DOI: 10.1016/j.jtbi.2013.03.018] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2012] [Revised: 01/24/2013] [Accepted: 03/18/2013] [Indexed: 12/23/2022]
Abstract
Template-based protein structure prediction plays an important role in Functional Genomics by providing structural models of gene products, which can be utilized by structure-based approaches to function inference. From a systems level perspective, the high structural coverage of gene products in a given organism is critical. Despite continuous efforts towards the development of more sensitive threading approaches, confident structural models cannot be constructed for a considerable fraction of proteins due to difficulties in recognizing low-sequence identity templates with a similar fold to the target. Here we introduce a new modeling stratagem, which employs a library of synthetic sequences to improve template ranking in fold recognition by sequence profile-based methods. We developed a new method for the optimization of generic protein-like amino acid sequences to stabilize the respective structures using a combined empirical scoring function, which is compatible with these commonly used in protein threading and fold recognition. We show that the artificially evolved sequences, whose average sequence identity to the wild-type sequences is as low as 13.8%, have significant capabilities to recognize the correct structures. Importantly, the quality of the corresponding threading alignments is comparable to these constructed using conventional wild-type approaches (the average TM-score is 0.48 and 0.54, respectively). Fold recognition that uses data fusion to combine ranks calculated for both wild-type and synthetic template libraries systematically improves the detection of structural analogs. Depending on the threading algorithm used, it yields on average 4-16% higher recognition rates than using the wild-type template library alone. Synthetic sequences artificially evolved for the template structures provide an orthogonal source of signal that could be exploited to detect these templates unrecognized by standard modeling techniques. It opens up new directions in the development of more sensitive threading methods with the enhanced capabilities of targeting difficult, midnight zone templates.
Collapse
Affiliation(s)
- Michal Brylinski
- Department of Biological Sciences, Louisiana State University, Baton Rouge, LA 70803, USA.
| |
Collapse
|
28
|
Abstract
Background Computational/manual annotations of protein functions are one of the first routes to making sense of a newly sequenced genome. Protein domain predictions form an essential part of this annotation process. This is due to the natural modularity of proteins with domains as structural, evolutionary and functional units. Sometimes two, three, or more adjacent domains (called supra-domains) are the operational unit responsible for a function, e.g. via a binding site at the interface. These supra-domains have contributed to functional diversification in higher organisms. Traditionally functional ontologies have been applied to individual proteins, rather than families of related domains and supra-domains. We expect, however, to some extent functional signals can be carried by protein domains and supra-domains, and consequently used in function prediction and functional genomics. Results Here we present a domain-centric Gene Ontology (dcGO) perspective. We generalize a framework for automatically inferring ontological terms associated with domains and supra-domains from full-length sequence annotations. This general framework has been applied specifically to primary protein-level annotations from UniProtKB-GOA, generating GO term associations with SCOP domains and supra-domains. The resulting 'dcGO Predictor', can be used to provide functional annotation to protein sequences. The functional annotation of sequences in the Critical Assessment of Function Annotation (CAFA) has been used as a valuable opportunity to validate our method and to be assessed by the community. The functional annotation of all completely sequenced genomes has demonstrated the potential for domain-centric GO enrichment analysis to yield functional insights into newly sequenced or yet-to-be-annotated genomes. This generalized framework we have presented has also been applied to other domain classifications such as InterPro and Pfam, and other ontologies such as mammalian phenotype and disease ontology. The dcGO and its predictor are available at http://supfam.org/SUPERFAMILY/dcGO including an enrichment analysis tool. Conclusions As functional units, domains offer a unique perspective on function prediction regardless of whether proteins are multi-domain or single-domain. The 'dcGO Predictor' holds great promise for contributing to a domain-centric functional understanding of genomes in the next generation sequencing era.
Collapse
Affiliation(s)
- Hai Fang
- Department of Computer Science, University of Bristol, The Merchant Venturers Building, Bristol BS8 1UB, UK.
| | | |
Collapse
|
29
|
Fey P, Dodson RJ, Basu S, Chisholm RL. One stop shop for everything Dictyostelium: dictyBase and the Dicty Stock Center in 2012. Methods Mol Biol 2013; 983:59-92. [PMID: 23494302 DOI: 10.1007/978-1-62703-302-2_4] [Citation(s) in RCA: 120] [Impact Index Per Article: 10.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
dictyBase (http://dictybase.org), the model organism database for Dictyostelium discoideum, includes the complete genome sequence and expression data for this organism. Relevant literature is integrated into the database, and gene models and functional annotation are manually curated from experimental results and comparative multigenome analyses. dictyBase has recently expanded to include the genome sequences of three additional Dictyostelids and has added new software tools to facilitate multigenome comparisons. The Dicty Stock Center, a strain and plasmid repository for Dictyostelium research, has relocated to Northwestern University in 2009. This allowed us integrating all Dictyostelium resources to better serve the research community. In this chapter, we will describe how to navigate the Web site and highlight some of our newer improvements.
Collapse
Affiliation(s)
- Petra Fey
- dictyBase and the Dicty Stock Center, Center for Genetic Medicine, Northwestern University, Chicago, IL, USA.
| | | | | | | |
Collapse
|
30
|
The mRNA-bound proteome and its global occupancy profile on protein-coding transcripts. Mol Cell 2012; 46:674-90. [PMID: 22681889 DOI: 10.1016/j.molcel.2012.05.021] [Citation(s) in RCA: 877] [Impact Index Per Article: 73.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2012] [Revised: 05/14/2012] [Accepted: 05/17/2012] [Indexed: 01/17/2023]
Abstract
Protein-RNA interactions are fundamental to core biological processes, such as mRNA splicing, localization, degradation, and translation. We developed a photoreactive nucleotide-enhanced UV crosslinking and oligo(dT) purification approach to identify the mRNA-bound proteome using quantitative proteomics and to display the protein occupancy on mRNA transcripts by next-generation sequencing. Application to a human embryonic kidney cell line identified close to 800 proteins. To our knowledge, nearly one-third were not previously annotated as RNA binding, and about 15% were not predictable by computational methods to interact with RNA. Protein occupancy profiling provides a transcriptome-wide catalog of potential cis-regulatory regions on mammalian mRNAs and showed that large stretches in 3' UTRs can be contacted by the mRNA-bound proteome, with numerous putative binding sites in regions harboring disease-associated nucleotide polymorphisms. Our observations indicate the presence of a large number of mRNA binders with diverse molecular functions participating in combinatorial posttranscriptional gene-expression networks.
Collapse
|
31
|
Pentony MM, Winters P, Penfold-Brown D, Drew K, Narechania A, DeSalle R, Bonneau R, Purugganan MD. The plant proteome folding project: structure and positive selection in plant protein families. Genome Biol Evol 2012; 4:360-71. [PMID: 22345424 PMCID: PMC3318447 DOI: 10.1093/gbe/evs015] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
Despite its importance, relatively little is known about the relationship between the structure, function, and evolution of proteins, particularly in land plant species. We have developed a database with predicted protein domains for five plant proteomes (http://pfp.bio.nyu.edu) and used both protein structural fold recognition and de novo Rosetta-based protein structure prediction to predict protein structure for Arabidopsis and rice proteins. Based on sequence similarity, we have identified ∼15,000 orthologous/paralogous protein family clusters among these species and used codon-based models to predict positive selection in protein evolution within 175 of these sequence clusters. Our results show that codons that display positive selection appear to be less frequent in helical and strand regions and are overrepresented in amino acid residues that are associated with a change in protein secondary structure. Like in other organisms, disordered protein regions also appear to have more selected sites. Structural information provides new functional insights into specific plant proteins and allows us to map positively selected amino acid sites onto protein structures and view these sites in a structural and functional context.
Collapse
Affiliation(s)
- M M Pentony
- Center for Genomics and Systems Biology, Department of Biology, New York University, NY, USA
| | | | | | | | | | | | | | | |
Collapse
|
32
|
Ashworth J, Wurtmann EJ, Baliga NS. Reverse engineering systems models of regulation: discovery, prediction and mechanisms. Curr Opin Biotechnol 2011; 23:598-603. [PMID: 22209016 DOI: 10.1016/j.copbio.2011.12.005] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2011] [Accepted: 12/08/2011] [Indexed: 10/14/2022]
Abstract
Biological systems can now be understood in comprehensive and quantitative detail using systems biology approaches. Putative genome-scale models can be built rapidly based upon biological inventories and strategic system-wide molecular measurements. Current models combine statistical associations, causative abstractions, and known molecular mechanisms to explain and predict quantitative and complex phenotypes. This top-down 'reverse engineering' approach generates useful organism-scale models despite noise and incompleteness in data and knowledge. Here we review and discuss the reverse engineering of biological systems using top-down data-driven approaches, in order to improve discovery, hypothesis generation, and the inference of biological properties.
Collapse
|