1
|
Wang L, Sun H, Yue Z, Xia J, Li X. CDMPred: a tool for predicting cancer driver missense mutations with high-quality passenger mutations. PeerJ 2024; 12:e17991. [PMID: 39253604 PMCID: PMC11382650 DOI: 10.7717/peerj.17991] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2024] [Accepted: 08/07/2024] [Indexed: 09/11/2024] Open
Abstract
Most computational methods for predicting driver mutations have been trained using positive samples, while negative samples are typically derived from statistical methods or putative samples. The representativeness of these negative samples in capturing the diversity of passenger mutations remains to be determined. To tackle these issues, we curated a balanced dataset comprising driver mutations sourced from the COSMIC database and high-quality passenger mutations obtained from the Cancer Passenger Mutation database. Subsequently, we encoded the distinctive features of these mutations. Utilizing feature correlation analysis, we developed a cancer driver missense mutation predictor called CDMPred employing feature selection through the ensemble learning technique XGBoost. The proposed CDMPred method, utilizing the top 10 features and XGBoost, achieved an area under the receiver operating characteristic curve (AUC) value of 0.83 and 0.80 on the training and independent test sets, respectively. Furthermore, CDMPred demonstrated superior performance compared to existing state-of-the-art methods for cancer-specific and general diseases, as measured by AUC and area under the precision-recall curve. Including high-quality passenger mutations in the training data proves advantageous for CDMPred's prediction performance. We anticipate that CDMPred will be a valuable tool for predicting cancer driver mutations, furthering our understanding of personalized therapy.
Collapse
Affiliation(s)
- Lihua Wang
- Information Materials and Intelligent Sensing Laboratory of Anhui Province, Institutes of Physical Science and Information Technology, Anhui University, Hefei, Anhui, China
- School of Information Engineering, Huangshan University, Huangshan, Anhui, China
| | - Haiyang Sun
- State Key Laboratory of Medicinal Chemical Biology, NanKai University, Tianjin, Tianjin, China
| | - Zhenyu Yue
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui, China
| | - Junfeng Xia
- Information Materials and Intelligent Sensing Laboratory of Anhui Province, Institutes of Physical Science and Information Technology, Anhui University, Hefei, Anhui, China
| | - Xiaoyan Li
- Information Materials and Intelligent Sensing Laboratory of Anhui Province, Institutes of Physical Science and Information Technology, Anhui University, Hefei, Anhui, China
| |
Collapse
|
2
|
Villalobos-Alva J, Ochoa-Toledo L, Villalobos-Alva MJ, Aliseda A, Pérez-Escamirosa F, Altamirano-Bustamante NF, Ochoa-Fernández F, Zamora-Solís R, Villalobos-Alva S, Revilla-Monsalve C, Kemper-Valverde N, Altamirano-Bustamante MM. Protein Science Meets Artificial Intelligence: A Systematic Review and a Biochemical Meta-Analysis of an Inter-Field. Front Bioeng Biotechnol 2022; 10:788300. [PMID: 35875501 PMCID: PMC9301016 DOI: 10.3389/fbioe.2022.788300] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2021] [Accepted: 05/25/2022] [Indexed: 11/23/2022] Open
Abstract
Proteins are some of the most fascinating and challenging molecules in the universe, and they pose a big challenge for artificial intelligence. The implementation of machine learning/AI in protein science gives rise to a world of knowledge adventures in the workhorse of the cell and proteome homeostasis, which are essential for making life possible. This opens up epistemic horizons thanks to a coupling of human tacit-explicit knowledge with machine learning power, the benefits of which are already tangible, such as important advances in protein structure prediction. Moreover, the driving force behind the protein processes of self-organization, adjustment, and fitness requires a space corresponding to gigabytes of life data in its order of magnitude. There are many tasks such as novel protein design, protein folding pathways, and synthetic metabolic routes, as well as protein-aggregation mechanisms, pathogenesis of protein misfolding and disease, and proteostasis networks that are currently unexplored or unrevealed. In this systematic review and biochemical meta-analysis, we aim to contribute to bridging the gap between what we call binomial artificial intelligence (AI) and protein science (PS), a growing research enterprise with exciting and promising biotechnological and biomedical applications. We undertake our task by exploring "the state of the art" in AI and machine learning (ML) applications to protein science in the scientific literature to address some critical research questions in this domain, including What kind of tasks are already explored by ML approaches to protein sciences? What are the most common ML algorithms and databases used? What is the situational diagnostic of the AI-PS inter-field? What do ML processing steps have in common? We also formulate novel questions such as Is it possible to discover what the rules of protein evolution are with the binomial AI-PS? How do protein folding pathways evolve? What are the rules that dictate the folds? What are the minimal nuclear protein structures? How do protein aggregates form and why do they exhibit different toxicities? What are the structural properties of amyloid proteins? How can we design an effective proteostasis network to deal with misfolded proteins? We are a cross-functional group of scientists from several academic disciplines, and we have conducted the systematic review using a variant of the PICO and PRISMA approaches. The search was carried out in four databases (PubMed, Bireme, OVID, and EBSCO Web of Science), resulting in 144 research articles. After three rounds of quality screening, 93 articles were finally selected for further analysis. A summary of our findings is as follows: regarding AI applications, there are mainly four types: 1) genomics, 2) protein structure and function, 3) protein design and evolution, and 4) drug design. In terms of the ML algorithms and databases used, supervised learning was the most common approach (85%). As for the databases used for the ML models, PDB and UniprotKB/Swissprot were the most common ones (21 and 8%, respectively). Moreover, we identified that approximately 63% of the articles organized their results into three steps, which we labeled pre-process, process, and post-process. A few studies combined data from several databases or created their own databases after the pre-process. Our main finding is that, as of today, there are no research road maps serving as guides to address gaps in our knowledge of the AI-PS binomial. All research efforts to collect, integrate multidimensional data features, and then analyze and validate them are, so far, uncoordinated and scattered throughout the scientific literature without a clear epistemic goal or connection between the studies. Therefore, our main contribution to the scientific literature is to offer a road map to help solve problems in drug design, protein structures, design, and function prediction while also presenting the "state of the art" on research in the AI-PS binomial until February 2021. Thus, we pave the way toward future advances in the synthetic redesign of novel proteins and protein networks and artificial metabolic pathways, learning lessons from nature for the welfare of humankind. Many of the novel proteins and metabolic pathways are currently non-existent in nature, nor are they used in the chemical industry or biomedical field.
Collapse
Affiliation(s)
- Jalil Villalobos-Alva
- Unidad de Investigación en Enfermedades Metabólicas, Centro Médico Nacional Siglo XXI, Instituto Mexicano del Seguro Social, Mexico City, Mexico
| | - Luis Ochoa-Toledo
- Instituto de Ciencias Aplicadas y Tecnología (ICAT), Universidad Nacional Autónoma de México (UNAM), Mexico City, Mexico
| | - Mario Javier Villalobos-Alva
- Unidad de Investigación en Enfermedades Metabólicas, Centro Médico Nacional Siglo XXI, Instituto Mexicano del Seguro Social, Mexico City, Mexico
| | - Atocha Aliseda
- Instituto de Investigaciones Filosóficas, Universidad Nacional Autónoma de México (UNAM), Mexico City, Mexico
| | - Fernando Pérez-Escamirosa
- Instituto de Ciencias Aplicadas y Tecnología (ICAT), Universidad Nacional Autónoma de México (UNAM), Mexico City, Mexico
| | | | - Francine Ochoa-Fernández
- Unidad de Investigación en Enfermedades Metabólicas, Centro Médico Nacional Siglo XXI, Instituto Mexicano del Seguro Social, Mexico City, Mexico
| | - Ricardo Zamora-Solís
- Unidad de Investigación en Enfermedades Metabólicas, Centro Médico Nacional Siglo XXI, Instituto Mexicano del Seguro Social, Mexico City, Mexico
| | - Sebastián Villalobos-Alva
- Unidad de Investigación en Enfermedades Metabólicas, Centro Médico Nacional Siglo XXI, Instituto Mexicano del Seguro Social, Mexico City, Mexico
| | - Cristina Revilla-Monsalve
- Unidad de Investigación en Enfermedades Metabólicas, Centro Médico Nacional Siglo XXI, Instituto Mexicano del Seguro Social, Mexico City, Mexico
| | - Nicolás Kemper-Valverde
- Instituto de Ciencias Aplicadas y Tecnología (ICAT), Universidad Nacional Autónoma de México (UNAM), Mexico City, Mexico
| | - Myriam M. Altamirano-Bustamante
- Unidad de Investigación en Enfermedades Metabólicas, Centro Médico Nacional Siglo XXI, Instituto Mexicano del Seguro Social, Mexico City, Mexico
| |
Collapse
|
3
|
Sidi T, Keasar C. Redundancy-weighting the PDB for detailed secondary structure prediction using deep-learning models. Bioinformatics 2020; 36:3733-3738. [PMID: 32186698 DOI: 10.1093/bioinformatics/btaa196] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2019] [Revised: 03/12/2020] [Accepted: 03/16/2020] [Indexed: 11/15/2022] Open
Abstract
MOTIVATION The Protein Data Bank (PDB), the ultimate source for data in structural biology, is inherently imbalanced. To alleviate biases, virtually all structural biology studies use nonredundant (NR) subsets of the PDB, which include only a fraction of the available data. An alternative approach, dubbed redundancy-weighting (RW), down-weights redundant entries rather than discarding them. This approach may be particularly helpful for machine-learning (ML) methods that use the PDB as their source for data. Methods for secondary structure prediction (SSP) have greatly improved over the years with recent studies achieving above 70% accuracy for eight-class (DSSP) prediction. As these methods typically incorporate ML techniques, training on RW datasets might improve accuracy, as well as pave the way toward larger and more informative secondary structure classes. RESULTS This study compares the SSP performances of deep-learning models trained on either RW or NR datasets. We show that training on RW sets consistently results in better prediction of 3- (HCE), 8- (DSSP) and 13-class (STR2) secondary structures. AVAILABILITY AND IMPLEMENTATION The ML models, the datasets used for their derivation and testing, and a stand-alone SSP program for DSSP and STR2 predictions, are freely available under LGPL license in http://meshi1.cs.bgu.ac.il/rw. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Tomer Sidi
- Department of Computer Science, Ben-Gurion University, P.O.B 653, Be'er Sheva 84105, Israel
| | - Chen Keasar
- Department of Computer Science, Ben-Gurion University, P.O.B 653, Be'er Sheva 84105, Israel
| |
Collapse
|
4
|
Thomas JMH, Simkovic F, Keegan R, Mayans O, Zhang C, Zhang Y, Rigden DJ. Approaches to ab initio molecular replacement of α-helical transmembrane proteins. Acta Crystallogr D Struct Biol 2017; 73:985-996. [PMID: 29199978 PMCID: PMC5713875 DOI: 10.1107/s2059798317016436] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2017] [Accepted: 11/15/2017] [Indexed: 02/06/2023] Open
Abstract
α-Helical transmembrane proteins are a ubiquitous and important class of proteins, but present difficulties for crystallographic structure solution. Here, the effectiveness of the AMPLE molecular replacement pipeline in solving α-helical transmembrane-protein structures is assessed using a small library of eight ideal helices, as well as search models derived from ab initio models generated both with and without evolutionary contact information. The ideal helices prove to be surprisingly effective at solving higher resolution structures, but ab initio-derived search models are able to solve structures that could not be solved with the ideal helices. The addition of evolutionary contact information results in a marked improvement in the modelling and makes additional solutions possible.
Collapse
Affiliation(s)
- Jens M. H. Thomas
- Institute of Integrative Biology, University of Liverpool, Liverpool L69 7ZB, England
| | - Felix Simkovic
- Institute of Integrative Biology, University of Liverpool, Liverpool L69 7ZB, England
| | - Ronan Keegan
- Research Complex at Harwell, STFC Rutherford Appleton Laboratory, Didcot OX11 0FA, England
| | - Olga Mayans
- Fachbereich Biologie, Universität Konstanz, D-78457 Konstanz, Germany
| | - Chengxin Zhang
- Department of Computational Medicine and Bioinformatics, Department of Biological Chemistry, Medical School, University of Michigan, 100 Washtenaw Avenue, Ann Arbor, MI 48109-2218, USA
| | - Yang Zhang
- Department of Computational Medicine and Bioinformatics, Department of Biological Chemistry, Medical School, University of Michigan, 100 Washtenaw Avenue, Ann Arbor, MI 48109-2218, USA
| | - Daniel J. Rigden
- Institute of Integrative Biology, University of Liverpool, Liverpool L69 7ZB, England
| |
Collapse
|
5
|
Protein secondary structure prediction: A survey of the state of the art. J Mol Graph Model 2017; 76:379-402. [DOI: 10.1016/j.jmgm.2017.07.015] [Citation(s) in RCA: 50] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2017] [Revised: 07/14/2017] [Accepted: 07/17/2017] [Indexed: 11/21/2022]
|
6
|
Beltrandi M, Blocquel D, Erales J, Barbier P, Cavalli A, Longhi S. Insights into the coiled-coil organization of the Hendra virus phosphoprotein from combined biochemical and SAXS studies. Virology 2015; 477:42-55. [PMID: 25637789 DOI: 10.1016/j.virol.2014.12.029] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2014] [Revised: 10/21/2014] [Accepted: 12/19/2014] [Indexed: 10/24/2022]
Abstract
Nipah and Hendra viruses are recently emerged paramyxoviruses belonging to the Henipavirus genus. The Henipavirus phosphoprotein (P) consists of a large intrinsically disordered domain and a C-terminal domain (PCT) containing alternating disordered and ordered regions. Among these latter is the P multimerization domain (PMD). Using biochemical, analytical ultracentrifugation and small-angle X-ray scattering (SAXS) studies, we show that Hendra virus (HeV) PMD forms an elongated coiled-coil homotrimer in solution, in agreement with our previous findings on Nipah virus (NiV) PMD. However, the orientation of the N-terminal region differs from that observed in solution for NiV PMD, consistent with the ability of this region to adopt different conformations. SAXS studies provided evidence for a trimeric organization also in the case of PCT, thus extending and strengthening our findings on PMD. The present results are discussed in light of conflicting reports in the literature pointing to a tetrameric organization of paramyxoviral P proteins.
Collapse
Affiliation(s)
- Matilde Beltrandi
- Aix-Marseille University, Architecture et Fonction des Macromolécules Biologiques (AFMB) UMR 7257, 13288 Marseille, France; CNRS, AFMB UMR 7257, 13288 Marseille, France
| | - David Blocquel
- Aix-Marseille University, Architecture et Fonction des Macromolécules Biologiques (AFMB) UMR 7257, 13288 Marseille, France; CNRS, AFMB UMR 7257, 13288 Marseille, France
| | - Jenny Erales
- Aix-Marseille University, Architecture et Fonction des Macromolécules Biologiques (AFMB) UMR 7257, 13288 Marseille, France; CNRS, AFMB UMR 7257, 13288 Marseille, France
| | - Pascale Barbier
- Aix-Marseille University, INSERM, CRO2 UMR_S911, Faculté de Pharmacie, 13385 Marseille, France
| | - Andrea Cavalli
- Institute for Research in Biomedicine, Via Vincenzo Vela 6, 6500 Bellinzona, Switzerland; Department of Chemistry, University of Cambridge, Cambridge CB2 1EW, United Kingdom.
| | - Sonia Longhi
- Aix-Marseille University, Architecture et Fonction des Macromolécules Biologiques (AFMB) UMR 7257, 13288 Marseille, France; CNRS, AFMB UMR 7257, 13288 Marseille, France.
| |
Collapse
|
7
|
Meier A, Söding J. Context similarity scoring improves protein sequence alignments in the midnight zone. Bioinformatics 2014; 31:674-81. [PMID: 25338715 DOI: 10.1093/bioinformatics/btu697] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION High-quality protein sequence alignments are essential for a number of downstream applications such as template-based protein structure prediction. In addition to the similarity score between sequence profile columns, many current profile-profile alignment tools use extra terms that compare 1D-structural properties such as secondary structure and solvent accessibility, which are predicted from short profile windows around each sequence position. Such scores add non-redundant information by evaluating the conservation of local patterns of hydrophobicity and other amino acid properties and thus exploiting correlations between profile columns. RESULTS Here, instead of predicting and comparing known 1D properties, we follow an agnostic approach. We learn in an unsupervised fashion a set of maximally conserved patterns represented by 13-residue sequence profiles, without the need to know the cause of the conservation of these patterns. We use a maximum likelihood approach to train a set of 32 such profiles that can best represent patterns conserved within pairs of remotely homologs, structurally aligned training profiles. We include the new context score into our Hmm-Hmm alignment tool hhsearch and improve especially the quality of difficult alignments significantly. CONCLUSION The context similarity score improves the quality of homology models and other methods that depend on accurate pairwise alignments.
Collapse
Affiliation(s)
- Armin Meier
- Gene Center, LMU Munich, 81377 Munich and Max Planck Institute for Biophysical Chemistry, 37077 Göttingen, Germany
| | - Johannes Söding
- Gene Center, LMU Munich, 81377 Munich and Max Planck Institute for Biophysical Chemistry, 37077 Göttingen, Germany Gene Center, LMU Munich, 81377 Munich and Max Planck Institute for Biophysical Chemistry, 37077 Göttingen, Germany
| |
Collapse
|
8
|
Hussain RH, Zawawi M, Bayfield MA. Conservation of RNA chaperone activity of the human La-related proteins 4, 6 and 7. Nucleic Acids Res 2013; 41:8715-25. [PMID: 23887937 PMCID: PMC3794603 DOI: 10.1093/nar/gkt649] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2013] [Revised: 06/28/2013] [Accepted: 07/03/2013] [Indexed: 12/22/2022] Open
Abstract
The La module is a conserved tandem arrangement of a La motif and RNA recognition motif whose function has been best characterized in genuine La proteins. The best-characterized substrates of La proteins are pre-tRNAs, and previous work using tRNA mediated suppression in Schizosaccharomyces pombe has demonstrated that yeast and human La enhance the maturation of these using two distinguishable activities: UUU-3'OH-dependent trailer binding/protection and a UUU-3'OH independent activity related to RNA chaperone function. The La module has also been identified in several conserved families of La-related proteins (LARPs) that engage other RNAs, but their mode of RNA binding and function(s) are not well understood. We demonstrate that the La modules of the human LARPs 4, 6 and 7 are also active in tRNA-mediated suppression, even in the absence of stable UUU-3'OH trailer protection. Rather, the capacity of these to enhance pre-tRNA maturation is associated with RNA chaperone function, which we demonstrate to be a conserved activity for each hLARP in vitro. Our work reveals insight into the mechanisms by which La module containing proteins discriminate RNA targets and demonstrates that RNA chaperone activity is a conserved function across representative members of the La motif-containing superfamily.
Collapse
Affiliation(s)
| | | | - Mark A. Bayfield
- Department of Biology, York University, Toronto, Ontario M3J 1P3, Canada
| |
Collapse
|
9
|
Blocquel D, Beltrandi M, Erales J, Barbier P, Longhi S. Biochemical and structural studies of the oligomerization domain of the Nipah virus phosphoprotein: evidence for an elongated coiled-coil homotrimer. Virology 2013; 446:162-72. [PMID: 24074578 DOI: 10.1016/j.virol.2013.07.031] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2013] [Revised: 07/08/2013] [Accepted: 07/24/2013] [Indexed: 12/19/2022]
Abstract
Nipah virus (NiV) is a recently emerged severe human pathogen that belongs to the Henipavirus genus within the Paramyxoviridae family. The NiV genome is encapsidated by the nucleoprotein (N) within a helical nucleocapsid that is the substrate used by the polymerase for transcription and replication. The polymerase is recruited onto the nucleocapsid via its cofactor, the phosphoprotein (P). The NiV P protein has a modular organization, with alternating disordered and ordered domains. Among these latter, is the P multimerization domain (PMD) that was predicted to adopt a coiled-coil conformation. Using both biochemical and biophysical approaches, we show that NiV PMD forms a highly stable and elongated coiled-coil trimer, a finding in striking contrast with respect to the PMDs of Paramyxoviridae members investigated so far that were all found to tetramerize. The present results therefore represent the first report of a paramyxoviral P protein forming trimers.
Collapse
Affiliation(s)
- David Blocquel
- CNRS and Aix-Marseille Université, Architecture et Fonction des Macromolécules Biologiques (AFMB), UMR 7257, 13288 Marseille, France
| | | | | | | | | |
Collapse
|
10
|
Trotta AP, Need EF, Butler LM, Selth LA, O'Loughlin MA, Coetzee GA, Tilley WD, Buchanan G. Subdomain structure of the co-chaperone SGTA and activity of its androgen receptor client. J Mol Endocrinol 2012; 49:57-68. [PMID: 22693264 DOI: 10.1530/jme-11-0152] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 01/31/2023]
Abstract
Ligand-dependent activity of steroid receptors is affected by tetratricopeptide repeat (TPR)-containing co-chaperones, such as small glutamine-rich tetratricopeptide repeat-containing alpha (SGTA). However, the precise mechanisms by which the predominantly cytoplasmic TPR proteins affect downstream transcriptional outcomes of steroid signaling remain unclear. In this study, we assessed how SGTA affects ligand sensitivity and action of the androgen receptor (AR) using a transactivation profiling approach. Deletion mapping coupled with structural prediction, transcriptional assays, and in vivo regulation of AR-responsive promoters were used to assess the role of SGTA domains in AR responses. At subsaturating ligand concentrations of ≤ 0.1 nM 5α-dihydrotestosterone, SGTA overexpression constricted AR activity by an average of 32% (P<0.002) across the majority of androgen-responsive loci tested, as well as on endogenous promoters in vivo. The strength of the SGTA effect was associated with the presence or absence of bioinformatically predicated transcription factor motifs at each site. Homodimerizaion of SGTA, which is thought to be necessary for chaperone complex formation, was found to be dependent on the structural integrity of amino acids 1-80, and a core evolutionary conserved peptide within this region (amino acids 21-40) necessary for an effect of SGTA on the activity of both exogenous and endogenous AR. This study provides new insights into the subdomain structure of SGTA and how SGTA acts as a regulator of AR ligand sensitivity. A change in AR:SGTA ratio will impact the cellular and molecular response of prostate cancer cells to maintain androgenic signals, which may influence tumor progression.
Collapse
Affiliation(s)
- Andrew P Trotta
- Cancer Biology Group, Level 1 Basil Hetzel Institute for Translational Health Research, Freemasons Foundation Centre for Men's Health, Queen Elizabeth Hospital, University of Adelaide, 28 Woodville Road, Woodville South, Adelaide, South Australia 5011, Australia
| | | | | | | | | | | | | | | |
Collapse
|
11
|
Al Rayyan N, Wankhade UD, Bush K, Good DJ. Two single nucleotide polymorphisms in the human nescient helix-loop-helix 2 (NHLH2) gene reduce mRNA stability and DNA binding. Gene 2012; 512:134-42. [PMID: 23026212 DOI: 10.1016/j.gene.2012.09.068] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2012] [Revised: 08/07/2012] [Accepted: 09/12/2012] [Indexed: 01/17/2023]
Abstract
Nescient helix-loop-helix-2 (NHLH2) is a basic helix-loop-helix transcription factor, which has been implicated, using mouse knockouts, in adult body weight regulation and fertility. A scan of the known single nucleotide polymorphisms (SNPs) in the NHLH2 gene revealed one in the 3' untranslated region (3'UTR), which lies within an AUUUA RNA stability motif. A second SNP is nonsynonymous within the coding region of NHLH2, and was found in a genome-wide association study for obesity. Both of these SNPs were examined for their effect on NLHL2 by creating mouse mimics and examining mRNA stability, and protein function in mouse hypothalamic cell lines. The 3'UTR SNP causes increased instability and, when the SNP-containing Nhlh2 3'UTR is attached to luciferase mRNA, reduced protein levels in cells. The nonsynonymous SNP at position 83 in the protein changes an alanine residue, conserved in NHLH2 orthologs through the Drosophila sp. to a proline residue. This change affects migration of the protein on an SDS-PAGE gel, and appears to alter secondary structure of the protein, as predicted using in silico methods. These results provide functional information on two rare human SNPs in the NHLH2 gene. One of these has been linked to human obese phenotypes, while the other is present in a relatively high proportion of individuals. Given their effects on NHLH2 protein levels, both SNPs deserve further analysis in whether they are causative and/or additive for human body weight and fertility phenotypes.
Collapse
Affiliation(s)
- Numan Al Rayyan
- Department of Human Nutrition, Foods and Exercise, Virginia Tech University, Blacksburg, VA 24061, USA
| | | | | | | |
Collapse
|
12
|
Gront D, Blaszczyk M, Wojciechowski P, Kolinski A. BioShell Threader: protein homology detection based on sequence profiles and secondary structure profiles. Nucleic Acids Res 2012; 40:W257-62. [PMID: 22693216 PMCID: PMC3394251 DOI: 10.1093/nar/gks555] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
Abstract
The BioShell package has recently been extended with a web server for protein homology detection based on profile-to-profile alignment (known as 1D threading). Its aim is to assign structural templates to each domain of the query. The server uses sequence profiles that describe observed sequence variability and secondary structure profiles providing expected probability for a certain secondary structure type at a given position in a protein. Three independent predictors are used to increase the rate of successful predictions. Careful evaluation shows that there is nearly 80% chance that the query sequence belongs to the same SCOP family as the top scoring template. The Bioshell Threader server is freely available at: http://www.bioshell.pl/threader/.
Collapse
Affiliation(s)
- Dominik Gront
- University of Warsaw, Faculty of Chemistry, Pasteura 1, 02-093 Warsaw, Poland.
| | | | | | | |
Collapse
|
13
|
Qi Y, Oja M, Weston J, Noble WS. A unified multitask architecture for predicting local protein properties. PLoS One 2012; 7:e32235. [PMID: 22461885 PMCID: PMC3312883 DOI: 10.1371/journal.pone.0032235] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2011] [Accepted: 01/25/2012] [Indexed: 01/27/2023] Open
Abstract
A variety of functionally important protein properties, such as secondary structure, transmembrane topology and solvent accessibility, can be encoded as a labeling of amino acids. Indeed, the prediction of such properties from the primary amino acid sequence is one of the core projects of computational biology. Accordingly, a panoply of approaches have been developed for predicting such properties; however, most such approaches focus on solving a single task at a time. Motivated by recent, successful work in natural language processing, we propose to use multitask learning to train a single, joint model that exploits the dependencies among these various labeling tasks. We describe a deep neural network architecture that, given a protein sequence, outputs a host of predicted local properties, including secondary structure, solvent accessibility, transmembrane topology, signal peptides and DNA-binding residues. The network is trained jointly on all these tasks in a supervised fashion, augmented with a novel form of semi-supervised learning in which the model is trained to distinguish between local patterns from natural and synthetic protein sequences. The task-independent architecture of the network obviates the need for task-specific feature engineering. We demonstrate that, for all of the tasks that we considered, our approach leads to statistically significant improvements in performance, relative to a single task neural network approach, and that the resulting model achieves state-of-the-art performance.
Collapse
Affiliation(s)
- Yanjun Qi
- Machine Learning Department, NEC Labs America, Princeton, New Jersey, United States of America
| | - Merja Oja
- Department of Genome Sciences, University of Washington, Seattle, Washington, United States of America
| | - Jason Weston
- Google, New York, New York, United States of America
| | - William Stafford Noble
- Department of Genome Sciences, University of Washington, Seattle, Washington, United States of America
- * E-mail:
| |
Collapse
|
14
|
Naeeni AR, Conte MR, Bayfield MA. RNA chaperone activity of human La protein is mediated by variant RNA recognition motif. J Biol Chem 2012; 287:5472-82. [PMID: 22203678 PMCID: PMC3285324 DOI: 10.1074/jbc.m111.276071] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2011] [Revised: 12/23/2011] [Indexed: 02/05/2023] Open
Abstract
La proteins are conserved factors in eukaryotes that bind and protect the 3' trailers of pre-tRNAs from exonuclease digestion via sequence-specific recognition of UUU-3'OH. La has also been hypothesized to assist pre-tRNAs in attaining their native fold through RNA chaperone activity. In addition to binding polymerase III transcripts, human La has also been shown to enhance the translation of several internal ribosome entry sites and upstream ORF-containing mRNA targets, also potentially through RNA chaperone activity. Using in vitro FRET-based assays, we show that human and Schizosaccharomyces pombe La proteins harbor RNA chaperone activity by enhancing RNA strand annealing and strand dissociation. We use various RNA substrates and La mutants to show that UUU-3'OH-dependent La-RNA binding is not required for this function, and we map RNA chaperone activity to its RRM1 motif including a noncanonical α3-helix. We validate the importance of this α3-helix by appending it to the RRM of the unrelated U1A protein and show that this fusion protein acquires significant strand annealing activity. Finally, we show that residues required for La-mediated RNA chaperone activity in vitro are required for La-dependent rescue of tRNA-mediated suppression via a mutated suppressor tRNA in vivo. This work delineates the structural elements required for La-mediated RNA chaperone activity and provides a basis for understanding how La can enhance the folding of its various RNA targets.
Collapse
Affiliation(s)
- Amir R. Naeeni
- From the Department of Biology, York University, Toronto, Ontario M3J 1P3, Canada and
| | - Maria R. Conte
- the Randall Division of Cell and Molecular Biophysics, King's College London, London SE1 1UL, United Kingdom
| | - Mark A. Bayfield
- From the Department of Biology, York University, Toronto, Ontario M3J 1P3, Canada and
| |
Collapse
|
15
|
Vornam B, Gailing O, Derory J, Plomion C, Kremer A, Finkeldey R. Characterisation and natural variation of a dehydrin gene in Quercus petraea (Matt.) Liebl. PLANT BIOLOGY (STUTTGART, GERMANY) 2011; 13:881-887. [PMID: 21973280 DOI: 10.1111/j.1438-8677.2011.00446.x] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/26/2023]
Abstract
For the first time in sessile oak [Quercus petraea (Matt.) Liebl.], the isolation and characterisation of a full-length dehydrin gene and its promoter region, as well as its allelic variation in natural populations, is reported. Dehydrins (Dhn) are stress-related genes important for the survival of perennial plants in a seasonal climate. A full-length dehydrin gene (Dhn3) was characterised at the nucleotide level and the protein structure was modelled. Additionally, the allelic variation was analysed in five natural populations of Quercus petraea (Matt.) Liebl. sampled along an altitudinal gradient in the French Pyrenees. The analysed sequences contain typical domains of the K(n) class of dehydrins in the coding region. Also, the 5'untranslated region (promoter) of the gene was amplified, which shows typical motifs essential for drought- and cold-responsive gene expression. Single nucleotide substitutions and indels (insertions/deletions) within the coding region determine large biochemical differences at the protein level. However, only low levels of genetic differentiation between populations from different altitudes were detectable.
Collapse
Affiliation(s)
- B Vornam
- Buesgen-Institute, Department of Forest Genetics and Forest Tree Breeding, University of Göttingen, Germany.
| | | | | | | | | | | |
Collapse
|
16
|
Identification of a chemoreceptor zinc-binding domain common to cytoplasmic bacterial chemoreceptors. J Bacteriol 2011; 193:4338-45. [PMID: 21725005 DOI: 10.1128/jb.05140-11] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Abstract
We report the identification and characterization of a previously unidentified protein domain found in bacterial chemoreceptors and other bacterial signal transduction proteins. This domain contains a motif of three noncontiguous histidines and one cysteine, arranged as Hxx[WFYL]x(21-28)Cx[LFMVI]Gx[WFLVI]x(18-27)HxxxH(boldface type indicates residues that are nearly 100% conserved). This domain was first identified in the soluble Helicobacter pylori chemoreceptor TlpD. Using inductively coupled plasma mass spectrometry on heterologously and natively expressed TlpD, we determined that this domain binds zinc with a subfemtomolar dissociation constant. We thus named the domain CZB, for chemoreceptor zinc binding. Further analysis showed that many bacterial signaling proteins contain the CZB domain, most commonly proteins that participate in chemotaxis but also those that participate in c-di-GMP signaling and nitrate/nitrite sensing, among others. Proteins bearing the CZB domain are found in several bacterial phyla. The variety of signaling proteins using the CZB domain suggests that it plays a critical role in several signal transduction pathways.
Collapse
|
17
|
Strunk T, Hamacher K, Hoffgaard F, Engelhardt H, Zillig MD, Faist K, Wenzel W, Pfeifer F. Structural model of the gas vesicle protein GvpA and analysis of GvpA mutants in vivo. Mol Microbiol 2011; 81:56-68. [PMID: 21542854 DOI: 10.1111/j.1365-2958.2011.07669.x] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Abstract
Gas vesicles are gas-filled protein structures increasing the buoyancy of cells. The gas vesicle envelope is mainly constituted by the 8 kDa protein GvpA forming a wall with a water excluding inner surface. A structure of GvpA is not available; recent solid-state NMR results suggest a coil-α-β-β-α-coil fold. We obtained a first structural model of GvpA by high-performance de novo modelling. Attenuated total reflection (ATR)-Fourier transform infrared spectroscopy (FTIR) supported this structure. A dimer of GvpA was derived that could explain the formation of the protein monolayer in the gas vesicle wall. The hydrophobic inner surface is mainly constituted by anti-parallel β-strands. The proposed structure allows the pinpointing of contact sites that were mutated and tested for the ability to form gas vesicles in haloarchaea. Mutations in α-helix I and α-helix II, but also in the β-turn affected the gas vesicle formation, whereas other alterations had no effect. All mutants supported the structural features deduced from the model. The proposed GvpA dimers allow the formation of a monolayer protein wall, also consistent with protease treatments of isolated gas vesicles.
Collapse
Affiliation(s)
- Timo Strunk
- Institute for Nanotechnology, Karlsruhe Institute of Technology, PO Box 3640, D-76021 Karlsruhe, Germany
| | | | | | | | | | | | | | | |
Collapse
|
18
|
Samayoa J, Yildiz FH, Karplus K. Identification of prokaryotic small proteins using a comparative genomic approach. ACTA ACUST UNITED AC 2011; 27:1765-71. [PMID: 21551138 DOI: 10.1093/bioinformatics/btr275] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION Accurate prediction of genes encoding small proteins (on the order of 50 amino acids or less) remains an elusive open problem in bioinformatics. Some of the best methods for gene prediction use either sequence composition analysis or sequence similarity to a known protein coding sequence. These methods often fail for small proteins, however, either due to a lack of experimentally verified small protein coding genes or due to the limited statistical significance of statistics on small sequences. Our approach is based upon the hypothesis that true small proteins will be under selective pressure for encoding the particular amino acid sequence, for ease of translation by the ribosome and for structural stability. This stability can be achieved either independently or as part of a larger protein complex. Given this assumption, it follows that small proteins should display conserved local protein structure properties much like larger proteins. Our method incorporates neural-net predictions for three local structure alphabets within a comparative genomic approach using a genomic alignment of 22 closely related bacteria genomes to generate predictions for whether or not a given open reading frame (ORF) encodes for a small protein. RESULTS We have applied this method to the complete genome for Escherichia coli strain K12 and looked at how well our method performed on a set of 60 experimentally verified small proteins from this organism. Out of a total of 11 407 possible ORFs, we found that 6 of the top 10 and 27 of the top 100 predictions belonged to the set of 60 experimentally verified small proteins. We found 35 of all the true small proteins within the top 200 predictions. We compared our method to Glimmer, using a default Glimmer protocol and a modified small ORF Glimmer protocol with a lower minimum size cutoff. The default Glimmer protocol identified 16 of the true small proteins (all in the top 200 predictions), but failed to predict on 34 due to size cutoffs. The small ORF Glimmer protocol made predictions for all the experimentally verified small proteins but only contained 9 of the 60 true small proteins within the top 200 predictions. CONTACT jsamayoa@jhu.edu
Collapse
Affiliation(s)
- Josue Samayoa
- Department of Biomedical Engineering, Institute for Computational Medicine, Johns Hopkins University, Baltimore, MD 21218, USA.
| | | | | |
Collapse
|
19
|
Identification of missense mutation (I12T) in the BSND gene and bioinformatics analysis. J Biomed Biotechnol 2011; 2011:304612. [PMID: 21541222 PMCID: PMC3085335 DOI: 10.1155/2011/304612] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2010] [Revised: 12/21/2010] [Accepted: 02/04/2011] [Indexed: 11/18/2022] Open
Abstract
Nonsyndromic hearing loss is a paradigm of genetic heterogeneity with 85 loci and 39 nuclear disease genes reported so far. Mutations of BSND have been shown to cause Bartter syndrome type IV, characterized by significant renal abnormalities and deafness and nonsyndromic nearing loss. We studied a Pakistani consanguineous family. Clinical examinations of affected individuals did not reveal the presence of any associated signs, which are hallmarks of the Bartter syndrome type IV. Linkage analysis identified an area of 18.36 Mb shared by all affected individuals between markers D1S2706 and D1S1596. A maximum two-point LOD score of 2.55 with markers D1S2700 and multipoint LOD score of 3.42 with marker D1S1661 were obtained. BSND mutation, that is, p.I12T, cosegregated in all extant members of our pedigree. BSND mutations can cause nonsyndromic hearing loss, and it is a second report for this mutation. The respected protein, that is, BSND, was first modeled, and then, the identified mutation was further analyzed by using different bioinformatics tools; finally, this protein and its mutant was docked with CLCNKB and REN, interactions of BSND, respectively.
Collapse
|
20
|
Sagemark J, Kraulis P, Weigelt J. A software tool to accelerate design of protein constructs for recombinant expression. Protein Expr Purif 2010; 72:175-8. [PMID: 20359538 DOI: 10.1016/j.pep.2010.03.020] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2009] [Revised: 03/19/2010] [Accepted: 03/25/2010] [Indexed: 11/26/2022]
Abstract
Structural and biochemical analysis of proteins requires access to purified protein material. Modern molecular biology technologies facilitate straightforward molecular cloning and expression analysis of multiple protein constructs in parallel, and such approaches have proven very efficient to identify samples suitable for further analysis. A variety of information can be used to support rational design of protein constructs. This includes, e.g. prediction of secondary structure elements, protein domain predictions, and structure prediction methods such as threading. To fully access the available information, collation of data extracted from several different sources is required. This can be cumbersome and sometimes also confusing due to for example different implementation of amino acid residue numbering schemes. The SGC Domain Boundary Analyser tool provides a graphical interface that simplifies and accelerates rational design of protein expression constructs.
Collapse
Affiliation(s)
- Johanna Sagemark
- Structural Genomics Consortium, Karolinska Institutet, Department of Medical Biochemistry and Biophysics, 171 77 Stockholm, Sweden
| | | | | |
Collapse
|
21
|
Madera M, Calmus R, Thiltgen G, Karplus K, Gough J. Improving protein secondary structure prediction using a simple k-mer model. Bioinformatics 2010; 26:596-602. [PMID: 20130034 PMCID: PMC2828123 DOI: 10.1093/bioinformatics/btq020] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/04/2022] Open
Abstract
Motivation: Some first order methods for protein sequence analysis inherently treat each position as independent. We develop a general framework for introducing longer range interactions. We then demonstrate the power of our approach by applying it to secondary structure prediction; under the independence assumption, sequences produced by existing methods can produce features that are not protein like, an extreme example being a helix of length 1. Our goal was to make the predictions from state of the art methods more realistic, without loss of performance by other measures. Results: Our framework for longer range interactions is described as a k-mer order model. We succeeded in applying our model to the specific problem of secondary structure prediction, to be used as an additional layer on top of existing methods. We achieved our goal of making the predictions more realistic and protein like, and remarkably this also improved the overall performance. We improve the Segment OVerlap (SOV) score by 1.8%, but more importantly we radically improve the probability of the real sequence given a prediction from an average of 0.271 per residue to 0.385. Crucially, this improvement is obtained using no additional information. Availability:http://supfam.cs.bris.ac.uk/kmer Contact:gough@cs.bris.ac.uk
Collapse
Affiliation(s)
- Martin Madera
- Department of Computer Science, University of Bristol, Woodland Road, Bristol BS8 1UB, UK
| | | | | | | | | |
Collapse
|
22
|
Abstract
Undertaker is a program designed to help predict protein structure using alignments to proteins of known structure and fragment assembly. The program generates conformations and uses cost functions to select the best structures from among the generated conformations. This paper describes the use of Undertaker's cost functions for model quality assessment. We achieve an accuracy that is similar to other methods, without using consensus-based techniques. Adding consensus-based features further improves our approach substantially. We report several correlation measures, including a new weighted version of Kendall's tau (tau(3)) and show model quality assessment results superior to previously published results on all correlation measures when using only models with no missing atoms.
Collapse
Affiliation(s)
- John Archie
- University of California at Santa Cruz, Biomolecular Engineering, Santa Cruz, CA, USA
| | | |
Collapse
|
23
|
Paluszewski M, Karplus K. Model quality assessment using distance constraints from alignments. Proteins 2009; 75:540-9. [PMID: 19003987 DOI: 10.1002/prot.22262] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Given a set of alternative models for a specific protein sequence, the model quality assessment (MQA) problem asks for an assignment of scores to each model in the set. A good MQA program assigns these scores such that they correlate well with real quality of the models, ideally scoring best that model which is closest to the true structure. In this article, we present a new approach for addressing the MQA problem. It is based on distance constraints extracted from alignments to templates of known structure, and is implemented in the Undertaker program for protein structure prediction. One novel feature is that we extract noncontact constraints as well as contact constraints. We describe how the distance constraint extraction is done and we show how they can be used to address the MQA problem. We have compared our method on CASP7 targets and the results show that our method is at least comparable with the best MQA methods that were assessed at CASP7. We also propose a new evaluation measure, Kendall's tau, that is more interpretable than conventional measures used for evaluating MQA methods (Pearson's r and Spearman's rho). We show clear examples where Kendall's tau agrees much more with our intuition of a correct MQA, and we therefore propose that Kendall's tau be used for future CASP MQA assessments.
Collapse
Affiliation(s)
- Martin Paluszewski
- Department of Computer Science, University of Copenhagen, Copenhagen, Denmark
| | | |
Collapse
|
24
|
Helles G, Fonseca R. Predicting dihedral angle probability distributions for protein coil residues from primary sequence using neural networks. BMC Bioinformatics 2009; 10:338. [PMID: 19835576 PMCID: PMC2771020 DOI: 10.1186/1471-2105-10-338] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2009] [Accepted: 10/16/2009] [Indexed: 11/10/2022] Open
Abstract
Background Predicting the three-dimensional structure of a protein from its amino acid sequence is currently one of the most challenging problems in bioinformatics. The internal structure of helices and sheets is highly recurrent and help reduce the search space significantly. However, random coil segments make up nearly 40% of proteins and they do not have any apparent recurrent patterns, which complicates overall prediction accuracy of protein structure prediction methods. Luckily, previous work has indicated that coil segments are in fact not completely random in structure and flanking residues do seem to have a significant influence on the dihedral angles adopted by the individual amino acids in coil segments. In this work we attempt to predict a probability distribution of these dihedral angles based on the flanking residues. While attempts to predict dihedral angles of coil segments have been done previously, none have, to our knowledge, presented comparable results for the probability distribution of dihedral angles. Results In this paper we develop an artificial neural network that uses an input-window of amino acids to predict a dihedral angle probability distribution for the middle residue in the input-window. The trained neural network shows a significant improvement (4-68%) in predicting the most probable bin (covering a 30° × 30° area of the dihedral angle space) for all amino acids in the data set compared to baseline statistics. An accuracy comparable to that of secondary structure prediction (≈ 80%) is achieved by observing the 20 bins with highest output values. Conclusion Many different protein structure prediction methods exist and each uses different tools and auxiliary predictions to help determine the native structure. In this work the sequence is used to predict local context dependent dihedral angle propensities in coil-regions. This predicted distribution can potentially improve tertiary structure prediction methods that are based on sampling the backbone dihedral angles of individual amino acids. The predicted distribution may also help predict local structure fragments used in fragment assembly methods.
Collapse
Affiliation(s)
- Glennie Helles
- University of Copenhagen, Department of Computer Science, Universitetsparken 1, 2100 Copenhagen, Denmark.
| | | |
Collapse
|
25
|
Lippi M, Frasconi P. Prediction of protein beta-residue contacts by Markov logic networks with grounding-specific weights. ACTA ACUST UNITED AC 2009; 25:2326-33. [PMID: 19592394 DOI: 10.1093/bioinformatics/btp421] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION Accurate prediction of contacts between beta-strand residues can significantly contribute towards ab initio prediction of the 3D structure of many proteins. Contacts in the same protein are highly interdependent. Therefore, significant improvements can be expected by applying statistical relational learners that overcome the usual machine learning assumption that examples are independent and identically distributed. Furthermore, the dependencies among beta-residue contacts are subject to strong regularities, many of which are known a priori. In this article, we take advantage of Markov logic, a statistical relational learning framework that is able to capture dependencies between contacts, and constrain the solution according to domain knowledge expressed by means of weighted rules in a logical language. RESULTS We introduce a novel hybrid architecture based on neural and Markov logic networks with grounding-specific weights. On a non-redundant dataset, our method achieves 44.9% F(1) measure, with 47.3% precision and 42.7% recall, which is significantly better (P < 0.01) than previously reported performance obtained by 2D recursive neural networks. Our approach also significantly improves the number of chains for which beta-strands are nearly perfectly paired (36% of the chains are predicted with F(1) >or= 70% on coarse map). It also outperforms more general contact predictors on recent CASP 2008 targets.
Collapse
Affiliation(s)
- Marco Lippi
- Dipartimento di Sistemi e Informatica, Università degli Studi di Firenze, Firenze, Italy.
| | | |
Collapse
|