1
|
Wenzel M, Grüner E, Strodthoff N. Insights into the inner workings of transformer models for protein function prediction. Bioinformatics 2024; 40:btae031. [PMID: 38244570 PMCID: PMC10950482 DOI: 10.1093/bioinformatics/btae031] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2023] [Revised: 12/14/2023] [Accepted: 01/16/2024] [Indexed: 01/22/2024] Open
Abstract
MOTIVATION We explored how explainable artificial intelligence (XAI) can help to shed light into the inner workings of neural networks for protein function prediction, by extending the widely used XAI method of integrated gradients such that latent representations inside of transformer models, which were finetuned to Gene Ontology term and Enzyme Commission number prediction, can be inspected too. RESULTS The approach enabled us to identify amino acids in the sequences that the transformers pay particular attention to, and to show that these relevant sequence parts reflect expectations from biology and chemistry, both in the embedding layer and inside of the model, where we identified transformer heads with a statistically significant correspondence of attribution maps with ground truth sequence annotations (e.g. transmembrane regions, active sites) across many proteins. AVAILABILITY AND IMPLEMENTATION Source code can be accessed at https://github.com/markuswenzel/xai-proteins.
Collapse
Affiliation(s)
- Markus Wenzel
- Department of Artificial Intelligence, Fraunhofer Institute for Telecommunications, Heinrich-Hertz-Institut, HHI, Einsteinufer 37, 10587 Berlin, Germany
| | - Erik Grüner
- Department of Artificial Intelligence, Fraunhofer Institute for Telecommunications, Heinrich-Hertz-Institut, HHI, Einsteinufer 37, 10587 Berlin, Germany
| | - Nils Strodthoff
- School VI - Medicine and Health Services, Carl von Ossietzky University of Oldenburg, Ammerländer Heerstr. 114-118, 26129 Oldenburg, Germany
| |
Collapse
|
2
|
Hall MWJ, Shorthouse D, Alcraft R, Jones PH, Hall BA. Mutations observed in somatic evolution reveal underlying gene mechanisms. Commun Biol 2023; 6:753. [PMID: 37468606 PMCID: PMC10356810 DOI: 10.1038/s42003-023-05136-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2023] [Accepted: 07/11/2023] [Indexed: 07/21/2023] Open
Abstract
Highly sensitive DNA sequencing techniques have allowed the discovery of large numbers of somatic mutations in normal tissues. Some mutations confer a competitive advantage over wild-type cells, generating expanding clones that spread through the tissue. Competition between mutant clones leads to selection. This process can be considered a large scale, in vivo screen for mutations increasing cell fitness. It follows that somatic missense mutations may offer new insights into the relationship between protein structure, function and cell fitness. We present a flexible statistical method for exploring the selection of structural features in data sets of somatic mutants. We show how this approach can evidence selection of specific structural features in key drivers in aged tissues. Finally, we show how drivers may be classified as fitness-enhancing and fitness-suppressing through different patterns of mutation enrichment. This method offers a route to understanding the mechanism of protein function through in vivo mutant selection.
Collapse
Affiliation(s)
| | - David Shorthouse
- Department of Medical Physics and Biomedical Engineering, Malet Place Engineering Building, University College London, Gower Street, London, WC1E 6BT, UK
| | - Rachel Alcraft
- Advanced Research Computing, University College London, London, UK
| | - Philip H Jones
- Wellcome Sanger Institute, Hinxton, CB10 1SA, UK
- Department of Oncology, University of Cambridge, Cambridge, CB2 0XZ, UK
| | - Benjamin A Hall
- Department of Medical Physics and Biomedical Engineering, Malet Place Engineering Building, University College London, Gower Street, London, WC1E 6BT, UK.
| |
Collapse
|
3
|
Batool Z, Qureshi U, Mushtaq M, Ahmed S, Nur-E-Alam M, Ul-Haq Z. Structural basis for the mutation-induced dysfunction of the human IL-15/IL-15α receptor complex. Phys Chem Chem Phys 2023; 25:3020-3030. [PMID: 36607223 DOI: 10.1039/d2cp03012h] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Abstract
In silico strategies offer a reliable, fast, and inexpensive, way compared to the clumsy in vitro approaches to boost understanding of the effect of amino acid substitution on the structure and consequently the associated function of proteins. In the present work, we report an atomistic-based, reliable in silico structural and energetic framework of the interactions between the receptor-binding domain of the Interleukin-15 (IL-15) protein and its receptor Interleukin-15α (IL-15α), consequently, providing qualitative and quantitative details of the key molecular determinants in ligand/receptor recognition. Molecular dynamics simulations were used to investigate the dynamic behavior of the specific binding between IL-15 and IL-15α followed by estimation of the free energies via molecular mechanics/generalized Born surface area (MM/GBSA). In particular, residues Y26, E46, E53, and E89 of the IL-15 protein receptor-binding domain are identified as main hot spots, shaping and governing the stability of the assembly. These results can be used for the development of neutralizing antibodies and the effective structure-based design of protein-protein interaction inhibitors against the so-called orphan disease, vitiligo.
Collapse
Affiliation(s)
- Zahida Batool
- H.E.J Research Institute of Chemistry, International Center for Chemical and Biological Sciences, University of Karachi, Karachi-75270, Pakistan.
| | - Urooj Qureshi
- H.E.J Research Institute of Chemistry, International Center for Chemical and Biological Sciences, University of Karachi, Karachi-75270, Pakistan.
| | - Mamona Mushtaq
- Dr. Panjwani Center for Molecular Medicine and Drug Research, International Center for Chemical and Biological Sciences, University of Karachi, Karachi-75270, Pakistan
| | - Sarfaraz Ahmed
- Department of Pharmacognosy, College of Pharmacy, King Saud University, P.O. Box. 2457, Riyadh 11451, Kingdom of Saudi Arabia
| | - Mohammad Nur-E-Alam
- Department of Pharmacognosy, College of Pharmacy, King Saud University, P.O. Box. 2457, Riyadh 11451, Kingdom of Saudi Arabia
| | - Zaheer Ul-Haq
- H.E.J Research Institute of Chemistry, International Center for Chemical and Biological Sciences, University of Karachi, Karachi-75270, Pakistan. .,Dr. Panjwani Center for Molecular Medicine and Drug Research, International Center for Chemical and Biological Sciences, University of Karachi, Karachi-75270, Pakistan
| |
Collapse
|
4
|
i4mC-Deep: An Intelligent Predictor of N4-Methylcytosine Sites Using a Deep Learning Approach with Chemical Properties. Genes (Basel) 2021; 12:genes12081117. [PMID: 34440291 PMCID: PMC8393747 DOI: 10.3390/genes12081117] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2021] [Revised: 07/15/2021] [Accepted: 07/16/2021] [Indexed: 01/26/2023] Open
Abstract
DNA is subject to epigenetic modification by the molecule N4-methylcytosine (4mC). N4-methylcytosine plays a crucial role in DNA repair and replication, protects host DNA from degradation, and regulates DNA expression. However, though current experimental techniques can identify 4mC sites, such techniques are expensive and laborious. Therefore, computational tools that can predict 4mC sites would be very useful for understanding the biological mechanism of this vital type of DNA modification. Conventional machine-learning-based methods rely on hand-crafted features, but the new method saves time and computational cost by making use of learned features instead. In this study, we propose i4mC-Deep, an intelligent predictor based on a convolutional neural network (CNN) that predicts 4mC modification sites in DNA samples. The CNN is capable of automatically extracting important features from input samples during training. Nucleotide chemical properties and nucleotide density, which together represent a DNA sequence, act as CNN input data. The outcome of the proposed method outperforms several state-of-the-art predictors. When i4mC-Deep was used to analyze G. subterruneus DNA, the accuracy of the results was improved by 3.9% and MCC increased by 10.5% compared to a conventional predictor.
Collapse
|
5
|
Echave J. Fast computational mutation-response scanning of proteins. PeerJ 2021; 9:e11330. [PMID: 33976988 PMCID: PMC8067912 DOI: 10.7717/peerj.11330] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2020] [Accepted: 03/31/2021] [Indexed: 12/21/2022] Open
Abstract
Studying the effect of perturbations on protein structure is a basic approach in protein research. Important problems, such as predicting pathological mutations and understanding patterns of structural evolution, have been addressed by computational simulations that model mutations using forces and predict the resulting deformations. In single mutation-response scanning simulations, a sensitivity matrix is obtained by averaging deformations over point mutations. In double mutation-response scanning simulations, a compensation matrix is obtained by minimizing deformations over pairs of mutations. These very useful simulation-based methods may be too slow to deal with large proteins, protein complexes, or large protein databases. To address this issue, I derived analytical closed formulas to calculate the sensitivity and compensation matrices directly, without simulations. Here, I present these derivations and show that the resulting analytical methods are much faster than their simulation counterparts.
Collapse
Affiliation(s)
- Julian Echave
- Instituto de Ciencias Físicas, Escuela de Ciencia y Tecnología, Universidad Nacional de San Martín, San Martín, Buenos Aires, Argentina
| |
Collapse
|
6
|
Wahab A, Tayara H, Xuan Z, Chong KT. DNA sequences performs as natural language processing by exploiting deep learning algorithm for the identification of N4-methylcytosine. Sci Rep 2021; 11:212. [PMID: 33420191 PMCID: PMC7794489 DOI: 10.1038/s41598-020-80430-x] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2020] [Accepted: 12/14/2020] [Indexed: 12/17/2022] Open
Abstract
N4-methylcytosine is a biochemical alteration of DNA that affects the genetic operations without modifying the DNA nucleotides such as gene expression, genomic imprinting, chromosome stability, and the development of the cell. In the proposed work, a computational model, 4mCNLP-Deep, used the word embedding approach as a vector formulation by exploiting deep learning based CNN algorithm to predict 4mC and non-4mC sites on the C.elegans genome dataset. Diversity of ranges employed for the experimental such as corpus k-mer and k-fold cross-validation to obtain the prevailing capabilities. The 4mCNLP-Deep outperform from the state-of-the-art predictor by achieving the results in five evaluation metrics by following; Accuracy (ACC) as 0.9354, Mathew’s correlation coefficient (MCC) as 0.8608, Specificity (Sp) as 0.89.96, Sensitivity (Sn) as 0.9563, and Area under curve (AUC) as 0.9731 by using 3-mer corpus word2vec and 3-fold cross-validation and attained the increment of 1.1%, 0.6%, 0.58%, 0.77%, and 4.89%, respectively. At last, we developed the online webserver http://nsclbio.jbnu.ac.kr/tools/4mCNLP-Deep/, for the experimental researchers to get the results easily.
Collapse
Affiliation(s)
- Abdul Wahab
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju, 54896, South Korea
| | - Hilal Tayara
- School of International Engineering and Science, Jeonbuk National University, Jeonju, 54896, South Korea
| | - Zhenyu Xuan
- Department of Biological Sciences, The University of Texas at Dallas, Richardson, 75080, USA.
| | - Kil To Chong
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju, 54896, South Korea. .,Advanced Electronics and Information Research Center, Jeonbuk National University, Jeonju, 54896, South Korea.
| |
Collapse
|
7
|
Wahab A, Mahmoudi O, Kim J, Chong KT. DNC4mC-Deep: Identification and Analysis of DNA N4-Methylcytosine Sites Based on Different Encoding Schemes By Using Deep Learning. Cells 2020; 9:E1756. [PMID: 32707969 PMCID: PMC7465362 DOI: 10.3390/cells9081756] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2020] [Revised: 07/17/2020] [Accepted: 07/17/2020] [Indexed: 11/24/2022] Open
Abstract
N4-methylcytosine as one kind of modification of DNA has a critical role which alters genetic performance such as protein interactions, conformation, stability in DNA as well as the regulation of gene expression same cell developmental and genomic imprinting. Some different 4mC site identifiers have been proposed for various species. Herein, we proposed a computational model, DNC4mC-Deep, including six encoding techniques plus a deep learning model to predict 4mC sites in the genome of F. vesca, R. chinensis, and Cross-species dataset. It was demonstrated by the 10-fold cross-validation test to get superior performance. The DNC4mC-Deep obtained 0.829 and 0.929 of MCC on F. vesca and R. chinensis training dataset, respectively, and 0.814 on cross-species. This means the proposed method outperforms the state-of-the-art predictors at least 0.284 and 0.265 on F. vesca and R. chinensis training dataset in turn. Furthermore, the DNC4mC-Deep achieved 0.635 and 0.565 of MCC on F. vesca and R. chinensis independent dataset, respectively, and 0.562 on cross-species which shows it can achieve the best performance to predict 4mC sites as compared to the state-of-the-art predictor.
Collapse
Affiliation(s)
- Abdul Wahab
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, Korea; (A.W.); (O.M.)
| | - Omid Mahmoudi
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, Korea; (A.W.); (O.M.)
| | - Jeehong Kim
- Department of New & Renewable Energy, VISION College of Jeonju, Jeonju 55069, Korea
| | - Kil To Chong
- Department of Electronics Engineering, Jeonbuk National University, Jeonju 54896, Korea
- Advance Electronics & Information Research Center, Jeonbuk National University, Jeonju 54896, Korea
| |
Collapse
|
8
|
Stein A, Fowler DM, Hartmann-Petersen R, Lindorff-Larsen K. Biophysical and Mechanistic Models for Disease-Causing Protein Variants. Trends Biochem Sci 2019; 44:575-588. [PMID: 30712981 PMCID: PMC6579676 DOI: 10.1016/j.tibs.2019.01.003] [Citation(s) in RCA: 95] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2018] [Revised: 01/04/2019] [Accepted: 01/08/2019] [Indexed: 12/13/2022]
Abstract
The rapid decrease in DNA sequencing cost is revolutionizing medicine and science. In medicine, genome sequencing has revealed millions of missense variants that change protein sequences, yet we only understand the molecular and phenotypic consequences of a small fraction. Within protein science, high-throughput deep mutational scanning experiments enable us to probe thousands of variants in a single, multiplexed experiment. We review efforts that bring together these topics via experimental and computational approaches to determine the consequences of missense variants in proteins. We focus on the role of changes in protein stability as a driver for disease, and how experiments, biophysical models, and computation are providing a framework for understanding and predicting how changes in protein sequence affect cellular protein stability.
Collapse
Affiliation(s)
- Amelie Stein
- Linderstrøm-Lang Centre for Protein Science, Department of Biology, University of Copenhagen, Copenhagen, Denmark.
| | - Douglas M Fowler
- Departments of Genome Sciences and Bioengineering, University of Washington, Seattle, WA, USA
| | - Rasmus Hartmann-Petersen
- Linderstrøm-Lang Centre for Protein Science, Department of Biology, University of Copenhagen, Copenhagen, Denmark
| | - Kresten Lindorff-Larsen
- Linderstrøm-Lang Centre for Protein Science, Department of Biology, University of Copenhagen, Copenhagen, Denmark.
| |
Collapse
|