1
|
Soleymani F, Paquet E, Viktor HL, Michalowski W. Structure-based protein and small molecule generation using EGNN and diffusion models: A comprehensive review. Comput Struct Biotechnol J 2024; 23:2779-2797. [PMID: 39050782 PMCID: PMC11268121 DOI: 10.1016/j.csbj.2024.06.021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2024] [Revised: 06/13/2024] [Accepted: 06/18/2024] [Indexed: 07/27/2024] Open
Abstract
Recent breakthroughs in deep learning have revolutionized protein sequence and structure prediction. These advancements are built on decades of protein design efforts, and are overcoming traditional time and cost limitations. Diffusion models, at the forefront of these innovations, significantly enhance design efficiency by automating knowledge acquisition. In the field of de novo protein design, the goal is to create entirely novel proteins with predetermined structures. Given the arbitrary positions of proteins in 3-D space, graph representations and their properties are widely used in protein generation studies. A critical requirement in protein modelling is maintaining spatial relationships under transformations (rotations, translations, and reflections). This property, known as equivariance, ensures that predicted protein characteristics adapt seamlessly to changes in orientation or position. Equivariant graph neural networks offer a solution to this challenge. By incorporating equivariant graph neural networks to learn the score of the probability density function in diffusion models, one can generate proteins with robust 3-D structural representations. This review examines the latest deep learning advancements, specifically focusing on frameworks that combine diffusion models with equivariant graph neural networks for protein generation.
Collapse
Affiliation(s)
- Farzan Soleymani
- Telfer School of Management, University of Ottawa, ON, K1N 6N5, Canada
| | - Eric Paquet
- National Research Council, 1200 Montreal Road, Ottawa, ON, K1A 0R6, Canada
- School of Electrical Engineering and Computer Science, University of Ottawa, ON, K1N 6N5, Canada
| | - Herna Lydia Viktor
- School of Electrical Engineering and Computer Science, University of Ottawa, ON, K1N 6N5, Canada
| | | |
Collapse
|
2
|
Li C, Luo Y, Xie Y, Zhang Z, Liu Y, Zou L, Xiao F. Structural and functional prediction, evaluation, and validation in the post-sequencing era. Comput Struct Biotechnol J 2024; 23:446-451. [PMID: 38223342 PMCID: PMC10787220 DOI: 10.1016/j.csbj.2023.12.031] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2023] [Revised: 12/20/2023] [Accepted: 12/22/2023] [Indexed: 01/16/2024] Open
Abstract
The surge of genome sequencing data has underlined substantial genetic variants of uncertain significance (VUS). The decryption of VUS discovered by sequencing poses a major challenge in the post-sequencing era. Although experimental assays have progressed in classifying VUS, only a tiny fraction of the human genes have been explored experimentally. Thus, it is urgently needed to generate state-of-the-art functional predictors of VUS in silico. Artificial intelligence (AI) is an invaluable tool to assist in the identification of VUS with high efficiency and accuracy. An increasing number of studies indicate that AI has brought an exciting acceleration in the interpretation of VUS, and our group has already used AI to develop protein structure-based prediction models. In this review, we provide an overview of the previous research on AI-based prediction of missense variants, and elucidate the challenges and opportunities for protein structure-based variant prediction in the post-sequencing era.
Collapse
Affiliation(s)
- Chang Li
- Clinical Biobank, Beijing Hospital, National Center of Gerontology, National Health Commission, Institute of Geriatric Medicine, Chinese Academy of Medical Sciences, Beijing, China
- The Key Laboratory of Geriatrics, Beijing Institute of Geriatrics, Beijing Hospital, National Center of Gerontology, National Health Commission, Institute of Geriatric Medicine, Chinese Academy of Medical Sciences, Beijing, China
| | - Yixuan Luo
- Beijing Normal University, Beijing, China
| | - Yibo Xie
- Information Center, Beijing Hospital, National Center of Gerontology, National Health Commission, Institute of Geriatric Medicine, Chinese Academy of Medical Sciences, Beijing, China
| | - Zaifeng Zhang
- The Key Laboratory of Geriatrics, Beijing Institute of Geriatrics, Beijing Hospital, National Center of Gerontology, National Health Commission, Institute of Geriatric Medicine, Chinese Academy of Medical Sciences, Beijing, China
| | - Ye Liu
- The Key Laboratory of Geriatrics, Beijing Institute of Geriatrics, Beijing Hospital, National Center of Gerontology, National Health Commission, Institute of Geriatric Medicine, Chinese Academy of Medical Sciences, Beijing, China
| | - Lihui Zou
- The Key Laboratory of Geriatrics, Beijing Institute of Geriatrics, Beijing Hospital, National Center of Gerontology, National Health Commission, Institute of Geriatric Medicine, Chinese Academy of Medical Sciences, Beijing, China
| | - Fei Xiao
- Clinical Biobank, Beijing Hospital, National Center of Gerontology, National Health Commission, Institute of Geriatric Medicine, Chinese Academy of Medical Sciences, Beijing, China
- The Key Laboratory of Geriatrics, Beijing Institute of Geriatrics, Beijing Hospital, National Center of Gerontology, National Health Commission, Institute of Geriatric Medicine, Chinese Academy of Medical Sciences, Beijing, China
- Beijing Normal University, Beijing, China
| |
Collapse
|
3
|
Carpenter KA, Altman RB. Databases of ligand-binding pockets and protein-ligand interactions. Comput Struct Biotechnol J 2024; 23:1320-1338. [PMID: 38585646 PMCID: PMC10997877 DOI: 10.1016/j.csbj.2024.03.015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2024] [Revised: 03/16/2024] [Accepted: 03/17/2024] [Indexed: 04/09/2024] Open
Abstract
Many research groups and institutions have created a variety of databases curating experimental and predicted data related to protein-ligand binding. The landscape of available databases is dynamic, with new databases emerging and established databases becoming defunct. Here, we review the current state of databases that contain binding pockets and protein-ligand binding interactions. We have compiled a list of such databases, fifty-three of which are currently available for use. We discuss variation in how binding pockets are defined and summarize pocket-finding methods. We organize the fifty-three databases into subgroups based on goals and contents, and describe standard use cases. We also illustrate that pockets within the same protein are characterized differently across different databases. Finally, we assess critical issues of sustainability, accessibility and redundancy.
Collapse
Affiliation(s)
- Kristy A. Carpenter
- Department of Biomedical Data Science, Stanford University, Stanford, CA 94305, USA
| | - Russ B. Altman
- Department of Biomedical Data Science, Stanford University, Stanford, CA 94305, USA
- Department of Bioengineering, Stanford University, Stanford, CA 94305, USA
- Department of Genetics, Stanford University, Stanford, CA 94305, USA
- Department of Medicine, Stanford University, Stanford, CA 94305, USA
| |
Collapse
|
4
|
Mi Y, Marcu SB, Tabirca S, Yallapragada VV. PS-GO parametric protein search engine. Comput Struct Biotechnol J 2024; 23:1499-1509. [PMID: 38633387 PMCID: PMC11021831 DOI: 10.1016/j.csbj.2024.04.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2024] [Revised: 04/01/2024] [Accepted: 04/01/2024] [Indexed: 04/19/2024] Open
Abstract
With the explosive growth of protein-related data, we are confronted with a critical scientific inquiry: How can we effectively retrieve, compare, and profoundly comprehend these protein structures to maximize the utilization of such data resources? PS-GO, a parametric protein search engine, has been specifically designed and developed to maximize the utilization of the rapidly growing volume of protein-related data. This innovative tool addresses the critical need for effective retrieval, comparison, and deep understanding of protein structures. By integrating computational biology, bioinformatics, and data science, PS-GO is capable of managing large-scale data and accurately predicting and comparing protein structures and functions. The engine is built upon the concept of parametric protein design, a computer-aided method that adjusts and optimizes protein structures and sequences to achieve desired biological functions and structural stability. PS-GO utilizes key parameters such as amino acid sequence, side chain angle, and solvent accessibility, which have a significant influence on protein structure and function. Additionally, PS-GO leverages computable parameters, derived computationally, which are crucial for understanding and predicting protein behavior. The development of PS-GO underscores the potential of parametric protein design in a variety of applications, including enhancing enzyme activity, improving antibody affinity, and designing novel functional proteins. This advancement not only provides a robust theoretical foundation for the field of protein engineering and biotechnology but also offers practical guidelines for future progress in this domain.
Collapse
Affiliation(s)
- Yanlin Mi
- School of Computer Science and Information Technology, University College Cork, Cork, Ireland
- SFI Centre for Research Training in Artificial Intelligence, University College Cork, Cork, Ireland
| | - Stefan-Bogdan Marcu
- School of Computer Science and Information Technology, University College Cork, Cork, Ireland
| | - Sabin Tabirca
- School of Computer Science and Information Technology, University College Cork, Cork, Ireland
- Faculty of Mathematics and Informatics, Transylvania University of Brasov, Brasov, Romania
| | - Venkata V.B. Yallapragada
- Centre for Advanced Photonics and Process Analytics, Munster Technological University, Cork, Ireland
| |
Collapse
|
5
|
Gong X, Zhang J, Gan Q, Teng Y, Hou J, Lyu Y, Liu Z, Wu Z, Dai R, Zou Y, Wang X, Zhu D, Zhu H, Liu T, Yan Y. Advancing microbial production through artificial intelligence-aided biology. Biotechnol Adv 2024; 74:108399. [PMID: 38925317 DOI: 10.1016/j.biotechadv.2024.108399] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2024] [Revised: 05/20/2024] [Accepted: 06/23/2024] [Indexed: 06/28/2024]
Abstract
Microbial cell factories (MCFs) have been leveraged to construct sustainable platforms for value-added compound production. To optimize metabolism and reach optimal productivity, synthetic biology has developed various genetic devices to engineer microbial systems by gene editing, high-throughput protein engineering, and dynamic regulation. However, current synthetic biology methodologies still rely heavily on manual design, laborious testing, and exhaustive analysis. The emerging interdisciplinary field of artificial intelligence (AI) and biology has become pivotal in addressing the remaining challenges. AI-aided microbial production harnesses the power of processing, learning, and predicting vast amounts of biological data within seconds, providing outputs with high probability. With well-trained AI models, the conventional Design-Build-Test (DBT) cycle has been transformed into a multidimensional Design-Build-Test-Learn-Predict (DBTLP) workflow, leading to significantly improved operational efficiency and reduced labor consumption. Here, we comprehensively review the main components and recent advances in AI-aided microbial production, focusing on genome annotation, AI-aided protein engineering, artificial functional protein design, and AI-enabled pathway prediction. Finally, we discuss the challenges of integrating novel AI techniques into biology and propose the potential of large language models (LLMs) in advancing microbial production.
Collapse
Affiliation(s)
- Xinyu Gong
- School of Chemical, Materials, and Biomedical Engineering, College of Engineering, The University of Georgia, Athens, GA 30602, USA
| | - Jianli Zhang
- School of Chemical, Materials, and Biomedical Engineering, College of Engineering, The University of Georgia, Athens, GA 30602, USA
| | - Qi Gan
- School of Chemical, Materials, and Biomedical Engineering, College of Engineering, The University of Georgia, Athens, GA 30602, USA
| | - Yuxi Teng
- School of Chemical, Materials, and Biomedical Engineering, College of Engineering, The University of Georgia, Athens, GA 30602, USA
| | - Jixin Hou
- School of ECAM, College of Engineering, University of Georgia, Athens, GA 30602, USA
| | - Yanjun Lyu
- Department of Computer Science and Engineering, The University of Texas at Arlington, Arlington 76019, USA
| | - Zhengliang Liu
- School of Computing, The University of Georgia, Athens, GA 30602, USA
| | - Zihao Wu
- School of Computing, The University of Georgia, Athens, GA 30602, USA
| | - Runpeng Dai
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Yusong Zou
- School of Chemical, Materials, and Biomedical Engineering, College of Engineering, The University of Georgia, Athens, GA 30602, USA
| | - Xianqiao Wang
- School of ECAM, College of Engineering, University of Georgia, Athens, GA 30602, USA
| | - Dajiang Zhu
- Department of Computer Science and Engineering, The University of Texas at Arlington, Arlington 76019, USA
| | - Hongtu Zhu
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Tianming Liu
- School of Computing, The University of Georgia, Athens, GA 30602, USA
| | - Yajun Yan
- School of Chemical, Materials, and Biomedical Engineering, College of Engineering, The University of Georgia, Athens, GA 30602, USA.
| |
Collapse
|
6
|
Boadu F, Lee A, Cheng J. Deep learning methods for protein function prediction. Proteomics 2024:e2300471. [PMID: 38996351 DOI: 10.1002/pmic.202300471] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2024] [Revised: 06/15/2024] [Accepted: 06/18/2024] [Indexed: 07/14/2024]
Abstract
Predicting protein function from protein sequence, structure, interaction, and other relevant information is important for generating hypotheses for biological experiments and studying biological systems, and therefore has been a major challenge in protein bioinformatics. Numerous computational methods had been developed to advance protein function prediction gradually in the last two decades. Particularly, in the recent years, leveraging the revolutionary advances in artificial intelligence (AI), more and more deep learning methods have been developed to improve protein function prediction at a faster pace. Here, we provide an in-depth review of the recent developments of deep learning methods for protein function prediction. We summarize the significant advances in the field, identify several remaining major challenges to be tackled, and suggest some potential directions to explore. The data sources and evaluation metrics widely used in protein function prediction are also discussed to assist the machine learning, AI, and bioinformatics communities to develop more cutting-edge methods to advance protein function prediction.
Collapse
Affiliation(s)
- Frimpong Boadu
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, Missouri, USA
| | - Ahhyun Lee
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, Missouri, USA
| | - Jianlin Cheng
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, Missouri, USA
| |
Collapse
|
7
|
Mietzsch M, Kailasan S, Bennett A, Chipman P, Fane B, Huiskonen JT, Clarke IN, McKenna R. The Structure of Spiroplasma Virus 4: Exploring the Capsid Diversity of the Microviridae. Viruses 2024; 16:1103. [PMID: 39066266 DOI: 10.3390/v16071103] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2024] [Revised: 07/02/2024] [Accepted: 07/06/2024] [Indexed: 07/28/2024] Open
Abstract
Spiroplasma virus 4 (SpV4) is a bacteriophage of the Microviridae, which packages circular ssDNA within non-enveloped T = 1 icosahedral capsids. It infects spiroplasmas, which are known pathogens of honeybees. Here, the structure of the SpV4 virion is determined using cryo-electron microscopy to a resolution of 2.5 Å. A striking feature of the SpV4 capsid is the mushroom-like protrusions at the 3-fold axes, which is common among all members of the subfamily Gokushovirinae. While the function of the protrusion is currently unknown, this feature varies widely in this subfamily and is therefore possibly an adaptation for host recognition. Furthermore, on the interior of the SpV4 capsid, the location of DNA-binding protein VP8 was identified and shown to have low structural conservation to the capsids of other viruses in the family. The structural characterization of SpV4 will aid future studies analyzing the virus-host interaction, to understand disease mechanisms at a molecular level. Furthermore, the structural comparisons in this study, including a low-resolution structure of the chlamydia phage 2, provide an overview of the structural repertoire of the viruses in this family that infect various bacterial hosts, which in turn infect a wide range of animals and plants.
Collapse
Affiliation(s)
- Mario Mietzsch
- Department of Biochemistry and Molecular Biology, College of Medicine, Center for Structural Biology, McKnight Brain Institute, University of Florida, Gainesville, FL 32610, USA
| | - Shweta Kailasan
- Department of Biochemistry and Molecular Biology, College of Medicine, Center for Structural Biology, McKnight Brain Institute, University of Florida, Gainesville, FL 32610, USA
| | - Antonette Bennett
- Department of Biochemistry and Molecular Biology, College of Medicine, Center for Structural Biology, McKnight Brain Institute, University of Florida, Gainesville, FL 32610, USA
| | - Paul Chipman
- Department of Biochemistry and Molecular Biology, College of Medicine, Center for Structural Biology, McKnight Brain Institute, University of Florida, Gainesville, FL 32610, USA
| | - Bentley Fane
- The BIO5 Institute, Keating Building, University of Arizona, Tucson, AZ 85721, USA
| | - Juha T Huiskonen
- Institute of Biotechnology, Helsinki Institute of Life Science HiLIFE, University of Helsinki, 00014 Helsinki, Finland
| | - Ian N Clarke
- Molecular Microbiology Group, Faculty of Medicine, University of Southampton, Southampton General Hospital, Southampton SO16 6YD, UK
| | - Robert McKenna
- Department of Biochemistry and Molecular Biology, College of Medicine, Center for Structural Biology, McKnight Brain Institute, University of Florida, Gainesville, FL 32610, USA
| |
Collapse
|
8
|
Hu X, Sun Z, Nian Y, Wang Y, Dang Y, Li F, Feng J, Yu E, Tao C. Self-Explainable Graph Neural Network for Alzheimer Disease and Related Dementias Risk Prediction: Algorithm Development and Validation Study. JMIR Aging 2024; 7:e54748. [PMID: 38976869 PMCID: PMC11263893 DOI: 10.2196/54748] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2023] [Revised: 03/31/2024] [Accepted: 06/02/2024] [Indexed: 07/10/2024] Open
Abstract
BACKGROUND Alzheimer disease and related dementias (ADRD) rank as the sixth leading cause of death in the United States, underlining the importance of accurate ADRD risk prediction. While recent advancements in ADRD risk prediction have primarily relied on imaging analysis, not all patients undergo medical imaging before an ADRD diagnosis. Merging machine learning with claims data can reveal additional risk factors and uncover interconnections among diverse medical codes. OBJECTIVE The study aims to use graph neural networks (GNNs) with claim data for ADRD risk prediction. Addressing the lack of human-interpretable reasons behind these predictions, we introduce an innovative, self-explainable method to evaluate relationship importance and its influence on ADRD risk prediction. METHODS We used a variationally regularized encoder-decoder GNN (variational GNN [VGNN]) integrated with our proposed relation importance method for estimating ADRD likelihood. This self-explainable method can provide a feature-important explanation in the context of ADRD risk prediction, leveraging relational information within a graph. Three scenarios with 1-year, 2-year, and 3-year prediction windows were created to assess the model's efficiency, respectively. Random forest (RF) and light gradient boost machine (LGBM) were used as baselines. By using this method, we further clarify the key relationships for ADRD risk prediction. RESULTS In scenario 1, the VGNN model showed area under the receiver operating characteristic (AUROC) scores of 0.7272 and 0.7480 for the small subset and the matched cohort data set. It outperforms RF and LGBM by 10.6% and 9.1%, respectively, on average. In scenario 2, it achieved AUROC scores of 0.7125 and 0.7281, surpassing the other models by 10.5% and 8.9%, respectively. Similarly, in scenario 3, AUROC scores of 0.7001 and 0.7187 were obtained, exceeding 10.1% and 8.5% than the baseline models, respectively. These results clearly demonstrate the significant superiority of the graph-based approach over the tree-based models (RF and LGBM) in predicting ADRD. Furthermore, the integration of the VGNN model and our relation importance interpretation could provide valuable insight into paired factors that may contribute to or delay ADRD progression. CONCLUSIONS Using our innovative self-explainable method with claims data enhances ADRD risk prediction and provides insights into the impact of interconnected medical code relationships. This methodology not only enables ADRD risk modeling but also shows potential for other image analysis predictions using claims data.
Collapse
Affiliation(s)
- Xinyue Hu
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Jacksonville, FL, United States
- McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, United States
| | - Zenan Sun
- McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, United States
| | - Yi Nian
- McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, United States
| | - Yichen Wang
- Division of Hospital Medicine at Perelman School of Medicine, The University of Pennsylvania, Philadelphia, PA, United States
| | - Yifang Dang
- McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, United States
| | - Fang Li
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Jacksonville, FL, United States
- McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, United States
| | - Jingna Feng
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Jacksonville, FL, United States
- McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, United States
| | - Evan Yu
- McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, United States
| | - Cui Tao
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Jacksonville, FL, United States
- McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, United States
| |
Collapse
|
9
|
Cheng P, Mao C, Tang J, Yang S, Cheng Y, Wang W, Gu Q, Han W, Chen H, Li S, Chen Y, Zhou J, Li W, Pan A, Zhao S, Huang X, Zhu S, Zhang J, Shu W, Wang S. Zero-shot prediction of mutation effects with multimodal deep representation learning guides protein engineering. Cell Res 2024:10.1038/s41422-024-00989-2. [PMID: 38969803 DOI: 10.1038/s41422-024-00989-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2024] [Accepted: 06/03/2024] [Indexed: 07/07/2024] Open
Abstract
Mutations in amino acid sequences can provoke changes in protein function. Accurate and unsupervised prediction of mutation effects is critical in biotechnology and biomedicine, but remains a fundamental challenge. To resolve this challenge, here we present Protein Mutational Effect Predictor (ProMEP), a general and multiple sequence alignment-free method that enables zero-shot prediction of mutation effects. A multimodal deep representation learning model embedded in ProMEP was developed to comprehensively learn both sequence and structure contexts from ~160 million proteins. ProMEP achieves state-of-the-art performance in mutational effect prediction and accomplishes a tremendous improvement in speed, enabling efficient and intelligent protein engineering. Specifically, ProMEP accurately forecasts mutational consequences on the gene-editing enzymes TnpB and TadA, and successfully guides the development of high-performance gene-editing tools with their engineered variants. The gene-editing efficiency of a 5-site mutant of TnpB reaches up to 74.04% (vs 24.66% for the wild type); and the base editing tool developed on the basis of a TadA 15-site mutant (in addition to the A106V/D108N double mutation that renders deoxyadenosine deaminase activity to TadA) exhibits an A-to-G conversion frequency of up to 77.27% (vs 69.80% for ABE8e, a previous TadA-based adenine base editor) with significantly reduced bystander and off-target effects compared to ABE8e. ProMEP not only showcases superior performance in predicting mutational effects on proteins but also demonstrates a great capability to guide protein engineering. Therefore, ProMEP enables efficient exploration of the gigantic protein space and facilitates practical design of proteins, thereby advancing studies in biomedicine and synthetic biology.
Collapse
Affiliation(s)
- Peng Cheng
- Bioinformatics Center of AMMS, Beijing, China
| | - Cong Mao
- State Key Laboratory of Reproductive Medicine and Offspring Health, Women's Hospital of Nanjing Medical University, Nanjing Maternity and Child Health Care Hospital, Nanjing Medical University, Nanjing, Jiangsu, China
| | - Jin Tang
- Zhejiang Lab, Hangzhou, Zhejiang, China
| | - Sen Yang
- Bioinformatics Center of AMMS, Beijing, China
| | - Yu Cheng
- State Key Laboratory of Reproductive Medicine and Offspring Health, Women's Hospital of Nanjing Medical University, Nanjing Maternity and Child Health Care Hospital, Nanjing Medical University, Nanjing, Jiangsu, China
| | - Wuke Wang
- Zhejiang Lab, Hangzhou, Zhejiang, China
| | - Qiuxi Gu
- State Key Laboratory of Reproductive Medicine and Offspring Health, Women's Hospital of Nanjing Medical University, Nanjing Maternity and Child Health Care Hospital, Nanjing Medical University, Nanjing, Jiangsu, China
| | - Wei Han
- Zhejiang Lab, Hangzhou, Zhejiang, China
| | - Hao Chen
- State Key Laboratory of Reproductive Medicine and Offspring Health, Women's Hospital of Nanjing Medical University, Nanjing Maternity and Child Health Care Hospital, Nanjing Medical University, Nanjing, Jiangsu, China
| | - Sihan Li
- State Key Laboratory of Reproductive Medicine and Offspring Health, Women's Hospital of Nanjing Medical University, Nanjing Maternity and Child Health Care Hospital, Nanjing Medical University, Nanjing, Jiangsu, China
| | | | | | - Wuju Li
- Bioinformatics Center of AMMS, Beijing, China
| | - Aimin Pan
- Zhejiang Lab, Hangzhou, Zhejiang, China
| | - Suwen Zhao
- iHuman Institute, ShanghaiTech University, Shanghai, China
- School of Life Science and Technology, ShanghaiTech University, Shanghai, China
| | - Xingxu Huang
- Zhejiang Lab, Hangzhou, Zhejiang, China
- School of Life Science and Technology, ShanghaiTech University, Shanghai, China
| | | | - Jun Zhang
- State Key Laboratory of Reproductive Medicine and Offspring Health, Women's Hospital of Nanjing Medical University, Nanjing Maternity and Child Health Care Hospital, Nanjing Medical University, Nanjing, Jiangsu, China.
| | - Wenjie Shu
- Bioinformatics Center of AMMS, Beijing, China.
| | | |
Collapse
|
10
|
Yuan Q, Tian C, Song Y, Ou P, Zhu M, Zhao H, Yang Y. GPSFun: geometry-aware protein sequence function predictions with language models. Nucleic Acids Res 2024; 52:W248-W255. [PMID: 38738636 PMCID: PMC11223820 DOI: 10.1093/nar/gkae381] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2024] [Revised: 04/22/2024] [Accepted: 04/26/2024] [Indexed: 05/14/2024] Open
Abstract
Knowledge of protein function is essential for elucidating disease mechanisms and discovering new drug targets. However, there is a widening gap between the exponential growth of protein sequences and their limited function annotations. In our prior studies, we have developed a series of methods including GraphPPIS, GraphSite, LMetalSite and SPROF-GO for protein function annotations at residue or protein level. To further enhance their applicability and performance, we now present GPSFun, a versatile web server for Geometry-aware Protein Sequence Function annotations, which equips our previous tools with language models and geometric deep learning. Specifically, GPSFun employs large language models to efficiently predict 3D conformations of the input protein sequences and extract informative sequence embeddings. Subsequently, geometric graph neural networks are utilized to capture the sequence and structure patterns in the protein graphs, facilitating various downstream predictions including protein-ligand binding sites, gene ontologies, subcellular locations and protein solubility. Notably, GPSFun achieves superior performance to state-of-the-art methods across diverse tasks without requiring multiple sequence alignments or experimental protein structures. GPSFun is freely available to all users at https://bio-web1.nscc-gz.cn/app/GPSFun with user-friendly interfaces and rich visualizations.
Collapse
Affiliation(s)
- Qianmu Yuan
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, Guangdong 510000, China
| | - Chong Tian
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, Guangdong 510000, China
| | - Yidong Song
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, Guangdong 510000, China
| | - Peihua Ou
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, Guangdong 510000, China
| | - Mingming Zhu
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, Guangdong 510000, China
| | - Huiying Zhao
- Sun Yat-sen Memorial Hospital, Sun Yat-sen University, Guangzhou, Guangdong 510000, China
| | - Yuedong Yang
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, Guangdong 510000, China
| |
Collapse
|
11
|
Ye B, Tian W, Wang B, Liang J. CASTpFold: Computed Atlas of Surface Topography of the universe of protein Folds. Nucleic Acids Res 2024; 52:W194-W199. [PMID: 38783102 PMCID: PMC11223844 DOI: 10.1093/nar/gkae415] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2024] [Revised: 04/25/2024] [Accepted: 05/03/2024] [Indexed: 05/25/2024] Open
Abstract
Geometric and topological properties of protein structures, including surface pockets, interior cavities and cross channels, are of fundamental importance for proteins to carry out their functions. Computed Atlas of Surface Topography of proteins (CASTp) is a widely used web server for locating, delineating, and measuring these geometric and topological properties of protein structures. Recent developments in AI-based protein structure prediction such as AlphaFold2 (AF2) have significantly expanded our knowledge on protein structures. Here we present CASTpFold, a continuation of CASTp that provides accurate and comprehensive identifications and quantifications of protein topography. It now provides (i) results on an expanded database of proteins, including the Protein Data Bank (PDB) and non-singleton representative structures of AlphaFold2 structures, covering 183 million AF2 structures; (ii) functional pockets prediction with corresponding Gene Ontology (GO) terms or Enzyme Commission (EC) numbers for AF2-predicted structures and (iii) pocket similarity search function for surface and protein-protein interface pockets. The CASTpFold web server is freely accessible at https://cfold.bme.uic.edu/castpfold/.
Collapse
Affiliation(s)
- Bowei Ye
- Center for Bioinformatics and Quantitative Biology, and Richard and Loan Hill Department of Bioengineering, University of Illinois at Chicago, Chicago, IL 60607, USA
| | - Wei Tian
- Salk Institute for Biological Studies, La Jolla, CA 92037, USA
| | - Boshen Wang
- UT Southwestern Medical Center, Dallas, TX 75390, USA
| | - Jie Liang
- Center for Bioinformatics and Quantitative Biology, and Richard and Loan Hill Department of Bioengineering, University of Illinois at Chicago, Chicago, IL 60607, USA
- University of Illinois Cancer Center, Chicago, IL 60612, USA
| |
Collapse
|
12
|
de Crécy-Lagard V, Dias R, Friedberg I, Yuan Y, Swairjo MA. Limitations of Current Machine-Learning Models in Predicting Enzymatic Functions for Uncharacterized Proteins. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.07.01.601547. [PMID: 39005379 PMCID: PMC11244979 DOI: 10.1101/2024.07.01.601547] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/16/2024]
Abstract
Thirty to seventy percent of proteins in any given genome have no assigned function and have been labeled as the protein "unknownme". This large knowledge gap prevents the biological community from fully leveraging the plethora of genomic data that is now available. Machine-learning approaches are showing some promise in propagating functional knowledge from experimentally characterized proteins to the correct set of isofunctional orthologs. However, they largely fail to predict enzymatic functions unseen in the training set, as shown by dissecting the predictions made for 450 enzymes of unknown function from the model bacteria Escherichia coli using the DeepECTransformer platform. Lessons from these failures can help the community develop machine-learning methods that assist domain experts in making testable functional predictions for more members of the uncharacterized proteome.
Collapse
|
13
|
Haghayegh F, Norouziazad A, Haghani E, Feygin AA, Rahimi RH, Ghavamabadi HA, Sadighbayan D, Madhoun F, Papagelis M, Felfeli T, Salahandish R. Revolutionary Point-of-Care Wearable Diagnostics for Early Disease Detection and Biomarker Discovery through Intelligent Technologies. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2024:e2400595. [PMID: 38958517 DOI: 10.1002/advs.202400595] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/16/2024] [Revised: 06/19/2024] [Indexed: 07/04/2024]
Abstract
Early-stage disease detection, particularly in Point-Of-Care (POC) wearable formats, assumes pivotal role in advancing healthcare services and precision-medicine. Public benefits of early detection extend beyond cost-effectively promoting healthcare outcomes, to also include reducing the risk of comorbid diseases. Technological advancements enabling POC biomarker recognition empower discovery of new markers for various health conditions. Integration of POC wearables for biomarker detection with intelligent frameworks represents ground-breaking innovations enabling automation of operations, conducting advanced large-scale data analysis, generating predictive models, and facilitating remote and guided clinical decision-making. These advancements substantially alleviate socioeconomic burdens, creating a paradigm shift in diagnostics, and revolutionizing medical assessments and technology development. This review explores critical topics and recent progress in development of 1) POC systems and wearable solutions for early disease detection and physiological monitoring, as well as 2) discussing current trends in adoption of smart technologies within clinical settings and in developing biological assays, and ultimately 3) exploring utilities of POC systems and smart platforms for biomarker discovery. Additionally, the review explores technology translation from research labs to broader applications. It also addresses associated risks, biases, and challenges of widespread Artificial Intelligence (AI) integration in diagnostics systems, while systematically outlining potential prospects, current challenges, and opportunities.
Collapse
Affiliation(s)
- Fatemeh Haghayegh
- Laboratory of Advanced Biotechnologies for Health Assessments (Lab-HA), Biomedical Engineering Program, Lassonde School of Engineering, York University, Toronto, M3J 1P3, Canada
- Department of Electrical Engineering and Computer Science (EECS), Lassonde School of Engineering, York University, Toronto, ON, M3J 1P3, Canada
| | - Alireza Norouziazad
- Laboratory of Advanced Biotechnologies for Health Assessments (Lab-HA), Biomedical Engineering Program, Lassonde School of Engineering, York University, Toronto, M3J 1P3, Canada
- Department of Electrical Engineering and Computer Science (EECS), Lassonde School of Engineering, York University, Toronto, ON, M3J 1P3, Canada
| | - Elnaz Haghani
- Laboratory of Advanced Biotechnologies for Health Assessments (Lab-HA), Biomedical Engineering Program, Lassonde School of Engineering, York University, Toronto, M3J 1P3, Canada
- Department of Electrical Engineering and Computer Science (EECS), Lassonde School of Engineering, York University, Toronto, ON, M3J 1P3, Canada
| | - Ariel Avraham Feygin
- Laboratory of Advanced Biotechnologies for Health Assessments (Lab-HA), Biomedical Engineering Program, Lassonde School of Engineering, York University, Toronto, M3J 1P3, Canada
- Department of Electrical Engineering and Computer Science (EECS), Lassonde School of Engineering, York University, Toronto, ON, M3J 1P3, Canada
| | - Reza Hamed Rahimi
- Laboratory of Advanced Biotechnologies for Health Assessments (Lab-HA), Biomedical Engineering Program, Lassonde School of Engineering, York University, Toronto, M3J 1P3, Canada
- Department of Electrical Engineering and Computer Science (EECS), Lassonde School of Engineering, York University, Toronto, ON, M3J 1P3, Canada
| | - Hamidreza Akbari Ghavamabadi
- Laboratory of Advanced Biotechnologies for Health Assessments (Lab-HA), Biomedical Engineering Program, Lassonde School of Engineering, York University, Toronto, M3J 1P3, Canada
- Department of Electrical Engineering and Computer Science (EECS), Lassonde School of Engineering, York University, Toronto, ON, M3J 1P3, Canada
| | - Deniz Sadighbayan
- Department of Biology, Faculty of Science, York University, Toronto, ON, M3J 1P3, Canada
| | - Faress Madhoun
- Laboratory of Advanced Biotechnologies for Health Assessments (Lab-HA), Biomedical Engineering Program, Lassonde School of Engineering, York University, Toronto, M3J 1P3, Canada
- Department of Electrical Engineering and Computer Science (EECS), Lassonde School of Engineering, York University, Toronto, ON, M3J 1P3, Canada
| | - Manos Papagelis
- Department of Electrical Engineering and Computer Science (EECS), Lassonde School of Engineering, York University, Toronto, ON, M3J 1P3, Canada
| | - Tina Felfeli
- Department of Ophthalmology and Vision Sciences, University of Toronto, Ontario, M5T 3A9, Canada
- Institute of Health Policy, Management and Evaluation, University of Toronto, Ontario, M5T 3M6, Canada
| | - Razieh Salahandish
- Laboratory of Advanced Biotechnologies for Health Assessments (Lab-HA), Biomedical Engineering Program, Lassonde School of Engineering, York University, Toronto, M3J 1P3, Canada
- Department of Electrical Engineering and Computer Science (EECS), Lassonde School of Engineering, York University, Toronto, ON, M3J 1P3, Canada
| |
Collapse
|
14
|
Nestl BM, Nebel BA, Resch V, Schürmann M, Tischler D. The Development and Opportunities of Predictive Biotechnology. Chembiochem 2024; 25:e202300863. [PMID: 38713151 DOI: 10.1002/cbic.202300863] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2023] [Revised: 04/05/2024] [Indexed: 05/08/2024]
Abstract
Recent advances in bioeconomy allow a holistic view of existing and new process chains and enable novel production routines continuously advanced by academia and industry. All this progress benefits from a growing number of prediction tools that have found their way into the field. For example, automated genome annotations, tools for building model structures of proteins, and structural protein prediction methods such as AlphaFold2TM or RoseTTAFold have gained popularity in recent years. Recently, it has become apparent that more and more AI-based tools are being developed and used for biocatalysis and biotechnology. This is an excellent opportunity for academia and industry to accelerate advancements in the field further. Biotechnology, as a rapidly growing interdisciplinary field, stands to benefit greatly from these developments.
Collapse
Affiliation(s)
- Bettina M Nestl
- Joint working group on biotransformations of the Association for General and Applied Microbiology VAAM, the Society for Chemical Engineering, Biotechnology DECHEMA, Theodor-Heuss-Allee 25, 60486, Frankfurt, Germany
- Innophore GmbH, Am Eisernen Tor 3, 8010, Graz, Austria
| | - Bernd A Nebel
- Innophore GmbH, Am Eisernen Tor 3, 8010, Graz, Austria
| | - Verena Resch
- Innophore GmbH, Am Eisernen Tor 3, 8010, Graz, Austria
| | - Martin Schürmann
- Joint working group on biotransformations of the Association for General and Applied Microbiology VAAM, the Society for Chemical Engineering, Biotechnology DECHEMA, Theodor-Heuss-Allee 25, 60486, Frankfurt, Germany
- InnoSyn B. V., Urmonderbaan 22, 6167 RD, Geleen, The Netherlands
- SynSilico B. V., Urmonderbaan 22, 6167 RD, Geleen, The Netherlands
| | - Dirk Tischler
- Joint working group on biotransformations of the Association for General and Applied Microbiology VAAM, the Society for Chemical Engineering, Biotechnology DECHEMA, Theodor-Heuss-Allee 25, 60486, Frankfurt, Germany
- Microbial Biotechnology, Ruhr University Bochum, Universitätsstrasse 150, 44780, Bochum, Germany
| |
Collapse
|
15
|
Chen Z, Luo Q. DualNetGO: a dual network model for protein function prediction via effective feature selection. BIOINFORMATICS (OXFORD, ENGLAND) 2024; 40:btae437. [PMID: 38963311 DOI: 10.1093/bioinformatics/btae437] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/08/2023] [Revised: 06/05/2024] [Accepted: 07/03/2024] [Indexed: 07/05/2024]
Abstract
MOTIVATION Protein-protein interaction (PPI) networks are crucial for automatically annotating protein functions. As multiple PPI networks exist for the same set of proteins that capture properties from different aspects, it is a challenging task to effectively utilize these heterogeneous networks. Recently, several deep learning models have combined PPI networks from all evidence, or concatenated all graph embeddings for protein function prediction. However, the lack of a judicious selection procedure prevents the effective harness of information from different PPI networks, as these networks vary in densities, structures, and noise levels. Consequently, combining protein features indiscriminately could increase the noise level, leading to decreased model performance. RESULTS We develop DualNetGO, a dual-network model comprised of a Classifier and a Selector, to predict protein functions by effectively selecting features from different sources including graph embeddings of PPI networks, protein domain, and subcellular location information. Evaluation of DualNetGO on human and mouse datasets in comparison with other network-based models shows at least 4.5%, 6.2%, and 14.2% improvement on Fmax in BP, MF, and CC gene ontology categories, respectively, for human, and 3.3%, 10.6%, and 7.7% improvement on Fmax for mouse. We demonstrate the generalization capability of our model by training and testing on the CAFA3 data, and show its versatility by incorporating Esm2 embeddings. We further show that our model is insensitive to the choice of graph embedding method and is time- and memory-saving. These results demonstrate that combining a subset of features including PPI networks and protein attributes selected by our model is more effective in utilizing PPI network information than only using one kind of or concatenating graph embeddings from all kinds of PPI networks. AVAILABILITY AND IMPLEMENTATION The source code of DualNetGO and some of the experiment data are available at: https://github.com/georgedashen/DualNetGO.
Collapse
Affiliation(s)
- Zhuoyang Chen
- Data Science and Analytics Thrust, Information Hub, The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, Guangdong, 511400, China
| | - Qiong Luo
- Data Science and Analytics Thrust, Information Hub, The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, Guangdong, 511400, China
- HKUST, Hong Kong SAR, China
| |
Collapse
|
16
|
Dong Y, Quan H, Ma C, Shan L, Deng L. TGC-ARG: Anticipating Antibiotic Resistance via Transformer-Based Modeling and Contrastive Learning. Int J Mol Sci 2024; 25:7228. [PMID: 39000335 PMCID: PMC11241484 DOI: 10.3390/ijms25137228] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2024] [Revised: 06/25/2024] [Accepted: 06/27/2024] [Indexed: 07/16/2024] Open
Abstract
In various domains, including everyday activities, agricultural practices, and medical treatments, the escalating challenge of antibiotic resistance poses a significant concern. Traditional approaches to studying antibiotic resistance genes (ARGs) often require substantial time and effort and are limited in accuracy. Moreover, the decentralized nature of existing data repositories complicates comprehensive analysis of antibiotic resistance gene sequences. In this study, we introduce a novel computational framework named TGC-ARG designed to predict potential ARGs. This framework takes protein sequences as input, utilizes SCRATCH-1D for protein secondary structure prediction, and employs feature extraction techniques to derive distinctive features from both sequence and structural data. Subsequently, a Siamese network is employed to foster a contrastive learning environment, enhancing the model's ability to effectively represent the data. Finally, a multi-layer perceptron (MLP) integrates and processes sequence embeddings alongside predicted secondary structure embeddings to forecast ARG presence. To evaluate our approach, we curated a pioneering open dataset termed ARSS (Antibiotic Resistance Sequence Statistics). Comprehensive comparative experiments demonstrate that our method surpasses current state-of-the-art methodologies. Additionally, through detailed case studies, we illustrate the efficacy of our approach in predicting potential ARGs.
Collapse
Affiliation(s)
| | | | | | | | - Lei Deng
- School of Computer Science and Engineering, Central South University, Changsha 410083, China; (Y.D.); (H.Q.); (C.M.); (L.S.)
| |
Collapse
|
17
|
Jamasb AR, Morehead A, Joshi CK, Zhang Z, Didi K, Mathis S, Harris C, Tang J, Cheng J, Liò P, Blundell TL. Evaluating Representation Learning on the Protein Structure Universe. ARXIV 2024:arXiv:2406.13864v1. [PMID: 38947934 PMCID: PMC11213157] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 07/02/2024]
Abstract
We introduce ProteinWorkshop, a comprehensive benchmark suite for representation learning on protein structures with Geometric Graph Neural Networks. We consider large-scale pre-training and downstream tasks on both experimental and predicted structures to enable the systematic evaluation of the quality of the learned structural representation and their usefulness in capturing functional relationships for downstream tasks. We find that: (1) large-scale pretraining on AlphaFold structures and auxiliary tasks consistently improve the performance of both rotation-invariant and equivariant GNNs, and (2) more expressive equivariant GNNs benefit from pretraining to a greater extent compared to invariant models. We aim to establish a common ground for the machine learning and computational biology communities to rigorously compare and advance protein structure representation learning. Our open-source codebase reduces the barrier to entry for working with large protein structure datasets by providing: (1) storage-efficient dataloaders for large-scale structural databases including AlphaFoldDB and ESM Atlas, as well as (2) utilities for constructing new tasks from the entire PDB. ProteinWorkshop is available at: github.com/a-r-j/ProteinWorkshop.
Collapse
|
18
|
Kwon JJ, Pan J, Gonzalez G, Hahn WC, Zitnik M. On knowing a gene: A distributional hypothesis of gene function. Cell Syst 2024; 15:488-496. [PMID: 38810640 PMCID: PMC11189734 DOI: 10.1016/j.cels.2024.04.008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2023] [Revised: 02/25/2024] [Accepted: 04/30/2024] [Indexed: 05/31/2024]
Abstract
As words can have multiple meanings that depend on sentence context, genes can have various functions that depend on the surrounding biological system. This pleiotropic nature of gene function is limited by ontologies, which annotate gene functions without considering biological contexts. We contend that the gene function problem in genetics may be informed by recent technological leaps in natural language processing, in which representations of word semantics can be automatically learned from diverse language contexts. In contrast to efforts to model semantics as "is-a" relationships in the 1990s, modern distributional semantics represents words as vectors in a learned semantic space and fuels current advances in transformer-based models such as large language models and generative pre-trained transformers. A similar shift in thinking of gene functions as distributions over cellular contexts may enable a similar breakthrough in data-driven learning from large biological datasets to inform gene function.
Collapse
Affiliation(s)
- Jason J Kwon
- Dana-Farber Cancer Institute and Harvard Medical School, Department of Medical Oncology, Boston, MA 02215, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Joshua Pan
- Dana-Farber Cancer Institute and Harvard Medical School, Department of Medical Oncology, Boston, MA 02215, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Guadalupe Gonzalez
- Department of Computing, Faculty of Engineering, Imperial College, London SW7 2AZ, UK
| | - William C Hahn
- Dana-Farber Cancer Institute and Harvard Medical School, Department of Medical Oncology, Boston, MA 02215, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.
| | - Marinka Zitnik
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Harvard Medical School, Department of Biomedical Informatics, Boston, MA 02115, USA; Harvard Data Science Initiative, Harvard University, Cambridge, MA 02138, USA; Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University, Allston, MA 02134, USA.
| |
Collapse
|
19
|
Padalko A, Nair G, Sousa FL. Fusion/fission protein family identification in Archaea. mSystems 2024; 9:e0094823. [PMID: 38700364 PMCID: PMC11237513 DOI: 10.1128/msystems.00948-23] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2023] [Accepted: 04/02/2024] [Indexed: 05/05/2024] Open
Abstract
The majority of newly discovered archaeal lineages remain without a cultivated representative, but scarce experimental data from the cultivated organisms show that they harbor distinct functional repertoires. To unveil the ecological as well as evolutionary impact of Archaea from metagenomics, new computational methods need to be developed, followed by in-depth analysis. Among them is the genome-wide protein fusion screening performed here. Natural fusions and fissions of genes not only contribute to microbial evolution but also complicate the correct identification and functional annotation of sequences. The products of these processes can be defined as fusion (or composite) proteins, the ones consisting of two or more domains originally encoded by different genes and split proteins, and the ones originating from the separation of a gene in two (fission). Fusion identifications are required for proper phylogenetic reconstructions and metabolic pathway completeness assessments, while mappings between fused and unfused proteins can fill some of the existing gaps in metabolic models. In the archaeal genome-wide screening, more than 1,900 fusion/fission protein clusters were identified, belonging to both newly sequenced and well-studied lineages. These protein families are mainly associated with different types of metabolism, genetic, and cellular processes. Moreover, 162 of the identified fusion/fission protein families are archaeal specific, having no identified fused homolog within the bacterial domain. Our approach was validated by the identification of experimentally characterized fusion/fission cases. However, around 25% of the identified fusion/fission families lack functional annotations for both composite and split states, showing the need for experimental characterization in Archaea.IMPORTANCEGenome-wide fusion screening has never been performed in Archaea on a broad taxonomic scale. The overlay of multiple computational techniques allows the detection of a fine-grained set of predicted fusion/fission families, instead of rough estimations based on conserved domain annotations only. The exhaustive mapping of fused proteins to bacterial organisms allows us to capture fusion/fission families that are specific to archaeal biology, as well as to identify links between bacterial and archaeal lineages based on cooccurrence of taxonomically restricted proteins and their sequence features. Furthermore, the identification of poorly characterized lineage-specific fusion proteins opens up possibilities for future experimental and computational investigations. This approach enhances our understanding of Archaea in general and provides potential candidates for in-depth studies in the future.
Collapse
Affiliation(s)
- Anastasiia Padalko
- Genome Evolution and Ecology Group, Department of Functional and Evolutionary Ecology, University of Vienna, Vienna, Austria
- Vienna Doctoral School of Ecology and Evolution, University of Vienna, Vienna, Austria
| | - Govind Nair
- Genome Evolution and Ecology Group, Department of Functional and Evolutionary Ecology, University of Vienna, Vienna, Austria
| | - Filipa L. Sousa
- Genome Evolution and Ecology Group, Department of Functional and Evolutionary Ecology, University of Vienna, Vienna, Austria
| |
Collapse
|
20
|
Guo J, Chen PK, Chang S. Molecular-Scale Electronics: From Individual Molecule Detection to the Application of Recognition Sensing. Anal Chem 2024; 96:9303-9316. [PMID: 38809941 DOI: 10.1021/acs.analchem.3c04656] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/31/2024]
|
21
|
Ingelman H, Heffernan JK, Harris A, Brown SD, Shaikh KM, Saqib AY, Pinheiro MJ, de Lima LA, Martinez KR, Gonzalez-Garcia RA, Hawkins G, Daleiden J, Tran L, Zeleznik H, Jensen RO, Reynoso V, Schindel H, Jänes J, Simpson SD, Köpke M, Marcellin E, Valgepea K. Autotrophic adaptive laboratory evolution of the acetogen Clostridium autoethanogenum delivers the gas-fermenting strain LAbrini with superior growth, products, and robustness. N Biotechnol 2024; 83:S1871-6784(24)00023-2. [PMID: 38871051 DOI: 10.1016/j.nbt.2024.06.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2024] [Revised: 06/05/2024] [Accepted: 06/10/2024] [Indexed: 06/15/2024]
Abstract
Microbes able to convert gaseous one-carbon (C1) waste feedstocks are increasingly important to transition to the sustainable production of renewable chemicals and fuels. Acetogens are interesting biocatalysts since gas fermentation using Clostridium autoethanogenum has been commercialised. However, most acetogen strains need complex nutrients, display slow growth, and are not robust for bioreactor fermentations. In this work, we used three different and independent adaptive laboratory evolution (ALE) strategies to evolve the wild-type C. autoethanogenum to grow faster, without yeast extract and to be robust in operating continuous bioreactor cultures. Multiple evolved strains with improved phenotypes were isolated on minimal media with one strain, named "LAbrini", exhibiting superior performance regarding the maximum specific growth rate, product profile, and robustness in continuous cultures. Whole-genome sequencing of the evolved strains identified 25 mutations. Of particular interest are two genes that acquired seven different mutations across the three ALE strategies, potentially as a result of convergent evolution. Reverse genetic engineering of mutations in potentially sporulation-related genes CLAU_3129 (spo0A) and CLAU_1957 recovered all three superior features of our ALE strains through triggering significant proteomic rearrangements. This work provides a robust C. autoethanogenum strain "LAbrini" to accelerate phenotyping and genetic engineering and to better understand acetogen metabolism.
Collapse
Affiliation(s)
- Henri Ingelman
- ERA Chair in Gas Fermentation Technologies, Institute of Bioengineering, University of Tartu, 50411 Tartu, Estonia
| | - James K Heffernan
- Australian Institute for Bioengineering and Nanotechnology (AIBN), The University of Queensland, 4072 St. Lucia, Australia
| | | | | | | | - Asfand Yar Saqib
- ERA Chair in Gas Fermentation Technologies, Institute of Bioengineering, University of Tartu, 50411 Tartu, Estonia
| | - Marina J Pinheiro
- ERA Chair in Gas Fermentation Technologies, Institute of Bioengineering, University of Tartu, 50411 Tartu, Estonia
| | - Lorena Azevedo de Lima
- ERA Chair in Gas Fermentation Technologies, Institute of Bioengineering, University of Tartu, 50411 Tartu, Estonia
| | - Karen Rodriguez Martinez
- Australian Institute for Bioengineering and Nanotechnology (AIBN), The University of Queensland, 4072 St. Lucia, Australia
| | - Ricardo A Gonzalez-Garcia
- Australian Institute for Bioengineering and Nanotechnology (AIBN), The University of Queensland, 4072 St. Lucia, Australia
| | | | | | | | | | | | | | | | - Jürgen Jänes
- Institute of Molecular Systems Biology, ETH Zürich, 8049 Zürich, Switzerland
| | | | | | - Esteban Marcellin
- Australian Institute for Bioengineering and Nanotechnology (AIBN), The University of Queensland, 4072 St. Lucia, Australia.
| | - Kaspar Valgepea
- ERA Chair in Gas Fermentation Technologies, Institute of Bioengineering, University of Tartu, 50411 Tartu, Estonia.
| |
Collapse
|
22
|
Hamamsy T, Morton JT, Blackwell R, Berenberg D, Carriero N, Gligorijevic V, Strauss CEM, Leman JK, Cho K, Bonneau R. Protein remote homology detection and structural alignment using deep learning. Nat Biotechnol 2024; 42:975-985. [PMID: 37679542 PMCID: PMC11180608 DOI: 10.1038/s41587-023-01917-2] [Citation(s) in RCA: 10] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2022] [Accepted: 07/26/2023] [Indexed: 09/09/2023]
Abstract
Exploiting sequence-structure-function relationships in biotechnology requires improved methods for aligning proteins that have low sequence similarity to previously annotated proteins. We develop two deep learning methods to address this gap, TM-Vec and DeepBLAST. TM-Vec allows searching for structure-structure similarities in large sequence databases. It is trained to accurately predict TM-scores as a metric of structural similarity directly from sequence pairs without the need for intermediate computation or solution of structures. Once structurally similar proteins have been identified, DeepBLAST can structurally align proteins using only sequence information by identifying structurally homologous regions between proteins. It outperforms traditional sequence alignment methods and performs similarly to structure-based alignment methods. We show the merits of TM-Vec and DeepBLAST on a variety of datasets, including better identification of remotely homologous proteins compared with state-of-the-art sequence alignment and structure prediction methods.
Collapse
Grants
- R35GM122515 National Science Foundation (NSF)
- IOS-1546218 National Science Foundation (NSF)
- R35 GM122515 NIGMS NIH HHS
- R01 DK103358 NIDDK NIH HHS
- CBET- 1728858 National Science Foundation (NSF)
- R01 AI130945 NIAID NIH HHS
- This research was supported by NIH R01DK103358, the Simons Foundation, NSF- IOS-1546218, R35GM122515, NSF CBET- 1728858, NIH R01AI130945, to T.H. This research was supported by the intramural research program of the Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD) to J.T.M. This research was supported by the Flatiron Institute as part of the Simons Foundation to Robert Blackwell, J.K.L., and N.C. This research was supported by Los Alamos National Lab to C.S. This research was supported by the Samsung Advanced Institute of Technology (Next Generation Deep Learning: from pattern recognition to AI), Samsung Research (Improving Deep Learning using Latent Structure), and NSF Award 1922658 to K.C.
- Simons Foundation
- U.S. Department of Health & Human Services | NIH | Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD)
Collapse
Affiliation(s)
- Tymor Hamamsy
- Center for Data Science, New York University, New York, NY, USA
| | - James T Morton
- Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA
- Biostatistics and Bioinformatics Branch, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, MD, USA
| | - Robert Blackwell
- Scientific Computing Core, Flatiron Institute, Simons Foundation, New York, NY, USA
| | - Daniel Berenberg
- Department of Computer Science, Courant Institute of Mathematical Sciences, New York University, New York, NY, USA
- Prescient Design, New York, NY, USA
| | - Nicholas Carriero
- Scientific Computing Core, Flatiron Institute, Simons Foundation, New York, NY, USA
| | | | | | - Julia Koehler Leman
- Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA
| | - Kyunghyun Cho
- Center for Data Science, New York University, New York, NY, USA.
- Department of Computer Science, Courant Institute of Mathematical Sciences, New York University, New York, NY, USA.
- Prescient Design, New York, NY, USA.
- CIFAR, Toronto, Ontario, Canada.
| | - Richard Bonneau
- Center for Data Science, New York University, New York, NY, USA.
- Department of Computer Science, Courant Institute of Mathematical Sciences, New York University, New York, NY, USA.
- Prescient Design, New York, NY, USA.
- Department of Biology, New York University, New York, NY, USA.
| |
Collapse
|
23
|
Liu Y, Zhang Y, Chen Z, Peng J. POLAT: Protein function prediction based on soft mask graph network and residue-Label ATtention. Comput Biol Chem 2024; 110:108064. [PMID: 38677014 DOI: 10.1016/j.compbiolchem.2024.108064] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2023] [Revised: 01/19/2024] [Accepted: 03/26/2024] [Indexed: 04/29/2024]
Abstract
MOTIVATION Elucidating protein function is a central problem in biochemistry, genetics, and molecular biology. Developing computational methods for protein function prediction is critical due to the significant gap between sequence and functional data. Recent advances in protein structure prediction, which strongly correlates with function, make it feasible to use structure to predict function. However, current structure-based methods overlook the fact that individual residues may contribute differently to the protein's function and do not take into account the correlation between protein residues and their functions. The challenge of effectively utilizing the relationship between protein residues and function-level information to predict protein function remains unsolved. RESULT We proposed a protein function prediction method based on Soft Mask Graph Networks and Residue-Label Attention (POLAT), which could combine sequence features, predicted structure features, and function-level information to get an accurate prediction. We use soft mask graph networks to adaptively extract the residues relevant to functions. A residue-label attention mechanism is adopted to obtain the protein-level encoded features of a protein, which are then concatenated with a protein-level embedding and fed into a dense classifier to determine the probabilities of each function. POLAT achieves 0.670, 0.515, 0.578 Fmax and 0.677, 0.409, 0.507 AUPR on the PDB cdhit test set for the MFO, BPO, and CCO domains, respectively, outperforming the existing structure-based SOTA method GAT-GO (Fmax 0.633, 0.492, 0.547; AUPR 0.660, 0.381, 0.479). POLAT is also competitive in extensive experiments among sequence-based and multimodal methods and achieves the SOTA performance in three out of six metrics.
Collapse
Affiliation(s)
- Yang Liu
- Intelligent Bioinformatics Laboratory, School of Computer and Artificial Intelligence, Wuhan University of Technology, Wuhan, 430070, China.
| | - Yi Zhang
- Intelligent Bioinformatics Laboratory, School of Computer and Artificial Intelligence, Wuhan University of Technology, Wuhan, 430070, China.
| | - ZiHao Chen
- Intelligent Bioinformatics Laboratory, School of Computer and Artificial Intelligence, Wuhan University of Technology, Wuhan, 430070, China.
| | - Jing Peng
- Intelligent Bioinformatics Laboratory, School of Computer and Artificial Intelligence, Wuhan University of Technology, Wuhan, 430070, China.
| |
Collapse
|
24
|
Ito S, Matsunaga R, Nakakido M, Komura D, Katoh H, Ishikawa S, Tsumoto K. High-throughput system for the thermostability analysis of proteins. Protein Sci 2024; 33:e5029. [PMID: 38801228 PMCID: PMC11129621 DOI: 10.1002/pro.5029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2024] [Revised: 04/30/2024] [Accepted: 05/06/2024] [Indexed: 05/29/2024]
Abstract
Thermal stability of proteins is a primary metric for evaluating their physical properties. Although researchers attempted to predict it using machine learning frameworks, their performance has been dependent on the quality and quantity of published data. This is due to the technical limitation that thermodynamic characterization of protein denaturation by fluorescence or calorimetry in a high-throughput manner has been challenging. Obtaining a melting curve that derives solely from the target protein requires laborious purification, making it far from practical to prepare a hundred or more samples in a single workflow. Here, we aimed to overcome this throughput limitation by leveraging the high protein secretion efficacy of Brevibacillus and consecutive treatment with plate-scale purification methodologies. By handling the entire process of expression, purification, and analysis on a per-plate basis, we enabled the direct observation of protein denaturation in 384 samples within 4 days. To demonstrate a practical application of the system, we conducted a comprehensive analysis of 186 single mutants of a single-chain variable fragment of nivolumab, harvesting the melting temperature (Tm) ranging from -9.3 up to +10.8°C compared to the wild-type sequence. Our findings will allow for data-driven stabilization in protein design and streamlining the rational approaches.
Collapse
Affiliation(s)
- Sae Ito
- Department of Bioengineering, School of EngineeringThe University of TokyoTokyoJapan
| | - Ryo Matsunaga
- Department of Bioengineering, School of EngineeringThe University of TokyoTokyoJapan
- Department of Chemistry and Biotechnology, School of EngineeringThe University of TokyoTokyoJapan
| | - Makoto Nakakido
- Department of Bioengineering, School of EngineeringThe University of TokyoTokyoJapan
- Department of Chemistry and Biotechnology, School of EngineeringThe University of TokyoTokyoJapan
| | - Daisuke Komura
- Department of Preventive Medicine, Graduate School of MedicineThe University of TokyoTokyoJapan
| | - Hiroto Katoh
- Department of Preventive Medicine, Graduate School of MedicineThe University of TokyoTokyoJapan
| | - Shumpei Ishikawa
- Department of Preventive Medicine, Graduate School of MedicineThe University of TokyoTokyoJapan
| | - Kouhei Tsumoto
- Department of Bioengineering, School of EngineeringThe University of TokyoTokyoJapan
- Department of Chemistry and Biotechnology, School of EngineeringThe University of TokyoTokyoJapan
- The Institute of Medical ScienceThe University of TokyoTokyoJapan
| |
Collapse
|
25
|
Chen N, Yu J, Zhe L, Wang F, Li X, Wong KC. TP-LMMSG: a peptide prediction graph neural network incorporating flexible amino acid property representation. Brief Bioinform 2024; 25:bbae308. [PMID: 38920345 PMCID: PMC11200197 DOI: 10.1093/bib/bbae308] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2024] [Revised: 05/28/2024] [Accepted: 06/10/2024] [Indexed: 06/27/2024] Open
Abstract
Bioactive peptide therapeutics has been a long-standing research topic. Notably, the antimicrobial peptides (AMPs) have been extensively studied for its therapeutic potential. Meanwhile, the demand for annotating other therapeutic peptides, such as antiviral peptides (AVPs) and anticancer peptides (ACPs), also witnessed an increase in recent years. However, we conceive that the structure of peptide chains and the intrinsic information between the amino acids is not fully investigated among the existing protocols. Therefore, we develop a new graph deep learning model, namely TP-LMMSG, which offers lightweight and easy-to-deploy advantages while improving the annotation performance in a generalizable manner. The results indicate that our model can accurately predict the properties of different peptides. The model surpasses the other state-of-the-art models on AMP, AVP and ACP prediction across multiple experimental validated datasets. Moreover, TP-LMMSG also addresses the challenges of time-consuming pre-processing in graph neural network frameworks. With its flexibility in integrating heterogeneous peptide features, our model can provide substantial impacts on the screening and discovery of therapeutic peptides. The source code is available at https://github.com/NanjunChen37/TP_LMMSG.
Collapse
Affiliation(s)
- Nanjun Chen
- Department of Computer Science, City University of Hong Kong, 83 Tat Chee Ave, Kowloon Tong, Kowloon, Hong Kong SAR
| | - Jixiang Yu
- Department of Computer Science, City University of Hong Kong, 83 Tat Chee Ave, Kowloon Tong, Kowloon, Hong Kong SAR
| | - Liu Zhe
- Department of Computer Science, City University of Hong Kong, 83 Tat Chee Ave, Kowloon Tong, Kowloon, Hong Kong SAR
| | - Fuzhou Wang
- Department of Computer Science, City University of Hong Kong, 83 Tat Chee Ave, Kowloon Tong, Kowloon, Hong Kong SAR
| | - Xiangtao Li
- School of Artificial Intelligence, Jilin University, Chang Chun, Ji Lin, China
| | - Ka-Chun Wong
- Department of Computer Science, City University of Hong Kong, 83 Tat Chee Ave, Kowloon Tong, Kowloon, Hong Kong SAR
- Shenzhen Research Institute, City University of Hong Kong, Shenzhen, Guang Dong, China
| |
Collapse
|
26
|
Joho Y, Royan S, Caputo AT, Newton S, Peat TS, Newman J, Jackson C, Ardevol A. Enhancing PET Degrading Enzymes: A Combinatory Approach. Chembiochem 2024; 25:e202400084. [PMID: 38584134 DOI: 10.1002/cbic.202400084] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2024] [Revised: 04/02/2024] [Accepted: 04/04/2024] [Indexed: 04/09/2024]
Abstract
Plastic waste has become a substantial environmental issue. A potential strategy to mitigate this problem is to use enzymatic hydrolysis of plastics to depolymerize post-consumer waste and allow it to be reused. Over the last few decades, the use of enzymatic PET-degrading enzymes has shown promise as a great solution for creating a circular plastic waste economy. PsPETase from Piscinibacter sakaiensis has been identified as an enzyme with tremendous potential for such applications. But to improve its efficiency, enzyme engineering has been applied aiming at enhancing its thermal stability, enzymatic activity, and ease of production. Here, we combine different strategies such as structure-based rational design, ancestral sequence reconstruction and machine learning to engineer a more highly active Combi-PETase variant with a melting temperature of 70 °C and optimal performance at 60 °C. Furthermore, this study demonstrates that these approaches, commonly used in other works of enzyme engineering, are most effective when utilized in combination, enabling the improvement of enzymes for industrial applications.
Collapse
Affiliation(s)
- Yvonne Joho
- Manufacturing, Commonwealth Scientific and Industrial Research Organisation, Clayton, Victoria, 3168, Australia
- Research School of Chemistry, Australian National University, Canberra, ACT 2601, Australia
- CSIRO Advanced Engineering Biology Future Science Platform, GPO Box 1700, Canberra, ACT 2601, Australia
| | - Santana Royan
- Manufacturing, Commonwealth Scientific and Industrial Research Organisation, Clayton, Victoria, 3168, Australia
| | - Alessandro T Caputo
- Manufacturing, Commonwealth Scientific and Industrial Research Organisation, Clayton, Victoria, 3168, Australia
| | - Sophia Newton
- Manufacturing, Commonwealth Scientific and Industrial Research Organisation, Clayton, Victoria, 3168, Australia
| | - Thomas S Peat
- School of Biotechnology & Biomolecular Sciences, University of New South Wales, Sydney, NSW 2052, Australia
| | - Janet Newman
- School of Biotechnology & Biomolecular Sciences, University of New South Wales, Sydney, NSW 2052, Australia
| | - Colin Jackson
- Research School of Chemistry, Australian National University, Canberra, ACT 2601, Australia
- ARC Centre of Excellence for Innovations in Peptide & Protein Science, Research School of Chemistry, Australian National University, Canberra, ACT 2601, Australia
- ARC Centre of Excellence for Innovations in Synthetic Biology, Research School of Chemistry, Australian National University, Canberra, ACT 2601, Australia
| | - Albert Ardevol
- Manufacturing, Commonwealth Scientific and Industrial Research Organisation, Clayton, Victoria, 3168, Australia
- CSIRO Advanced Engineering Biology Future Science Platform, GPO Box 1700, Canberra, ACT 2601, Australia
| |
Collapse
|
27
|
Ye B, Tian W, Wang B, Liang J. CASTpFold: Computed Atlas of Surface Topography of the universe of protein Folds. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.04.592496. [PMID: 38766001 PMCID: PMC11100609 DOI: 10.1101/2024.05.04.592496] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/22/2024]
Abstract
Geometric and topological properties of protein structures, including surface pockets, interior cavities, and cross channels, are of fundamental importance for proteins to carry out their functions. Computed Atlas of Surface Topography of proteins (CASTp) is a widely used web server for locating, delineating, and measuring these geometric and topological properties of protein structures. Recent developments in AI-based protein structure prediction such as AlphaFold2 (AF2) have significantly expanded our knowledge on protein structures. Here we present CASTpFold, a continuation of CASTp that provides accurate and comprehensive identifications and quantifications of protein topography. It now provides (i) results on an expanded database of proteins, including the Protein Data Bank (PDB) and non-singleton representative structures of AlphaFold2 structures, covering 183 million AF2 structures; (ii) functional pockets prediction with corresponding Gene Ontology (GO) terms or Enzyme Commission (EC) numbers for AF2-predicted structures; and (iii) pocket similarity search function for surface and protein-protein interface pockets. The CASTpFold web server is freely accessible at https://cfold.bme.uic.edu/castpfold/.
Collapse
Affiliation(s)
- Bowei Ye
- Department of Bioengineering, University of Illinois at Chicago, Chicago, IL 60607, USA
| | - Wei Tian
- Salk Institute for Biological Studies, La Jolla, CA 92037, USA
| | - Boshen Wang
- UT Southwestern Medical Center, Dallas, TX 75390, USA
| | - Jie Liang
- Department of Bioengineering, University of Illinois at Chicago, Chicago, IL 60607, USA
| |
Collapse
|
28
|
Huang J, Li J, Chen Q, Wang X, Chen G, Tang J. Freeprotmap: waiting-free prediction method for protein distance map. BMC Bioinformatics 2024; 25:176. [PMID: 38704533 PMCID: PMC11069170 DOI: 10.1186/s12859-024-05771-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2023] [Accepted: 04/09/2024] [Indexed: 05/06/2024] Open
Abstract
BACKGROUND Protein residue-residue distance maps are used for remote homology detection, protein information estimation, and protein structure research. However, existing prediction approaches are time-consuming, and hundreds of millions of proteins are discovered each year, necessitating the development of a rapid and reliable prediction method for protein residue-residue distances. Moreover, because many proteins lack known homologous sequences, a waiting-free and alignment-free deep learning method is needed. RESULT In this study, we propose a learning framework named FreeProtMap. In terms of protein representation processing, the proposed group pooling in FreeProtMap effectively mitigates issues arising from high-dimensional sparseness in protein representation. In terms of model structure, we have made several careful designs. Firstly, it is designed based on the locality of protein structures and triangular inequality distance constraints to improve prediction accuracy. Secondly, inference speed is improved by using additive attention and lightweight design. Besides, the generalization ability is improved by using bottlenecks and a neural network block named local microformer. As a result, FreeProtMap can predict protein residue-residue distances in tens of milliseconds and has higher precision than the best structure prediction method. CONCLUSION Several groups of comparative experiments and ablation experiments verify the effectiveness of the designs. The results demonstrate that FreeProtMap significantly outperforms other state-of-the-art methods in accurate protein residue-residue distance prediction, which is beneficial for lots of protein research works. It is worth mentioning that we could scan all proteins discovered each year based on FreeProtMap to find structurally similar proteins in a short time because the fact that the structure similarity calculation method based on distance maps is much less time-consuming than algorithms based on 3D structures.
Collapse
Affiliation(s)
- Jiajian Huang
- Zhejiang Lab, Zhejiang, China.
- Dalian University of Technology, Liaoning, China.
| | - Jinpeng Li
- Zhejiang Lab, Zhejiang, China
- The Chinese University of Hong Kong, Hong Kong, China
| | | | - Xia Wang
- Zhejiang Lab, Zhejiang, China.
- Dalian University of Technology, Liaoning, China.
| | | | | |
Collapse
|
29
|
Pan H, Wu Z, Liu W, Zhang G. AlphaFun: Structural-Alignment-Based Proteome Annotation Reveals why the Functionally Unknown Proteins (uPE1) Are So Understudied. J Proteome Res 2024; 23:1593-1602. [PMID: 38626392 PMCID: PMC11078154 DOI: 10.1021/acs.jproteome.3c00678] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2023] [Revised: 03/27/2024] [Accepted: 04/03/2024] [Indexed: 04/18/2024]
Abstract
With the rapid expansion of sequencing of genomes, the functional annotation of proteins becomes a bottleneck in understanding proteomes. The Chromosome-centric Human Proteome Project (C-HPP) aims to identify all proteins encoded by the human genome and find functional annotations for them. However, until now there are still 1137 identified human proteins without functional annotation, called uPE1 proteins. Sequence alignment was insufficient to predict their functions, and the crystal structures of most proteins were unavailable. In this study, we demonstrated a new functional annotation strategy, AlphaFun, based on structural alignment using deep-learning-predicted protein structures. Using this strategy, we functionally annotated 99% of the human proteome, including the uPE1 proteins and missing proteins, which have not been identified yet. The accuracy of the functional annotations was validated using the known-function proteins. The uPE1 proteins shared similar functions to the known-function PE1 proteins and tend to express only in very limited tissues. They are evolutionally young genes and thus should conduct functions only in specific tissues and conditions, limiting their occurrence in commonly studied biological models. Such functional annotations provide hints for functional investigations on the uPE1 proteins. This proteome-wide-scale functional annotation strategy is also applicable to any other species.
Collapse
Affiliation(s)
- Hengxin Pan
- MOE Key Laboratory of Tumor
Molecular Biology and Key Laboratory of Functional Protein Research
of Guangdong Higher Education Institutes, Institute of Life and Health
Engineering, College of Life Science and Technology, Jinan University, Guangzhou 510632, China
| | - Zhenqi Wu
- MOE Key Laboratory of Tumor
Molecular Biology and Key Laboratory of Functional Protein Research
of Guangdong Higher Education Institutes, Institute of Life and Health
Engineering, College of Life Science and Technology, Jinan University, Guangzhou 510632, China
| | - Wanting Liu
- MOE Key Laboratory of Tumor
Molecular Biology and Key Laboratory of Functional Protein Research
of Guangdong Higher Education Institutes, Institute of Life and Health
Engineering, College of Life Science and Technology, Jinan University, Guangzhou 510632, China
| | - Gong Zhang
- MOE Key Laboratory of Tumor
Molecular Biology and Key Laboratory of Functional Protein Research
of Guangdong Higher Education Institutes, Institute of Life and Health
Engineering, College of Life Science and Technology, Jinan University, Guangzhou 510632, China
| |
Collapse
|
30
|
Rollins ZA, Widatalla T, Waight A, Cheng AC, Metwally E. AbLEF: antibody language ensemble fusion for thermodynamically empowered property predictions. Bioinformatics 2024; 40:btae268. [PMID: 38627249 PMCID: PMC11256947 DOI: 10.1093/bioinformatics/btae268] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2024] [Revised: 03/27/2024] [Accepted: 04/23/2024] [Indexed: 05/08/2024] Open
Abstract
MOTIVATION Pre-trained protein language and/or structural models are often fine-tuned on drug development properties (i.e. developability properties) to accelerate drug discovery initiatives. However, these models generally rely on a single structural conformation and/or a single sequence as a molecular representation. We present a physics-based model, whereby 3D conformational ensemble representations are fused by a transformer-based architecture and concatenated to a language representation to predict antibody protein properties. Antibody language ensemble fusion enables the direct infusion of thermodynamic information into latent space and this enhances property prediction by explicitly infusing dynamic molecular behavior that occurs during experimental measurement. RESULTS We showcase the antibody language ensemble fusion model on two developability properties: hydrophobic interaction chromatography retention time and temperature of aggregation (Tagg). We find that (i) 3D conformational ensembles that are generated from molecular simulation can further improve antibody property prediction for small datasets, (ii) the performance benefit from 3D conformational ensembles matches shallow machine learning methods in the small data regime, and (iii) fine-tuned large protein language models can match smaller antibody-specific language models at predicting antibody properties. AVAILABILITY AND IMPLEMENTATION AbLEF codebase is available at https://github.com/merck/AbLEF.
Collapse
Affiliation(s)
- Zachary A Rollins
- Modeling and Informatics, Merck & Co., Inc, South San Francisco, CA, 94080, United States
| | - Talal Widatalla
- Modeling and Informatics, Merck & Co., Inc, South San Francisco, CA, 94080, United States
| | - Andrew Waight
- Discovery Biologics, Merck & Co., Inc, South San Francisco, CA, 94080, United States
| | - Alan C Cheng
- Modeling and Informatics, Merck & Co., Inc, South San Francisco, CA, 94080, United States
| | - Essam Metwally
- Modeling and Informatics, Merck & Co., Inc, South San Francisco, CA, 94080, United States
| |
Collapse
|
31
|
Armah-Sekum RE, Szedmak S, Rousu J. Protein function prediction through multi-view multi-label latent tensor reconstruction. BMC Bioinformatics 2024; 25:174. [PMID: 38698340 PMCID: PMC11067221 DOI: 10.1186/s12859-024-05789-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2024] [Accepted: 04/17/2024] [Indexed: 05/05/2024] Open
Abstract
BACKGROUND In last two decades, the use of high-throughput sequencing technologies has accelerated the pace of discovery of proteins. However, due to the time and resource limitations of rigorous experimental functional characterization, the functions of a vast majority of them remain unknown. As a result, computational methods offering accurate, fast and large-scale assignment of functions to new and previously unannotated proteins are sought after. Leveraging the underlying associations between the multiplicity of features that describe proteins could reveal functional insights into the diverse roles of proteins and improve performance on the automatic function prediction task. RESULTS We present GO-LTR, a multi-view multi-label prediction model that relies on a high-order tensor approximation of model weights combined with non-linear activation functions. The model is capable of learning high-order relationships between multiple input views representing the proteins and predicting high-dimensional multi-label output consisting of protein functional categories. We demonstrate the competitiveness of our method on various performance measures. Experiments show that GO-LTR learns polynomial combinations between different protein features, resulting in improved performance. Additional investigations establish GO-LTR's practical potential in assigning functions to proteins under diverse challenging scenarios: very low sequence similarity to previously observed sequences, rarely observed and highly specific terms in the gene ontology. IMPLEMENTATION The code and data used for training GO-LTR is available at https://github.com/aalto-ics-kepaco/GO-LTR-prediction .
Collapse
Affiliation(s)
- Robert Ebo Armah-Sekum
- Department of Computer Science, Aalto University, Konemiehentie 2, 02150, Espoo, Finland.
| | - Sandor Szedmak
- Department of Computer Science, Aalto University, Konemiehentie 2, 02150, Espoo, Finland
| | - Juho Rousu
- Department of Computer Science, Aalto University, Konemiehentie 2, 02150, Espoo, Finland.
| |
Collapse
|
32
|
Ravichandran A, Araque JC, Lawson JW. Predicting the functional state of protein kinases using interpretable graph neural networks from sequence and structural data. Proteins 2024; 92:623-636. [PMID: 38083830 DOI: 10.1002/prot.26641] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2023] [Revised: 10/13/2023] [Accepted: 11/09/2023] [Indexed: 04/13/2024]
Abstract
Protein kinases are central to cellular activities and are actively pursued as drug targets for several conditions including cancer and autoimmune diseases. Despite the availability of a large structural database for kinases, methodologies to elucidate the structure-function relationship of these proteins (without manual intervention) are lacking. Such techniques are essential in structural biology and to accelerate drug discovery efforts. Here, we implement an interpretable graph neural network (GNN) framework for classifying the functionally active and inactive states of a large set of protein kinases by only using their tertiary structure and amino acid sequence. We show that the GNN models can classify kinase structures with high accuracy (>97%). We implement the Gradient-weighted Class Activation Mapping for graphs (Graph Grad-CAM) to automatically identify structurally important residues and residue-residue contacts of the kinases without any a priori input. We show that the motifs identified through the Graph Grad-CAM methodology are functionally critical, consistent with the existing kinase literature. Notably, the highly conserved DFG and HRD motifs of the well-known hydrophobic spine are identified by the interpretable framework in addition to some of the lesser known motifs. Further, using Grad-CAM maps as the vector embedding of the protein structures, we identify the subtle differences in the crystal structures among different sub-classes of kinases in the Protein Data Bank (PDB). Frameworks such as the one implemented here, for high-throughput identification of protein structure-function relationships are essential in designing targeted small molecules therapies as well as in engineering new proteins for novel applications.
Collapse
Affiliation(s)
- Ashwin Ravichandran
- KBR Inc., Intelligent Systems Division, NASA Ames Research Center, Moffett Field, California, USA
| | - Juan C Araque
- KBR Inc., Intelligent Systems Division, NASA Ames Research Center, Moffett Field, California, USA
| | - John W Lawson
- Intelligent Systems Division, NASA Ames Research Center, Moffett Field, California, USA
| |
Collapse
|
33
|
Ma W, Bi X, Jiang H, Zhang S, Wei Z. CollaPPI: A Collaborative Learning Framework for Predicting Protein-Protein Interactions. IEEE J Biomed Health Inform 2024; 28:3167-3177. [PMID: 38466584 DOI: 10.1109/jbhi.2024.3375621] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/13/2024]
Abstract
Exploring protein-protein interaction (PPI) is of paramount importance for elucidating the intrinsic mechanism of various biological processes. Nevertheless, experimental determination of PPI can be both time-consuming and expensive, motivating the exploration of data-driven deep learning technologies as a viable, efficient, and accurate alternative. Nonetheless, most current deep learning-based methods regarded a pair of proteins to be predicted for possible interaction as two separate entities when extracting PPI features, thus neglecting the knowledge sharing among the collaborative protein and the target protein. Aiming at the above issue, a collaborative learning framework CollaPPI was proposed in this study, where two kinds of collaboration, i.e., protein-level collaboration and task-level collaboration, were incorporated to achieve not only the knowledge-sharing between a pair of proteins, but also the complementation of such shared knowledge between biological domains closely related to PPI (i.e., protein function, and subcellular location). Evaluation results demonstrated that CollaPPI obtained superior performance compared to state-of-the-art methods on two PPI benchmarks. Besides, evaluation results of CollaPPI on the additional PPI type prediction task further proved its excellent generalization ability.
Collapse
|
34
|
Hu F, Zhang W, Huang H, Li W, Li Y, Yin P. A Transferability-Based Method for Evaluating the Protein Representation Learning. IEEE J Biomed Health Inform 2024; 28:3158-3166. [PMID: 38416611 DOI: 10.1109/jbhi.2024.3370680] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/01/2024]
Abstract
Self-supervised pre-trained language models have recently risen as a powerful approach in learning protein representations, showing exceptional effectiveness in various biological tasks, such as drug discovery. Amidst the evolving trend in protein language model development, there is an observable shift towards employing large-scale multimodal and multitask models. However, the predominant reliance on empirical assessments using specific benchmark datasets for evaluating these models raises concerns about the comprehensiveness and efficiency of current evaluation methods. Addressing this gap, our study introduces a novel quantitative approach for estimating the performance of transferring multi-task pre-trained protein representations to downstream tasks. This transferability-based method is designed to quantify the similarities in latent space distributions between pre-trained features and those fine-tuned for downstream tasks. It encompasses a broad spectrum, covering multiple domains and a variety of heterogeneous tasks. To validate this method, we constructed a diverse set of protein-specific pre-training tasks. The resulting protein representations were then evaluated across several downstream biological tasks. Our experimental results demonstrate a robust correlation between the transferability scores obtained using our method and the actual transfer performance observed. This significant correlation highlights the potential of our method as a more comprehensive and efficient tool for evaluating protein representation learning.
Collapse
|
35
|
Ding K, Luo J, Luo Y. Leveraging conformal prediction to annotate enzyme function space with limited false positives. PLoS Comput Biol 2024; 20:e1012135. [PMID: 38809942 PMCID: PMC11164347 DOI: 10.1371/journal.pcbi.1012135] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2023] [Revised: 06/10/2024] [Accepted: 05/03/2024] [Indexed: 05/31/2024] Open
Abstract
Machine learning (ML) is increasingly being used to guide biological discovery in biomedicine such as prioritizing promising small molecules in drug discovery. In those applications, ML models are used to predict the properties of biological systems, and researchers use these predictions to prioritize candidates as new biological hypotheses for downstream experimental validations. However, when applied to unseen situations, these models can be overconfident and produce a large number of false positives. One solution to address this issue is to quantify the model's prediction uncertainty and provide a set of hypotheses with a controlled false discovery rate (FDR) pre-specified by researchers. We propose CPEC, an ML framework for FDR-controlled biological discovery. We demonstrate its effectiveness using enzyme function annotation as a case study, simulating the discovery process of identifying the functions of less-characterized enzymes. CPEC integrates a deep learning model with a statistical tool known as conformal prediction, providing accurate and FDR-controlled function predictions for a given protein enzyme. Conformal prediction provides rigorous statistical guarantees to the predictive model and ensures that the expected FDR will not exceed a user-specified level with high probability. Evaluation experiments show that CPEC achieves reliable FDR control, better or comparable prediction performance at a lower FDR than existing methods, and accurate predictions for enzymes under-represented in the training data. We expect CPEC to be a useful tool for biological discovery applications where a high yield rate in validation experiments is desired but the experimental budget is limited.
Collapse
Affiliation(s)
- Kerr Ding
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, Georgia, United States of America
| | - Jiaqi Luo
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, Georgia, United States of America
| | - Yunan Luo
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, Georgia, United States of America
| |
Collapse
|
36
|
Wang H, Chen M, Wei X, Xia R, Pei D, Huang X, Han B. Computational tools for plant genomics and breeding. SCIENCE CHINA. LIFE SCIENCES 2024:10.1007/s11427-024-2578-6. [PMID: 38676814 DOI: 10.1007/s11427-024-2578-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/05/2024] [Accepted: 03/25/2024] [Indexed: 04/29/2024]
Abstract
Plant genomics and crop breeding are at the intersection of biotechnology and information technology. Driven by a combination of high-throughput sequencing, molecular biology and data science, great advances have been made in omics technologies at every step along the central dogma, especially in genome assembling, genome annotation, epigenomic profiling, and transcriptome profiling. These advances further revolutionized three directions of development. One is genetic dissection of complex traits in crops, along with genomic prediction and selection. The second is comparative genomics and evolution, which open up new opportunities to depict the evolutionary constraints of biological sequences for deleterious variant discovery. The third direction is the development of deep learning approaches for the rational design of biological sequences, especially proteins, for synthetic biology. All three directions of development serve as the foundation for a new era of crop breeding where agronomic traits are enhanced by genome design.
Collapse
Affiliation(s)
- Hai Wang
- State Key Laboratory of Maize Bio-breeding, Frontiers Science Center for Molecular Design Breeding, Joint International Research Laboratory of Crop Molecular Breeding, National Maize Improvement Center, College of Agronomy and Biotechnology, China Agricultural University, Beijing, 100193, China.
- Sanya Institute of China Agricultural University, Sanya, 572025, China.
- Hainan Yazhou Bay Seed Laboratory, Sanya, 572025, China.
| | - Mengjiao Chen
- State Key Laboratory of Tree Genetics and Breeding, Key Laboratory of Tree Breeding and Cultivation of the State Forestry and Grassland Administration, Research Institute of Forestry, Chinese Academy of Forestry, Beijing, 100091, China
| | - Xin Wei
- Shanghai Key Laboratory of Plant Molecular Sciences, College of Life Sciences, Shanghai Normal University, Shanghai, 200234, China
| | - Rui Xia
- College of Horticulture, South China Agricultural University, Guangzhou, 510640, China
| | - Dong Pei
- State Key Laboratory of Tree Genetics and Breeding, Key Laboratory of Tree Breeding and Cultivation of the State Forestry and Grassland Administration, Research Institute of Forestry, Chinese Academy of Forestry, Beijing, 100091, China
| | - Xuehui Huang
- Shanghai Key Laboratory of Plant Molecular Sciences, College of Life Sciences, Shanghai Normal University, Shanghai, 200234, China
| | - Bin Han
- National Center for Gene Research, CAS Center for Excellence in Molecular Plant Sciences, Chinese Academy of Sciences, Shanghai, 200233, China
| |
Collapse
|
37
|
Chitboonthavisuk C, Martin C, Huss P, Peters JM, Anantharaman K, Raman S. Systematic genome-wide discovery of host factors governing bacteriophage infectivity. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.04.20.590424. [PMID: 38659955 PMCID: PMC11042327 DOI: 10.1101/2024.04.20.590424] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/26/2024]
Abstract
Bacterial host factors regulate the infection cycle of bacteriophages. Except for some well-studied host factors (e.g., receptors or restriction-modification systems), the contribution of the rest of the host genome on phage infection remains poorly understood. We developed PHAGEPACK, a pooled assay that systematically and comprehensively measures each host-gene impact on phage fitness. PHAGEPACK combines CRISPR interference with phage packaging to link host perturbation to phage fitness during active infection. Using PHAGEPACK, we constructed a genome-wide map of genes impacting T7 phage fitness in permissive E. coli, revealing pathways previously unknown to affect phage packaging. When applied to the non-permissive E. coli O121, PHAGEPACK identified pathways leading to host resistance; their removal increased phage susceptibility up to a billion-fold. Bioinformatic analysis indicates phage genomes carry homologs or truncations of key host factors, potentially for fitness advantage. In summary, PHAGEPACK offers valuable insights into phage-host interactions, phage evolution, and bacterial resistance.
Collapse
|
38
|
Tripp A, Braun M, Wieser F, Oberdorfer G, Lechner H. Click, Compute, Create: A Review of Web-based Tools for Enzyme Engineering. Chembiochem 2024:e202400092. [PMID: 38634409 DOI: 10.1002/cbic.202400092] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2024] [Revised: 04/14/2024] [Accepted: 04/15/2024] [Indexed: 04/19/2024]
Abstract
Enzyme engineering, though pivotal across various biotechnological domains, is often plagued by its time-consuming and labor-intensive nature. This review aims to offer an overview of supportive in silico methodologies for this demanding endeavor. Starting from methods to predict protein structures, to classification of their activity and even the discovery of new enzymes we continue with describing tools used to increase thermostability and production yields of selected targets. Subsequently, we discuss computational methods to modulate both, the activity as well as selectivity of enzymes. Last, we present recent approaches based on cutting-edge machine learning methods to redesign enzymes. With exception of the last chapter, there is a strong focus on methods easily accessible via web-interfaces or simple Python-scripts, therefore readily useable for a diverse and broad community.
Collapse
Affiliation(s)
- Adrian Tripp
- Institute of Biochemistry, Graz University of Technology, Petersgasse 12/2, 8010, Graz, Austria
| | - Markus Braun
- Institute of Biochemistry, Graz University of Technology, Petersgasse 12/2, 8010, Graz, Austria
| | - Florian Wieser
- Institute of Biochemistry, Graz University of Technology, Petersgasse 12/2, 8010, Graz, Austria
| | - Gustav Oberdorfer
- Institute of Biochemistry, Graz University of Technology, Petersgasse 12/2, 8010, Graz, Austria
- BioTechMed, Graz, Austria
| | - Horst Lechner
- Institute of Biochemistry, Graz University of Technology, Petersgasse 12/2, 8010, Graz, Austria
- BioTechMed, Graz, Austria
| |
Collapse
|
39
|
Malatesta M, Fornasier E, Di Salvo ML, Tramonti A, Zangelmi E, Peracchi A, Secchi A, Polverini E, Giachin G, Battistutta R, Contestabile R, Percudani R. One substrate many enzymes virtual screening uncovers missing genes of carnitine biosynthesis in human and mouse. Nat Commun 2024; 15:3199. [PMID: 38615009 PMCID: PMC11016064 DOI: 10.1038/s41467-024-47466-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2023] [Accepted: 03/26/2024] [Indexed: 04/15/2024] Open
Abstract
The increasing availability of experimental and computational protein structures entices their use for function prediction. Here we develop an automated procedure to identify enzymes involved in metabolic reactions by assessing substrate conformations docked to a library of protein structures. By screening AlphaFold-modeled vitamin B6-dependent enzymes, we find that a metric based on catalytically favorable conformations at the enzyme active site performs best (AUROC Score=0.84) in identifying genes associated with known reactions. Applying this procedure, we identify the mammalian gene encoding hydroxytrimethyllysine aldolase (HTMLA), the second enzyme of carnitine biosynthesis. Upon experimental validation, we find that the top-ranked candidates, serine hydroxymethyl transferase (SHMT) 1 and 2, catalyze the HTMLA reaction. However, a mouse protein absent in humans (threonine aldolase; Tha1) catalyzes the reaction more efficiently. Tha1 did not rank highest based on the AlphaFold model, but its rank improved to second place using the experimental crystal structure we determined at 2.26 Å resolution. Our findings suggest that humans have lost a gene involved in carnitine biosynthesis, with HTMLA activity of SHMT partially compensating for its function.
Collapse
Affiliation(s)
- Marco Malatesta
- Department of Chemistry, Life Sciences and Environmental Sustainability, University of Parma, Parma, Italy
| | | | - Martino Luigi Di Salvo
- Istituto Pasteur Italia-Fondazione Cenci Bolognetti and Department of Biochemical Sciences "A. Rossi Fanelli", Sapienza University of Rome, Rome, Italy
| | - Angela Tramonti
- Institute of Molecular Biology and Pathology, Italian National Research Council, Rome, Italy
| | - Erika Zangelmi
- Department of Chemistry, Life Sciences and Environmental Sustainability, University of Parma, Parma, Italy
| | - Alessio Peracchi
- Department of Chemistry, Life Sciences and Environmental Sustainability, University of Parma, Parma, Italy
| | - Andrea Secchi
- Department of Chemistry, Life Sciences and Environmental Sustainability, University of Parma, Parma, Italy
| | - Eugenia Polverini
- Department of Mathematical, Physical and Computer Sciences, University of Parma, Parma, Italy
| | - Gabriele Giachin
- Department of Chemical Sciences, University of Padua, Padova, Italy
| | | | - Roberto Contestabile
- Istituto Pasteur Italia-Fondazione Cenci Bolognetti and Department of Biochemical Sciences "A. Rossi Fanelli", Sapienza University of Rome, Rome, Italy.
| | - Riccardo Percudani
- Department of Chemistry, Life Sciences and Environmental Sustainability, University of Parma, Parma, Italy.
| |
Collapse
|
40
|
Brocidiacono M, Francoeur P, Aggarwal R, Popov KI, Koes DR, Tropsha A. BigBind: Learning from Nonstructural Data for Structure-Based Virtual Screening. J Chem Inf Model 2024; 64:2488-2495. [PMID: 38113513 DOI: 10.1021/acs.jcim.3c01211] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2023]
Abstract
Deep learning methods that predict protein-ligand binding have recently been used for structure-based virtual screening. Many such models have been trained using protein-ligand complexes with known crystal structures and activities from the PDBBind data set. However, because PDBbind only includes 20K complexes, models typically fail to generalize to new targets, and model performance is on par with models trained with only ligand information. Conversely, the ChEMBL database contains a wealth of chemical activity information but includes no information about binding poses. We introduce BigBind, a data set that maps ChEMBL activity data to proteins from the CrossDocked data set. BigBind comprises 583 K ligand activities and includes 3D structures of the protein binding pockets. Additionally, we augmented the data by adding an equal number of putative inactives for each target. Using this data, we developed Banana (basic neural network for binding affinity), a neural network-based model to classify active from inactive compounds, defined by a 10 μM cutoff. Our model achieved an AUC of 0.72 on BigBind's test set, while a ligand-only model achieved an AUC of 0.59. Furthermore, Banana achieved competitive performance on the LIT-PCBA benchmark (median EF1% 1.81) while running 16,000 times faster than molecular docking with Gnina. We suggest that Banana, as well as other models trained on this data set, will significantly improve the outcomes of prospective virtual screening tasks.
Collapse
Affiliation(s)
- Michael Brocidiacono
- Eshelman School of Pharmacy, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599, United States
| | - Paul Francoeur
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, Pennsylvania 15260, United States
| | - Rishal Aggarwal
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, Pennsylvania 15260, United States
| | - Konstantin I Popov
- Eshelman School of Pharmacy, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599, United States
| | - David Ryan Koes
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, Pennsylvania 15260, United States
| | - Alexander Tropsha
- Eshelman School of Pharmacy, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599, United States
| |
Collapse
|
41
|
Zhao Y, Yang Z, Wang L, Zhang Y, Lin H, Wang J. Predicting Protein Functions Based on Heterogeneous Graph Attention Technique. IEEE J Biomed Health Inform 2024; 28:2408-2415. [PMID: 38319781 DOI: 10.1109/jbhi.2024.3357834] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2024]
Abstract
In bioinformatics, protein function prediction stands as a fundamental area of research and plays a crucial role in addressing various biological challenges, such as the identification of potential targets for drug discovery and the elucidation of disease mechanisms. However, known functional annotation databases usually provide positive experimental annotations that proteins carry out a given function, and rarely record negative experimental annotations that proteins do not carry out a given function. Therefore, existing computational methods based on deep learning models focus on these positive annotations for prediction and ignore these scarce but informative negative annotations, leading to an underestimation of precision. To address this issue, we introduce a deep learning method that utilizes a heterogeneous graph attention technique. The method first constructs a heterogeneous graph that covers the protein-protein interaction network, ontology structure, and positive and negative annotation information. Then, it learns embedding representations of proteins and ontology terms by using the heterogeneous graph attention technique. Finally, it leverages these learned representations to reconstruct the positive protein-term associations and score unobserved functional annotations. It can enhance the predictive performance by incorporating these known limited negative annotations into the constructed heterogeneous graph. Experimental results on three species (i.e., Human, Mouse, and Arabidopsis) demonstrate that our method can achieve better performance in predicting new protein annotations than state-of-the-art methods.
Collapse
|
42
|
Reveguk I, Simonson T. Classifying protein kinase conformations with machine learning. Protein Sci 2024; 33:e4918. [PMID: 38501429 PMCID: PMC10962494 DOI: 10.1002/pro.4918] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2023] [Revised: 01/02/2024] [Accepted: 01/22/2024] [Indexed: 03/20/2024]
Abstract
Protein kinases are key actors of signaling networks and important drug targets. They cycle between active and inactive conformations, distinguished by a few elements within the catalytic domain. One is the activation loop, whose conserved DFG motif can occupy DFG-in, DFG-out, and some rarer conformations. Annotation and classification of the structural kinome are important, as different conformations can be targeted by different inhibitors and activators. Valuable resources exist; however, large-scale applications will benefit from increased automation and interpretability of structural annotation. Interpretable machine learning models are described for this purpose, based on ensembles of decision trees. To train them, a set of catalytic domain sequences and structures was collected, somewhat larger and more diverse than existing resources. The structures were clustered based on the DFG conformation and manually annotated. They were then used as training input. Two main models were constructed, which distinguished active/inactive and in/out/other DFG conformations. They considered initially 1692 structural variables, spanning the whole catalytic domain, then identified ("learned") a small subset that sufficed for accurate classification. The first model correctly labeled all but 3 of 3289 structures as active or inactive, while the second assigned the correct DFG label to all but 17 of 8826 structures. The most potent classifying variables were all related to well-known structural elements in or near the activation loop and their ranking gives insights into the conformational preferences. The models were used to automatically annotate 3850 kinase structures predicted recently with the Alphafold2 tool, showing that Alphafold2 reproduced the active/inactive but not the DFG-in proportions seen in the Protein Data Bank. We expect the models will be useful for understanding and engineering kinases.
Collapse
Affiliation(s)
- Ivan Reveguk
- Laboratoire de Biologie Structurale de la Cellule (CNRS UMR7654)Ecole PolytechniquePalaiseauFrance
| | - Thomas Simonson
- Laboratoire de Biologie Structurale de la Cellule (CNRS UMR7654)Ecole PolytechniquePalaiseauFrance
| |
Collapse
|
43
|
Wang JM, Cui RK, Qian ZK, Yang ZZ, Li Y. Mining channel-regulated peptides from animal venom by integrating sequence semantics and structural information. Comput Biol Chem 2024; 109:108027. [PMID: 38340414 DOI: 10.1016/j.compbiolchem.2024.108027] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2023] [Revised: 01/24/2024] [Accepted: 02/04/2024] [Indexed: 02/12/2024]
Abstract
Channel-regulated peptides (CRPs) derived from animal venom hold great promise as potential drug candidates for numerous diseases associated with channel proteins. However, discovering and identifying CRPs using traditional bio-experimental methods is a time-consuming and laborious process. While there were a few computational studies on CRPs, they were limited to specific channel proteins, relied heavily on complex feature engineering, and lacked the incorporation of multi-source information. To address these problems, we proposed a novel deep learning model, called DeepCRPs, based on graph neural networks for systematically mining CRPs from animal venom. By combining the sequence semantic and structural information, the classification performance of four CRPs was significantly enhanced, reaching an accuracy of 0.92. This performance surpassed baseline models with accuracies ranging from 0.77 to 0.89. Furthermore, we employed advanced interpretable techniques to explore sequence and structural determinants relevant to the classification of CRPs, yielding potentially valuable bio-function interpretations. Comprehensive experimental results demonstrated the precision and interpretive capability of DeepCRPs, making it an accurate and bio-explainable suit for the identification and categorization of CRPs. Our research will contribute to the discovery and development of toxin peptides targeting channel proteins. The source data and code are freely available at https://github.com/liyigerry/DeepCRPs.
Collapse
Affiliation(s)
- Jian-Ming Wang
- College of Mathematics and Computer Science, Dali University, Dali, China
| | - Rong-Kai Cui
- College of Mathematics and Computer Science, Dali University, Dali, China
| | - Zheng-Kun Qian
- College of Mathematics and Computer Science, Dali University, Dali, China
| | - Zi-Zhong Yang
- Yunnan Provincial Key Laboratory of Entomological Biopharmaceutical R&D, College of Pharmacy, Dali University, Dali, China
| | - Yi Li
- College of Mathematics and Computer Science, Dali University, Dali, China.
| |
Collapse
|
44
|
Waman VP, Bordin N, Alcraft R, Vickerstaff R, Rauer C, Chan Q, Sillitoe I, Yamamori H, Orengo C. CATH 2024: CATH-AlphaFlow Doubles the Number of Structures in CATH and Reveals Nearly 200 New Folds. J Mol Biol 2024:168551. [PMID: 38548261 DOI: 10.1016/j.jmb.2024.168551] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2024] [Revised: 03/20/2024] [Accepted: 03/22/2024] [Indexed: 04/07/2024]
Abstract
CATH (https://www.cathdb.info) classifies domain structures from experimental protein structures in the PDB and predicted structures in the AlphaFold Database (AFDB). To cope with the scale of the predicted data a new NextFlow workflow (CATH-AlphaFlow), has been developed to classify high-quality domains into CATH superfamilies and identify novel fold groups and superfamilies. CATH-AlphaFlow uses a novel state-of-the-art structure-based domain boundary prediction method (ChainSaw) for identifying domains in multi-domain proteins. We applied CATH-AlphaFlow to process PDB structures not classified in CATH and AFDB structures from 21 model organisms, expanding CATH by over 100%. Domains not classified in existing CATH superfamilies or fold groups were used to seed novel folds, giving 253 new folds from PDB structures (September 2023 release) and 96 from AFDB structures of proteomes of 21 model organisms. Where possible, functional annotations were obtained using (i) predictions from publicly available methods (ii) annotations from structural relatives in AFDB/UniProt50. We also predicted functional sites and highly conserved residues. Some folds are associated with important functions such as photosynthetic acclimation (in flowering plants), iron permease activity (in fungi) and post-natal spermatogenesis (in mice). CATH-AlphaFlow will allow us to identify many more CATH relatives in the AFDB, further characterising the protein structure landscape.
Collapse
Affiliation(s)
- Vaishali P Waman
- Institute of Structural and Molecular Biology, University College London, London, United Kingdom
| | - Nicola Bordin
- Institute of Structural and Molecular Biology, University College London, London, United Kingdom
| | - Rachel Alcraft
- Advanced Research Computing Centre, University College London, London, United Kingdom
| | - Robert Vickerstaff
- Advanced Research Computing Centre, University College London, London, United Kingdom
| | - Clemens Rauer
- Institute of Structural and Molecular Biology, University College London, London, United Kingdom
| | - Qian Chan
- Institute of Structural and Molecular Biology, University College London, London, United Kingdom
| | - Ian Sillitoe
- Institute of Structural and Molecular Biology, University College London, London, United Kingdom
| | - Hazuki Yamamori
- Institute of Structural and Molecular Biology, University College London, London, United Kingdom
| | - Christine Orengo
- Institute of Structural and Molecular Biology, University College London, London, United Kingdom.
| |
Collapse
|
45
|
Ashrafzadeh S, Golding GB, Ilie S, Ilie L. Scoring alignments by embedding vector similarity. Brief Bioinform 2024; 25:bbae178. [PMID: 38695119 PMCID: PMC11063651 DOI: 10.1093/bib/bbae178] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2023] [Revised: 03/20/2024] [Accepted: 03/31/2024] [Indexed: 05/05/2024] Open
Abstract
Sequence similarity is of paramount importance in biology, as similar sequences tend to have similar function and share common ancestry. Scoring matrices, such as PAM or BLOSUM, play a crucial role in all bioinformatics algorithms for identifying similarities, but have the drawback that they are fixed, independent of context. We propose a new scoring method for amino acid similarity that remedies this weakness, being contextually dependent. It relies on recent advances in deep learning architectures that employ self-supervised learning in order to leverage the power of enormous amounts of unlabelled data to generate contextual embeddings, which are vector representations for words. These ideas have been applied to protein sequences, producing embedding vectors for protein residues. We propose the E-score between two residues as the cosine similarity between their embedding vector representations. Thorough testing on a wide variety of reference multiple sequence alignments indicate that the alignments produced using the new $E$-score method, especially ProtT5-score, are significantly better than those obtained using BLOSUM matrices. The new method proposes to change the way alignments are computed, with far-reaching implications in all areas of textual data that use sequence similarity. The program to compute alignments based on various $E$-scores is available as a web server at e-score.csd.uwo.ca. The source code is freely available for download from github.com/lucian-ilie/E-score.
Collapse
Affiliation(s)
- Sepehr Ashrafzadeh
- Department of Computer Science, University of Western Ontario, London, N6A 5B7, Ontario, Canada
| | - G Brian Golding
- Department of Biology, McMaster University, Hamilton, L8S 4K1, Ontario, Canada
| | - Silvana Ilie
- Department of Mathematics, Toronto Metropolitan University, Toronto, M5B 2K3, Ontario, Canada
| | - Lucian Ilie
- Department of Computer Science, University of Western Ontario, London, N6A 5B7, Ontario, Canada
| |
Collapse
|
46
|
Li X, Qian Y, Hu Y, Chen J, Yue H, Deng L. MSF-PFP: A Novel Multisource Feature Fusion Model for Protein Function Prediction. J Chem Inf Model 2024; 64:1502-1511. [PMID: 38413369 DOI: 10.1021/acs.jcim.3c01794] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/29/2024]
Abstract
Protein function prediction is essential for disease treatment and drug development; yet, traditional biological experimental methods are less efficient in annotating protein function, and existing automated methods fail to fully leverage protein multisource data. Here, we present MSF-PFP, a computational framework that fuses multisource data features to predict protein function with high accuracy. Our framework designs specific models for feature extraction based on the characteristics of various data sources, including a global-local-individual strategy for local location features. MSF-PFP then integrates extracted features through a multisource feature fusion model, ultimately categorizing protein functions. Experimental results demonstrate that MSF-PFP outperforms eight state-of-the-art models, achieving FMax scores of 0.542, 0.675, and 0.624 for the biological process (BP), molecular function (MF), and cellular component (CC), respectively. The source code and data set for MSF-PFP are available at https://swanhub.co/TianGua/MSF-PFP, facilitating further exploration and validation of the proposed framework. This study highlights the potential of multisource data fusion in enhancing protein function prediction, contributing to improved disease therapy and medication discovery strategies.
Collapse
Affiliation(s)
- Xinhui Li
- School of Software, Xinjiang University, Urumqi 830091, China
- Key Laboratory of Signal Detection and Processing in Xinjiang Uygur Autonomous Region, Xinjiang University, Urumqi 830046, China
- Key Laboratory of Software Engineering, Xinjiang University, Urumqi 830091, China
| | - Yurong Qian
- School of Software, Xinjiang University, Urumqi 830091, China
- Key Laboratory of Signal Detection and Processing in Xinjiang Uygur Autonomous Region, Xinjiang University, Urumqi 830046, China
- Key Laboratory of Software Engineering, Xinjiang University, Urumqi 830091, China
| | - Yue Hu
- School of Software, Xinjiang University, Urumqi 830091, China
- Key Laboratory of Signal Detection and Processing in Xinjiang Uygur Autonomous Region, Xinjiang University, Urumqi 830046, China
- Key Laboratory of Software Engineering, Xinjiang University, Urumqi 830091, China
| | - Jiaying Chen
- School of Software, Xinjiang University, Urumqi 830091, China
- Key Laboratory of Signal Detection and Processing in Xinjiang Uygur Autonomous Region, Xinjiang University, Urumqi 830046, China
- Key Laboratory of Software Engineering, Xinjiang University, Urumqi 830091, China
| | - Haitao Yue
- School of Future Technology, Xinjiang University, Urumqi 830017, China
- Laboratory of Synthetic Biology, School of Life Science and Technology, Xinjiang University, Urumqi 830017, China
| | - Lei Deng
- School of Software, Xinjiang University, Urumqi 830091, China
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
| |
Collapse
|
47
|
Miravet-Verde S, Mazzolini R, Segura-Morales C, Broto A, Lluch-Senar M, Serrano L. ProTInSeq: transposon insertion tracking by ultra-deep DNA sequencing to identify translated large and small ORFs. Nat Commun 2024; 15:2091. [PMID: 38453908 PMCID: PMC10920889 DOI: 10.1038/s41467-024-46112-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2022] [Accepted: 02/14/2024] [Indexed: 03/09/2024] Open
Abstract
Identifying open reading frames (ORFs) being translated is not a trivial task. ProTInSeq is a technique designed to characterize proteomes by sequencing transposon insertions engineered to express a selection marker when they occur in-frame within a protein-coding gene. In the bacterium Mycoplasma pneumoniae, ProTInSeq identifies 83% of its annotated proteins, along with 5 proteins and 153 small ORF-encoded proteins (SEPs; ≤100 aa) that were not previously annotated. Moreover, ProTInSeq can be utilized for detecting translational noise, as well as for relative quantification and transmembrane topology estimation of fitness and non-essential proteins. By integrating various identification approaches, the number of initially annotated SEPs in this bacterium increases from 27 to 329, with a quarter of them predicted to possess antimicrobial potential. Herein, we describe a methodology complementary to Ribo-Seq and mass spectroscopy that can identify SEPs while providing other insights in a proteome with a flexible and cost-effective DNA ultra-deep sequencing approach.
Collapse
Affiliation(s)
- Samuel Miravet-Verde
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Dr Aiguader 88, 08003, Barcelona, Spain.
- Department of Biology, Institute of Microbiology and Swiss Institute of Bioinformatics, ETH Zurich, Zurich, Switzerland.
| | | | - Carolina Segura-Morales
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Dr Aiguader 88, 08003, Barcelona, Spain
| | - Alicia Broto
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Dr Aiguader 88, 08003, Barcelona, Spain
| | - Maria Lluch-Senar
- Pulmobiotics, Dr Aiguader 88, 08003, Barcelona, Spain.
- Institute of Biotechnology and Biomedicine "Vicent Villar Palasi" (IBB), Universitat Autònoma de Barcelona, Barcelona, Spain.
| | - Luis Serrano
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Dr Aiguader 88, 08003, Barcelona, Spain.
- Universitat Pompeu Fabra (UPF), Barcelona, Spain.
- ICREA, Pg. Lluis Companys 23, 08010, Barcelona, Spain.
| |
Collapse
|
48
|
Kohyama S, Frohn BP, Babl L, Schwille P. Machine learning-aided design and screening of an emergent protein function in synthetic cells. Nat Commun 2024; 15:2010. [PMID: 38443351 PMCID: PMC10914801 DOI: 10.1038/s41467-024-46203-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2023] [Accepted: 02/16/2024] [Indexed: 03/07/2024] Open
Abstract
Recently, utilization of Machine Learning (ML) has led to astonishing progress in computational protein design, bringing into reach the targeted engineering of proteins for industrial and biomedical applications. However, the design of proteins for emergent functions of core relevance to cells, such as the ability to spatiotemporally self-organize and thereby structure the cellular space, is still extremely challenging. While on the generative side conditional generative models and multi-state design are on the rise, for emergent functions there is a lack of tailored screening methods as typically needed in a protein design project, both computational and experimental. Here we describe a proof-of-principle of how such screening, in silico and in vitro, can be achieved for ML-generated variants of a protein that forms intracellular spatiotemporal patterns. For computational screening we use a structure-based divide-and-conquer approach to find the most promising candidates, while for the subsequent in vitro screening we use synthetic cell-mimics as established by Bottom-Up Synthetic Biology. We then show that the best screened candidate can indeed completely substitute the wildtype gene in Escherichia coli. These results raise great hopes for the next level of synthetic biology, where ML-designed synthetic proteins will be used to engineer cellular functions.
Collapse
Affiliation(s)
- Shunshi Kohyama
- Dept. Cellular and Molecular Biophysics, Max Planck Institute of Biochemistry, Martinsried, D-82152, Germany
| | - Béla P Frohn
- Dept. Cellular and Molecular Biophysics, Max Planck Institute of Biochemistry, Martinsried, D-82152, Germany
| | - Leon Babl
- Dept. Cellular and Molecular Biophysics, Max Planck Institute of Biochemistry, Martinsried, D-82152, Germany
| | - Petra Schwille
- Dept. Cellular and Molecular Biophysics, Max Planck Institute of Biochemistry, Martinsried, D-82152, Germany.
| |
Collapse
|
49
|
Sagendorf JM, Mitra R, Huang J, Chen XS, Rohs R. PNAbind: Structure-based prediction of protein-nucleic acid binding using graph neural networks. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.02.27.582387. [PMID: 38529493 PMCID: PMC10962711 DOI: 10.1101/2024.02.27.582387] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/27/2024]
Abstract
The recognition and binding of nucleic acids (NAs) by proteins depends upon complementary chemical, electrostatic and geometric properties of the protein-NA binding interface. Structural models of protein-NA complexes provide insights into these properties but are scarce relative to models of unbound proteins. We present a deep learning approach for predicting protein-NA binding given the apo structure of a protein (PNAbind). Our method utilizes graph neural networks to encode spatial distributions of physicochemical and geometric properties of the protein molecular surface that are predictive of NA binding. Using global physicochemical encodings, our models predict the overall binding function of a protein and can discriminate between specificity for DNA or RNA binding. We show that such predictions made on protein structures modeled with AlphaFold2 can be used to gain mechanistic understanding of chemical and structural features that determine NA recognition. Using local encodings, our models predict the location of NA binding sites at the level of individual binding residues. Binding site predictions were validated against benchmark datasets, achieving AUROC scores in the range of 0.92-0.95. We applied our models to the HIV-1 restriction factor APOBEC3G and show that our predictions are consistent with experimental RNA binding data.
Collapse
|
50
|
Borujeni PM, Salavati R. Functional domain annotation by structural similarity. NAR Genom Bioinform 2024; 6:lqae005. [PMID: 38298181 PMCID: PMC10830352 DOI: 10.1093/nargab/lqae005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2023] [Revised: 12/03/2023] [Accepted: 01/15/2024] [Indexed: 02/02/2024] Open
Abstract
Traditional automated in silico functional annotation uses tools like Pfam that rely on sequence similarities for domain annotation. However, structural conservation often exceeds sequence conservation, suggesting an untapped potential for improved annotation through structural similarity. This approach was previously overlooked before the AlphaFold2 introduction due to the need for more high-quality protein structures. Leveraging structural information especially holds significant promise to enhance accurate annotation in diverse proteins across phylogenetic distances. In our study, we evaluated the feasibility of annotating Pfam domains based on structural similarity. To this end, we created a database from segmented full-length protein structures at their domain boundaries, representing the structure of Pfam seeds. We used Trypanosoma brucei, a phylogenetically distant protozoan parasite as our model organism. Its structome was aligned with our database using Foldseek, the ultra-fast structural alignment tool, and the top non-overlapping hits were annotated as domains. Our method identified over 400 new domains in the T. brucei proteome, surpassing the benchmark set by sequence-based tools, Pfam and Pfam-N, with some predictions validated manually. We have also addressed limitations and suggested avenues for further enhancing structure-based domain annotation.
Collapse
Affiliation(s)
| | - Reza Salavati
- Institute of Parasitology, McGill University, Ste. Anne de Bellevue, Quebec H9X 3V9, Canada
- Department of Biochemistry, McGill University, Montreal, Quebec H3G 1Y6, Canada
| |
Collapse
|