1
|
Liu J, Yang M, Yu Y, Xu H, Wang T, Li K, Zhou X. Advancing bioinformatics with large language models: components, applications and perspectives. ARXIV 2025:arXiv:2401.04155v2. [PMID: 38259343 PMCID: PMC10802675] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Subscribe] [Scholar Register] [Indexed: 01/24/2024]
Abstract
Large language models (LLMs) are a class of artificial intelligence models based on deep learning, which have great performance in various tasks, especially in natural language processing (NLP). Large language models typically consist of artificial neural networks with numerous parameters, trained on large amounts of unlabeled input using self-supervised or semi-supervised learning. However, their potential for solving bioinformatics problems may even exceed their proficiency in modeling human language. In this review, we will provide a comprehensive overview of the essential components of large language models (LLMs) in bioinformatics, spanning genomics, transcriptomics, proteomics, drug discovery, and single-cell analysis. Key aspects covered include tokenization methods for diverse data types, the architecture of transformer models, the core attention mechanism, and the pre-training processes underlying these models. Additionally, we will introduce currently available foundation models and highlight their downstream applications across various bioinformatics domains. Finally, drawing from our experience, we will offer practical guidance for both LLM users and developers, emphasizing strategies to optimize their use and foster further innovation in the field.
Collapse
|
2
|
Ji S, Wang F, Wu Y, Hu H, Xing Z, Zhu J, Xu S, Han T, Liu G, Wu Z, Fei C, Kong L, Chen J, Ding Z, Huang Z, Zhang J. Large-scale transcript variants dictate neoepitopes for cancer immunotherapy. SCIENCE ADVANCES 2025; 11:eado5600. [PMID: 39888994 PMCID: PMC11784853 DOI: 10.1126/sciadv.ado5600] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/07/2024] [Accepted: 01/02/2025] [Indexed: 02/02/2025]
Abstract
Precise neoepitope discovery is crucial for effective cancer therapeutic vaccines. Conventional approaches struggle to build a repertoire with sufficient immunogenic epitopes. We developed a workflow leveraging full-length ribosome-nascent chain complex-bound mRNA sequencing (FL-RNC seq) and artificial intelligence-based predictive models to accurately identify the neoepitope landscape, especially large-scale transcript variants (LSTVs) missed by short-read sequencing. In the MC38 mouse model, we identified 22 LSTV-derived neoepitopes encoded by a synthesized mRNA lipid nanoparticle vaccine. As a standalone therapy and combined with anti-PD-1 immunotherapy, the vaccine curbed tumor progression, induced robust T cell-specific immunity, and modulated the tumor microenvironment. This underscores the multifaceted potentials of LSTV-derived vaccines. Our approach expands the neoepitope source repertoire, offering a method for discovering personalized cancer vaccines applicable to a broader tumor range. The results highlight the importance of comprehensive neoepitope identification and the promise of LSTV-based vaccines for cancer immunotherapy.
Collapse
Affiliation(s)
- Shiliang Ji
- State Key Laboratory of Pharmaceutical Biotechnology, School of Life Sciences, Nanjing University, Nanjing 210023, China
| | - Feifan Wang
- State Key Laboratory of Pharmaceutical Biotechnology, School of Life Sciences, Nanjing University, Nanjing 210023, China
| | - Yongjie Wu
- State Key Laboratory of Pharmaceutical Biotechnology, School of Life Sciences, Nanjing University, Nanjing 210023, China
| | - Haoran Hu
- State Key Laboratory of Pharmaceutical Biotechnology, School of Life Sciences, Nanjing University, Nanjing 210023, China
| | - Zhen Xing
- State Key Laboratory of Pharmaceutical Biotechnology, School of Life Sciences, Nanjing University, Nanjing 210023, China
| | - Jie Zhu
- State Key Laboratory of Pharmaceutical Biotechnology, School of Life Sciences, Nanjing University, Nanjing 210023, China
| | - Shi Xu
- Nanjing Chengshi Biomedical Technology Co. Ltd., Nanjing 210031, China
| | - Tiyun Han
- Nanjing Chengshi Biomedical Technology Co. Ltd., Nanjing 210031, China
| | - Guilai Liu
- Nanjing Chengshi Biomedical Technology Co. Ltd., Nanjing 210031, China
| | - Zengding Wu
- Nanjing Chengshi Biomedical Technology Co. Ltd., Nanjing 210031, China
| | - Caiyi Fei
- Nanjing Chengshi Biomedical Technology Co. Ltd., Nanjing 210031, China
| | - Lingming Kong
- Nanjing Chengshi Biomedical Technology Co. Ltd., Nanjing 210031, China
| | - Jiangning Chen
- State Key Laboratory of Pharmaceutical Biotechnology, School of Life Sciences, Nanjing University, Nanjing 210023, China
| | - Zhi Ding
- State Key Laboratory of Pharmaceutical Biotechnology, School of Life Sciences, Nanjing University, Nanjing 210023, China
| | - Zhen Huang
- State Key Laboratory of Pharmaceutical Biotechnology, School of Life Sciences, Nanjing University, Nanjing 210023, China
| | - Junfeng Zhang
- State Key Laboratory of Pharmaceutical Biotechnology, School of Life Sciences, Nanjing University, Nanjing 210023, China
| |
Collapse
|
3
|
Floudas CS, Sarkizova S, Ceccarelli M, Zheng W. Leveraging mRNA technology for antigen based immuno-oncology therapies. J Immunother Cancer 2025; 13:e010569. [PMID: 39848687 PMCID: PMC11784169 DOI: 10.1136/jitc-2024-010569] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2024] [Accepted: 01/03/2025] [Indexed: 01/25/2025] Open
Abstract
The application of messenger RNA (mRNA) technology in antigen-based immuno-oncology therapies represents a significant advancement in cancer treatment. Cancer vaccines are an effective combinatorial partner to sensitize the host immune system to the tumor and boost the efficacy of immune therapies. Selecting suitable tumor antigens is the key step to devising effective vaccinations and amplifying the immune response. Tumor neoantigens are de novo epitopes derived from somatic mutations, avoiding T-cell central tolerance of self-epitopes and inducing immune responses to tumors. The identification and prioritization of patient-specific tumor neoantigens are based on advanced computational algorithms taking advantage of the profiling with next-generation sequencing considering factors involved in human leukocyte antigen (HLA)-peptide-T-cell receptor (TCR) complex formation, including peptide presentation, HLA-peptide affinity, and TCR recognition. This review discusses the development and clinical application of mRNA vaccines in oncology, with a particular focus on recent clinical trials and the computational workflows and methodologies for identifying both shared and individual antigens. While this review centers on therapeutic mRNA vaccines targeting existing tumors, it does not cover preventative vaccines. Preclinical experimental validations are crucial in cancer vaccine development, but we emphasize the computational approaches that facilitate neoantigen selection and design, highlighting their role in advancing mRNA vaccine development. The versatility and rapid development potential of mRNA make it an ideal platform for personalized neoantigen immunotherapy. We explore various strategies for antigen target identification, including tumor-associated and tumor-specific antigens and the computational tools used to predict epitopes capable of eliciting strong immune responses. We address key design considerations for enhancing the immunogenicity and stability of mRNA vaccines, as well as emerging trends and challenges in the field. This comprehensive overview highlights the therapeutic potential of mRNA-based cancer vaccines and underscores ongoing research efforts aimed at optimizing these therapies for improved clinical outcomes.
Collapse
Affiliation(s)
- Charalampos S Floudas
- Center for Immuno-Oncology, Center for Cancer Research, National Cancer Institute, Bethesda, Maryland, USA
| | | | - Michele Ceccarelli
- Sylvester Comprehensive Cancer Center, Department of Public Health Sciences, Miller School of Medicine, University of Miami, Miami, Florida, USA
| | - Wei Zheng
- Moderna, Inc, Cambridge, Massachusetts, USA
| |
Collapse
|
4
|
Asediya VS, Anjaria PA, Mathakiya RA, Koringa PG, Nayak JB, Bisht D, Fulmali D, Patel VA, Desai DN. Vaccine development using artificial intelligence and machine learning: A review. Int J Biol Macromol 2024; 282:136643. [PMID: 39426778 DOI: 10.1016/j.ijbiomac.2024.136643] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2024] [Revised: 09/30/2024] [Accepted: 10/15/2024] [Indexed: 10/21/2024]
Abstract
The COVID-19 pandemic has underscored the critical importance of effective vaccines, yet their development is a challenging and demanding process. It requires identifying antigens that elicit protective immunity, selecting adjuvants that enhance immunogenicity, and designing delivery systems that ensure optimal efficacy. Artificial intelligence (AI) can facilitate this process by using machine learning methods to analyze large and diverse datasets, suggest novel vaccine candidates, and refine their design and predict their performance. This review explores how AI can be applied to various aspects of vaccine development, such as predicting immune response from protein sequences, discovering adjuvants, optimizing vaccine doses, modeling vaccine supply chains, and predicting protein structures. We also address the challenges and ethical issues that emerge from the use of AI in vaccine development, such as data privacy, algorithmic bias, and health data sensitivity. We contend that AI has immense potential to accelerate vaccine development and respond to future pandemics, but it also requires careful attention to the quality and validity of the data and methods used.
Collapse
Affiliation(s)
| | | | | | | | | | - Deepanker Bisht
- Indian Veterinary Research Institute, Izatnagar, U.P., India
| | | | | | | |
Collapse
|
5
|
Tu Z, Wang Y, Liang J, Liu J. Helicobacter pylori-targeted AI-driven vaccines: a paradigm shift in gastric cancer prevention. Front Immunol 2024; 15:1500921. [PMID: 39669583 PMCID: PMC11634812 DOI: 10.3389/fimmu.2024.1500921] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2024] [Accepted: 11/08/2024] [Indexed: 12/14/2024] Open
Abstract
Helicobacter pylori (H. pylori), a globally prevalent pathogen Group I carcinogen, presents a formidable challenge in gastric cancer prevention due to its increasing antimicrobial resistance and strain diversity. This comprehensive review critically analyzes the limitations of conventional antibiotic-based therapies and explores cutting-edge approaches to combat H. pylori infections and associated gastric carcinogenesis. We emphasize the pressing need for innovative therapeutic strategies, with a particular focus on precision medicine and tailored vaccine development. Despite promising advancements in enhancing host immunity, current Helicobacter pylori vaccine clinical trials have yet to achieve long-term efficacy or gain approval regulatory approval. We propose a paradigm-shifting approach leveraging artificial intelligence (AI) to design precision-targeted, multiepitope vaccines tailored to multiple H. pylori subtypes. This AI-driven strategy has the potential to revolutionize antigen selection and optimize vaccine efficacy, addressing the critical need for personalized interventions in H. pylori eradication efforts. By leveraging AI in vaccine design, we propose a revolutionary approach to precision therapy that could significantly reduce H. pylori -associated gastric cancer burden.
Collapse
Affiliation(s)
| | | | | | - Jinping Liu
- State Key Laboratory of Oncology in South China, Guangdong Provincial Clinical Research Center for Cancer, Sun Yat-sen University Cancer Center, Guangzhou, China
| |
Collapse
|
6
|
Su L, Yan Y, Ma B, Zhao S, Cui Z. GIHP: Graph convolutional neural network based interpretable pan-specific HLA-peptide binding affinity prediction. Front Genet 2024; 15:1405032. [PMID: 39050251 PMCID: PMC11266168 DOI: 10.3389/fgene.2024.1405032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2024] [Accepted: 06/20/2024] [Indexed: 07/27/2024] Open
Abstract
Accurately predicting the binding affinities between Human Leukocyte Antigen (HLA) molecules and peptides is a crucial step in understanding the adaptive immune response. This knowledge can have important implications for the development of effective vaccines and the design of targeted immunotherapies. Existing sequence-based methods are insufficient to capture the structure information. Besides, the current methods lack model interpretability, which hinder revealing the key binding amino acids between the two molecules. To address these limitations, we proposed an interpretable graph convolutional neural network (GCNN) based prediction method named GIHP. Considering the size differences between HLA and short peptides, GIHP represent HLA structure as amino acid-level graph while represent peptide SMILE string as atom-level graph. For interpretation, we design a novel visual explanation method, gradient weighted activation mapping (Grad-WAM), for identifying key binding residues. GIHP achieved better prediction accuracy than state-of-the-art methods across various datasets. According to current research findings, key HLA-peptide binding residues mutations directly impact immunotherapy efficacy. Therefore, we verified those highlighted key residues to see whether they can significantly distinguish immunotherapy patient groups. We have verified that the identified functional residues can successfully separate patient survival groups across breast, bladder, and pan-cancer datasets. Results demonstrate that GIHP improves the accuracy and interpretation capabilities of HLA-peptide prediction, and the findings of this study can be used to guide personalized cancer immunotherapy treatment. Codes and datasets are publicly accessible at: https://github.com/sdustSu/GIHP.
Collapse
Affiliation(s)
- Lingtao Su
- Shandong University of Science and Technology, Qingdao, China
| | - Yan Yan
- Shandong Guohe Industrial Technology Research Institute Co. Ltd., Jinan, China
| | - Bo Ma
- Qingdao UNIC Information Technology Co. Ltd., Qingdao, China
| | - Shiwei Zhao
- Shandong University of Science and Technology, Qingdao, China
| | - Zhenyu Cui
- Shandong University of Science and Technology, Qingdao, China
| |
Collapse
|
7
|
Machaca V, Goyzueta V, Cruz MG, Sejje E, Pilco LM, López J, Túpac Y. Transformers meets neoantigen detection: a systematic literature review. J Integr Bioinform 2024; 21:jib-2023-0043. [PMID: 38960869 PMCID: PMC11377031 DOI: 10.1515/jib-2023-0043] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2023] [Accepted: 03/20/2024] [Indexed: 07/05/2024] Open
Abstract
Cancer immunology offers a new alternative to traditional cancer treatments, such as radiotherapy and chemotherapy. One notable alternative is the development of personalized vaccines based on cancer neoantigens. Moreover, Transformers are considered a revolutionary development in artificial intelligence with a significant impact on natural language processing (NLP) tasks and have been utilized in proteomics studies in recent years. In this context, we conducted a systematic literature review to investigate how Transformers are applied in each stage of the neoantigen detection process. Additionally, we mapped current pipelines and examined the results of clinical trials involving cancer vaccines.
Collapse
Affiliation(s)
| | | | | | - Erika Sejje
- Universidad Nacional de San Agustín, Arequipa, Perú
| | | | | | - Yván Túpac
- 187038 Universidad Católica San Pablo , Arequipa, Perú
| |
Collapse
|
8
|
Bulashevska A, Nacsa Z, Lang F, Braun M, Machyna M, Diken M, Childs L, König R. Artificial intelligence and neoantigens: paving the path for precision cancer immunotherapy. Front Immunol 2024; 15:1394003. [PMID: 38868767 PMCID: PMC11167095 DOI: 10.3389/fimmu.2024.1394003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/29/2024] [Accepted: 05/13/2024] [Indexed: 06/14/2024] Open
Abstract
Cancer immunotherapy has witnessed rapid advancement in recent years, with a particular focus on neoantigens as promising targets for personalized treatments. The convergence of immunogenomics, bioinformatics, and artificial intelligence (AI) has propelled the development of innovative neoantigen discovery tools and pipelines. These tools have revolutionized our ability to identify tumor-specific antigens, providing the foundation for precision cancer immunotherapy. AI-driven algorithms can process extensive amounts of data, identify patterns, and make predictions that were once challenging to achieve. However, the integration of AI comes with its own set of challenges, leaving space for further research. With particular focus on the computational approaches, in this article we have explored the current landscape of neoantigen prediction, the fundamental concepts behind, the challenges and their potential solutions providing a comprehensive overview of this rapidly evolving field.
Collapse
Affiliation(s)
- Alla Bulashevska
- Host-Pathogen-Interactions, Paul-Ehrlich-Institut, Langen, Germany
| | - Zsófia Nacsa
- Host-Pathogen-Interactions, Paul-Ehrlich-Institut, Langen, Germany
| | - Franziska Lang
- TRON - Translational Oncology at the University Medical Center of the Johannes Gutenberg University gGmbH, Mainz, Germany
| | - Markus Braun
- Host-Pathogen-Interactions, Paul-Ehrlich-Institut, Langen, Germany
| | - Martin Machyna
- Host-Pathogen-Interactions, Paul-Ehrlich-Institut, Langen, Germany
| | - Mustafa Diken
- TRON - Translational Oncology at the University Medical Center of the Johannes Gutenberg University gGmbH, Mainz, Germany
| | - Liam Childs
- Host-Pathogen-Interactions, Paul-Ehrlich-Institut, Langen, Germany
| | - Renate König
- Host-Pathogen-Interactions, Paul-Ehrlich-Institut, Langen, Germany
| |
Collapse
|
9
|
Omelchenko AA, Siwek JC, Chhibbar P, Arshad S, Nazarali I, Nazarali K, Rosengart A, Rahimikollu J, Tilstra J, Shlomchik MJ, Koes DR, Joglekar AV, Das J. Sliding Window INteraction Grammar (SWING): a generalized interaction language model for peptide and protein interactions. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.01.592062. [PMID: 38746274 PMCID: PMC11092674 DOI: 10.1101/2024.05.01.592062] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/16/2024]
Abstract
The explosion of sequence data has allowed the rapid growth of protein language models (pLMs). pLMs have now been employed in many frameworks including variant-effect and peptide-specificity prediction. Traditionally, for protein-protein or peptide-protein interactions (PPIs), corresponding sequences are either co-embedded followed by post-hoc integration or the sequences are concatenated prior to embedding. Interestingly, no method utilizes a language representation of the interaction itself. We developed an interaction LM (iLM), which uses a novel language to represent interactions between protein/peptide sequences. Sliding Window Interaction Grammar (SWING) leverages differences in amino acid properties to generate an interaction vocabulary. This vocabulary is the input into a LM followed by a supervised prediction step where the LM's representations are used as features. SWING was first applied to predicting peptide:MHC (pMHC) interactions. SWING was not only successful at generating Class I and Class II models that have comparable prediction to state-of-the-art approaches, but the unique Mixed Class model was also successful at jointly predicting both classes. Further, the SWING model trained only on Class I alleles was predictive for Class II, a complex prediction task not attempted by any existing approach. For de novo data, using only Class I or Class II data, SWING also accurately predicted Class II pMHC interactions in murine models of SLE (MRL/lpr model) and T1D (NOD model), that were validated experimentally. To further evaluate SWING's generalizability, we tested its ability to predict the disruption of specific protein-protein interactions by missense mutations. Although modern methods like AlphaMissense and ESM1b can predict interfaces and variant effects/pathogenicity per mutation, they are unable to predict interaction-specific disruptions. SWING was successful at accurately predicting the impact of both Mendelian mutations and population variants on PPIs. This is the first generalizable approach that can accurately predict interaction-specific disruptions by missense mutations with only sequence information. Overall, SWING is a first-in-class generalizable zero-shot iLM that learns the language of PPIs.
Collapse
Affiliation(s)
- Alisa A. Omelchenko
- Center for Systems immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA
- Department of Immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA
- Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh, PA, USA
- The joint CMU-Pitt PhD program in computational biology, School of Medicine, University of Pittsburgh, PA, USA
| | - Jane C. Siwek
- Center for Systems immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA
- Department of Immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA
- Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh, PA, USA
- The joint CMU-Pitt PhD program in computational biology, School of Medicine, University of Pittsburgh, PA, USA
| | - Prabal Chhibbar
- Center for Systems immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA
- Department of Immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA
- Integrative systems biology PhD program, School of Medicine, University of Pittsburgh, PA, USA
| | - Sanya Arshad
- Center for Systems immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA
- Department of Immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA
| | - Iliyan Nazarali
- Center for Systems immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA
| | - Kiran Nazarali
- Center for Systems immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA
| | - AnnaElaine Rosengart
- Center for Systems immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA
| | - Javad Rahimikollu
- Center for Systems immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA
- Department of Immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA
- Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh, PA, USA
- The joint CMU-Pitt PhD program in computational biology, School of Medicine, University of Pittsburgh, PA, USA
| | - Jeremy Tilstra
- Department of Immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA
- Division of Rheumatology and Clinical Immunology, Department of Medicine, School of Medicine, University of Pittsburgh, PA, USA
| | - Mark J. Shlomchik
- Department of Immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA
| | - David R. Koes
- Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh, PA, USA
| | - Alok V. Joglekar
- Center for Systems immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA
- Department of Immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA
- Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh, PA, USA
| | - Jishnu Das
- Center for Systems immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA
- Department of Immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA
- Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh, PA, USA
| |
Collapse
|
10
|
Zhang L, Song W, Zhu T, Liu Y, Chen W, Cao Y. ConvNeXt-MHC: improving MHC-peptide affinity prediction by structure-derived degenerate coding and the ConvNeXt model. Brief Bioinform 2024; 25:bbae133. [PMID: 38561979 PMCID: PMC10985285 DOI: 10.1093/bib/bbae133] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2023] [Revised: 02/11/2024] [Accepted: 03/02/2024] [Indexed: 04/04/2024] Open
Abstract
Peptide binding to major histocompatibility complex (MHC) proteins plays a critical role in T-cell recognition and the specificity of the immune response. Experimental validation such peptides is extremely resource-intensive. As a result, accurate computational prediction of binding peptides is highly important, particularly in the context of cancer immunotherapy applications, such as the identification of neoantigens. In recent years, there is a significant need to continually improve the existing prediction methods to meet the demands of this field. We developed ConvNeXt-MHC, a method for predicting MHC-I-peptide binding affinity. It introduces a degenerate encoding approach to enhance well-established panspecific methods and integrates transfer learning and semi-supervised learning methods into the cutting-edge deep learning framework ConvNeXt. Comprehensive benchmark results demonstrate that ConvNeXt-MHC outperforms state-of-the-art methods in terms of accuracy. We expect that ConvNeXt-MHC will help us foster new discoveries in the field of immunoinformatics in the distant future. We constructed a user-friendly website at http://www.combio-lezhang.online/predict/, where users can access our data and application.
Collapse
Affiliation(s)
- Le Zhang
- College of Computer Science, Sichuan University, Chengdu 610065, China
| | - Wenkai Song
- College of Computer Science, Sichuan University, Chengdu 610065, China
| | - Tinghao Zhu
- College of Computer Science, Sichuan University, Chengdu 610065, China
- Nuclear Power Institute of China, Chengdu 610213, China
| | - Yang Liu
- Center of Growth, Metabolism and Aging, Key Laboratory of Bio-Resource and Eco-Environment of Ministry of Education, College of Life Sciences, Sichuan University, No. 29 Wangjiang Road, Chengdu 610065, China
| | - Wei Chen
- Innovative Institute of Chinese Medicine and Pharmacy, Chengdu University of Traditional Chinese Medicine, Chengdu 611137, China
| | - Yang Cao
- Center of Growth, Metabolism and Aging, Key Laboratory of Bio-Resource and Eco-Environment of Ministry of Education, College of Life Sciences, Sichuan University, No. 29 Wangjiang Road, Chengdu 610065, China
| |
Collapse
|
11
|
Kumar N, Srivastava R. Deep learning in structural bioinformatics: current applications and future perspectives. Brief Bioinform 2024; 25:bbae042. [PMID: 38701422 PMCID: PMC11066934 DOI: 10.1093/bib/bbae042] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2023] [Revised: 01/05/2024] [Accepted: 01/18/2024] [Indexed: 05/05/2024] Open
Abstract
In this review article, we explore the transformative impact of deep learning (DL) on structural bioinformatics, emphasizing its pivotal role in a scientific revolution driven by extensive data, accessible toolkits and robust computing resources. As big data continue to advance, DL is poised to become an integral component in healthcare and biology, revolutionizing analytical processes. Our comprehensive review provides detailed insights into DL, featuring specific demonstrations of its notable applications in bioinformatics. We address challenges tailored for DL, spotlight recent successes in structural bioinformatics and present a clear exposition of DL-from basic shallow neural networks to advanced models such as convolution, recurrent, artificial and transformer neural networks. This paper discusses the emerging use of DL for understanding biomolecular structures, anticipating ongoing developments and applications in the realm of structural bioinformatics.
Collapse
Affiliation(s)
- Niranjan Kumar
- School of Computational and Integrative Sciences, Jawaharlal Nehru University, New Delhi, India
| | - Rakesh Srivastava
- Center for Computational Natural Sciences and Bioinformatics, International Institute of Information Technology, Hyderabad, India
| |
Collapse
|
12
|
Yang Y, Wei Z, Cia G, Song X, Pucci F, Rooman M, Xue F, Hou Q. MHCII-peptide presentation: an assessment of the state-of-the-art prediction methods. Front Immunol 2024; 15:1293706. [PMID: 38646540 PMCID: PMC11027168 DOI: 10.3389/fimmu.2024.1293706] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2023] [Accepted: 02/19/2024] [Indexed: 04/23/2024] Open
Abstract
Major histocompatibility complex Class II (MHCII) proteins initiate and regulate immune responses by presentation of antigenic peptides to CD4+ T-cells and self-restriction. The interactions between MHCII and peptides determine the specificity of the immune response and are crucial in immunotherapy and cancer vaccine design. With the ever-increasing amount of MHCII-peptide binding data available, many computational approaches have been developed for MHCII-peptide interaction prediction over the last decade. There is thus an urgent need to provide an up-to-date overview and assessment of these newly developed computational methods. To benchmark the prediction performance of these methods, we constructed an independent dataset containing binding and non-binding peptides to 20 human MHCII protein allotypes from the Immune Epitope Database, covering DP, DR and DQ alleles. After collecting 11 known predictors up to January 2022, we evaluated those available through a webserver or standalone packages on this independent dataset. The benchmarking results show that MixMHC2pred and NetMHCIIpan-4.1 achieve the best performance among all predictors. In general, newly developed methods perform better than older ones due to the rapid expansion of data on which they are trained and the development of deep learning algorithms. Our manuscript not only draws a full picture of the state-of-art of MHCII-peptide binding prediction, but also guides researchers in the choice among the different predictors. More importantly, it will inspire biomedical researchers in both academia and industry for the future developments in this field.
Collapse
Affiliation(s)
- Yaqing Yang
- Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan, China
- National Institute of Health Data Science of China, Shandong University, Jinan, China
| | - Zhonghui Wei
- Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan, China
- National Institute of Health Data Science of China, Shandong University, Jinan, China
| | - Gabriel Cia
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, Brussels, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, Brussels, Belgium
| | - Xixi Song
- Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan, China
- National Institute of Health Data Science of China, Shandong University, Jinan, China
| | - Fabrizio Pucci
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, Brussels, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, Brussels, Belgium
| | - Marianne Rooman
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, Brussels, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, Brussels, Belgium
| | - Fuzhong Xue
- Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan, China
- National Institute of Health Data Science of China, Shandong University, Jinan, China
| | - Qingzhen Hou
- Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan, China
- National Institute of Health Data Science of China, Shandong University, Jinan, China
| |
Collapse
|
13
|
Fasoulis R, Rigo MM, Antunes DA, Paliouras G, Kavraki LE. Transfer learning improves pMHC kinetic stability and immunogenicity predictions. IMMUNOINFORMATICS (AMSTERDAM, NETHERLANDS) 2024; 13:100030. [PMID: 38577265 PMCID: PMC10994007 DOI: 10.1016/j.immuno.2023.100030] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 04/06/2024]
Abstract
The cellular immune response comprises several processes, with the most notable ones being the binding of the peptide to the Major Histocompability Complex (MHC), the peptide-MHC (pMHC) presentation to the surface of the cell, and the recognition of the pMHC by the T-Cell Receptor. Identifying the most potent peptide targets for MHC binding, presentation and T-cell recognition is vital for developing peptide-based vaccines and T-cell-based immunotherapies. Data-driven tools that predict each of these steps have been developed, and the availability of mass spectrometry (MS) datasets has facilitated the development of accurate Machine Learning (ML) methods for class-I pMHC binding prediction. However, the accuracy of ML-based tools for pMHC kinetic stability prediction and peptide immunogenicity prediction is uncertain, as stability and immunogenicity datasets are not abundant. Here, we use transfer learning techniques to improve stability and immunogenicity predictions, by taking advantage of a large number of binding affinity and MS datasets. The resulting models, TLStab and TLImm, exhibit comparable or better performance than state-of-the-art approaches on different stability and immunogenicity test sets respectively. Our approach demonstrates the promise of learning from the task of peptide binding to improve predictions on downstream tasks. The source code of TLStab and TLImm is publicly available at https://github.com/KavrakiLab/TL-MHC.
Collapse
Affiliation(s)
- Romanos Fasoulis
- Department of Computer Science, Rice University, 6100 Main St, Houston, 77005, TX, United States
| | - Mauricio Menegatti Rigo
- Department of Computer Science, Rice University, 6100 Main St, Houston, 77005, TX, United States
| | - Dinler Amaral Antunes
- Department of Biology and Biochemistry, University of Houston, 4800 Calhoun Rd, Houston, 77004, TX, United States
| | - Georgios Paliouras
- Institute of Informatics and Telecommunications, NCSR Demokritos, Patr. Gregoriou E and 27 Neapoleos St, Athens, 15341, Greece
| | - Lydia E. Kavraki
- Department of Computer Science, Rice University, 6100 Main St, Houston, 77005, TX, United States
| |
Collapse
|
14
|
Yu Y, Zu L, Jiang J, Wu Y, Wang Y, Xu M, Liu Q. Structure-aware deep model for MHC-II peptide binding affinity prediction. BMC Genomics 2024; 25:127. [PMID: 38291350 PMCID: PMC10826266 DOI: 10.1186/s12864-023-09900-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2023] [Accepted: 12/12/2023] [Indexed: 02/01/2024] Open
Abstract
The prediction of major histocompatibility complex (MHC)-peptide binding affinity is an important branch in immune bioinformatics, especially helpful in accelerating the design of disease vaccines and immunity therapy. Although deep learning-based solutions have yielded promising results on MHC-II molecules in recent years, these methods ignored structure knowledge from each peptide when employing the deep neural network models. Each peptide sequence has its specific combination order, so it is worth considering adding the structural information of the peptide sequence to the deep model training. In this work, we use positional encoding to represent the structural information of peptide sequences and validly combine the positional encoding with existing models by different strategies. Experiments on three datasets show that the introduction of position-coding information can further improve the performance built upon the existing model. The idea of introducing positional encoding to this field can provide important reference significance for the optimization of the deep network structure in the future.
Collapse
Affiliation(s)
- Ying Yu
- School of Health Science and Engineering, University of Shanghai for Science and Technology, Shanghai, 200093, China
| | - Lipeng Zu
- Department of Computer Science, Florida State University, Tallahassee, 32306, USA
| | - Jiaye Jiang
- School of Health Science and Engineering, University of Shanghai for Science and Technology, Shanghai, 200093, China
| | - Yafang Wu
- School of Health Science and Engineering, University of Shanghai for Science and Technology, Shanghai, 200093, China
| | - Yinglin Wang
- School of Health Science and Engineering, University of Shanghai for Science and Technology, Shanghai, 200093, China
| | - Midie Xu
- Department of Pathology, Fudan University, Shanghai Cancer Center, Shanghai, 200032, China.
- Department of Medical Oncology, Shanghai Medical College, Fudan University, Shanghai, 200032, China.
- Institute of Pathology, Fudan University, Shanghai, 200032, China.
| | - Qing Liu
- School of Health Science and Engineering, University of Shanghai for Science and Technology, Shanghai, 200093, China.
| |
Collapse
|
15
|
Wang X, Wu T, Jiang Y, Chen T, Pan D, Jin Z, Xie J, Quan L, Lyu Q. RPEMHC: improved prediction of MHC-peptide binding affinity by a deep learning approach based on residue-residue pair encoding. Bioinformatics 2024; 40:btad785. [PMID: 38175759 PMCID: PMC10796178 DOI: 10.1093/bioinformatics/btad785] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2023] [Revised: 12/20/2023] [Accepted: 12/28/2023] [Indexed: 01/06/2024] Open
Abstract
MOTIVATION Binding of peptides to major histocompatibility complex (MHC) molecules plays a crucial role in triggering T cell recognition mechanisms essential for immune response. Accurate prediction of MHC-peptide binding is vital for the development of cancer therapeutic vaccines. While recent deep learning-based methods have achieved significant performance in predicting MHC-peptide binding affinity, most of them separately encode MHC molecules and peptides as inputs, potentially overlooking critical interaction information between the two. RESULTS In this work, we propose RPEMHC, a new deep learning approach based on residue-residue pair encoding to predict the binding affinity between peptides and MHC, which encode an MHC molecule and a peptide as a residue-residue pair map. We evaluate the performance of RPEMHC on various MHC-II-related datasets for MHC-peptide binding prediction, demonstrating that RPEMHC achieves better or comparable performance against other state-of-the-art baselines. Moreover, we further construct experiments on MHC-I-related datasets, and experimental results demonstrate that our method can work on both two MHC classes. These extensive validations have manifested that RPEMHC is an effective tool for studying MHC-peptide interactions and can potentially facilitate the vaccine development. AVAILABILITY The source code of the method along with trained models is freely available at https://github.com/lennylv/RPEMHC.
Collapse
Affiliation(s)
- Xuejiao Wang
- School of Computer Science and Technology, Soochow University, Suzhou, Jiangsu 215006, China
| | - Tingfang Wu
- School of Computer Science and Technology, Soochow University, Suzhou, Jiangsu 215006, China
- Province Key Lab for Information Processing Technologies, Soochow University, Suzhou, Jiangsu 215006, China
- Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing, Jiangsu 210000, China
| | - Yelu Jiang
- School of Computer Science and Technology, Soochow University, Suzhou, Jiangsu 215006, China
| | - Taoning Chen
- School of Computer Science and Technology, Soochow University, Suzhou, Jiangsu 215006, China
| | - Deng Pan
- School of Computer Science and Technology, Soochow University, Suzhou, Jiangsu 215006, China
| | - Zhi Jin
- School of Computer Science and Technology, Soochow University, Suzhou, Jiangsu 215006, China
| | - Jingxin Xie
- School of Computer Science and Technology, Soochow University, Suzhou, Jiangsu 215006, China
| | - Lijun Quan
- School of Computer Science and Technology, Soochow University, Suzhou, Jiangsu 215006, China
- Province Key Lab for Information Processing Technologies, Soochow University, Suzhou, Jiangsu 215006, China
- Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing, Jiangsu 210000, China
| | - Qiang Lyu
- School of Computer Science and Technology, Soochow University, Suzhou, Jiangsu 215006, China
- Province Key Lab for Information Processing Technologies, Soochow University, Suzhou, Jiangsu 215006, China
- Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing, Jiangsu 210000, China
| |
Collapse
|
16
|
Zankov D, Madzhidov T, Varnek A, Polishchuk P. Chemical complexity challenge: Is multi‐instance machine learning a solution? WIRES COMPUTATIONAL MOLECULAR SCIENCE 2024; 14. [DOI: 10.1002/wcms.1698] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/08/2023] [Accepted: 11/07/2023] [Indexed: 01/03/2025]
Abstract
AbstractMolecules are complex dynamic objects that can exist in different molecular forms (conformations, tautomers, stereoisomers, protonation states, etc.) and often it is not known which molecular form is responsible for observed physicochemical and biological properties of a given molecule. This raises the problem of the selection of the correct molecular form for machine learning modeling of target properties. The same problem is common to biological molecules (RNA, DNA, proteins)—long sequences where only key segments, which often cannot be located precisely, are involved in biological functions. Multi‐instance machine learning (MIL) is an efficient approach for solving problems where objects under study cannot be uniquely represented by a single instance, but rather by a set of multiple alternative instances. Multi‐instance learning was formalized in 1997 and motivated by the problem of conformation selection in drug activity prediction tasks. Since then MIL has found a lot of applications in various domains, such as information retrieval, computer vision, signal processing, bankruptcy prediction, and so on. In the given review we describe the MIL framework and its applications to the tasks associated with ambiguity in the representation of small and biological molecules in chemoinformatics and bioinformatics. We have collected examples that demonstrate the advantages of MIL over the traditional single‐instance learning (SIL) approach. Special attention was paid to the ability of MIL models to identify key instances responsible for a modeling property.This article is categorized under:Data Science > ChemoinformaticsData Science > Artificial Intelligence/Machine Learning
Collapse
Affiliation(s)
| | | | - Alexandre Varnek
- ICReDD Hokkaido University Sapporo Japan
- Laboratory of Chemoinformatics University of Strasbourg Strasbourg France
| | - Pavel Polishchuk
- Institute of Molecular and Translational Medicine, Faculty of Medicine and Dentistry Palacky University Olomouc Olomouc Czech Republic
| |
Collapse
|
17
|
Le NQK. Leveraging transformers-based language models in proteome bioinformatics. Proteomics 2023; 23:e2300011. [PMID: 37381841 DOI: 10.1002/pmic.202300011] [Citation(s) in RCA: 17] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2023] [Revised: 06/13/2023] [Accepted: 06/13/2023] [Indexed: 06/30/2023]
Abstract
In recent years, the rapid growth of biological data has increased interest in using bioinformatics to analyze and interpret this data. Proteomics, which studies the structure, function, and interactions of proteins, is a crucial area of bioinformatics. Using natural language processing (NLP) techniques in proteomics is an emerging field that combines machine learning and text mining to analyze biological data. Recently, transformer-based NLP models have gained significant attention for their ability to process variable-length input sequences in parallel, using self-attention mechanisms to capture long-range dependencies. In this review paper, we discuss the recent advancements in transformer-based NLP models in proteome bioinformatics and examine their advantages, limitations, and potential applications to improve the accuracy and efficiency of various tasks. Additionally, we highlight the challenges and future directions of using these models in proteome bioinformatics research. Overall, this review provides valuable insights into the potential of transformer-based NLP models to revolutionize proteome bioinformatics.
Collapse
Affiliation(s)
- Nguyen Quoc Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, Taipei, Taiwan
- AIBioMed Research Group, Taipei Medical University, Taipei, Taiwan
- Research Center for Artificial Intelligence in Medicine, Taipei Medical University, Taipei, Taiwan
- Translational Imaging Research Center, Taipei Medical University Hospital, Taipei, Taiwan
| |
Collapse
|
18
|
Hartout P, Počuča B, Méndez-García C, Schleberger C. Investigating the human and nonobese diabetic mouse MHC class II immunopeptidome using protein language modeling. Bioinformatics 2023; 39:btad469. [PMID: 37527005 PMCID: PMC10421966 DOI: 10.1093/bioinformatics/btad469] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2022] [Revised: 06/17/2023] [Accepted: 07/31/2023] [Indexed: 08/03/2023] Open
Abstract
MOTIVATION Identifying peptides associated with the major histocompability complex class II (MHCII) is a central task in the evaluation of the immunoregulatory function of therapeutics and drug prototypes. MHCII-peptide presentation prediction has multiple biopharmaceutical applications, including the safety assessment of biologics and engineered derivatives in silico, or the fast progression of antigen-specific immunomodulatory drug discovery programs in immune disease and cancer. This has resulted in the collection of large-scale datasets on adaptive immune receptor antigenic responses and MHC-associated peptide proteomics. In parallel, recent deep learning algorithmic advances in protein language modeling have shown potential in leveraging large collections of sequence data and improve MHC presentation prediction. RESULTS Here, we train a compact transformer model (AEGIS) on human and mouse MHCII immunopeptidome data, including a preclinical murine model, and evaluate its performance on the peptide presentation prediction task. We show that the transformer performs on par with existing deep learning algorithms and that combining datasets from multiple organisms increases model performance. We trained variants of the model with and without MHCII information. In both alternatives, the inclusion of peptides presented by the I-Ag7 MHC class II molecule expressed by nonobese diabetic mice enabled for the first time the accurate in silico prediction of presented peptides in a preclinical type 1 diabetes model organism, which has promising therapeutic applications. AVAILABILITY AND IMPLEMENTATION The source code is available at https://github.com/Novartis/AEGIS.
Collapse
Affiliation(s)
- Philip Hartout
- Discovery Sciences, Novartis Institutes for Biomedical Research, Basel 4056, Switzerland
| | - Bojana Počuča
- NIBR Research Informatics, Novartis Institutes for Biomedical Research, Basel 4056, Switzerland
| | - Celia Méndez-García
- Discovery Sciences, Novartis Institutes for Biomedical Research, Basel 4056, Switzerland
| | - Christian Schleberger
- Discovery Sciences, Novartis Institutes for Biomedical Research, Basel 4056, Switzerland
| |
Collapse
|
19
|
Liu R, Hu YF, Huang JD, Fan X. A Bayesian approach to estimate MHC-peptide binding threshold. Brief Bioinform 2023; 24:bbad208. [PMID: 37279464 DOI: 10.1093/bib/bbad208] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2023] [Revised: 05/08/2023] [Accepted: 05/16/2023] [Indexed: 06/08/2023] Open
Abstract
Major histocompatibility complex (MHC)-peptide binding is a critical step in enabling a peptide to serve as an antigen for T-cell recognition. Accurate prediction of this binding can facilitate various applications in immunotherapy. While many existing methods offer good predictive power for the binding affinity of a peptide to a specific MHC, few models attempt to infer the binding threshold that distinguishes binding sequences. These models often rely on experience-based ad hoc criteria, such as 500 or 1000nM. However, different MHCs may have different binding thresholds. As such, there is a need for an automatic, data-driven method to determine an accurate binding threshold. In this study, we proposed a Bayesian model that jointly infers core locations (binding sites), the binding affinity and the binding threshold. Our model provided the posterior distribution of the binding threshold, enabling accurate determination of an appropriate threshold for each MHC. To evaluate the performance of our method under different scenarios, we conducted simulation studies with varying dominant levels of motif distributions and proportions of random sequences. These simulation studies showed desirable estimation accuracy and robustness of our model. Additionally, when applied to real data, our results outperformed commonly used thresholds.
Collapse
Affiliation(s)
- Ran Liu
- Department of Statistics, The Chinese University of Hong Kong, Hong Kong SAR, China
| | - Ye-Fan Hu
- School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, 3/F, Laboratory Block, 21 Sassoon Road, Hong Kong SAR, China
- Department of Medicine, Li Ka Shing Faculty of Medicine, The University of Hong Kong, 4/F Professional Block, Queen Mary Hospital, 102 Pokfulam Road, Hong Kong SAR, China
- BayVax Biotech Limited, Hong Kong Science Park, Pak Shek Kok, New Territories, Hong Kong SAR, China
| | - Jian-Dong Huang
- School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, 3/F, Laboratory Block, 21 Sassoon Road, Hong Kong SAR, China
- CAS Key Laboratory of Quantitative Engineering Biology, Shenzhen Institute of Synthetic Biology, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China
- Clinical Oncology Center, Shenzhen Key Laboratory for Cancer Metastasis and Personalized Therapy, The University of Hong Kong-Shenzhen Hospital, Shenzhen 518053, China
- Guangdong-Hong Kong Joint Laboratory for RNA Medicine, Sun Yat-Sen University, Guangzhou 510120, China
- State Key Laboratory of Cognitive and Brain Research, The University of Hong Kong, Hong Kong SAR, China
| | - Xiaodan Fan
- Department of Statistics, The Chinese University of Hong Kong, Hong Kong SAR, China
| |
Collapse
|
20
|
Yu X, Negron C, Huang L, Veldman G. TransMHCII: a novel MHC-II binding prediction model built using a protein language model and an image classifier. Antib Ther 2023; 6:137-146. [PMID: 37342671 PMCID: PMC10278228 DOI: 10.1093/abt/tbad011] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2023] [Revised: 04/18/2023] [Accepted: 05/09/2023] [Indexed: 06/23/2023] Open
Abstract
The emergence of deep learning models such as AlphaFold2 has revolutionized the structure prediction of proteins. Nevertheless, much remains unexplored, especially on how we utilize structure models to predict biological properties. Herein, we present a method using features extracted from protein language models (PLMs) to predict the major histocompatibility complex class II (MHC-II) binding affinity of peptides. Specifically, we evaluated a novel transfer learning approach where the backbone of our model was interchanged with architectures designed for image classification tasks. Features extracted from several PLMs (ESM1b, ProtXLNet or ProtT5-XL-UniRef) were passed into image models (EfficientNet v2b0, EfficientNet v2m or ViT-16). The optimal pairing of the PLM and image classifier resulted in the final model TransMHCII, outperforming NetMHCIIpan 3.2 and NetMHCIIpan 4.0-BA on the receiver operating characteristic area under the curve, balanced accuracy and Jaccard scores. The architecture innovation may facilitate the development of other deep learning models for biological problems.
Collapse
Affiliation(s)
- Xin Yu
- Biotherapeutics Discovery, AbbVie Bioresearch Center, 100 Research Drive, Worcester, MA 01605, USA
| | - Christopher Negron
- Biotherapeutics Discovery, AbbVie Bioresearch Center, 100 Research Drive, Worcester, MA 01605, USA
| | - Lili Huang
- Biotherapeutics Discovery, AbbVie Bioresearch Center, 100 Research Drive, Worcester, MA 01605, USA
| | - Geertruida Veldman
- Biotherapeutics Discovery, AbbVie Bioresearch Center, 100 Research Drive, Worcester, MA 01605, USA
| |
Collapse
|
21
|
Chandra A, Tünnermann L, Löfstedt T, Gratz R. Transformer-based deep learning for predicting protein properties in the life sciences. eLife 2023; 12:e82819. [PMID: 36651724 PMCID: PMC9848389 DOI: 10.7554/elife.82819] [Citation(s) in RCA: 43] [Impact Index Per Article: 21.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2022] [Accepted: 01/06/2023] [Indexed: 01/19/2023] Open
Abstract
Recent developments in deep learning, coupled with an increasing number of sequenced proteins, have led to a breakthrough in life science applications, in particular in protein property prediction. There is hope that deep learning can close the gap between the number of sequenced proteins and proteins with known properties based on lab experiments. Language models from the field of natural language processing have gained popularity for protein property predictions and have led to a new computational revolution in biology, where old prediction results are being improved regularly. Such models can learn useful multipurpose representations of proteins from large open repositories of protein sequences and can be used, for instance, to predict protein properties. The field of natural language processing is growing quickly because of developments in a class of models based on a particular model-the Transformer model. We review recent developments and the use of large-scale Transformer models in applications for predicting protein characteristics and how such models can be used to predict, for example, post-translational modifications. We review shortcomings of other deep learning models and explain how the Transformer models have quickly proven to be a very promising way to unravel information hidden in the sequences of amino acids.
Collapse
Affiliation(s)
- Abel Chandra
- Department of Computing Science, Umeå UniversityUmeåSweden
| | - Laura Tünnermann
- Umeå Plant Science Centre (UPSC), Department of Forest Genetics and Plant Physiology, Swedish University of Agricultural SciencesUmeåSweden
| | - Tommy Löfstedt
- Department of Computing Science, Umeå UniversityUmeåSweden
| | - Regina Gratz
- Umeå Plant Science Centre (UPSC), Department of Forest Genetics and Plant Physiology, Swedish University of Agricultural SciencesUmeåSweden
- Department of Forest Ecology and Management, Swedish University of Agricultural SciencesUmeåSweden
| |
Collapse
|
22
|
Grazioli F, Machart P, Mösch A, Li K, Castorina LV, Pfeifer N, Min MR. Attentive Variational Information Bottleneck for TCR-peptide interaction prediction. Bioinformatics 2022; 39:6960920. [PMID: 36571499 PMCID: PMC9825246 DOI: 10.1093/bioinformatics/btac820] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2022] [Revised: 11/18/2022] [Accepted: 12/23/2022] [Indexed: 12/27/2022] Open
Abstract
MOTIVATION We present a multi-sequence generalization of Variational Information Bottleneck and call the resulting model Attentive Variational Information Bottleneck (AVIB). Our AVIB model leverages multi-head self-attention to implicitly approximate a posterior distribution over latent encodings conditioned on multiple input sequences. We apply AVIB to a fundamental immuno-oncology problem: predicting the interactions between T-cell receptors (TCRs) and peptides. RESULTS Experimental results on various datasets show that AVIB significantly outperforms state-of-the-art methods for TCR-peptide interaction prediction. Additionally, we show that the latent posterior distribution learned by AVIB is particularly effective for the unsupervised detection of out-of-distribution amino acid sequences. AVAILABILITY AND IMPLEMENTATION The code and the data used for this study are publicly available at: https://github.com/nec-research/vibtcr. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Pierre Machart
- Biomedical AI Group, NEC Laboratories Europe, Heidelberg 69115, Germany
| | - Anja Mösch
- Biomedical AI Group, NEC Laboratories Europe, Heidelberg 69115, Germany
| | - Kai Li
- Machine Learning Department, NEC Laboratories America, Princeton, NJ 08540, USA
| | | | - Nico Pfeifer
- Methods in Medical Informatics, Department of Computer Science, University of Tübingen, Tübingen 72076, Germany
| | | |
Collapse
|
23
|
Fang Y, Liu X, Liu H. Attention-aware contrastive learning for predicting T cell receptor-antigen binding specificity. Brief Bioinform 2022; 23:6696141. [PMID: 36094087 DOI: 10.1093/bib/bbac378] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2022] [Revised: 07/06/2022] [Accepted: 08/09/2022] [Indexed: 12/14/2022] Open
Abstract
MOTIVATION It has been proven that only a small fraction of the neoantigens presented by major histocompatibility complex (MHC) class I molecules on the cell surface can elicit T cells. This restriction can be attributed to the binding specificity of T cell receptor (TCR) and peptide-MHC complex (pMHC). Computational prediction of T cells binding to neoantigens is a challenging and unresolved task. RESULTS In this paper, we proposed an attention-aware contrastive learning model, ATMTCR, to infer the TCR-pMHC binding specificity. For each TCR sequence, we used a transformer encoder to transform it to latent representation, and then masked a percentage of amino acids guided by attention weights to generate its contrastive view. Compared to fully-supervised baseline model, we verified that contrastive learning-based pretraining on large-scale TCR sequences significantly improved the prediction performance of downstream tasks. Interestingly, masking a percentage of amino acids with low attention weights yielded best performance compared to other masking strategies. Comparison experiments on two independent datasets demonstrated our method achieved better performance than other existing algorithms. Moreover, we identified important amino acids and their positional preference through attention weights, which indicated the potential interpretability of our proposed model.
Collapse
Affiliation(s)
- Yiming Fang
- School of Computer Science and Technology, Nanjing Tech University, 211816, Nanjing, China
| | - Xuejun Liu
- School of Computer Science and Technology, Nanjing Tech University, 211816, Nanjing, China
| | - Hui Liu
- School of Computer Science and Technology, Nanjing Tech University, 211816, Nanjing, China
| |
Collapse
|
24
|
Neoantigens in precision cancer immunotherapy: from identification to clinical applications. Chin Med J (Engl) 2022; 135:1285-1298. [PMID: 35838545 PMCID: PMC9433083 DOI: 10.1097/cm9.0000000000002181] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
Immunotherapies targeting cancer neoantigens are safe, effective, and precise. Neoantigens can be identified mainly by genomic techniques such as next-generation sequencing and high-throughput single-cell sequencing; proteomic techniques such as mass spectrometry; and bioinformatics tools based on high-throughput sequencing data, mass spectrometry data, and biological databases. Neoantigen-related therapies are widely used in clinical practice and include neoantigen vaccines, neoantigen-specific CD8+ and CD4+ T cells, and neoantigen-pulsed dendritic cells. In addition, neoantigens can be used as biomarkers to assess immunotherapy response, resistance, and prognosis. Therapies based on neoantigens are an important and promising branch of cancer immunotherapy. Unremitting efforts are needed to unravel the comprehensive role of neoantigens in anti-tumor immunity and to extend their clinical application. This review aimed to summarize the progress in neoantigen research and to discuss its opportunities and challenges in precision cancer immunotherapy.
Collapse
|
25
|
Katayama Y, Yokota R, Akiyama T, Kobayashi TJ. Machine Learning Approaches to TCR Repertoire Analysis. Front Immunol 2022; 13:858057. [PMID: 35911778 PMCID: PMC9334875 DOI: 10.3389/fimmu.2022.858057] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2022] [Accepted: 06/07/2022] [Indexed: 11/13/2022] Open
Abstract
Sparked by the development of genome sequencing technology, the quantity and quality of data handled in immunological research have been changing dramatically. Various data and database platforms are now driving the rapid progress of machine learning for immunological data analysis. Of various topics in immunology, T cell receptor repertoire analysis is one of the most important targets of machine learning for assessing the state and abnormalities of immune systems. In this paper, we review recent repertoire analysis methods based on machine learning and deep learning and discuss their prospects.
Collapse
Affiliation(s)
- Yotaro Katayama
- Graduate School of Engineering, The University of Tokyo, Tokyo, Japan
| | - Ryo Yokota
- National Research Institute of Police Science, Kashiwa, Chiba, Japan
| | - Taishin Akiyama
- Laboratory for Immune Homeostasis, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
- Graduate School of Medical Life Science, Yokohama City University, Yokohama, Japan
| | - Tetsuya J. Kobayashi
- Graduate School of Engineering, The University of Tokyo, Tokyo, Japan
- Institute of Industrial Science, The University of Tokyo, Tokyo, Japan
| |
Collapse
|
26
|
Xu S, Wang X, Fei C. A Highly Effective System for Predicting MHC-II Epitopes With Immunogenicity. Front Oncol 2022; 12:888556. [PMID: 35785204 PMCID: PMC9246415 DOI: 10.3389/fonc.2022.888556] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2022] [Accepted: 04/27/2022] [Indexed: 12/30/2022] Open
Abstract
In the past decade, the substantial achievements of therapeutic cancer vaccines have shed a new light on cancer immunotherapy. The major challenge for designing potent therapeutic cancer vaccines is to identify neoantigens capable of inducing sufficient immune responses, especially involving major histocompatibility complex (MHC)-II epitopes. However, most previous studies on T-cell epitopes were focused on either ligand binding or antigen presentation by MHC rather than the immunogenicity of T-cell epitopes. In order to better facilitate a therapeutic vaccine design, in this study, we propose a revolutionary new tool: a convolutional neural network model named FIONA (Flexible Immunogenicity Optimization Neural-network Architecture) trained on IEDB datasets. FIONA could accurately predict the epitopes presented by the given specific MHC-II subtypes, as well as their immunogenicity. By leveraging the human leukocyte antigen allele hierarchical encoding model together with peptide dense embedding fusion encoding, FIONA (with AUC = 0.94) outperforms several other tools in predicting epitopes presented by MHC-II subtypes in head-to-head comparison; moreover, FIONA has unprecedentedly incorporated the capacity to predict the immunogenicity of epitopes with MHC-II subtype specificity. Therefore, we developed a reliable pipeline to effectively predict CD4+ T-cell immune responses against cancer and infectious diseases.
Collapse
Affiliation(s)
| | | | - Caiyi Fei
- Department of AI and Bioinformatics, Nanjing Chengshi BioTech (TheraRNA) Co., Ltd., Nanjing, China
| |
Collapse
|