1
|
Peng Y, Wu J, Sun Y, Zhang Y, Wang Q, Shao S. Contrastive-learning of language embedding and biological features for cross modality encoding and effector prediction. Nat Commun 2025; 16:1299. [PMID: 39900608 PMCID: PMC11791096 DOI: 10.1038/s41467-025-56526-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2024] [Accepted: 01/15/2025] [Indexed: 02/05/2025] Open
Abstract
Identifying and characterizing virulence proteins secreted by Gram-negative bacteria are fundamental for deciphering microbial pathogenicity as well as aiding the development of therapeutic strategies. Effector predictors utilizing pre-trained protein language models (PLMs) have shown sound performance by leveraging extensive evolutionary and sequential protein features. However, the accuracy and sensitivity of effector prediction remain challenging. Here, we introduce a model named Contrastive-learning of Language Embedding and Biological Features (CLEF) leveraging contrastive learning to integrate PLM representations with supplementary biological features. Biologically information is captured in learned contextualized embeddings to yield meaningful representations. With cross-modality biological features, CLEF outperforms state-of-the-art (SOTA) models in predicting type III, type IV, and type VI secreted effectors (T3SEs/T4SEs/T6SEs) in enteric pathogens. All experimentally verified effectors in Enterohemorrhagic Escherichia coli and 41 of 43 experimentally verified T3SEs of Salmonella Typhimurium are recognized. Moreover, 12 predicted T3SEs and 11 predicted T6SEs are validated by extensive experiments in Edwardsiella piscicida. Furthermore, integrating omics data via CLEF framework enhances protein representations to illustrate effector-effector interactions and determine in vivo colonization-essential genes. Collectively, CLEF provides a blueprint to bridge the gap between in silico PLM's capacity and experimental biological information to fulfill complicated tasks.
Collapse
Affiliation(s)
- Yue Peng
- State Key Laboratory of Bioreactor Engineering, East China University of Science and Technology, Shanghai, China
| | - Junze Wu
- State Key Laboratory of Bioreactor Engineering, East China University of Science and Technology, Shanghai, China
| | - Yi Sun
- State Key Laboratory of Bioreactor Engineering, East China University of Science and Technology, Shanghai, China
| | - Yuanxing Zhang
- Southern Marine Science and Engineering Guangdong Laboratory (Zhuhai), 519000, Zhuhai, China
| | - Qiyao Wang
- State Key Laboratory of Bioreactor Engineering, East China University of Science and Technology, Shanghai, China
- Shanghai Engineering Research Center of Maricultured Animal Vaccines, Shanghai, China
- Laboratory of Aquatic Animal Diseases of MOA, Shanghai, China
| | - Shuai Shao
- State Key Laboratory of Bioreactor Engineering, East China University of Science and Technology, Shanghai, China.
- Shanghai Engineering Research Center of Maricultured Animal Vaccines, Shanghai, China.
- Laboratory of Aquatic Animal Diseases of MOA, Shanghai, China.
| |
Collapse
|
2
|
Hu Y, Wang Y, Hu X, Chao H, Li S, Ni Q, Zhu Y, Hu Y, Zhao Z, Chen M. T4SEpp: A pipeline integrating protein language models to predict bacterial type IV secreted effectors. Comput Struct Biotechnol J 2024; 23:801-812. [PMID: 38328004 PMCID: PMC10847861 DOI: 10.1016/j.csbj.2024.01.015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2023] [Revised: 01/20/2024] [Accepted: 01/20/2024] [Indexed: 02/09/2024] Open
Abstract
Many pathogenic bacteria use type IV secretion systems (T4SSs) to deliver effectors (T4SEs) into the cytoplasm of eukaryotic cells, causing diseases. The identification of effectors is a crucial step in understanding the mechanisms of bacterial pathogenicity, but this remains a major challenge. In this study, we used the full-length embedding features generated by six pre-trained protein language models to train classifiers predicting T4SEs and compared their performance. We integrated three modules into a model called T4SEpp. The first module searched for full-length homologs of known T4SEs, signal sequences, and effector domains; the second module fine-tuned a machine learning model using data for a signal sequence feature; and the third module used the three best-performing pre-trained protein language models. T4SEpp outperformed other state-of-the-art (SOTA) software tools, achieving ∼0.98 accuracy at a high specificity of ∼0.99, based on the assessment of an independent validation dataset. T4SEpp predicted 13 T4SEs from Helicobacter pylori, including the well-known CagA and 12 other potential ones, among which eleven could potentially interact with human proteins. This suggests that these potential T4SEs may be associated with the pathogenicity of H. pylori. Overall, T4SEpp provides a better solution to assist in the identification of bacterial T4SEs and facilitates studies of bacterial pathogenicity. T4SEpp is freely accessible at https://bis.zju.edu.cn/T4SEpp.
Collapse
Affiliation(s)
- Yueming Hu
- Department of Bioinformatics, College of Life Sciences, Zhejiang University, Hangzhou, China
| | - Yejun Wang
- Youth Innovation Team of Medical Bioinformatics, Shenzhen University Medical School, Shenzhen, China
- Department of Cell Biology and Genetics, College of Basic Medicine, Shenzhen University Medical School, Shenzhen, China
| | - Xiaotian Hu
- Department of Bioinformatics, College of Life Sciences, Zhejiang University, Hangzhou, China
| | - Haoyu Chao
- Department of Bioinformatics, College of Life Sciences, Zhejiang University, Hangzhou, China
| | - Sida Li
- Department of Bioinformatics, College of Life Sciences, Zhejiang University, Hangzhou, China
| | - Qinyang Ni
- Department of Bioinformatics, College of Life Sciences, Zhejiang University, Hangzhou, China
| | - Yanyan Zhu
- Department of Bioinformatics, College of Life Sciences, Zhejiang University, Hangzhou, China
| | - Yixue Hu
- Youth Innovation Team of Medical Bioinformatics, Shenzhen University Medical School, Shenzhen, China
| | - Ziyi Zhao
- Youth Innovation Team of Medical Bioinformatics, Shenzhen University Medical School, Shenzhen, China
| | - Ming Chen
- Department of Bioinformatics, College of Life Sciences, Zhejiang University, Hangzhou, China
- Institute of Hematology, Zhejiang University School of Medicine, The First Affiliated Hospital, Zhejiang University, Hangzhou 310058, China
| |
Collapse
|
3
|
Dotan E, Jaschek G, Pupko T, Belinkov Y. Effect of tokenization on transformers for biological sequences. Bioinformatics 2024; 40:btae196. [PMID: 38608190 PMCID: PMC11055402 DOI: 10.1093/bioinformatics/btae196] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2023] [Revised: 02/20/2024] [Accepted: 04/11/2024] [Indexed: 04/14/2024] Open
Abstract
MOTIVATION Deep-learning models are transforming biological research, including many bioinformatics and comparative genomics algorithms, such as sequence alignments, phylogenetic tree inference, and automatic classification of protein functions. Among these deep-learning algorithms, models for processing natural languages, developed in the natural language processing (NLP) community, were recently applied to biological sequences. However, biological sequences are different from natural languages, such as English, and French, in which segmentation of the text to separate words is relatively straightforward. Moreover, biological sequences are characterized by extremely long sentences, which hamper their processing by current machine-learning models, notably the transformer architecture. In NLP, one of the first processing steps is to transform the raw text to a list of tokens. Deep-learning applications to biological sequence data mostly segment proteins and DNA to single characters. In this work, we study the effect of alternative tokenization algorithms on eight different tasks in biology, from predicting the function of proteins and their stability, through nucleotide sequence alignment, to classifying proteins to specific families. RESULTS We demonstrate that applying alternative tokenization algorithms can increase accuracy and at the same time, substantially reduce the input length compared to the trivial tokenizer in which each character is a token. Furthermore, applying these tokenization algorithms allows interpreting trained models, taking into account dependencies among positions. Finally, we trained these tokenizers on a large dataset of protein sequences containing more than 400 billion amino acids, which resulted in over a 3-fold decrease in the number of tokens. We then tested these tokenizers trained on large-scale data on the above specific tasks and showed that for some tasks it is highly beneficial to train database-specific tokenizers. Our study suggests that tokenizers are likely to be a critical component in future deep-network analysis of biological sequence data. AVAILABILITY AND IMPLEMENTATION Code, data, and trained tokenizers are available on https://github.com/technion-cs-nlp/BiologicalTokenizers.
Collapse
Affiliation(s)
- Edo Dotan
- The Henry and Marilyn Taub Faculty of Computer Science, Technion – Israel Institute of Technology, Haifa 3200003, Israel
- The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel
| | - Gal Jaschek
- Department of Genetics, Yale University School of Medicine, New Haven, CT 06510, United States
| | - Tal Pupko
- The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel
| | - Yonatan Belinkov
- The Henry and Marilyn Taub Faculty of Computer Science, Technion – Israel Institute of Technology, Haifa 3200003, Israel
| |
Collapse
|
4
|
Tang X, Luo L, Wang S. TSE-ARF: An adaptive prediction method of effectors across secretion system types. Anal Biochem 2024; 686:115407. [PMID: 38030053 DOI: 10.1016/j.ab.2023.115407] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2023] [Revised: 11/12/2023] [Accepted: 11/20/2023] [Indexed: 12/01/2023]
Abstract
Bacterial effector proteins are secreted by a variety of protein secretion systems and play an important role in the interaction between the host and pathogenic bacteria. Therefore, it is important to find a fast and inexpensive method to discover bacterial effectors. In this study, we propose a multi-type secretion effector adaptive random forest (TSE-ARF) to adaptively identify secretion effectors across T1SE-T4SE and T6SE based only on protein sequences. First, we proposed two new feature descriptors by considering some characteristic protein information and fused them with some universal features to form a 290-dimensional feature vector with good versatility. Then, the TSE-ARF model was used to make classification predictions by parameter adaptation of different secretion effectors integrating Shuffled Frog Leaping Algorithm and random forest. The perfect performance in TSE-ARF under different data sets and settings shows its considerable generalization ability, with which more candidate effectors were screened in the whole genome. Source code is available at https://github.com/AIMOVE/TSE-ARF.
Collapse
Affiliation(s)
- Xianjun Tang
- Department of Computer Science and Engineering, School of Information Science and Engineering, Yunnan University, Kunming, 650504, Yunnan, China
| | - Longfei Luo
- Department of Computer Science and Engineering, School of Information Science and Engineering, Yunnan University, Kunming, 650504, Yunnan, China
| | - Shunfang Wang
- Department of Computer Science and Engineering, School of Information Science and Engineering, Yunnan University, Kunming, 650504, Yunnan, China; Yunnan Key Laboratory of Intelligent Systems and Computing, Yunnan University, Kunming, Yunnan, China.
| |
Collapse
|
5
|
Zhao Z, Hu Y, Hu Y, White AP, Wang Y. Features and algorithms: facilitating investigation of secreted effectors in Gram-negative bacteria. Trends Microbiol 2023; 31:1162-1178. [PMID: 37349207 DOI: 10.1016/j.tim.2023.05.011] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2023] [Revised: 05/22/2023] [Accepted: 05/22/2023] [Indexed: 06/24/2023]
Abstract
Gram-negative bacteria deliver effector proteins through type III, IV, or VI secretion systems (T3SSs, T4SSs, and T6SSs) into host cells, causing infections and diseases. In general, effector proteins for each of these distinct secretion systems lack homology and are difficult to identify. Sequence analysis has disclosed many common features, helping us to understand the evolution, function, and secretion mechanisms of the effectors. In combination with various algorithms, the known common features have facilitated accurate prediction of new effectors. Ensemblers or integrated pipelines achieve a better prediction of performance, which combines multiple computational models or modules with multidimensional features. Natural language processing (NLP) models also show the merits, which could enable discovery of novel features and, in turn, facilitate more precise effector prediction, extending our knowledge about each secretion mechanism.
Collapse
Affiliation(s)
- Ziyi Zhao
- Youth Innovation Team of Medical Bioinformatics, Shenzhen University Medical School, Shenzhen 518060, China
| | - Yixue Hu
- Youth Innovation Team of Medical Bioinformatics, Shenzhen University Medical School, Shenzhen 518060, China
| | - Yueming Hu
- Department of Bioinformatics, College of Life Sciences, Zhejiang University, Hangzhou, China
| | - Aaron P White
- Vaccine and Infectious Disease Organization (VIDO), University of Saskatchewan, Saskatoon, Saskatchewan, Canada
| | - Yejun Wang
- Youth Innovation Team of Medical Bioinformatics, Shenzhen University Medical School, Shenzhen 518060, China; Department of Cell Biology and Genetics, College of Basic Medicine, Shenzhen University Medical School, Shenzhen 518060, China.
| |
Collapse
|
6
|
Wagner N, Ben-Meir D, Teper D, Pupko T. Complete genome sequence of an Israeli isolate of Xanthomonas hortorum pv. pelargonii strain 305 and novel type III effectors identified in Xanthomonas. FRONTIERS IN PLANT SCIENCE 2023; 14:1155341. [PMID: 37332699 PMCID: PMC10275491 DOI: 10.3389/fpls.2023.1155341] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/31/2023] [Accepted: 05/10/2023] [Indexed: 06/20/2023]
Abstract
Xanthomonas hortorum pv. pelargonii is the causative agent of bacterial blight in geranium ornamental plants, the most threatening bacterial disease of this plant worldwide. Xanthomonas fragariae is the causative agent of angular leaf spot in strawberries, where it poses a significant threat to the strawberry industry. Both pathogens rely on the type III secretion system and the translocation of effector proteins into the plant cells for their pathogenicity. Effectidor is a freely available web server we have previously developed for the prediction of type III effectors in bacterial genomes. Following a complete genome sequencing and assembly of an Israeli isolate of Xanthomonas hortorum pv. pelargonii - strain 305, we used Effectidor to predict effector encoding genes both in this newly sequenced genome, and in X. fragariae strain Fap21, and validated its predictions experimentally. Four and two genes in X. hortorum and X. fragariae, respectively, contained an active translocation signal that allowed the translocation of the reporter AvrBs2 that induced the hypersensitive response in pepper leaves, and are thus considered validated novel effectors. These newly validated effectors are XopBB, XopBC, XopBD, XopBE, XopBF, and XopBG.
Collapse
Affiliation(s)
- Naama Wagner
- The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv, Israel
| | - Daniella Ben-Meir
- The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv, Israel
| | - Doron Teper
- Department of Plant Pathology and Weed Research, Institute of Plant Protection Agricultural Research Organization (ARO), Volcani Institute, Rishon LeZion, Israel
| | - Tal Pupko
- The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv, Israel
| |
Collapse
|