1
|
Xie S, Xie X, Zhao X, Liu F, Wang Y, Ping J, Ji Z. HNSPPI: a hybrid computational model combing network and sequence information for predicting protein-protein interaction. Brief Bioinform 2023; 24:bbad261. [PMID: 37480553 DOI: 10.1093/bib/bbad261] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2023] [Revised: 06/24/2023] [Accepted: 06/26/2023] [Indexed: 07/24/2023] Open
Abstract
Most life activities in organisms are regulated through protein complexes, which are mainly controlled via Protein-Protein Interactions (PPIs). Discovering new interactions between proteins and revealing their biological functions are of great significance for understanding the molecular mechanisms of biological processes and identifying the potential targets in drug discovery. Current experimental methods only capture stable protein interactions, which lead to limited coverage. In addition, expensive cost and time consuming are also the obvious shortcomings. In recent years, various computational methods have been successfully developed for predicting PPIs based only on protein homology, primary sequences of protein or gene ontology information. Computational efficiency and data complexity are still the main bottlenecks for the algorithm generalization. In this study, we proposed a novel computational framework, HNSPPI, to predict PPIs. As a hybrid supervised learning model, HNSPPI comprehensively characterizes the intrinsic relationship between two proteins by integrating amino acid sequence information and connection properties of PPI network. The experimental results show that HNSPPI works very well on six benchmark datasets. Moreover, the comparison analysis proved that our model significantly outperforms other five existing algorithms. Finally, we used the HNSPPI model to explore the SARS-CoV-2-Human interaction system and found several potential regulations. In summary, HNSPPI is a promising model for predicting new protein interactions from known PPI data.
Collapse
Affiliation(s)
- Shijie Xie
- College of Artificial Intelligence, Nanjing Agricultural University, No. 1 Weigang Rd, Nanjing, Jiangsu 210095, China
| | - Xiaojun Xie
- College of Artificial Intelligence, Nanjing Agricultural University, No. 1 Weigang Rd, Nanjing, Jiangsu 210095, China
| | - Xin Zhao
- Department of Hepatobiliary Surgery, Beijing Chaoyang Hospital affiliated to Capital Medical University, Beijing 100020, China
| | - Fei Liu
- Joint International Research Laboratory of Animal Health and Food Safety of Ministry of Education & Single Molecule Nanometry Laboratory (Sinmolab), Nanjing Agricultural University, Nanjing, Jiangsu 210095, China
| | - Yiming Wang
- Key Laboratory of Biological Interactions and Crop Health, Department of Plant Pathology, Nanjing Agricultural University, 210095, Nanjing, China
| | - Jihui Ping
- MOE International Joint Collaborative Research Laboratory for Animal Health and Food Safety & Jiangsu Engineering Laboratory of Animal Immunology, College of Veterinary Medicine, Nanjing Agricultural University, Nanjing, Jiangsu 210095, China
| | - Zhiwei Ji
- College of Artificial Intelligence, Nanjing Agricultural University, No. 1 Weigang Rd, Nanjing, Jiangsu 210095, China
| |
Collapse
|
2
|
Li H, Liu B. BioSeq-Diabolo: Biological sequence similarity analysis using Diabolo. PLoS Comput Biol 2023; 19:e1011214. [PMID: 37339155 DOI: 10.1371/journal.pcbi.1011214] [Citation(s) in RCA: 18] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2022] [Accepted: 05/24/2023] [Indexed: 06/22/2023] Open
Abstract
As the key for biological sequence structure and function prediction, disease diagnosis and treatment, biological sequence similarity analysis has attracted more and more attentions. However, the exiting computational methods failed to accurately analyse the biological sequence similarities because of the various data types (DNA, RNA, protein, disease, etc) and their low sequence similarities (remote homology). Therefore, new concepts and techniques are desired to solve this challenging problem. Biological sequences (DNA, RNA and protein sequences) can be considered as the sentences of "the book of life", and their similarities can be considered as the biological language semantics (BLS). In this study, we are seeking the semantics analysis techniques derived from the natural language processing (NLP) to comprehensively and accurately analyse the biological sequence similarities. 27 semantics analysis methods derived from NLP were introduced to analyse biological sequence similarities, bringing new concepts and techniques to biological sequence similarity analysis. Experimental results show that these semantics analysis methods are able to facilitate the development of protein remote homology detection, circRNA-disease associations identification and protein function annotation, achieving better performance than the other state-of-the-art predictors in the related fields. Based on these semantics analysis methods, a platform called BioSeq-Diabolo has been constructed, which is named after a popular traditional sport in China. The users only need to input the embeddings of the biological sequence data. BioSeq-Diabolo will intelligently identify the task, and then accurately analyse the biological sequence similarities based on biological language semantics. BioSeq-Diabolo will integrate different biological sequence similarities in a supervised manner by using Learning to Rank (LTR), and the performance of the constructed methods will be evaluated and analysed so as to recommend the best methods for the users. The web server and stand-alone package of BioSeq-Diabolo can be accessed at http://bliulab.net/BioSeq-Diabolo/server/.
Collapse
Affiliation(s)
- Hongliang Li
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
| | - Bin Liu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
- Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing, China
| |
Collapse
|
3
|
Klein-Seetharaman J. Gobind's last graduate student. Biophys Rev 2023; 15:75-88. [PMID: 36909953 PMCID: PMC9995623 DOI: 10.1007/s12551-023-01047-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2022] [Accepted: 01/23/2023] [Indexed: 02/21/2023] Open
Abstract
Written on the occasion of his 100th birthday, this is a personal account of my time as a graduate student with Nobel laureate, H. Gobind Khorana, at the Massachusetts Institute of Technology from 1996 to 2000.
Collapse
Affiliation(s)
- Judith Klein-Seetharaman
- School of Molecular Sciences & College of Health Solutions, Arizona State University, Phoenix, AZ 85004 USA
| |
Collapse
|
4
|
Pang Y, Liu B. SelfAT-Fold: Protein Fold Recognition Based on Residue-Based and Motif-Based Self-Attention Networks. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:1861-1869. [PMID: 33090951 DOI: 10.1109/tcbb.2020.3031888] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
The protein fold recognition is a fundamental and crucial step of tertiary structure determination. In this regard, several computational predictors have been proposed. Recently, the predictive performance has been obviously improved by the fold-specific features generated by deep learning techniques. However, these methods failed to measure the global associations among residues or motifs along the protein sequences. Furthermore, these deep learning techniques are often treated as black boxes without interpretability. Inspired by the similarities between protein sequences and natural language sentences, we applied the self-attention mechanism derived from natural language processing (NLP) field to protein fold recognition. The motif-based self-attention network (MSAN) and the residue-based self-attention network (RSAN) were constructed based on a training set to capture the global associations among the structure motifs and residues along the protein sequences, respectively. The fold-specific attention features trained and generated from the training set were then combined with Support Vector Machines (SVMs) to predict the samples in the widely used LE benchmark dataset, which is fully independent from the training set. Experimental results showed that the proposed two SelfAT-Fold predictors outperformed 34 existing state-of-the-art computational predictors. The two SelfAT-Fold predictors were further tested on an independent dataset SCOP_TEST, and they can achieve stable performance. Furthermore, the fold-specific attention features can be used to analyse the characteristics of protein folds. The trained models and data of SelfAT-Fold can be downloaded from http://bliulab.net/selfAT_fold/.
Collapse
|
5
|
Kermanidis KL. Identifying Latent Semantics in Action Games for Player Modeling. INTERNATIONAL JOURNAL OF GAMING AND COMPUTER-MEDIATED SIMULATIONS 2019. [DOI: 10.4018/ijgcms.2019040101] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Machine learning approaches to player modeling traditionally employ a high-level game-knowledge-based feature for representing game sessions, and often player behavioral features as well. The present work makes use of generic low-level features and latent semantic analysis for unsupervised player modeling, but mostly for revealing underlying hidden information regarding game semantics that is not easily detectable beforehand.
Collapse
|
6
|
Dong Q, Wang K, Liu X. Identifying the missing proteins in human proteome by biological language model. BMC SYSTEMS BIOLOGY 2016; 10:113. [PMID: 28155671 PMCID: PMC5259966 DOI: 10.1186/s12918-016-0352-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
BACKGROUND With the rapid development of high-throughput sequencing technology, the proteomics research becomes a trendy field in the post genomics era. It is necessary to identify all the native-encoding protein sequences for further function and pathway analysis. Toward that end, the Human Proteome Organization lunched the Human Protein Project in 2011. However many proteins are hard to be detected by experiment methods, which becomes one of the bottleneck in Human Proteome Project. In consideration of the complicatedness of detecting these missing proteins by using wet-experiment approach, here we use bioinformatics method to pre-filter the missing proteins. RESULTS Since there are analogy between the biological sequences and natural language, the n-gram models from Natural Language Processing field has been used to filter the missing proteins. The dataset used in this study contains 616 missing proteins from the "uncertain" category of the neXtProt database. There are 102 proteins deduced by the n-gram model, which have high probability to be native human proteins. We perform a detail analysis on the predicted structure and function of these missing proteins and also compare the high probability proteins with other mass spectrum datasets. The evaluation shows that the results reported here are in good agreement with those obtained by other well-established databases. CONCLUSION The analysis shows that 102 proteins may be native gene-coding proteins and some of the missing proteins are membrane or natively disordered proteins which are hard to be detected by experiment methods.
Collapse
Affiliation(s)
- Qiwen Dong
- Institute for Data Science and Engineering, East China Normal University, Shanghai, 200062, People's Republic of China. .,Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, 518055, People's Republic of China.
| | - Kai Wang
- College of Animal Science and technology, Jilin Agricultural University, Changchun, 130118, People's Republic of China
| | - Xuan Liu
- College of Engineering, Shanghai Ocean University, Shanghai, 201303, People's Republic of China.
| |
Collapse
|
7
|
Ganapathiraju MK, Mitchell AD, Thahir M, Motwani K, Ananthasubramanian S. Suite of tools for statistical N-gram language modeling for pattern mining in whole genome sequences. J Bioinform Comput Biol 2012; 10:1250016. [PMID: 22817111 DOI: 10.1142/s0219720012500163] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Genome sequences contain a number of patterns that have biomedical significance. Repetitive sequences of various kinds are a primary component of most of the genomic sequence patterns. We extended the suffix-array based Biological Language Modeling Toolkit to compute n-gram frequencies as well as n-gram language-model based perplexity in windows over the whole genome sequence to find biologically relevant patterns. We present the suite of tools and their application for analysis on whole human genome sequence.
Collapse
Affiliation(s)
- Madhavi K Ganapathiraju
- Department of Biomedical Informatics, University of Pittsburgh, 5607 Baum Boulevard, Suite BAUM 423, Pittsburgh, PA 15206-3701, USA.
| | | | | | | | | |
Collapse
|
8
|
|
9
|
Dong QW, Wang XL, Lin L. Application of latent semantic analysis to protein remote homology detection. Bioinformatics 2005; 22:285-90. [PMID: 16317074 DOI: 10.1093/bioinformatics/bti801] [Citation(s) in RCA: 73] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Remote homology detection between protein sequences is a central problem in computational biology. The discriminative method such as the support vector machine (SVM) is one of the most effective methods. Many of the SVM-based methods focus on finding useful representations of protein sequence, using either explicit feature vector representations or kernel functions. Such representations may suffer from the peaking phenomenon in many machine-learning methods because the features are usually very large and noise data may be introduced. Based on these observations, this research focuses on feature extraction and efficient representation of protein vectors for SVM protein classification. RESULTS In this study, a latent semantic analysis (LSA) model, which is an efficient feature extraction technique from natural language processing, has been introduced in protein remote homology detection. Several basic building blocks of protein sequences have been investigated as the 'words' of 'protein sequence language', including N-grams, patterns and motifs. Each protein sequence is taken as a 'document' that is composed of bags-of-word. The word-document matrix is constructed first. The LSA is performed on the matrix to produce the latent semantic representation vectors of protein sequences, leading to noise-removal and smart description of protein sequences. The latent semantic representation vectors are then evaluated by SVM. The method is tested on the SCOP 1.53 database. The results show that the LSA model significantly improves the performance of remote homology detection in comparison with the basic formalisms. Furthermore, the performance of this method is comparable with that of the complex kernel methods such as SVM-LA and better than that of other sequence-based methods such as PSI-BLAST and SVM-pairwise.
Collapse
Affiliation(s)
- Qi-Wen Dong
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China.
| | | | | |
Collapse
|