1
|
Shao J, Zhang Q, Yan K, Liu B. PreHom-PCLM: protein remote homology detection by combing motifs and protein cubic language model. Brief Bioinform 2023; 24:bbad347. [PMID: 37833837 DOI: 10.1093/bib/bbad347] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2023] [Revised: 08/14/2023] [Accepted: 09/14/2023] [Indexed: 10/15/2023] Open
Abstract
Protein remote homology detection is essential for structure prediction, function prediction, disease mechanism understanding, etc. The remote homology relationship depends on multiple protein properties, such as structural information and local sequence patterns. Previous studies have shown the challenges for predicting remote homology relationship by protein features at sequence level (e.g. position-specific score matrix). Protein motifs have been used in structure and function analysis due to their unique sequence patterns and implied structural information. Therefore, designing a usable architecture to fuse multiple protein properties based on motifs is urgently needed to improve protein remote homology detection performance. To make full use of the characteristics of motifs, we employed the language model called the protein cubic language model (PCLM). It combines multiple properties by constructing a motif-based neural network. Based on the PCLM, we proposed a predictor called PreHom-PCLM by extracting and fusing multiple motif features for protein remote homology detection. PreHom-PCLM outperforms the other state-of-the-art methods on the test set and independent test set. Experimental results further prove the effectiveness of multiple features fused by PreHom-PCLM for remote homology detection. Furthermore, the protein features derived from the PreHom-PCLM show strong discriminative power for proteins from different structural classes in the high-dimensional space. Availability and Implementation: http://bliulab.net/PreHom-PCLM.
Collapse
Affiliation(s)
- Jiangyi Shao
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
| | - Qi Zhang
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
| | - Ke Yan
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
- Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing 100081, China
| | - Bin Liu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
- Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing 100081, China
| |
Collapse
|
2
|
Shao J, Yan K, Liu B. FoldRec-C2C: protein fold recognition by combining cluster-to-cluster model and protein similarity network. Brief Bioinform 2021; 22:5873289. [PMID: 32685972 PMCID: PMC7454262 DOI: 10.1093/bib/bbaa144] [Citation(s) in RCA: 44] [Impact Index Per Article: 14.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2020] [Revised: 05/26/2020] [Accepted: 06/11/2020] [Indexed: 12/27/2022] Open
Abstract
As a key for studying the protein structures, protein fold recognition is playing an important role in predicting the protein structures associated with COVID-19 and other important structures. However, the existing computational predictors only focus on the protein pairwise similarity or the similarity between two groups of proteins from 2-folds. However, the homology relationship among proteins is in a hierarchical structure. The global protein similarity network will contribute to the performance improvement. In this study, we proposed a predictor called FoldRec-C2C to globally incorporate the interactions among proteins into the prediction. For the FoldRec-C2C predictor, protein fold recognition problem is treated as an information retrieval task in nature language processing. The initial ranking results were generated by a surprised ranking algorithm Learning to Rank, and then three re-ranking algorithms were performed on the ranking lists to adjust the results globally based on the protein similarity network, including seq-to-seq model, seq-to-cluster model and cluster-to-cluster model (C2C). When tested on a widely used and rigorous benchmark dataset LINDAHL dataset, FoldRec-C2C outperforms other 34 state-of-the-art methods in this field. The source code and data of FoldRec-C2C can be downloaded from http://bliulab.net/FoldRec-C2C/download.
Collapse
Affiliation(s)
- Jiangyi Shao
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
| | - Ke Yan
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong, China
| | - Bin Liu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
| |
Collapse
|
3
|
Shao J, Liu B. ProtFold-DFG: protein fold recognition by combining Directed Fusion Graph and PageRank algorithm. Brief Bioinform 2020; 22:5901980. [PMID: 32892224 DOI: 10.1093/bib/bbaa192] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2020] [Revised: 07/16/2020] [Accepted: 07/28/2020] [Indexed: 12/27/2022] Open
Abstract
As one of the most important tasks in protein structure prediction, protein fold recognition has attracted more and more attention. In this regard, some computational predictors have been proposed with the development of machine learning and artificial intelligence techniques. However, these existing computational methods are still suffering from some disadvantages. In this regard, we propose a new network-based predictor called ProtFold-DFG for protein fold recognition. We propose the Directed Fusion Graph (DFG) to fuse the ranking lists generated by different methods, which employs the transitive closure to incorporate more relationships among proteins and uses the KL divergence to calculate the relationship between two proteins so as to improve its generalization ability. Finally, the PageRank algorithm is performed on the DFG to accurately recognize the protein folds by considering the global interactions among proteins in the DFG. Tested on a widely used and rigorous benchmark data set, LINDAHL dataset, experimental results show that the ProtFold-DFG outperforms the other 35 competing methods, indicating that ProtFold-DFG will be a useful method for protein fold recognition. The source code and data of ProtFold-DFG can be downloaded from http://bliulab.net/ProtFold-DFG/download.
Collapse
Affiliation(s)
- Jiangyi Shao
- School of Computer Science and Technology, Beijing Institute of Technology, China
| | - Bin Liu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
| |
Collapse
|
4
|
Zhang J, Kwong S, Wong KC. ToBio: Global Pathway Similarity Search Based on Topological and Biological Features. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:336-349. [PMID: 29990160 DOI: 10.1109/tcbb.2017.2769642] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Pathway similarity search plays a vital role in the post-genomics era. Unfortunately, pathway similarity search involves the graph isomorphism problem which is NP-complete. Therefore, efficient search algorithms are desirable. In this work, we propose a novel global pathway similarity search approach named ToBio, which considers both topological and biological features for effective global pathway similarity search. Specifically, as motivated from nature, various topological and biological features including subgraph signature similarities, sequence similarities, and gene ontology similarities are considered in ToBio. Since different features carry different functional importance and dependences, we report three schemes of ToBio using different sets of features. In addition, to enhance the existing search algorithms for rigorous comparisons, post-processing pipelines are also proposed to investigate how different features can contribute to the search performance. ToBio and other state-of-the-art methods are benchmarked on the gold-standard pathway datasets from three species. The results demonstrate the competitive edges of ToBio over the state-of-the-arts ranging from the topological aspects to the biological aspects. Case studies have been conducted to reveal mechanistic insights into the unique search performance of ToBio.
Collapse
|
5
|
Wang A, Lim H, Cheng SY, Xie L. ANTENNA, a Multi-Rank, Multi-Layered Recommender System for Inferring Reliable Drug-Gene-Disease Associations: Repurposing Diazoxide as a Targeted Anti-Cancer Therapy. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:1960-1967. [PMID: 29993812 PMCID: PMC6139288 DOI: 10.1109/tcbb.2018.2812189] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Existing drug discovery processes follow a reductionist model of "one-drug-one-gene-one-disease," which is inadequate to tackle complex diseases involving multiple malfunctioned genes. The availability of big omics data offers opportunities to transform drug discovery process into a new paradigm of systems pharmacology that focuses on designing drugs to target molecular interaction networks instead of a single gene. Here, we develop a reliable multi-rank, multi-layered recommender system, ANTENNA, to mine large-scale chemical genomics and disease association data for prediction of novel drug-gene-disease associations. ANTENNA integrates a novel tri-factorization based dual-regularized weighted and imputed One Class Collaborative Filtering (OCCF) algorithm, tREMAP, with a statistical framework based on Random Walk with Restart and assess the reliability of specific predictions. In the benchmark, tREMAP clearly outperforms the single-rank OCCF. We apply ANTENNA to a real-world problem: repurposing old drugs for new clinical indications without effective treatments. We discover that FDA-approved drug diazoxide can inhibit multiple kinase genes responsible for many diseases including cancer and kill triple negative breast cancer (TNBC) cells efficiently [Formula: see text]. TNBC is a deadly disease without effective targeted therapies. Our finding demonstrates the power of big data analytics in drug discovery and developing a targeted therapy for TNBC.
Collapse
|
6
|
Lim H, Gray P, Xie L, Poleksic A. Improved genome-scale multi-target virtual screening via a novel collaborative filtering approach to cold-start problem. Sci Rep 2016; 6:38860. [PMID: 27958331 PMCID: PMC5153628 DOI: 10.1038/srep38860] [Citation(s) in RCA: 35] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2016] [Accepted: 11/15/2016] [Indexed: 12/18/2022] Open
Abstract
Conventional one-drug-one-gene approach has been of limited success in modern drug discovery. Polypharmacology, which focuses on searching for multi-targeted drugs to perturb disease-causing networks instead of designing selective ligands to target individual proteins, has emerged as a new drug discovery paradigm. Although many methods for single-target virtual screening have been developed to improve the efficiency of drug discovery, few of these algorithms are designed for polypharmacology. Here, we present a novel theoretical framework and a corresponding algorithm for genome-scale multi-target virtual screening based on the one-class collaborative filtering technique. Our method overcomes the sparseness of the protein-chemical interaction data by means of interaction matrix weighting and dual regularization from both chemicals and proteins. While the statistical foundation behind our method is general enough to encompass genome-wide drug off-target prediction, the program is specifically tailored to find protein targets for new chemicals with little to no available interaction data. We extensively evaluate our method using a number of the most widely accepted gene-specific and cross-gene family benchmarks and demonstrate that our method outperforms other state-of-the-art algorithms for predicting the interaction of new chemicals with multiple proteins. Thus, the proposed algorithm may provide a powerful tool for multi-target drug design.
Collapse
Affiliation(s)
- Hansaim Lim
- Department of Computer Science, Hunter College, The City University of New York, New York, New York 10065, United States
| | - Paul Gray
- Department of Computer Science, University of Northern Iowa, Cedar Falls, Iowa 50614, United States
| | - Lei Xie
- Department of Computer Science, Hunter College, The City University of New York, New York, New York 10065, United States.,Ph.D. Program in Computer Science, Biochemistry and Biology, The Graduate Center, The City University of New York, New York, New York 10065, United States
| | - Aleksandar Poleksic
- Department of Computer Science, University of Northern Iowa, Cedar Falls, Iowa 50614, United States
| |
Collapse
|
7
|
Cui X, Lu Z, Wang S, Jing-Yan Wang J, Gao X. CMsearch: simultaneous exploration of protein sequence space and structure space improves not only protein homology detection but also protein structure prediction. Bioinformatics 2016; 32:i332-i340. [PMID: 27307635 PMCID: PMC4908355 DOI: 10.1093/bioinformatics/btw271] [Citation(s) in RCA: 36] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
MOTIVATION Protein homology detection, a fundamental problem in computational biology, is an indispensable step toward predicting protein structures and understanding protein functions. Despite the advances in recent decades on sequence alignment, threading and alignment-free methods, protein homology detection remains a challenging open problem. Recently, network methods that try to find transitive paths in the protein structure space demonstrate the importance of incorporating network information of the structure space. Yet, current methods merge the sequence space and the structure space into a single space, and thus introduce inconsistency in combining different sources of information. METHOD We present a novel network-based protein homology detection method, CMsearch, based on cross-modal learning. Instead of exploring a single network built from the mixture of sequence and structure space information, CMsearch builds two separate networks to represent the sequence space and the structure space. It then learns sequence-structure correlation by simultaneously taking sequence information, structure information, sequence space information and structure space information into consideration. RESULTS We tested CMsearch on two challenging tasks, protein homology detection and protein structure prediction, by querying all 8332 PDB40 proteins. Our results demonstrate that CMsearch is insensitive to the similarity metrics used to define the sequence and the structure spaces. By using HMM-HMM alignment as the sequence similarity metric, CMsearch clearly outperforms state-of-the-art homology detection methods and the CASP-winning template-based protein structure prediction methods. AVAILABILITY AND IMPLEMENTATION Our program is freely available for download from http://sfb.kaust.edu.sa/Pages/Software.aspx CONTACT : xin.gao@kaust.edu.sa SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Xuefeng Cui
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Thuwal 23955-6900, Saudi Arabia
| | - Zhiwu Lu
- Beijing Key Laboratory of Big Data Management and Analysis Methods, School of Information, Renmin University of China, Beijing 100872, China
| | - Sheng Wang
- Toyota Technological Institute at Chicago, 6045 Kenwood Avenue, Chicago, IL 60637, USA Department of Human Genetics, University of Chicago, E. 58th St, Chicago, IL 60637, USA
| | - Jim Jing-Yan Wang
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Thuwal 23955-6900, Saudi Arabia
| | - Xin Gao
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Thuwal 23955-6900, Saudi Arabia
| |
Collapse
|
8
|
Lhota J, Xie L. Protein-fold recognition using an improved single-source K diverse shortest paths algorithm. Proteins 2016; 84:467-72. [PMID: 26800480 DOI: 10.1002/prot.24993] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2015] [Revised: 01/10/2016] [Accepted: 01/12/2016] [Indexed: 11/11/2022]
Abstract
Protein structure prediction, when construed as a fold recognition problem, is one of the most important applications of similarity search in bioinformatics. A new protein-fold recognition method is reported which combines a single-source K diverse shortest path (SSKDSP) algorithm with Enrichment of Network Topological Similarity (ENTS) algorithm to search a graphic feature space generated using sequence similarity and structural similarity metrics. A modified, more efficient SSKDSP algorithm is developed to improve the performance of graph searching. The new implementation of the SSKDSP algorithm empirically requires 82% less memory and 61% less time than the current implementation, allowing for the analysis of larger, denser graphs. Furthermore, the statistical significance of fold ranking generated from SSKDSP is assessed using ENTS. The reported ENTS-SSKDSP algorithm outperforms original ENTS that uses random walk with restart for the graph search as well as other state-of-the-art protein structure prediction algorithms HHSearch and Sparks-X, as evaluated by a benchmark of 600 query proteins. The reported methods may easily be extended to other similarity search problems in bioinformatics and chemoinformatics. The SSKDSP software is available at http://compsci.hunter.cuny.edu/~leixie/sskdsp.html.
Collapse
Affiliation(s)
| | - Lei Xie
- Department of Computer Science, Hunter College, the Graduate Center, the City University of New York, New York
| |
Collapse
|
9
|
Hart T, Xie L. Providing data science support for systems pharmacology and its implications to drug discovery. Expert Opin Drug Discov 2016; 11:241-56. [PMID: 26689499 DOI: 10.1517/17460441.2016.1135126] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
INTRODUCTION The conventional one-drug-one-target-one-disease drug discovery process has been less successful in tracking multi-genic, multi-faceted complex diseases. Systems pharmacology has emerged as a new discipline to tackle the current challenges in drug discovery. The goal of systems pharmacology is to transform huge, heterogeneous, and dynamic biological and clinical data into interpretable and actionable mechanistic models for decision making in drug discovery and patient treatment. Thus, big data technology and data science will play an essential role in systems pharmacology. AREAS COVERED This paper critically reviews the impact of three fundamental concepts of data science on systems pharmacology: similarity inference, overfitting avoidance, and disentangling causality from correlation. The authors then discuss recent advances and future directions in applying the three concepts of data science to drug discovery, with a focus on proteome-wide context-specific quantitative drug target deconvolution and personalized adverse drug reaction prediction. EXPERT OPINION Data science will facilitate reducing the complexity of systems pharmacology modeling, detecting hidden correlations between complex data sets, and distinguishing causation from correlation. The power of data science can only be fully realized when integrated with mechanism-based multi-scale modeling that explicitly takes into account the hierarchical organization of biological systems from nucleic acid to proteins, to molecular interaction networks, to cells, to tissues, to patients, and to populations.
Collapse
Affiliation(s)
- Thomas Hart
- a The Rockefeller University , New York , NY , USA.,b Department of Biological Sciences, Hunter College , The City University of New York , New York , NY , USA
| | - Lei Xie
- c Department of Computer Science, Hunter College , The City University of New York , New York , NY , USA.,d The Graduate Center , The City University of New York , New York , NY , USA
| |
Collapse
|
10
|
Scaiewicz A, Levitt M. The language of the protein universe. Curr Opin Genet Dev 2015; 35:50-6. [PMID: 26451980 DOI: 10.1016/j.gde.2015.08.010] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2015] [Revised: 08/20/2015] [Accepted: 08/25/2015] [Indexed: 11/17/2022]
Abstract
Proteins, the main cell machinery which play a major role in nearly every cellular process, have always been a central focus in biology. We live in the post-genomic era, and inferring information from massive data sets is a steadily growing universal challenge. The increasing availability of fully sequenced genomes can be regarded as the 'Rosetta Stone' of the protein universe, allowing the understanding of genomes and their evolution, just as the original Rosetta Stone allowed Champollion to decipher the ancient Egyptian hieroglyphics. In this review, we consider aspects of the protein domain architectures repertoire that are closely related to those of human languages and aim to provide some insights about the language of proteins.
Collapse
Affiliation(s)
- Andrea Scaiewicz
- Department of Structural Biology, Stanford University, Stanford, CA 94305-5126, United States
| | - Michael Levitt
- Department of Structural Biology, Stanford University, Stanford, CA 94305-5126, United States.
| |
Collapse
|