1
|
Nayar G, Altman RB. Heterogeneous network approaches to protein pathway prediction. Comput Struct Biotechnol J 2024; 23:2727-2739. [PMID: 39035835 PMCID: PMC11260399 DOI: 10.1016/j.csbj.2024.06.022] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2024] [Revised: 06/17/2024] [Accepted: 06/18/2024] [Indexed: 07/23/2024] Open
Abstract
Understanding protein-protein interactions (PPIs) and the pathways they comprise is essential for comprehending cellular functions and their links to specific phenotypes. Despite the prevalence of molecular data generated by high-throughput sequencing technologies, a significant gap remains in translating this data into functional information regarding the series of interactions that underlie phenotypic differences. In this review, we present an in-depth analysis of heterogeneous network methodologies for modeling protein pathways, highlighting the critical role of integrating multifaceted biological data. It outlines the process of constructing these networks, from data representation to machine learning-driven predictions and evaluations. The work underscores the potential of heterogeneous networks in capturing the complexity of proteomic interactions, thereby offering enhanced accuracy in pathway prediction. This approach not only deepens our understanding of cellular processes but also opens up new possibilities in disease treatment and drug discovery by leveraging the predictive power of comprehensive proteomic data analysis.
Collapse
Affiliation(s)
- Gowri Nayar
- Department of Biomedical Data Science, Stanford University, United States
| | - Russ B. Altman
- Department of Biomedical Data Science, Stanford University, United States
- Department of Genetics, Stanford University, United States
- Department of Medicine, Stanford University, United States
- Department of Bioengineering, Stanford University, United States
| |
Collapse
|
2
|
Sahoo TR, Patra S, Vipsita S. Decision tree classifier based on topological characteristics of subgraph for the mining of protein complexes from large scale PPI networks. Comput Biol Chem 2023; 106:107935. [PMID: 37536230 DOI: 10.1016/j.compbiolchem.2023.107935] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2022] [Revised: 06/11/2023] [Accepted: 07/23/2023] [Indexed: 08/05/2023]
Abstract
The growing accessibility of large-scale protein interaction data demands extensive research to understand cell organization and its functioning at the network level. Bioinformatics and data mining researchers have extensively studied network clustering to examine the structural and operational features of protein protein interaction (PPI) networks. Clustering PPI networks has proven useful in numerous research over the past two decades for identifying functional modules, understanding the roles of previously unknown proteins, and other purposes. Protein complexes represent one of the essential cellular components for creating biological activities. Inferring protein complexes has been made more accessible by experimental approaches. We offer a novel method that integrates the classification model with local topological data, making it more reliable and efficient. This article describes a decision tree classifier based on topological characteristics of the subgraph for mining protein complexes. The proposed graph-based algorithm is an effective and efficient way to identify protein complexes from large-scale PPI networks. The performance of the proposed algorithm is observed in protein-protein interaction networks of yeast and human in the Database of Interacting Proteins (DIP) and the Biological General Repository for Interaction Datasets (BioGRID) using widely accepted benchmark protein complexes from the comprehensive resource of mammalian protein complexes (CORUM) and the comprehensive catalogue of yeast protein complexes (CYC2008). The outcomes demonstrate that our method can outperform the best-performing supervised, semi-supervised, and unsupervised approaches to detecting protein complexes.
Collapse
Affiliation(s)
- Tushar Ranjan Sahoo
- Bioinformatics Lab, Department of Computer Science, IIIT, Bhubaneswar, India.
| | - Sabyasachi Patra
- Bioinformatics Lab, Department of Computer Science, IIIT, Bhubaneswar, India.
| | - Swati Vipsita
- Bioinformatics Lab, Department of Computer Science, IIIT, Bhubaneswar, India.
| |
Collapse
|
3
|
Palukuri MV, Patil RS, Marcotte EM. Molecular complex detection in protein interaction networks through reinforcement learning. BMC Bioinformatics 2023; 24:306. [PMID: 37532987 PMCID: PMC10394916 DOI: 10.1186/s12859-023-05425-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2023] [Accepted: 07/20/2023] [Indexed: 08/04/2023] Open
Abstract
BACKGROUND Proteins often assemble into higher-order complexes to perform their biological functions. Such protein-protein interactions (PPI) are often experimentally measured for pairs of proteins and summarized in a weighted PPI network, to which community detection algorithms can be applied to define the various higher-order protein complexes. Current methods include unsupervised and supervised approaches, often assuming that protein complexes manifest only as dense subgraphs. Utilizing supervised approaches, the focus is not on how to find them in a network, but only on learning which subgraphs correspond to complexes, currently solved using heuristics. However, learning to walk trajectories on a network to identify protein complexes leads naturally to a reinforcement learning (RL) approach, a strategy not extensively explored for community detection. Here, we develop and evaluate a reinforcement learning pipeline for community detection on weighted protein-protein interaction networks to detect new protein complexes. The algorithm is trained to calculate the value of different subgraphs encountered while walking on the network to reconstruct known complexes. A distributed prediction algorithm then scales the RL pipeline to search for novel protein complexes on large PPI networks. RESULTS The reinforcement learning pipeline is applied to a human PPI network consisting of 8k proteins and 60k PPI, which results in 1,157 protein complexes. The method demonstrated competitive accuracy with improved speed compared to previous algorithms. We highlight protein complexes such as C4orf19, C18orf21, and KIAA1522 which are currently minimally characterized. Additionally, the results suggest TMC04 be a putative additional subunit of the KICSTOR complex and confirm the involvement of C15orf41 in a higher-order complex with HIRA, CDAN1, ASF1A, and by 3D structural modeling. CONCLUSIONS Reinforcement learning offers several distinct advantages for community detection, including scalability and knowledge of the walk trajectories defining those communities. Applied to currently available human protein interaction networks, this method had comparable accuracy with other algorithms and notable savings in computational time, and in turn, led to clear predictions of protein function and interactions for several uncharacterized human proteins.
Collapse
Affiliation(s)
- Meghana V Palukuri
- Department of Molecular Biosciences, Center for Systems and Synthetic Biology, University of Texas, Austin, TX, 78712, USA.
- Oden Institute for Computational Engineering and Sciences, University of Texas, Austin, TX, 78712, USA.
| | - Ridhi S Patil
- Department of Biomedical Engineering, University of Texas, Austin, TX, 78712, USA.
| | - Edward M Marcotte
- Department of Molecular Biosciences, Center for Systems and Synthetic Biology, University of Texas, Austin, TX, 78712, USA.
- Oden Institute for Computational Engineering and Sciences, University of Texas, Austin, TX, 78712, USA.
| |
Collapse
|
4
|
Manipur I, Giordano M, Piccirillo M, Parashuraman S, Maddalena L. Community Detection in Protein-Protein Interaction Networks and Applications. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:217-237. [PMID: 34951849 DOI: 10.1109/tcbb.2021.3138142] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
The ability to identify and characterize not only the protein-protein interactions but also their internal modular organization through network analysis is fundamental for understanding the mechanisms of biological processes at the molecular level. Indeed, the detection of the network communities can enhance our understanding of the molecular basis of disease pathology, and promote drug discovery and disease treatment in personalized medicine. This work gives an overview of recent computational methods for the detection of protein complexes and functional modules in protein-protein interaction networks, also providing a focus on some of its applications. We propose a systematic reformulation of frequently adopted taxonomies for these methods, also proposing new categories to keep up with the most recent research. We review the literature of the last five years (2017-2021) and provide links to existing data and software resources. Finally, we survey recent works exploiting module identification and analysis, in the context of a variety of disease processes for biomarker identification and therapeutic target detection. Our review provides the interested reader with an up-to-date and self-contained view of the existing research, with links to state-of-the-art literature and resources, as well as hints on open issues and future research directions in complex detection and its applications.
Collapse
|
5
|
Wang R, Ma H, Wang C. An Ensemble Learning Framework for Detecting Protein Complexes From PPI Networks. Front Genet 2022; 13:839949. [PMID: 35281831 PMCID: PMC8908451 DOI: 10.3389/fgene.2022.839949] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2021] [Accepted: 01/31/2022] [Indexed: 11/14/2022] Open
Abstract
Detecting protein complexes is one of the keys to understanding cellular organization and processes principles. With high-throughput experiments and computing science development, it has become possible to detect protein complexes by computational methods. However, most computational methods are based on either unsupervised learning or supervised learning. Unsupervised learning-based methods do not need training datasets, but they can only detect one or several topological protein complexes. Supervised learning-based methods can detect protein complexes with different topological structures. However, they are usually based on a type of training model, and the generalization of a single model is poor. Therefore, we propose an Ensemble Learning Framework for Detecting Protein Complexes (ELF-DPC) within protein-protein interaction (PPI) networks to address these challenges. The ELF-DPC first constructs the weighted PPI network by combining topological and biological information. Second, it mines protein complex cores using the protein complex core mining strategy we designed. Third, it obtains an ensemble learning model by integrating structural modularity and a trained voting regressor model. Finally, it extends the protein complex cores and forms protein complexes by a graph heuristic search strategy. The experimental results demonstrate that ELF-DPC performs better than the twelve state-of-the-art approaches. Moreover, functional enrichment analysis illustrated that ELF-DPC could detect biologically meaningful protein complexes. The code/dataset is available for free download from https://github.com/RongquanWang/ELF-DPC.
Collapse
Affiliation(s)
- Rongquan Wang
- School of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing, China
| | - Huimin Ma
- School of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing, China
- *Correspondence: Huimin Ma,
| | - Caixia Wang
- School of International Economics, China Foreign Affairs University, Beijing, China
| |
Collapse
|
6
|
Palukuri MV, Marcotte EM. Super.Complex: A supervised machine learning pipeline for molecular complex detection in protein-interaction networks. PLoS One 2022; 16:e0262056. [PMID: 34972161 PMCID: PMC8719692 DOI: 10.1371/journal.pone.0262056] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2021] [Accepted: 12/15/2021] [Indexed: 12/12/2022] Open
Abstract
Characterization of protein complexes, i.e. sets of proteins assembling into a single larger physical entity, is important, as such assemblies play many essential roles in cells such as gene regulation. From networks of protein-protein interactions, potential protein complexes can be identified computationally through the application of community detection methods, which flag groups of entities interacting with each other in certain patterns. Most community detection algorithms tend to be unsupervised and assume that communities are dense network subgraphs, which is not always true, as protein complexes can exhibit diverse network topologies. The few existing supervised machine learning methods are serial and can potentially be improved in terms of accuracy and scalability by using better-suited machine learning models and parallel algorithms. Here, we present Super.Complex, a distributed, supervised AutoML-based pipeline for overlapping community detection in weighted networks. We also propose three new evaluation measures for the outstanding issue of comparing sets of learned and known communities satisfactorily. Super.Complex learns a community fitness function from known communities using an AutoML method and applies this fitness function to detect new communities. A heuristic local search algorithm finds maximally scoring communities, and a parallel implementation can be run on a computer cluster for scaling to large networks. On a yeast protein-interaction network, Super.Complex outperforms 6 other supervised and 4 unsupervised methods. Application of Super.Complex to a human protein-interaction network with ~8k nodes and ~60k edges yields 1,028 protein complexes, with 234 complexes linked to SARS-CoV-2, the COVID-19 virus, with 111 uncharacterized proteins present in 103 learned complexes. Super.Complex is generalizable with the ability to improve results by incorporating domain-specific features. Learned community characteristics can also be transferred from existing applications to detect communities in a new application with no known communities. Code and interactive visualizations of learned human protein complexes are freely available at: https://sites.google.com/view/supercomplex/super-complex-v3-0.
Collapse
Affiliation(s)
- Meghana Venkata Palukuri
- Oden Institute for Computational Engineering and Sciences, The University of Texas at Austin, Austin, Texas, United States of America
- * E-mail: (MVP); (EMM)
| | - Edward M. Marcotte
- Department of Molecular Biosciences, Center for Systems and Synthetic Biology, The University of Texas at Austin, Austin, Texas, United States of America
- * E-mail: (MVP); (EMM)
| |
Collapse
|
7
|
Palukuri MV, Marcotte EM. Super.Complex: A supervised machine learning pipeline for molecular complex detection in protein-interaction networks. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2021. [PMID: 34189530 PMCID: PMC8240683 DOI: 10.1101/2021.06.22.449395] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Characterization of protein complexes, i.e. sets of proteins assembling into a single larger physical entity, is important, as such assemblies play many essential roles in cells such as gene regulation. From networks of protein-protein interactions, potential protein complexes can be identified computationally through the application of community detection methods, which flag groups of entities interacting with each other in certain patterns. Most community detection algorithms tend to be unsupervised and assume that communities are dense network subgraphs, which is not always true, as protein complexes can exhibit diverse network topologies. The few existing supervised machine learning methods are serial and can potentially be improved in terms of accuracy and scalability by using better-suited machine learning models and parallel algorithms. Here, we present Super.Complex, a distributed, supervised AutoML-based pipeline for overlapping community detection in weighted networks. We also propose three new evaluation measures for the outstanding issue of comparing sets of learned and known communities satisfactorily. Super.Complex learns a community fitness function from known communities using an AutoML method and applies this fitness function to detect new communities. A heuristic local search algorithm finds maximally scoring communities, and a parallel implementation can be run on a computer cluster for scaling to large networks. On a yeast protein-interaction network, Super.Complex outperforms 6 other supervised and 4 unsupervised methods. Application of Super.Complex to a human protein-interaction network with ~8k nodes and ~60k edges yields 1,028 protein complexes, with 234 complexes linked to SARS-CoV-2, the COVID-19 virus, with 111 uncharacterized proteins present in 103 learned complexes. Super.Complex is generalizable with the ability to improve results by incorporating domain-specific features. Learned community characteristics can also be transferred from existing applications to detect communities in a new application with no known communities. Code and interactive visualizations of learned human protein complexes are freely available at: https://sites.google.com/view/supercomplex/super-complex-v3-0.
Collapse
|
8
|
Younis H, Anwar MW, Khan MUG, Sikandar A, Bajwa UI. A New Sequential Forward Feature Selection (SFFS) Algorithm for Mining Best Topological and Biological Features to Predict Protein Complexes from Protein-Protein Interaction Networks (PPINs). Interdiscip Sci 2021; 13:371-388. [PMID: 33959851 DOI: 10.1007/s12539-021-00433-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2020] [Revised: 04/09/2021] [Accepted: 04/15/2021] [Indexed: 10/21/2022]
Abstract
Protein-protein interaction plays an important role in the understanding of biological processes in the body. A network of dynamic protein complexes within a cell that regulates most biological processes is known as a protein-protein interaction network (PPIN). Complex prediction from PPINs is a challenging task. Most of the previous computation approaches mine cliques, stars, linear and hybrid structures as complexes from PPINs by considering topological features and fewer of them focus on important biological information contained within protein amino acid sequence. In this study, we have computed a wide variety of topological features and integrate them with biological features computed from protein amino acid sequence such as bag of words, physicochemical and spectral domain features. We propose a new Sequential Forward Feature Selection (SFFS) algorithm, i.e., random forest-based Boruta feature selection for selecting the best features from computed large feature set. Decision tree, linear discriminant analysis and gradient boosting classifiers are used as learners. We have conducted experiments by considering two reference protein complex datasets of yeast, i.e., CYC2008 and MIPS. Human and mouse complex information is taken from CORUM 3.0 dataset. Protein interaction information is extracted from the database of interacting proteins (DIP). Our proposed SFFS, i.e., random forest-based Brouta feature selection in combination with decision trees, linear discriminant analysis and Gradient Boosting Classifiers outperforms other state of art algorithms by achieving precision, recall and F-measure rates, i.e. 94.58%, 94.92% and 94.45% for MIPS, 96.31%, 93.55% and 96.02% for CYC2008, 98.84%, 98.00%, 98.87 % for CORUM humans and 96.60%, 96.70%, 96.32% for CORUM mouse dataset complexes, respectively.
Collapse
Affiliation(s)
- Haseeb Younis
- School of Professional Advancement, University of Management and Technology, Lahore, Pakistan.,Department of Computer Science, COMSATS University Islamabad, Lahore, Pakistan
| | | | - Muhammad Usman Ghani Khan
- Department of Computer Science and Engineering, University of Engineering and Technology, Lahore, Pakistan
| | - Aisha Sikandar
- Govt. Girls Post Graduate College No.1 Abbottabad, Abbottabad, Pakistan
| | - Usama Ijaz Bajwa
- Department of Computer Science, COMSATS University Islamabad, Lahore, Pakistan
| |
Collapse
|
9
|
Grbić M, Matić D, Kartelj A, Vračević S, Filipović V. A three-phase method for identifying functionally related protein groups in weighted PPI networks. Comput Biol Chem 2020; 86:107246. [PMID: 32339914 DOI: 10.1016/j.compbiolchem.2020.107246] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2019] [Revised: 01/27/2020] [Accepted: 03/03/2020] [Indexed: 01/17/2023]
Abstract
Identifying significant protein groups is of great importance for further understanding protein functions. This paper introduces a novel three-phase heuristic method for identifying such groups in weighted PPI networks. In the first phase a variable neighborhood search (VNS) algorithm is applied on a weighted PPI network, in order to support protein complexes by adding a minimum number of new PPIs. In the second phase proteins from different complexes are merged into larger protein groups. In the third phase these groups are expanded by a number of 2-level neighbor proteins, favoring proteins that have higher average gene co-expression with the base group proteins. Experimental results show that: (i) the proposed VNS algorithm outperforms the existing approach described in literature and (ii) the above-mentioned three-phase method identifies protein groups with very high statistical significance.
Collapse
Affiliation(s)
- Milana Grbić
- University of Banjaluka, Faculty of Natural Sciences and Mathematics, Mladena Stojanovića 2, 78000 Banjaluka, Bosnia and Herzegovina.
| | - Dragan Matić
- University of Banjaluka, Faculty of Natural Sciences and Mathematics, Mladena Stojanovića 2, 78000 Banjaluka, Bosnia and Herzegovina.
| | - Aleksandar Kartelj
- University of Belgrade, Faculty of Mathematics, Studentski trg 16/IV 11 000, Belgrade, Serbia.
| | - Savka Vračević
- University of Banjaluka, Faculty of Natural Sciences and Mathematics, Mladena Stojanovića 2, 78000 Banjaluka, Bosnia and Herzegovina.
| | - Vladimir Filipović
- University of Belgrade, Faculty of Mathematics, Studentski trg 16/IV 11 000, Belgrade, Serbia.
| |
Collapse
|
10
|
Borgmann-Winter KE, Wang K, Bandyopadhyay S, Torshizi AD, Blair IA, Hahn CG. The proteome and its dynamics: A missing piece for integrative multi-omics in schizophrenia. Schizophr Res 2020; 217:148-161. [PMID: 31416743 PMCID: PMC7500806 DOI: 10.1016/j.schres.2019.07.025] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/09/2019] [Revised: 07/10/2019] [Accepted: 07/13/2019] [Indexed: 01/08/2023]
Abstract
The complex and heterogeneous pathophysiology of schizophrenia can be deconstructed by integration of large-scale datasets encompassing genes through behavioral phenotypes. Genome-wide datasets are now available for genetic, epigenetic and transcriptomic variations in schizophrenia, which are then analyzed by newly devised systems biology algorithms. A missing piece, however, is the inclusion of information on the proteome and its dynamics in schizophrenia. Proteomics has lagged behind omics of the genome, transcriptome and epigenome since analytic platforms were relatively less robust for proteins. There has been remarkable progress, however, in the instrumentation of liquid chromatography (LC) and mass spectrometry (MS) (LCMS), experimental paradigms and bioinformatics of the proteome. Here, we present a summary of methodological innovations of recent years in MS based proteomics and the power of new generation proteomics, review proteomics studies that have been conducted in schizophrenia to date, and propose how such data can be analyzed and integrated with other omics results. The function of a protein is determined by multiple molecular properties, i.e., subcellular localization, posttranslational modification (PTMs) and protein-protein interactions (PPIs). Incorporation of these properties poses additional challenges in proteomics and their integration with other omics; yet is a critical next step to close the loop of multi-omics integration. In sum, the recent advent of high-throughput proteome characterization technologies and novel mathematical approaches enable us to incorporate functional properties of the proteome to offer a comprehensive multi-omics based understanding of schizophrenia pathophysiology.
Collapse
Affiliation(s)
- Karin E Borgmann-Winter
- Department of Psychiatry, University of Pennsylvania, Philadelphia, PA 19104-3403, United States of America; Department of Child and Adolescent Psychiatry and Behavioral Sciences, Children's Hospital of Philadelphia, Philadelphia, PA 19104, United States of America
| | - Kai Wang
- Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, United States of America; Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA 19104, United States of America
| | - Sabyasachi Bandyopadhyay
- Department of Psychiatry, University of Pennsylvania, Philadelphia, PA 19104-3403, United States of America
| | - Abolfazl Doostparast Torshizi
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA 19104, United States of America
| | - Ian A Blair
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, United States of America
| | - Chang-Gyu Hahn
- Department of Psychiatry, University of Pennsylvania, Philadelphia, PA 19104-3403, United States of America.
| |
Collapse
|