1
|
Chen S, Mira A, Onnela JP. Flexible model selection for mechanistic network models. JOURNAL OF COMPLEX NETWORKS 2020; 8:cnz024. [PMID: 32765880 PMCID: PMC7391990 DOI: 10.1093/comnet/cnz024] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/25/2019] [Accepted: 06/24/2019] [Indexed: 05/25/2023]
Abstract
Network models are applied across many domains where data can be represented as a network. Two prominent paradigms for modelling networks are statistical models (probabilistic models for the observed network) and mechanistic models (models for network growth and/or evolution). Mechanistic models are better suited for incorporating domain knowledge, to study effects of interventions (such as changes to specific mechanisms) and to forward simulate, but they typically have intractable likelihoods. As such, and in a stark contrast to statistical models, there is a relative dearth of research on model selection for such models despite the otherwise large body of extant work. In this article, we propose a simulator-based procedure for mechanistic network model selection that borrows aspects from Approximate Bayesian Computation along with a means to quantify the uncertainty in the selected model. To select the most suitable network model, we consider and assess the performance of several learning algorithms, most notably the so-called Super Learner, which makes our framework less sensitive to the choice of a particular learning algorithm. Our approach takes advantage of the ease to forward simulate from mechanistic network models to circumvent their intractable likelihoods. The overall process is flexible and widely applicable. Our simulation results demonstrate the approach's ability to accurately discriminate between competing mechanistic models. Finally, we showcase our approach with a protein-protein interaction network model from the literature for yeast (Saccharomyces cerevisiae).
Collapse
Affiliation(s)
- Sixing Chen
- Department of Biostatistics, T.H. Chan School of Public Health, Harvard University 655 Huntington Avenue, Building 2, 4th Floor, Boston, MA 02115, USA
| | - Antonietta Mira
- Data Science Lab, Institute of Computational Science, Università della Svizzera italiana Via Buffi 6, 6900 Lugano, Switzerland and Dipartimento di Scienza e Alta Tecnologia, Università degli Studi dell'Insubria Via Valleggio, 11 - 22100 Como, Italy
| | | |
Collapse
|
2
|
Boyen P, Neven F, van Dyck D, Valentim FL, van Dijk ADJ. Mining minimal motif pair sets maximally covering interactions in a protein-protein interaction network. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2013; 10:73-86. [PMID: 23702545 DOI: 10.1109/tcbb.2012.165] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
Correlated motif covering (CMC) is the problem of finding a set of motif pairs, i.e., pairs of patterns, in the sequences of proteins from a protein-protein interaction network (PPI-network) that describe the interactions in the network as concisely as possible. In other words, a perfect solution for CMC would be a minimal set of motif pairs that describes the interaction behavior perfectly in the sense that two proteins from the network interact if and only if their sequences match a motif pair in the minimal set. In this paper, we introduce and formally define CMC and show that it is closely related to the red-blue set cover (RBSC) problem and its weighted version (WRBSC)--both well-known NP-hard problems for that there exist several algorithms with known approximation factor guarantees. We prove the hardness of approximation of CMC by providing an approximation factor preserving reduction from RBSC to CMC. We show the existence of a theoretical approximation algorithm for CMC by providing an approximation factor preserving reduction from CMC to WRBSC. We adapt the latter algorithm into a functional heuristic for CMC, called CMC-approx, and experimentally assess its performance and biological relevance. The implementation in Java can be found at >http://bioinformatics.uhasselt.be.
Collapse
Affiliation(s)
- Peter Boyen
- Hasselt University and Transnational University of Limburg, Agoralaan, Diepenbeek, Belgium.
| | | | | | | | | |
Collapse
|
3
|
Abstract
Background Identifying the location of binding sites on proteins is of fundamental importance for a wide range of applications including molecular docking, de novo drug design, structure identification and comparison of functional sites. Structural genomic projects are beginning to produce protein structures with unknown functions. Therefore, efficient methods are required if all these structures are to be properly annotated. Lots of methods for finding binding sites involve 3D structure comparison. Here we design a method to find protein binding sites by direct comparison of protein 3D structures. Results We have developed an efficient heuristic approach for finding similar binding sites from the surface of given proteins. Our approach consists of three steps: local sequence alignment, protein surface detection, and 3D structures comparison. We implement the algorithm and produce a software package that works well in practice. When comparing a complete protein with all complete protein structures in the PDB database, experiments show that the average recall value of our approach is 82% and the average precision value of our approach is also significantly better than the existing approaches. Conclusions Our program has much higher recall values than those existing programs. Experiments show that all the existing approaches have recall values less than 50%. This implies that more than 50% of real binding sites cannot be reported by those existing approaches. The software package is available at http://sites.google.com/site/guofeics/bsfinder.
Collapse
Affiliation(s)
- Fei Guo
- Department of Computer Science, City University of Hong Kong, 83 Tat Chee Avenue, Kowloon, Hong Kong
| | | |
Collapse
|
4
|
Chang WC, Vakati S, Krause R, Eulenstein O. Exploring biological interaction networks with tailored weighted quasi-bicliques. BMC Bioinformatics 2012; 13 Suppl 10:S16. [PMID: 22759421 PMCID: PMC3314588 DOI: 10.1186/1471-2105-13-s10-s16] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
BACKGROUND Biological networks provide fundamental insights into the functional characterization of genes and their products, the characterization of DNA-protein interactions, the identification of regulatory mechanisms, and other biological tasks. Due to the experimental and biological complexity, their computational exploitation faces many algorithmic challenges. RESULTS We introduce novel weighted quasi-biclique problems to identify functional modules in biological networks when represented by bipartite graphs. In difference to previous quasi-biclique problems, we include biological interaction levels by using edge-weighted quasi-bicliques. While we prove that our problems are NP-hard, we also describe IP formulations to compute exact solutions for moderately sized networks. CONCLUSIONS We verify the effectiveness of our IP solutions using both simulation and empirical data. The simulation shows high quasi-biclique recall rates, and the empirical data corroborate the abilities of our weighted quasi-bicliques in extracting features and recovering missing interactions from biological networks.
Collapse
Affiliation(s)
- Wen-Chieh Chang
- Department of Computer Science, Iowa State University, Ames, IA, 50011, USA
| | - Sudheer Vakati
- Department of Computer Science, Iowa State University, Ames, IA, 50011, USA
| | - Roland Krause
- Department of Computer Science, Free University of Berlin, 14195 Berlin, Germany
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, 14195 Berlin, Germany
- Current address: Luxembourg Centre for Systems Biology, University of Luxembourg, L-4362 Esch-sur-Alzette, Luxembourg
| | - Oliver Eulenstein
- Department of Computer Science, Iowa State University, Ames, IA, 50011, USA
| |
Collapse
|
5
|
Goh WWB, Lee YH, Chung M, Wong L. How advancement in biological network analysis methods empowers proteomics. Proteomics 2012; 12:550-63. [PMID: 22247042 DOI: 10.1002/pmic.201100321] [Citation(s) in RCA: 49] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2011] [Revised: 09/05/2011] [Accepted: 09/13/2011] [Indexed: 12/23/2022]
Abstract
Proteomics provides important information--that may not be inferable from indirect sources such as RNA or DNA--on key players in biological systems or disease states. However, it suffers from coverage and consistency problems. The advent of network-based analysis methods can help in overcoming these problems but requires careful application and interpretation. This review considers briefly current trends in proteomics technologies and understanding the causes of critical issues that need to be addressed--i.e., incomplete data coverage and inter-sample inconsistency. On the coverage issue, we argue that holistic analysis based on biological networks provides a suitable background on which more robust models and interpretations can be built upon; and we introduce some recently developed approaches. On consistency, group-based approaches based on identified clusters, as well as on properly integrated pathway databases, are particularly useful. Despite that protein interactions and pathway networks are still largely incomplete, given proper quality checks, applications and reasonably sized data sets, they yield valuable insights that greatly complement data generated from quantitative proteomics.
Collapse
|
6
|
Wong L. Using Biological Networks in Protein Function Prediction and Gene Expression Analysis. ACTA ACUST UNITED AC 2011. [DOI: 10.1080/15427951.2011.604561] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/15/2022]
|
7
|
Guo F, Li SC, Wang L. Protein-protein binding sites prediction by 3D structural similarities. J Chem Inf Model 2011; 51:3287-94. [PMID: 22077765 DOI: 10.1021/ci200206n] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Identifying the location of binding sites on proteins is of fundamental importance for a wide range of applications including molecular docking, de novo drug design, structure identification, and comparison of functional sites. In this paper, we develop an efficient approach for finding binding sites between proteins. Our approach consists of four steps: local sequence alignment, protein surface detection, 3D structure comparison, and candidate binding site selection. A comparison of our method with the LSA algorithm shows that the binding sites predicted by our method are somewhat closer to the actual binding sites in the protein-protein complexes. The software package is available at http://sites.google.com/site/guofeics/pro-bs for noncommercial use.
Collapse
Affiliation(s)
- Fei Guo
- School of Computer Science and Technology, Shandong University, Jinan 250101, Shandong, China
| | | | | |
Collapse
|
8
|
LEUNG HENRYCHIMING, SIU MANHUNG, YIU SIUMING, CHIN FRANCISYUKLUN, SUNG KENWINGKIN. CLUSTERING-BASED APPROACH FOR PREDICTING MOTIF PAIRS FROM PROTEIN INTERACTION DATA. J Bioinform Comput Biol 2011; 7:701-16. [DOI: 10.1142/s0219720009004266] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2008] [Revised: 12/14/2008] [Accepted: 02/22/2009] [Indexed: 11/18/2022]
Abstract
Predicting motif pairs from a set of protein sequences based on the protein–protein interaction data is an important, but difficult computational problem. Tan et al. proposed a solution to this problem. However, the scoring function (using χ2 testing) used in their approach is not adequate and their approach is also not scalable. It may take days to process a set of 5000 protein sequences with about 20,000 interactions. Later, Leung et al. proposed an improved scoring function and faster algorithms for solving the same problem. But, the model used in Leung et al. is complicated. The exact value of the scoring function is not easy to compute and an estimated value is used in practice. In this paper, we derive a better model to capture the significance of a given motif pair based on a clustering notion. We develop a fast heuristic algorithm to solve the problem. The algorithm is able to locate the correct motif pair in the yeast data set in about 45 minutes for 5000 protein sequences and 20,000 interactions. Moreover, we derive a lower bound result for the p-value of a motif pair in order for it to be distinguishable from random motif pairs. The lower bound result has been verified using simulated data sets. Availability:
Collapse
Affiliation(s)
- HENRY CHI-MING LEUNG
- Department of Computer Science, The University of Hong Kong, Pokfulam Road, Hong Kong, China
| | - MAN-HUNG SIU
- Department of Computer Science, The University of Hong Kong, Pokfulam Road, Hong Kong, China
| | - SIU-MING YIU
- Department of Computer Science, The University of Hong Kong, Pokfulam Road, Hong Kong, China
| | - FRANCIS YUK-LUN CHIN
- Department of Computer Science, The University of Hong Kong, Pokfulam Road, Hong Kong, China
| | - KEN WING-KIN SUNG
- Department of Computer Science, National University of Singapore, Singapore
| |
Collapse
|
9
|
Schweiger R, Linial M, Linial N. Generative probabilistic models for protein-protein interaction networks--the biclique perspective. ACTA ACUST UNITED AC 2011; 27:i142-8. [PMID: 21685063 PMCID: PMC3117378 DOI: 10.1093/bioinformatics/btr201] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Motivation: Much of the large-scale molecular data from living cells can be represented in terms of networks. Such networks occupy a central position in cellular systems biology. In the protein–protein interaction (PPI) network, nodes represent proteins and edges represent connections between them, based on experimental evidence. As PPI networks are rich and complex, a mathematical model is sought to capture their properties and shed light on PPI evolution. The mathematical literature contains various generative models of random graphs. It is a major, still largely open question, which of these models (if any) can properly reproduce various biologically interesting networks. Here, we consider this problem where the graph at hand is the PPI network of Saccharomyces cerevisiae. We are trying to distinguishing between a model family which performs a process of copying neighbors, represented by the duplication–divergence (DD) model, and models which do not copy neighbors, with the Barabási–Albert (BA) preferential attachment model as a leading example. Results: The observed property of the network is the distribution of maximal bicliques in the graph. This is a novel criterion to distinguish between models in this area. It is particularly appropriate for this purpose, since it reflects the graph's growth pattern under either model. This test clearly favors the DD model. In particular, for the BA model, the vast majority (92.9%) of the bicliques with both sides ≥4 must be already embedded in the model's seed graph, whereas the corresponding figure for the DD model is only 5.1%. Our results, based on the biclique perspective, conclusively show that a naïve unmodified DD model can capture a key aspect of PPI networks. Contact:regevs01@cs.huji.ac.il; michall@cc.huji.ac.il; nati@cs.huji.ac.il Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Regev Schweiger
- School of Computer Science and Engineering, Department of Biological Chemistry, The Alexander Silberman Institute of Life Sciences and The Sudarsky Center for Computational Biology, The Hebrew University, Jerusalem, 91904 Israel.
| | | | | |
Collapse
|
10
|
Boyen P, Van Dyck D, Neven F, van Ham RCHJ, van Dijk ADJ. SLIDER: a generic metaheuristic for the discovery of correlated motifs in protein-protein interaction networks. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011; 8:1344-1357. [PMID: 21282865 DOI: 10.1109/tcbb.2011.17] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
Correlated motif mining (cmm) is the problem of finding overrepresented pairs of patterns, called motifs, in sequences of interacting proteins. Algorithmic solutions for cmm thereby provide a computational method for predicting binding sites for protein interaction. In this paper, we adopt a motif-driven approach where the support of candidate motif pairs is evaluated in the network. We experimentally establish the superiority of the Chi-square-based support measure over other support measures. Furthermore, we obtain that cmm is an np-hard problem for a large class of support measures (including Chi-square) and reformulate the search for correlated motifs as a combinatorial optimization problem. We then present the generic metaheuristic slider which uses steepest ascent with a neighborhood function based on sliding motifs and employs the Chi-square-based support measure. We show that slider outperforms existing motif-driven cmm methods and scales to large protein-protein interaction networks. The slider-implementation and the data used in the experiments are available on http://bioinformatics.uhasselt.be.
Collapse
Affiliation(s)
- Peter Boyen
- Hasselt University, Agoralaan Gebouw D, B-3590 Diepenbeek, Belgium.
| | | | | | | | | |
Collapse
|
11
|
Sarmady M, Dampier W, Tozeren A. HIV protein sequence hotspots for crosstalk with host hub proteins. PLoS One 2011; 6:e23293. [PMID: 21858059 PMCID: PMC3156123 DOI: 10.1371/journal.pone.0023293] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2011] [Accepted: 07/12/2011] [Indexed: 11/18/2022] Open
Abstract
HIV proteins target host hub proteins for transient binding interactions. The presence of viral proteins in the infected cell results in out-competition of host proteins in their interaction with hub proteins, drastically affecting cell physiology. Functional genomics and interactome datasets can be used to quantify the sequence hotspots on the HIV proteome mediating interactions with host hub proteins. In this study, we used the HIV and human interactome databases to identify HIV targeted host hub proteins and their host binding partners (H2). We developed a high throughput computational procedure utilizing motif discovery algorithms on sets of protein sequences, including sequences of HIV and H2 proteins. We identified as HIV sequence hotspots those linear motifs that are highly conserved on HIV sequences and at the same time have a statistically enriched presence on the sequences of H2 proteins. The HIV protein motifs discovered in this study are expressed by subsets of H2 host proteins potentially outcompeted by HIV proteins. A large subset of these motifs is involved in cleavage, nuclear localization, phosphorylation, and transcription factor binding events. Many such motifs are clustered on an HIV sequence in the form of hotspots. The sequential positions of these hotspots are consistent with the curated literature on phenotype altering residue mutations, as well as with existing binding site data. The hotspot map produced in this study is the first global portrayal of HIV motifs involved in altering the host protein network at highly connected hub nodes.
Collapse
MESH Headings
- Amino Acid Motifs/genetics
- Amino Acid Sequence
- Binding Sites/genetics
- CREB-Binding Protein/metabolism
- Calcium-Calmodulin-Dependent Protein Kinases/metabolism
- Calmodulin/metabolism
- Casein Kinase II/metabolism
- Databases, Protein
- Human Immunodeficiency Virus Proteins/chemistry
- Human Immunodeficiency Virus Proteins/genetics
- Human Immunodeficiency Virus Proteins/metabolism
- Humans
- Hydrophobic and Hydrophilic Interactions
- Mitogen-Activated Protein Kinase 1/metabolism
- Models, Molecular
- Protein Binding
- Protein Interaction Mapping/methods
- Protein Structure, Secondary
- Protein Structure, Tertiary
- Proteins/genetics
- Proteins/metabolism
- env Gene Products, Human Immunodeficiency Virus/chemistry
- env Gene Products, Human Immunodeficiency Virus/genetics
- env Gene Products, Human Immunodeficiency Virus/metabolism
- gag Gene Products, Human Immunodeficiency Virus/chemistry
- gag Gene Products, Human Immunodeficiency Virus/genetics
- gag Gene Products, Human Immunodeficiency Virus/metabolism
- nef Gene Products, Human Immunodeficiency Virus/chemistry
- nef Gene Products, Human Immunodeficiency Virus/genetics
- nef Gene Products, Human Immunodeficiency Virus/metabolism
- rev Gene Products, Human Immunodeficiency Virus/chemistry
- rev Gene Products, Human Immunodeficiency Virus/genetics
- rev Gene Products, Human Immunodeficiency Virus/metabolism
- tat Gene Products, Human Immunodeficiency Virus/chemistry
- tat Gene Products, Human Immunodeficiency Virus/genetics
- tat Gene Products, Human Immunodeficiency Virus/metabolism
Collapse
Affiliation(s)
- Mahdi Sarmady
- Center for Integrated Bioinformatics, Drexel University, Philadelphia, Pennsylvania, United States of America
| | - William Dampier
- Center for Integrated Bioinformatics, Drexel University, Philadelphia, Pennsylvania, United States of America
| | - Aydin Tozeren
- Center for Integrated Bioinformatics, Drexel University, Philadelphia, Pennsylvania, United States of America
- * E-mail:
| |
Collapse
|
12
|
Sequence- and interactome-based prediction of viral protein hotspots targeting host proteins: a case study for HIV Nef. PLoS One 2011; 6:e20735. [PMID: 21738584 PMCID: PMC3125164 DOI: 10.1371/journal.pone.0020735] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2011] [Accepted: 05/08/2011] [Indexed: 01/03/2023] Open
Abstract
Virus proteins alter protein pathways of the host toward the synthesis of viral particles by breaking and making edges via binding to host proteins. In this study, we developed a computational approach to predict viral sequence hotspots for binding to host proteins based on sequences of viral and host proteins and literature-curated virus-host protein interactome data. We use a motif discovery algorithm repeatedly on collections of sequences of viral proteins and immediate binding partners of their host targets and choose only those motifs that are conserved on viral sequences and highly statistically enriched among binding partners of virus protein targeted host proteins. Our results match experimental data on binding sites of Nef to host proteins such as MAPK1, VAV1, LCK, HCK, HLA-A, CD4, FYN, and GNB2L1 with high statistical significance but is a poor predictor of Nef binding sites on highly flexible, hoop-like regions. Predicted hotspots recapture CD8 cell epitopes of HIV Nef highlighting their importance in modulating virus-host interactions. Host proteins potentially targeted or outcompeted by Nef appear crowding the T cell receptor, natural killer cell mediated cytotoxicity, and neurotrophin signaling pathways. Scanning of HIV Nef motifs on multiple alignments of hepatitis C protein NS5A produces results consistent with literature, indicating the potential value of the hotspot discovery in advancing our understanding of virus-host crosstalk.
Collapse
|
13
|
Liu X, Li J, Wang L. Modeling protein interacting groups by quasi-bicliques: complexity, algorithm, and application. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2010; 7:354-364. [PMID: 20431154 DOI: 10.1109/tcbb.2008.61] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2023]
Abstract
UNLABELLED Protein-protein interactions (PPIs) are one of the most important mechanisms in cellular processes. To model protein interaction sites, recent studies have suggested to find interacting protein group pairs from large PPI networks at the first step and then to search conserved motifs within the protein groups to form interacting motif pairs. To consider the noise effect and the incompleteness of biological data, we propose to use quasi-bicliques for finding interacting protein group pairs. We investigate two new problems that arise from finding interacting protein group pairs: the maximum vertex quasi-biclique problem and the maximum balanced quasi-biclique problem. We prove that both problems are NP-hard. This is a surprising result as the widely known maximum vertex biclique problem is polynomial time solvable [1]. We then propose a heuristic algorithm that uses the greedy method to find the quasi-bicliques from PPI networks. Our experiment results on real data show that this algorithm has a better performance than a benchmark algorithm for identifying highly matched BLOCKS and PRINTS motifs. We also report results of two case studies on interacting motif pairs that map well with two interacting domain pairs in iPfam. AVAILABILITY The software and supplementary information are available at http://www.cs.cityu.edu.hk/~lwang/software/ppi/index.html.
Collapse
Affiliation(s)
- Xiaowen Liu
- Department of Computer Science, City University of Hong Kong and Department of Computer Science and Engineering, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA.
| | | | | |
Collapse
|
14
|
Hugo W, Song F, Aung Z, Ng SK, Sung WK. SLiM on Diet: finding short linear motifs on domain interaction interfaces in Protein Data Bank. Bioinformatics 2010; 26:1036-42. [DOI: 10.1093/bioinformatics/btq065] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
|
15
|
Park Y. Critical assessment of sequence-based protein-protein interaction prediction methods that do not require homologous protein sequences. BMC Bioinformatics 2009; 10:419. [PMID: 20003442 PMCID: PMC2803199 DOI: 10.1186/1471-2105-10-419] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2009] [Accepted: 12/14/2009] [Indexed: 11/10/2022] Open
Abstract
Background Protein-protein interactions underlie many important biological processes. Computational prediction methods can nicely complement experimental approaches for identifying protein-protein interactions. Recently, a unique category of sequence-based prediction methods has been put forward - unique in the sense that it does not require homologous protein sequences. This enables it to be universally applicable to all protein sequences unlike many of previous sequence-based prediction methods. If effective as claimed, these new sequence-based, universally applicable prediction methods would have far-reaching utilities in many areas of biology research. Results Upon close survey, I realized that many of these new methods were ill-tested. In addition, newer methods were often published without performance comparison with previous ones. Thus, it is not clear how good they are and whether there are significant performance differences among them. In this study, I have implemented and thoroughly tested 4 different methods on large-scale, non-redundant data sets. It reveals several important points. First, significant performance differences are noted among different methods. Second, data sets typically used for training prediction methods appear significantly biased, limiting the general applicability of prediction methods trained with them. Third, there is still ample room for further developments. In addition, my analysis illustrates the importance of complementary performance measures coupled with right-sized data sets for meaningful benchmark tests. Conclusions The current study reveals the potentials and limits of the new category of sequence-based protein-protein interaction prediction methods, which in turn provides a firm ground for future endeavours in this important area of contemporary bioinformatics.
Collapse
Affiliation(s)
- Yungki Park
- Institute of Cellular and Molecular Biology (MBB 3 210B), Center for Systems and Synthetic Biology, University of Texas at Austin, 2500 Speedway, Austin, Texas, USA.
| |
Collapse
|
16
|
Sim K, Li J, Gopalkrishnan V, Liu G. Mining maximal quasi-bicliques: Novel algorithm and applications in the stock market and protein networks. Stat Anal Data Min 2009. [DOI: 10.1002/sam.10051] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
|
17
|
Andreopoulos B, Winter C, Labudde D, Schroeder M. Triangle network motifs predict complexes by complementing high-error interactomes with structural information. BMC Bioinformatics 2009; 10:196. [PMID: 19558694 PMCID: PMC2714575 DOI: 10.1186/1471-2105-10-196] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2009] [Accepted: 06/27/2009] [Indexed: 11/30/2022] Open
Abstract
Background A lot of high-throughput studies produce protein-protein interaction networks (PPINs) with many errors and missing information. Even for genome-wide approaches, there is often a low overlap between PPINs produced by different studies. Second-level neighbors separated by two protein-protein interactions (PPIs) were previously used for predicting protein function and finding complexes in high-error PPINs. We retrieve second level neighbors in PPINs, and complement these with structural domain-domain interactions (SDDIs) representing binding evidence on proteins, forming PPI-SDDI-PPI triangles. Results We find low overlap between PPINs, SDDIs and known complexes, all well below 10%. We evaluate the overlap of PPI-SDDI-PPI triangles with known complexes from Munich Information center for Protein Sequences (MIPS). PPI-SDDI-PPI triangles have ~20 times higher overlap with MIPS complexes than using second-level neighbors in PPINs without SDDIs. The biological interpretation for triangles is that a SDDI causes two proteins to be observed with common interaction partners in high-throughput experiments. The relatively few SDDIs overlapping with PPINs are part of highly connected SDDI components, and are more likely to be detected in experimental studies. We demonstrate the utility of PPI-SDDI-PPI triangles by reconstructing myosin-actin processes in the nucleus, cytoplasm, and cytoskeleton, which were not obvious in the original PPIN. Using other complementary datatypes in place of SDDIs to form triangles, such as PubMed co-occurrences or threading information, results in a similar ability to find protein complexes. Conclusion Given high-error PPINs with missing information, triangles of mixed datatypes are a promising direction for finding protein complexes. Integrating PPINs with SDDIs improves finding complexes. Structural SDDIs partially explain the high functional similarity of second-level neighbors in PPINs. We estimate that relatively little structural information would be sufficient for finding complexes involving most of the proteins and interactions in a typical PPIN.
Collapse
Affiliation(s)
- Bill Andreopoulos
- Biotechnology Center (BIOTEC), Technische Universität Dresden, 01307 Dresden, Germany.
| | | | | | | |
Collapse
|
18
|
Chua HN, Hugo W, Liu G, Li X, Wong L, Ng SK. A Probabilistic Graph-Theoretic Approach to Integrate Multiple Predictions for the Protein-Protein Subnetwork Prediction Challenge. Ann N Y Acad Sci 2009; 1158:224-33. [DOI: 10.1111/j.1749-6632.2008.03760.x] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
|
19
|
Li J, Liu Q. 'Double water exclusion': a hypothesis refining the O-ring theory for the hot spots at protein interfaces. Bioinformatics 2009; 25:743-50. [PMID: 19179356 PMCID: PMC2654803 DOI: 10.1093/bioinformatics/btp058] [Citation(s) in RCA: 49] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
Motivation: The O-ring theory reveals that the binding hot spot at a protein interface is surrounded by a ring of residues that are energetically less important than the residues in the hot spot. As this ring of residues is served to occlude water molecules from the hot spot, the O-ring theory is also called ‘water exclusion’ hypothesis. We propose a ‘double water exclusion’ hypothesis to refine the O-ring theory by assuming the hot spot itself is water-free. To computationally model a water-free hot spot, we use a biclique pattern that is defined as two maximal groups of residues from two chains in a protein complex holding the property that every residue contacts with all residues in the other group. Methods and Results: Given a chain pair A and B of a protein complex from the Protein Data Bank (PDB), we calculate the interatomic distance of all possible pairs of atoms between A and B. We then represent A and B as a bipartite graph based on these distance information. Maximal biclique subgraphs are subsequently identified from all of the bipartite graphs to locate biclique patterns at the interfaces. We address two properties of biclique patterns: a non-redundant occurrence in PDB, and a correspondence with hot spots when the solvent-accessible surface area (SASA) of a biclique pattern in the complex form is small. A total of 1293 biclique patterns are discovered which have a non-redundant occurrence of at least five, and which each have a minimum two and four residues at the two sides. Through extensive queries to the HotSprint and ASEdb databases, we verified that biclique patterns are rich of true hot residues. Our algorithm and results provide a new way to identify hot spots by examining proteins' structural data. Availability: The biclique mining algorithm is available at http://www.ntu.edu.sg/home/jyli/dwe.html. Contact:jyli@ntu.edu.sg Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jinyan Li
- Bioinformatics Research Center, School of Computer Engineering, Nanyang Technological University, Singapore 639798.
| | | |
Collapse
|
20
|
Chua HN, Wong L. Increasing the reliability of protein interactomes. Drug Discov Today 2008; 13:652-8. [DOI: 10.1016/j.drudis.2008.05.004] [Citation(s) in RCA: 54] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2008] [Revised: 05/09/2008] [Accepted: 05/13/2008] [Indexed: 11/28/2022]
|
21
|
Royer L, Reimann M, Andreopoulos B, Schroeder M. Unraveling protein networks with power graph analysis. PLoS Comput Biol 2008; 4:e1000108. [PMID: 18617988 PMCID: PMC2424176 DOI: 10.1371/journal.pcbi.1000108] [Citation(s) in RCA: 96] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2007] [Accepted: 05/29/2008] [Indexed: 11/28/2022] Open
Abstract
Networks play a crucial role in computational biology, yet their analysis and representation is still an open problem. Power Graph Analysis is a lossless transformation of biological networks into a compact, less redundant representation, exploiting the abundance of cliques and bicliques as elementary topological motifs. We demonstrate with five examples the advantages of Power Graph Analysis. Investigating protein-protein interaction networks, we show how the catalytic subunits of the casein kinase II complex are distinguishable from the regulatory subunits, how interaction profiles and sequence phylogeny of SH3 domains correlate, and how false positive interactions among high-throughput interactions are spotted. Additionally, we demonstrate the generality of Power Graph Analysis by applying it to two other types of networks. We show how power graphs induce a clustering of both transcription factors and target genes in bipartite transcription networks, and how the erosion of a phosphatase domain in type 22 non-receptor tyrosine phosphatases is detected. We apply Power Graph Analysis to high-throughput protein interaction networks and show that up to 85% (56% on average) of the information is redundant. Experimental networks are more compressible than rewired ones of same degree distribution, indicating that experimental networks are rich in cliques and bicliques. Power Graphs are a novel representation of networks, which reduces network complexity by explicitly representing re-occurring network motifs. Power Graphs compress up to 85% of the edges in protein interaction networks and are applicable to all types of networks such as protein interactions, regulatory networks, or homology networks. Networks play a crucial role in biology and are often used as a way to represent experimental results. Yet, their analysis and representation is still an open problem. Recent experimental and computational progress yields networks of increased size and complexity. There are, for example, small- and large-scale interaction networks, regulatory networks, genetic networks, protein-ligand interaction networks, and homology networks analyzed and published regularly. A common way to access the information in a network is though direct visualization, but this fails as it often just results in “fur balls” from which little insight can be gathered. On the other hand, clustering techniques manage to avoid the problems caused by the large number of nodes and even larger number of edges by coarse-graining the networks and thus abstracting details. But these also fail, since, in fact, much of the biology lies in the details. This work presents a novel methodology for analyzing and representing networks. Power Graphs are a lossless representation of networks, which reduces network complexity by explicitly representing re-occurring network motifs. Moreover, power graphs can be clearly visualized: they compress up to 90% of the edges in biological networks and are applicable to all types of networks such as protein interaction, regulatory networks, or homology networks.
Collapse
Affiliation(s)
- Loïc Royer
- Biotechnology Center, Technische Universität Dresden, Dresden, Germany
| | | | | | | |
Collapse
|
22
|
Guo J, Wu X, Zhang DY, Lin K. Genome-wide inference of protein interaction sites: lessons from the yeast high-quality negative protein-protein interaction dataset. Nucleic Acids Res 2008; 36:2002-11. [PMID: 18281313 PMCID: PMC2346601 DOI: 10.1093/nar/gkn016] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
High-throughput studies of protein interactions may have produced, experimentally and computationally, the most comprehensive protein–protein interaction datasets in the completely sequenced genomes. It provides us an opportunity on a proteome scale, to discover the underlying protein interaction patterns. Here, we propose an approach to discovering motif pairs at interaction sites (often 3–8 residues) that are essential for understanding protein functions and helpful for the rational design of protein engineering and folding experiments. A gold standard positive (interacting) dataset and a gold standard negative (non-interacting) dataset were mined to infer the interacting motif pairs that are significantly overrepresented in the positive dataset compared to the negative dataset. Four negative datasets assembled by different strategies were evaluated and the one with the best performance was used as the gold standard negatives for further analysis. Meanwhile, to assess the efficiency of our method in detecting potential interacting motif pairs, other approaches developed previously were compared, and we found that our method achieved the highest prediction accuracy. In addition, many uncharacterized motif pairs of interest were found to be functional with experimental evidence in other species. This investigation demonstrates the important effects of a high-quality negative dataset on the performance of such statistical inference.
Collapse
Affiliation(s)
- Jie Guo
- MOE Key Laboratory for Biodiversity Science and Ecological Engineering and College of Life Sciences, Beijing Normal University, Beijing 100875, China
| | | | | | | |
Collapse
|
23
|
van Dijk ADJ, ter Braak CJF, Immink RG, Angenent GC, van Ham RCHJ. Predicting and understanding transcription factor interactions based on sequence level determinants of combinatorial control. ACTA ACUST UNITED AC 2007; 24:26-33. [PMID: 18024974 DOI: 10.1093/bioinformatics/btm539] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION Transcription factor interactions are the cornerstone of combinatorial control, which is a crucial aspect of the gene regulatory system. Understanding and predicting transcription factor interactions based on their sequence alone is difficult since they are often part of families of factors sharing high sequence identity. Given the scarcity of experimental data on interactions compared to available sequence data, however, it would be most useful to have accurate methods for the prediction of such interactions. RESULTS We present a method consisting of a Random Forest-based feature-selection procedure that selects relevant motifs out of a set found using a correlated motif search algorithm. Prediction accuracy for several transcription factor families (bZIP, MADS, homeobox and forkhead) reaches 60-90%. In addition, we identified those parts of the sequence that are important for the interaction specificity, and show that these are in agreement with available data. We also used the predictors to perform genome-wide scans for interaction partners and recovered both known and putative new interaction partners.
Collapse
Affiliation(s)
- A D J van Dijk
- Applied Bioinformatics, PRI, Wageningen UR, Droevendaalsesteeg 1, Wageningen, The Netherlands
| | | | | | | | | |
Collapse
|
24
|
Devos D, Russell RB. A more complete, complexed and structured interactome. Curr Opin Struct Biol 2007; 17:370-7. [PMID: 17574831 DOI: 10.1016/j.sbi.2007.05.011] [Citation(s) in RCA: 44] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2007] [Revised: 04/18/2007] [Accepted: 05/31/2007] [Indexed: 11/16/2022]
Abstract
Multiprotein complexes are key players in virtually all important cellular processes. The past year has seen the publication of several papers that have illuminated what we know about the number and composition of these molecular machines, using high-throughput purification methods. Other studies have illuminated structural and functional aspects of protein interactions, networks and molecular assemblies. As a result, we have a more complete view of how many complexes are in living systems, what they look like and the roles they play in the cell.
Collapse
Affiliation(s)
- Damien Devos
- EMBL, Meyerhofstrasse 1, 69117 Heidelberg, Germany
| | | |
Collapse
|
25
|
Andreopoulos B, An A, Wang X, Faloutsos M, Schroeder M. Clustering by common friends finds locally significant proteins mediating modules. Bioinformatics 2007; 23:1124-31. [PMID: 17314122 DOI: 10.1093/bioinformatics/btm064] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Much research has been dedicated to large-scale protein interaction networks including the analysis of scale-free topologies, network modules and the relation of domain-domain to protein-protein interaction networks. Identifying locally significant proteins that mediate the function of modules is still an open problem. METHOD We use a layered clustering algorithm for interaction networks, which groups proteins by the similarity of their direct neighborhoods. We identify locally significant proteins, called mediators, which link different clusters. We apply the algorithm to a yeast network. RESULTS Clusters and mediators are organized in hierarchies, where clusters are mediated by and act as mediators for other clusters. We compare the clusters and mediators to known yeast complexes and find agreement with precision of 71% and recall of 61%. We analyzed the functions, processes and locations of mediators and clusters. We found that 55% of mediators to a cluster are enriched with a set of diverse processes and locations, often related to translocation of biomolecules. Additionally, 82% of clusters are enriched with one or more functions. The important role of mediators is further corroborated by a comparatively higher degree of conservation across genomes. We illustrate the above findings with an example of membrane protein translocation from the cytoplasm to the inner nuclear membrane. AVAILABILITY All software is freely available under Supplementary information.
Collapse
|
26
|
Sim K, Li J, Gopalkrishnan V, Liu G. Mining Maximal Quasi-Bicliques to Co-Cluster Stocks and Financial Ratios for Value Investment. ACTA ACUST UNITED AC 2006. [DOI: 10.1109/icdm.2006.111] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
|