51
|
Zhao J, Hu X, He T, Li P, Zhang M, Shen X. An edge-based protein complex identification algorithm with gene co-expression data (PCIA-GeCo). IEEE Trans Nanobioscience 2014; 13:80-8. [PMID: 24803023 DOI: 10.1109/tnb.2014.2317519] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023]
Abstract
Recent studies have shown that protein complex is composed of core proteins and attachment proteins, and proteins inside the core are highly co-expressed. Based on this new concept, we reconstruct weighted PPI network by using gene expression data, and develop a novel protein complex identification algorithm from the angle of edge (PCIA-GeCo). First, we select the edge with high co-expressed coefficient as seed to form the preliminary cores. Then, the preliminary cores are filtered according to the weighted density of complex core to obtain the unique core. Finally, the protein complexes are generated by identifying attachment proteins for each core. A comprehensive comparison in term of F-measure, Coverage rate, P-value between our method and three other existing algorithms HUNTER, COACH and CORE has been made by comparing the predicted complexes against benchmark complexes. The evaluation results show our method PCIA-GeCo is effective; it can identify protein complexes more accurately.
Collapse
|
52
|
Paul S, Maji P. City block distance and rough-fuzzy clustering for identification of co-expressed microRNAs. MOLECULAR BIOSYSTEMS 2014; 10:1509-23. [PMID: 24682049 DOI: 10.1039/c4mb00101j] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
Abstract
The microRNAs or miRNAs are short, endogenous RNAs having ability to regulate mRNA expression at the post-transcriptional level. Various studies have revealed that miRNAs tend to cluster on chromosomes. The members of a cluster that are in close proximity on chromosomes are highly likely to be processed as co-transcribed units. Therefore, a large proportion of miRNAs are co-expressed. Expression profiling of miRNAs generates a huge volume of data. Complicated networks of miRNA-mRNA interaction increase the challenges of comprehending and interpreting the resulting mass of data. In this regard, this paper presents a clustering algorithm in order to extract meaningful information from miRNA expression data. It judiciously integrates the merits of rough sets, fuzzy sets, the c-means algorithm, and the normalized range-normalized city block distance to discover co-expressed miRNA clusters. While the membership functions of fuzzy sets enable efficient handling of overlapping partitions in a noisy environment, the concept of lower and upper approximations of rough sets deals with uncertainty, vagueness, and incompleteness in cluster definition. The city block distance is used to compute the membership functions of fuzzy sets and to find initial partition of a data set, and therefore helps to handle minute differences between two miRNA expression profiles. The effectiveness of the proposed approach, along with a comparison with other related methods, is demonstrated for several miRNA expression data sets using different cluster validity indices. Moreover, the gene ontology is used to analyze the functional consistency and biological significance of generated miRNA clusters.
Collapse
Affiliation(s)
- Sushmita Paul
- Biomedical Imaging and Bioinformatics Lab, and Machine Intelligence Unit, Indian Statistical Institute, Kolkata, 700 108, India. ,
| | | |
Collapse
|
53
|
MENÉNDEZ HÉCTORD, BARRERO DAVIDF, CAMACHO DAVID. A GENETIC GRAPH-BASED APPROACH FOR PARTITIONAL CLUSTERING. Int J Neural Syst 2014; 24:1430008. [DOI: 10.1142/s0129065714300083] [Citation(s) in RCA: 58] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Clustering is one of the most versatile tools for data analysis. In the recent years, clustering that seeks the continuity of data (in opposition to classical centroid-based approaches) has attracted an increasing research interest. It is a challenging problem with a remarkable practical interest. The most popular continuity clustering method is the spectral clustering (SC) algorithm, which is based on graph cut: It initially generates a similarity graph using a distance measure and then studies its graph spectrum to find the best cut. This approach is sensitive to the parameters of the metric, and a correct parameter choice is critical to the quality of the cluster. This work proposes a new algorithm, inspired by SC, that reduces the parameter dependency while maintaining the quality of the solution. The new algorithm, named genetic graph-based clustering (GGC), takes an evolutionary approach introducing a genetic algorithm (GA) to cluster the similarity graph. The experimental validation shows that GGC increases robustness of SC and has competitive performance in comparison with classical clustering methods, at least, in the synthetic and real dataset used in the experiments.
Collapse
Affiliation(s)
- HÉCTOR D. MENÉNDEZ
- Computer Science Department, Universidad Autónoma de Madrid, 28049, Madrid, Spain
| | - DAVID F. BARRERO
- Departamento de Automática, Universidad de Alcalá, 28801, Alcalá de Henares, Madrid, Spain
| | - DAVID CAMACHO
- Computer Science Department, Universidad Autónoma de Madrid, 28049, Madrid, Spain
| |
Collapse
|
54
|
Abstract
BACKGROUND Recently, large data sets of protein-protein interactions (PPI) which can be modeled as PPI networks are generated through high-throughput methods. And locally dense regions in PPI networks are very likely to be protein complexes. Since protein complexes play a key role in many biological processes, detecting protein complexes in PPI networks is one of important tasks in post-genomic era. However, PPI networks are often incomplete and noisy, which builds barriers to mining protein complexes. RESULTS We propose a new and effective algorithm based on robustness to detect overlapping clusters as protein complexes in PPI networks. And in order to improve the accuracy of resulting clusters, our algorithm tries to reduce bad effects brought by noise in PPI networks. And in our algorithm, each new cluster begins from a seed and is expanded through adding qualified nodes from the cluster's neighbourhood nodes. Besides, in our algorithm, a new distance measurement method between a cluster K and a node in the neighbours of K is proposed as well. The performance of our algorithm is evaluated by applying it on two PPI networks which are Gavin network and Database of Interacting Proteins (DIP). The results show that our algorithm is better than Markov clustering algorithm (MCL), Clique Percolation method (CPM) and core-attachment based method (CoAch) in terms of F-measure, co-localization and Gene Ontology (GO) semantic similarity. CONCLUSIONS Our algorithm detects locally dense regions or clusters as protein complexes. The results show that protein complexes generated by our algorithm have better quality than those generated by some previous classic methods. Therefore, our algorithm is effective and useful.
Collapse
Affiliation(s)
- Shuliang Wang
- School of Software, Beijing Institute of Technology, 5 South Zhongguancun Street, Haidian District, Beijing 100081, China
| | - Fang Wu
- The Integrated Information System Research Center, Institute of Automation, Chinese Academy of Sciences, 95 Zhongguancun East Road, Haidian District, Beijing 100190, China
| |
Collapse
|
55
|
Moschopoulos C, Beligiannis G, Likothanassis S, Kossida S. Using a Genetic Algorithm and Markov Clustering on Protein–Protein Interaction Graphs. Bioinformatics 2013. [DOI: 10.4018/978-1-4666-3604-0.ch043] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022] Open
Abstract
In this paper, a Genetic Algorithm is applied on the filter of the Enhanced Markov Clustering algorithm to optimize the selection of clusters having a high probability to represent protein complexes. The filter was applied on the results (obtained by experiments made on five different yeast datasets) of three different algorithms known for their efficiency on protein complex detection through protein interaction graphs. The results are compared with three popular clustering algorithms, proving the efficiency of the proposed method according to metrics such as successful prediction rate and geometrical accuracy.
Collapse
Affiliation(s)
| | | | | | - Sophia Kossida
- Biomedical Research Foundation of the Academy of Athens, Greece
| |
Collapse
|
56
|
Wang J, Peng X, Xiao Q, Li M, Pan Y. An effective method for refining predicted protein complexes based on protein activity and the mechanism of protein complex formation. BMC SYSTEMS BIOLOGY 2013; 7:28. [PMID: 23537347 PMCID: PMC3648373 DOI: 10.1186/1752-0509-7-28] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/09/2012] [Accepted: 03/14/2013] [Indexed: 11/10/2022]
Abstract
BACKGROUND Identifying protein complexes from protein-protein interaction network is fundamental for understanding the mechanism of cellular component and protein function. At present, many methods to identify protein complexes are mainly based on the topological characteristics or the functional similarity features, neglecting the fact that proteins must be in their active forms to interact with others and the formation of protein complex is following a just-in-time mechanism. RESULTS This paper firstly presents a protein complex formation model based on the just-in-time mechanism. By investigating known protein complexes combined with gene expression data, we find that most protein complexes can be formed in continuous time points, and the average overlapping rate of the known complexes during the formation is large. A method is proposed to refine the protein complexes predicted by clustering algorithms based on the protein complex formation model and the properties of known protein complexes. After refinement, the number of known complexes that are matched by predicted complexes, Sensitivity, Specificity, and f-measure are significantly improved, when compared with those of the original predicted complexes. CONCLUSION The refining method can discard the spurious proteins by protein activity and generate new complexes by just-in-time assemble mechanism, which can enhance the ability to predict complex.
Collapse
Affiliation(s)
- Jianxin Wang
- School of Information Science and Engineering, Central South University, Changsha 410083, China.
| | | | | | | | | |
Collapse
|
57
|
Maji P, Paul S. Rough-fuzzy clustering for grouping functionally similar genes from microarray data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2013; 10:286-299. [PMID: 22848138 DOI: 10.1109/tcbb.2012.103] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/01/2023]
Abstract
Gene expression data clustering is one of the important tasks of functional genomics as it provides a powerful tool for studying functional relationships of genes in a biological process. Identifying coexpressed groups of genes represents the basic challenge in gene clustering problem. In this regard, a gene clustering algorithm, termed as robust rough-fuzzy c-means, is proposed judiciously integrating the merits of rough sets and fuzzy sets. While the concept of lower and upper approximations of rough sets deals with uncertainty, vagueness, and incompleteness in cluster definition, the integration of probabilistic and possibilistic memberships of fuzzy sets enables efficient handling of overlapping partitions in noisy environment. The concept of possibilistic lower bound and probabilistic boundary of a cluster, introduced in robust rough-fuzzy c-means, enables efficient selection of gene clusters. An efficient method is proposed to select initial prototypes of different gene clusters, which enables the proposed c-means algorithm to converge to an optimum or near optimum solutions and helps to discover coexpressed gene clusters. The effectiveness of the algorithm, along with a comparison with other algorithms, is demonstrated both qualitatively and quantitatively on 14 yeast microarray data sets.
Collapse
Affiliation(s)
- Pradipta Maji
- Machine Intelligence Unit, Indian Statistical Institute, 203 BT Road, Kolkata 700108, West Bengal, India.
| | | |
Collapse
|
58
|
Hayes W, Sun K, Pržulj N. Graphlet-based measures are suitable for biological network comparison. Bioinformatics 2013; 29:483-91. [PMID: 23349212 DOI: 10.1093/bioinformatics/bts729] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
MOTIVATION Large amounts of biological network data exist for many species. Analogous to sequence comparison, network comparison aims to provide biological insight. Graphlet-based methods are proving to be useful in this respect. Recently some doubt has arisen concerning the applicability of graphlet-based measures to low edge density networks-in particular that the methods are 'unstable'-and further that no existing network model matches the structure found in real biological networks. RESULTS We demonstrate that it is the model networks themselves that are 'unstable' at low edge density and that graphlet-based measures correctly reflect this instability. Furthermore, while model network topology is unstable at low edge density, biological network topology is stable. In particular, one must distinguish between average density and local density. While model networks of low average edge densities also have low local edge density, that is not the case with protein-protein interaction (PPI) networks: real PPI networks have low average edge density, but high local edge densities, and hence, they (and thus graphlet-based measures) are stable on these networks. Finally, we use a recently devised non-parametric statistical test to demonstrate that PPI networks of many species are well-fit by several models not previously tested. In addition, we model several viral PPI networks for the first time and demonstrate an exceptionally good fit between the data and theoretical models.
Collapse
Affiliation(s)
- Wayne Hayes
- Department of Computer Science, University of California, Irvine, CA 92697-3435, USA
| | | | | |
Collapse
|
59
|
MOSCHOPOULOS CHARALAMPOS, FYTROS MARIOS, ALATSATHIANOS STAMATIS, LIKOTHANASSIS SPIRIDON, KOSSIDA SOPHIA. GAPPI: IDENTIFYING IMPORTANT PROTEIN MODULES THROUGH PROTEIN-PROTEIN INTERACTION GRAPHS. INT J ARTIF INTELL T 2013. [DOI: 10.1142/s0218213012500273] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
In this paper a new Genetic Algorithm is proposed, called GAppi, which performs clustering in protein-protein interaction networks to identify protein complexes. The algorithm has been tested exhaustively with experimental datasets coming from online protein interaction databases and individual experiments and it has been compared with five other effective techniques in order to demonstrate its efficiency and superior performance. Results showed that GAppi produces feasible and very efficient solutions compared to other techniques. Except from that, due to its adaptive behavior, each time it is used it can satisfy different constraints, thus meeting the different needs of each user. Furthermore, a user friendly interface has been implemented that hosts the proposed algorithmic strategy.
Collapse
Affiliation(s)
- CHARALAMPOS MOSCHOPOULOS
- Bioinformatics & Medical Informatics Team, Biomedical Research Foundation of the Academy of Athens, Soranou Efesiou 4, Athens GR-11527, Greece
| | - MARIOS FYTROS
- Computer System Department, Technological Institution of Piraeus, Petrou Ralli & Thivon 250, Aigaleo, Piraeus GR-11244, Greece
| | - STAMATIS ALATSATHIANOS
- Computer System Department, Technological Institution of Piraeus, Petrou Ralli & Thivon 250, Aigaleo, Piraeus GR-11244, Greece
| | - SPIRIDON LIKOTHANASSIS
- Department of Computer Engineering & Informatics, University of Patras, Rio GR-26500, Greece
| | - SOPHIA KOSSIDA
- Bioinformatics & Medical Informatics Team, Biomedical Research Foundation of the Academy of Athens, Soranou Efesiou 4, Athens GR-11527, Greece
| |
Collapse
|
60
|
Cai B, Wang H, Zheng H, Wang H. Detection of protein complexes from affinity purification/mass spectrometry data. BMC SYSTEMS BIOLOGY 2012; 6 Suppl 3:S4. [PMID: 23282282 PMCID: PMC3524315 DOI: 10.1186/1752-0509-6-s3-s4] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/05/2022]
Abstract
Background Recent advances in molecular biology have led to the accumulation of large amounts of data on protein-protein interaction networks in different species. An important challenge for the analysis of these data is to extract functional modules such as protein complexes and biological processes from networks which are characterised by the present of a significant number of false positives. Various computational techniques have been applied in recent years. However, most of them treat protein interaction as binary. Co-complex relations derived from affinity purification/mass spectrometry (AP-MS) experiments have been largely ignored. Methods This paper presents a new algorithm for detecting protein complexes from AP-MS data. The algorithm intends to detect groups of prey proteins that are significantly co-associated with the same set of bait proteins. We first construct AP-MS data as a bipartite network, where one set of nodes consists of bait proteins and the other set is composed of prey proteins. We then calculate pair-wise similarities of bait proteins based on the number of their commonly shared neighbours. A hierarchical clustering algorithm is employed to cluster bait proteins based on the similarities and thus a set of 'seed' clusters is obtained. Starting from these 'seed' clusters, an expansion process is developed to identify prey proteins which are significantly associated with the same set of bait proteins. Then, a set of complete protein complexes is derived. In application to two real AP-MS datasets, we validate biological significance of predicted protein complexes by using curated protein complexes and well-characterized cellular component annotation from Gene Ontology (GO). Several statistical metrics have been applied for evaluation. Results Experimental results show that, the proposed algorithm achieves significant improvement in detecting protein complexes from AP-MS data. In comparison to the well-known MCL algorithm, our algorithm improves the accuracy rate by about 20% in detecting protein complexes in both networks and increases the F-Measure value by about 50% in Krogan_2006 network. Greater precision and better accuracy have been achieved and the identified complexes are demonstrated to match well with existing curated protein complexes. Conclusions Our study highlights the significance of taking co-complex relations into account when extracting protein complexes from AP-MS data. The algorithm proposed in this paper can be easily extended to the analysis of other biological networks which can be conveniently represented by bipartite graphs such as drug-target networks.
Collapse
Affiliation(s)
- Bingjing Cai
- School of Computing and Mathematics, Computer Sciences Research Institute, University of Ulster, N. Ireland, BT37 0QB, UK
| | | | | | | |
Collapse
|
61
|
Jianxin Wang, Gang Chen, Binbin Liu, Min Li, Yi Pan. Identifying Protein Complexes From Interactome Based on Essential Proteins and Local Fitness Method. IEEE Trans Nanobioscience 2012; 11:324-35. [DOI: 10.1109/tnb.2012.2197863] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
62
|
Osoba O, Kosko B. Noise-enhanced clustering and competitive learning algorithms. Neural Netw 2012; 37:132-40. [PMID: 23137615 DOI: 10.1016/j.neunet.2012.09.012] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2012] [Revised: 09/20/2012] [Accepted: 09/20/2012] [Indexed: 11/30/2022]
Abstract
Noise can provably speed up convergence in many centroid-based clustering algorithms. This includes the popular k-means clustering algorithm. The clustering noise benefit follows from the general noise benefit for the expectation-maximization algorithm because many clustering algorithms are special cases of the expectation-maximization algorithm. Simulations show that noise also speeds up convergence in stochastic unsupervised competitive learning, supervised competitive learning, and differential competitive learning.
Collapse
Affiliation(s)
- Osonde Osoba
- Department of Electrical Engineering, Signal and Image Processing Institute, University of Southern California, Los Angeles, CA 90089-2564, USA
| | | |
Collapse
|
63
|
BELLO-ORGAZ GEMA, MENÉNDEZ HÉCTORD, CAMACHO DAVID. ADAPTIVE K-MEANS ALGORITHM FOR OVERLAPPED GRAPH CLUSTERING. Int J Neural Syst 2012; 22:1250018. [DOI: 10.1142/s0129065712500189] [Citation(s) in RCA: 41] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
The graph clustering problem has become highly relevant due to the growing interest of several research communities in social networks and their possible applications. Overlapped graph clustering algorithms try to find subsets of nodes that can belong to different clusters. In social network-based applications it is quite usual for a node of the network to belong to different groups, or communities, in the graph. Therefore, algorithms trying to discover, or analyze, the behavior of these networks needed to handle this feature, detecting and identifying the overlapped nodes. This paper shows a soft clustering approach based on a genetic algorithm where a new encoding is designed to achieve two main goals: first, the automatic adaptation of the number of communities that can be detected and second, the definition of several fitness functions that guide the searching process using some measures extracted from graph theory. Finally, our approach has been experimentally tested using the Eurovision contest dataset, a well-known social-based data network, to show how overlapped communities can be found using our method.
Collapse
Affiliation(s)
- GEMA BELLO-ORGAZ
- Computer Science Department, Escuela Politecnica Superior, Universidad Autónoma de Madrid, 28049, Madrid, Spain
| | - HÉCTOR D. MENÉNDEZ
- Computer Science Department, Escuela Politecnica Superior, Universidad Autónoma de Madrid, 28049, Madrid, Spain
| | - DAVID CAMACHO
- Computer Science Department, Escuela Politecnica Superior, Universidad Autónoma de Madrid, 28049, Madrid, Spain
| |
Collapse
|
64
|
Le T, Tran D, Nguyen P, Ma W, Sharma D. Proximity multi-sphere support vector clustering. Neural Comput Appl 2012. [DOI: 10.1007/s00521-012-1001-7] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
65
|
Kuhl C, Tautenhahn R, Böttcher C, Larson TR, Neumann S. CAMERA: an integrated strategy for compound spectra extraction and annotation of liquid chromatography/mass spectrometry data sets. Anal Chem 2011; 84:283-9. [PMID: 22111785 DOI: 10.1021/ac202450g] [Citation(s) in RCA: 752] [Impact Index Per Article: 57.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
Abstract
Liquid chromatography coupled to mass spectrometry is routinely used for metabolomics experiments. In contrast to the fairly routine and automated data acquisition steps, subsequent compound annotation and identification require extensive manual analysis and thus form a major bottleneck in data interpretation. Here we present CAMERA, a Bioconductor package integrating algorithms to extract compound spectra, annotate isotope and adduct peaks, and propose the accurate compound mass even in highly complex data. To evaluate the algorithms, we compared the annotation of CAMERA against a manually defined annotation for a mixture of known compounds spiked into a complex matrix at different concentrations. CAMERA successfully extracted accurate masses for 89.7% and 90.3% of the annotatable compounds in positive and negative ion modes, respectively. Furthermore, we present a novel annotation approach that combines spectral information of data acquired in opposite ion modes to further improve the annotation rate. We demonstrate the utility of CAMERA in two different, easily adoptable plant metabolomics experiments, where the application of CAMERA drastically reduced the amount of manual analysis.
Collapse
Affiliation(s)
- Carsten Kuhl
- Department of Stress and Developmental Biology, Leibniz Institute of Plant Biochemistry, Weinberg 3, 06120 Halle (Saale), Germany.
| | | | | | | | | |
Collapse
|
66
|
Parker BJ, Moltke I, Roth A, Washietl S, Wen J, Kellis M, Breaker R, Pedersen JS. New families of human regulatory RNA structures identified by comparative analysis of vertebrate genomes. Genome Res 2011; 21:1929-43. [PMID: 21994249 DOI: 10.1101/gr.112516.110] [Citation(s) in RCA: 80] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]
Abstract
Regulatory RNA structures are often members of families with multiple paralogous instances across the genome. Family members share functional and structural properties, which allow them to be studied as a whole, facilitating both bioinformatic and experimental characterization. We have developed a comparative method, EvoFam, for genome-wide identification of families of regulatory RNA structures, based on primary sequence and secondary structure similarity. We apply EvoFam to a 41-way genomic vertebrate alignment. Genome-wide, we identify 220 human, high-confidence families outside protein-coding regions comprising 725 individual structures, including 48 families with known structural RNA elements. Known families identified include both noncoding RNAs, e.g., miRNAs and the recently identified MALAT1/MEN β lincRNA family; and cis-regulatory structures, e.g., iron-responsive elements. We also identify tens of new families supported by strong evolutionary evidence and other statistical evidence, such as GO term enrichments. For some of these, detailed analysis has led to the formulation of specific functional hypotheses. Examples include two hypothesized auto-regulatory feedback mechanisms: one involving six long hairpins in the 3'-UTR of MAT2A, a key metabolic gene that produces the primary human methyl donor S-adenosylmethionine; the other involving a tRNA-like structure in the intron of the tRNA maturation gene POP1. We experimentally validate the predicted MAT2A structures. Finally, we identify potential new regulatory networks, including large families of short hairpins enriched in immunity-related genes, e.g., TNF, FOS, and CTLA4, which include known transcript destabilizing elements. Our findings exemplify the diversity of post-transcriptional regulation and provide a resource for further characterization of new regulatory mechanisms and families of noncoding RNAs.
Collapse
Affiliation(s)
- Brian J Parker
- The Bioinformatics Centre, Department of Biology, University of Copenhagen, DK-2200 Copenhagen, Denmark.
| | | | | | | | | | | | | | | |
Collapse
|
67
|
Gaussian kernel width exploration and cone cluster labeling for support vector clustering. Pattern Anal Appl 2011. [DOI: 10.1007/s10044-011-0244-8] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
68
|
YU L, GAO L, SUN PG. Research on Algorithms for Complexes and Functional Modules Prediction in Protein-Protein Interaction Networks. ACTA ACUST UNITED AC 2011. [DOI: 10.3724/sp.j.1016.2011.01239] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
69
|
Wang J, Li M, Chen J, Pan Y. A fast hierarchical clustering algorithm for functional modules discovery in protein interaction networks. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011; 8:607-620. [PMID: 20733244 DOI: 10.1109/tcbb.2010.75] [Citation(s) in RCA: 104] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2023]
Abstract
As advances in the technologies of predicting protein interactions, huge data sets portrayed as networks have been available. Identification of functional modules from such networks is crucial for understanding principles of cellular organization and functions. However, protein interaction data produced by high-throughput experiments are generally associated with high false positives, which makes it difficult to identify functional modules accurately. In this paper, we propose a fast hierarchical clustering algorithm HC-PIN based on the local metric of edge clustering value which can be used both in the unweighted network and in the weighted network. The proposed algorithm HC-PIN is applied to the yeast protein interaction network, and the identified modules are validated by all the three types of Gene Ontology (GO) Terms: Biological Process, Molecular Function, and Cellular Component. The experimental results show that HC-PIN is not only robust to false positives, but also can discover the functional modules with low density. The identified modules are statistically significant in terms of three types of GO annotations. Moreover, HC-PIN can uncover the hierarchical organization of functional modules with the variation of its parameter's value, which is approximatively corresponding to the hierarchical structure of GO annotations. Compared to other previous competing algorithms, our algorithm HC-PIN is faster and more accurate.
Collapse
Affiliation(s)
- Jianxin Wang
- Department of Computer Science, School of Information Science and Engineering, Central South University, Changsha 410083, China.
| | | | | | | |
Collapse
|
70
|
|
71
|
Abstract
The author explores the application of graph colouring to biological networks, specifically protein-protein interaction (PPI) networks. First, the author finds that given similar conditions (i.e. graph size, degree distribution and clustering), fewer colours are needed to colour disassortative than assortative networks. Fewer colours create fewer independent sets which in turn imply higher concurrency potential for a network. Since PPI networks tend to be disassortative, the author suggests that in addition to functional specificity and stability proposed previously by Maslov and Sneppen (Science, 296, 2002), the disassortative nature of PPI networks may promote the ability of cells to perform multiple, crucial and functionally diverse tasks concurrently. Second, because graph colouring is closely related to the presence of cliques in a graph, the significance of node colouring information to the problem of identifying protein complexes (dense subgraphs in PPI networks), is investigated. The author finds that for PPI networks where 1-11% of nodes participate in at least one identified protein complex, such as H. sapien, DSATUR (a well-known complete graph colouring algorithm) node colouring information can improve the quality (homogeneity and separation) of initial candidate complexes. This finding may help improve existing protein complex detection methods, and/or suggest new methods. [Includes supplementary material].
Collapse
|
72
|
Koyutürk M. Algorithmic and analytical methods in network biology. WILEY INTERDISCIPLINARY REVIEWS. SYSTEMS BIOLOGY AND MEDICINE 2010; 2:277-292. [PMID: 20836029 PMCID: PMC3087298 DOI: 10.1002/wsbm.61] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
During the genomic revolution, algorithmic and analytical methods for organizing, integrating, analyzing, and querying biological sequence data proved invaluable. Today, increasing availability of high-throughput data pertaining to functional states of biomolecules, as well as their interactions, enables genome-scale studies of the cell from a systems perspective. The past decade witnessed significant efforts on the development of computational infrastructure for large-scale modeling and analysis of biological systems, commonly using network models. Such efforts lead to novel insights into the complexity of living systems, through development of sophisticated abstractions, algorithms, and analytical techniques that address a broad range of problems, including the following: (1) inference and reconstruction of complex cellular networks; (2) identification of common and coherent patterns in cellular networks, with a view to understanding the organizing principles and building blocks of cellular signaling, regulation, and metabolism; and (3) characterization of cellular mechanisms that underlie the differences between living systems, in terms of evolutionary diversity, development and differentiation, and complex phenotypes, including human disease. These problems pose significant algorithmic and analytical challenges because of the inherent complexity of the systems being studied; limitations of data in terms of availability, scope, and scale; intractability of resulting computational problems; and limitations of reference models for reliable statistical inference. This article provides a broad overview of existing algorithmic and analytical approaches to these problems, highlights key biological insights provided by these approaches, and outlines emerging opportunities and challenges in computational systems biology.
Collapse
Affiliation(s)
- Mehmet Koyutürk
- Department of Electrical Engineering & Computer Science, Case Western Reserve University, Cleveland, OH 44106, USA
- Center for Proteomics and Bioinformatics, Case Western Reserve University, Cleveland, OH 44106, USA
| |
Collapse
|
73
|
|
74
|
|
75
|
Moschopoulos CN, Pavlopoulos GA, Schneider R, Likothanassis SD, Kossida S. GIBA: a clustering tool for detecting protein complexes. BMC Bioinformatics 2009; 10 Suppl 6:S11. [PMID: 19534736 PMCID: PMC2697634 DOI: 10.1186/1471-2105-10-s6-s11] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/24/2023] Open
Abstract
BACKGROUND During the last years, high throughput experimental methods have been developed which generate large datasets of protein - protein interactions (PPIs). However, due to the experimental methodologies these datasets contain errors mainly in terms of false positive data sets and reducing therefore the quality of any derived information. Typically these datasets can be modeled as graphs, where vertices represent proteins and edges the pairwise PPIs, making it easy to apply automated clustering methods to detect protein complexes or other biological significant functional groupings. METHODS In this paper, a clustering tool, called GIBA (named by the first characters of its developers' nicknames), is presented. GIBA implements a two step procedure to a given dataset of protein-protein interaction data. First, a clustering algorithm is applied to the interaction data, which is then followed by a filtering step to generate the final candidate list of predicted complexes. RESULTS The efficiency of GIBA is demonstrated through the analysis of 6 different yeast protein interaction datasets in comparison to four other available algorithms. We compared the results of the different methods by applying five different performance measurement metrices. Moreover, the parameters of the methods that constitute the filter have been checked on how they affect the final results. CONCLUSION GIBA is an effective and easy to use tool for the detection of protein complexes out of experimentally measured protein - protein interaction networks. The results show that GIBA has superior prediction accuracy than previously published methods.
Collapse
Affiliation(s)
- Charalampos N Moschopoulos
- Pattern Recognition Lab, Department of Computer Engineering & Informatics, University of Patras, Patra, Rio, GR-26500, Greece.
| | | | | | | | | |
Collapse
|
76
|
Gao L, Sun PG, Song J. Clustering algorithms for detecting functional modules in protein interaction networks. J Bioinform Comput Biol 2009; 7:217-42. [PMID: 19226668 DOI: 10.1142/s0219720009004023] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2008] [Revised: 10/21/2008] [Accepted: 10/21/2008] [Indexed: 01/21/2023]
Abstract
Protein-Protein Interaction (PPI) networks are believed to be important sources of information related to biological processes and complex metabolic functions of the cell. When studying the workings of a biological cell, it is useful to be able to detect known and predict still undiscovered protein complexes within the cell's PPI networks. Such predictions may be used as an inexpensive tool to direct biological experiments. The increasing amount of available PPI data necessitate a fast, accurate approach to biological complex identification. Because of its importance in the studies of protein interaction network, there are different models and algorithms in identifying functional modules in PPI networks. In this paper, we review some representative algorithms, focusing on the algorithms underlying the approaches and how the algorithms relate to each other. In particular, a comparison is given based on the property of the algorithms. Since the PPI network is noisy and still incomplete, some methods which consider other additional properties for preprocessing and purifying of PPI data are presented. We also give a discussion about the functional annotation and validation of protein complexes. Finally, new progress and future research directions are discussed from the computational viewpoint.
Collapse
Affiliation(s)
- Lin Gao
- School of Computer Science and Technology, Xidian University, Xi'an 710071, China.
| | | | | |
Collapse
|
77
|
Multiobjective evolutionary clustering of Web user sessions: a case study in Web page recommendation. Soft comput 2009. [DOI: 10.1007/s00500-009-0428-y] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
78
|
Maulik U, Mukhopadhyay A, Bandyopadhyay S. Finding multiple coherent biclusters in microarray data using variable string length multiobjective genetic algorithm. ACTA ACUST UNITED AC 2009; 13:969-75. [PMID: 19304489 DOI: 10.1109/titb.2009.2017527] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Microarray technology enables the simultaneous monitoring of the expression pattern of a huge number of genes across different experimental conditions. Biclustering in microarray data is an important technique that discovers a group of genes that are coregulated in a subset of conditions. Biclustering algorithms require to identify coherent and nontrivial biclusters, i.e., the biclusters should have low mean squared residue and high row variance. A multiobjective genetic biclustering technique is proposed here that optimizes these objectives simultaneously. A novel encoding scheme that uses variable chromosome length is developed. Moreover, a new quantitative measure to evaluate the goodness of the biclusters is proposed. The performance of the proposed algorithm has been evaluated on both simulated and real-life gene expression datasets, and compared with some other well-known biclustering techniques.
Collapse
Affiliation(s)
- Ujjwal Maulik
- Department of Computer Science and Engineering, Jadavpur University, Kolkata 700032, India.
| | | | | |
Collapse
|
79
|
|
80
|
Effective Pruning Techniques for Mining Quasi-Cliques. MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES 2008. [DOI: 10.1007/978-3-540-87481-2_3] [Citation(s) in RCA: 53] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
|
81
|
|
82
|
Madi A, Friedman Y, Roth D, Regev T, Bransburg-Zabary S, Jacob EB. Genome holography: deciphering function-form motifs from gene expression data. PLoS One 2008; 3:e2708. [PMID: 18628959 PMCID: PMC2444029 DOI: 10.1371/journal.pone.0002708] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2008] [Accepted: 06/19/2008] [Indexed: 12/28/2022] Open
Abstract
Background DNA chips allow simultaneous measurements of genome-wide response of thousands of genes, i.e. system level monitoring of the gene-network activity. Advanced analysis methods have been developed to extract meaningful information from the vast amount of raw gene-expression data obtained from the microarray measurements. These methods usually aimed to distinguish between groups of subjects (e.g., cancer patients vs. healthy subjects) or identifying marker genes that help to distinguish between those groups. We assumed that motifs related to the internal structure of operons and gene-networks regulation are also embedded in microarray and can be deciphered by using proper analysis. Methodology/Principal Findings The analysis presented here is based on investigating the gene-gene correlations. We analyze a database of gene expression of Bacillus subtilis exposed to sub-lethal levels of 37 different antibiotics. Using unsupervised analysis (dendrogram) of the matrix of normalized gene-gene correlations, we identified the operons as they form distinct clusters of genes in the sorted correlation matrix. Applying dimension-reduction algorithm (Principal Component Analysis, PCA) to the matrices of normalized correlations reveals functional motifs. The genes are placed in a reduced 3-dimensional space of the three leading PCA eigen-vectors according to their corresponding eigen-values. We found that the organization of the genes in the reduced PCA space recovers motifs of the operon internal structure, such as the order of the genes along the genome, gene separation by non-coding segments, and translational start and end regions. In addition to the intra-operon structure, it is also possible to predict inter-operon relationships, operons sharing functional regulation factors, and more. In particular, we demonstrate the above in the context of the competence and sporulation pathways. Conclusions/Significance We demonstrated that by analyzing gene-gene correlation from gene-expression data it is possible to identify operons and to predict unknown internal structure of operons and gene-networks regulation.
Collapse
Affiliation(s)
- Asaf Madi
- School of Physics and Astronomy, Tel Aviv University, Tel Aviv, Israel
- Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel
| | - Yonatan Friedman
- School of Physics and Astronomy, Tel Aviv University, Tel Aviv, Israel
- Computational and Systems Biology, Massachusetts Institute of Technology (MIT), Boston, Massachusetts, United States of America
| | - Dalit Roth
- Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel
| | - Tamar Regev
- School of Physics and Astronomy, Tel Aviv University, Tel Aviv, Israel
| | - Sharron Bransburg-Zabary
- School of Physics and Astronomy, Tel Aviv University, Tel Aviv, Israel
- Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel
| | - Eshel Ben Jacob
- School of Physics and Astronomy, Tel Aviv University, Tel Aviv, Israel
- The Center for Theoretical and Biological Physics, University of California San Diego, La Jolla, California, United States of America
- * E-mail:
| |
Collapse
|
83
|
Zhao XM, Chen L, Aihara K. Protein function prediction with high-throughput data. Amino Acids 2008; 35:517-30. [PMID: 18427717 DOI: 10.1007/s00726-008-0077-y] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2008] [Accepted: 03/13/2008] [Indexed: 12/12/2022]
Abstract
Protein function prediction is one of the main challenges in post-genomic era. The availability of large amounts of high-throughput data provides an alternative approach to handling this problem from the computational viewpoint. In this review, we provide a comprehensive description of the computational methods that are currently applicable to protein function prediction, especially from the perspective of machine learning. Machine learning techniques can generally be classified as supervised learning, semi-supervised learning and unsupervised learning. By classifying the existing computational methods for protein annotation into these three groups, we are able to present a comprehensive framework on protein annotation based on machine learning techniques. In addition to describing recently developed theoretical methodologies, we also cover representative databases and software tools that are widely utilized in the prediction of protein function.
Collapse
Affiliation(s)
- Xing-Ming Zhao
- ERATO Aihara Complexity Modelling Project, JST, Tokyo, 151-0064, Japan
| | | | | |
Collapse
|
84
|
Milenković T, Pržulj N. Uncovering Biological Network Function via Graphlet Degree Signatures. Cancer Inform 2008. [DOI: 10.4137/cin.s680] [Citation(s) in RCA: 166] [Impact Index Per Article: 10.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022] Open
Abstract
Motivation Proteins are essential macromolecules of life and thus understanding their function is of great importance. The number of functionally unclassified proteins is large even for simple and well studied organisms such as baker's yeast. Methods for determining protein function have shifted their focus from targeting specific proteins based solely on sequence homology to analyses of the entire proteome based on protein-protein interaction (PPI) networks. Since proteins interact to perform a certain function, analyzing structural properties of PPI networks may provide useful clues about the biological function of individual proteins, protein complexes they participate in, and even larger subcellular machines. Results We design a sensitive graph theoretic method for comparing local structures of node neighborhoods that demonstrates that in PPI networks, biological function of a node and its local network structure are closely related. The method summarizes a protein's local topology in a PPI network into the vector of graphlet degrees called the signature of the protein and computes the signature similarities between all protein pairs. We group topologically similar proteins under this measure in a PPI network and show that these protein groups belong to the same protein complexes, perform the same biological functions, are localized in the same subcellular compartments, and have the same tissue expressions. Moreover, we apply our technique on a proteome-scale network data and infer biological function of yet unclassified proteins demonstrating that our method can provide valuable guidelines for future experimental research such as disease protein prediction. Availability Data is available upon request.
Collapse
Affiliation(s)
- Tijana Milenković
- Department of Computer Science, University of California, Irvine, CA 92697-3435, U.S.A
| | - Nataša Pržulj
- Department of Computer Science, University of California, Irvine, CA 92697-3435, U.S.A
| |
Collapse
|
85
|
Hwang W, Cho YR, Zhang A, Ramanathan M. CASCADE: a novel quasi all paths-based network analysis algorithm for clustering biological interactions. BMC Bioinformatics 2008; 9:64. [PMID: 18230159 PMCID: PMC2253513 DOI: 10.1186/1471-2105-9-64] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2007] [Accepted: 01/29/2008] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Quantitative characterization of the topological characteristics of protein-protein interaction (PPI) networks can enable the elucidation of biological functional modules. Here, we present a novel clustering methodology for PPI networks wherein the biological and topological influence of each protein on other proteins is modeled using the probability distribution that the series of interactions necessary to link a pair of distant proteins in the network occur within a time constant (the occurrence probability). RESULTS CASCADE selects representative nodes for each cluster and iteratively refines clusters based on a combination of the occurrence probability and graph topology between every protein pair. The CASCADE approach is compared to nine competing approaches. The clusters obtained by each technique are compared for enrichment of biological function. CASCADE generates larger clusters and the clusters identified have p-values for biological function that are approximately 1000-fold better than the other methods on the yeast PPI network dataset. An important strength of CASCADE is that the percentage of proteins that are discarded to create clusters is much lower than the other approaches which have an average discard rate of 45% on the yeast protein-protein interaction network. CONCLUSION CASCADE is effective at detecting biologically relevant clusters of interactions.
Collapse
Affiliation(s)
- Woochang Hwang
- Department of Computer Science and Engineering, State University of New York, Buffalo, NY 14260, USA.
| | | | | | | |
Collapse
|
86
|
Santos JM, Marques de Sa J, Alexandre LA. LEGClust- a clustering algorithm based on layered entropic subgraphs. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2008; 30:62-75. [PMID: 18000325 DOI: 10.1109/tpami.2007.1142] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/25/2023]
Abstract
Hierarchical clustering is a stepwise clustering method usually based on proximity measures between objects or sets of objects from a given data set. The most common proximity measures are distance measures. The derived proximity matrices can be used to build graphs, which provide the basic structure for some clustering methods. We present here a new proximity matrix based on an entropic measure and also a clustering algorithm (LEGClust) that builds layers of subgraphs based on this matrix, and uses them and a hierarchical agglomerative clustering technique to form the clusters. Our approach capitalizes on both a graph structure and a hierarchical construction. Moreover, by using entropy as a proximity measure we are able, with no assumption about the cluster shapes, to capture the local structure of the data, forcing the clustering method to reflect this structure. We present several experiments on artificial and real data sets that provide evidence on the superior performance of this new algorithm when compared with competing ones.
Collapse
Affiliation(s)
- Jorge M Santos
- Department of Mathematics, ISEP- Polytechnic, School of Engineering, Porto, Portugal.
| | | | | |
Collapse
|
87
|
Yan X, Mehan MR, Huang Y, Waterman MS, Yu PS, Zhou XJ. A graph-based approach to systematically reconstruct human transcriptional regulatory modules. ACTA ACUST UNITED AC 2007; 23:i577-86. [PMID: 17646346 DOI: 10.1093/bioinformatics/btm227] [Citation(s) in RCA: 63] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
MOTIVATION A major challenge in studying gene regulation is to systematically reconstruct transcription regulatory modules, which are defined as sets of genes that are regulated by a common set of transcription factors. A commonly used approach for transcription module reconstruction is to derive coexpression clusters from a microarray dataset. However, such results often contain false positives because genes from many transcription modules may be simultaneously perturbed upon a given type of conditions. In this study, we propose and validate that genes, which form a coexpression cluster in multiple microarray datasets across diverse conditions, are more likely to form a transcription module. However, identifying genes coexpressed in a subset of many microarray datasets is not a trivial computational problem. RESULTS We propose a graph-based data-mining approach to efficiently and systematically identify frequent coexpression clusters. Given m microarray datasets, we model each microarray dataset as a coexpression graph, and search for vertex sets which are frequently densely connected across [theta m] datasets (0 < or = theta < or = 1). For this novel graph-mining problem, we designed two techniques to narrow down the search space: (1) partition the input graphs into (overlapping) groups sharing common properties; (2) summarize the vertex neighbor information from the partitioned datasets onto the 'Neighbor Association Summary Graph's for effective mining. We applied our method to 105 human microarray datasets, and identified a large number of potential transcription modules, activated under different subsets of conditions. Validation by ChIP-chip data demonstrated that the likelihood of a coexpression cluster being a transcription module increases significantly with its recurrence. Our method opens a new way to exploit the vast amount of existing microarray data accumulation for gene regulation study. Furthermore, the algorithm is applicable to other biological networks for approximate network module mining. AVAILABILITY http://zhoulab.usc.edu/NeMo/.
Collapse
Affiliation(s)
- Xifeng Yan
- IBM T. J. Watson Research Center, Hawthorne, NY, USA
| | | | | | | | | | | |
Collapse
|
88
|
Koyutürk M, Szpankowski W, Grama A. Assessing Significance of Connectivity and Conservation in Protein Interaction Networks. J Comput Biol 2007; 14:747-64. [PMID: 17691892 DOI: 10.1089/cmb.2007.r014] [Citation(s) in RCA: 44] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Comparative analyses of cellular interaction networks enable understanding of the cell's modular organization through identification of functional modules and complexes. These techniques often rely on topological features such as connectedness and density, based on the premise that functionally related proteins are likely to interact densely and that these interactions follow similar evolutionary trajectories. Significant recent work has focused on efficient algorithms for identification of such functional modules and their conservation. In spite of algorithmic advances, development of a comprehensive infrastructure for interaction databases is in relative infancy compared to corresponding sequence analysis tools. One critical, and as yet unresolved aspect of this infrastructure is a measure of the statistical significance of a match, or a dense subcomponent. In the absence of analytical measures, conventional methods rely on computationally expensive simulations based on ad-hoc models for quantifying significance. In this paper, we present techniques for analytically quantifying statistical significance of dense components in reference model graphs. We consider two reference models--a G(n, p) model in which each pair of nodes in a graph has an identical likelihood, p, of sharing an edge, and a two-level G(n, p) model, which accounts for high-degree hub nodes generally observed in interaction networks. Experiments performed on a rich collection of protein interaction (PPI) networks show that the proposed model provides a reliable means of evaluating statistical significance of dense patterns in these networks. We also adapt existing state-of-the-art network clustering algorithms by using our statistical significance measure as an optimization criterion. Comparison of the resulting module identification algorithm, SIDES, with existing methods shows that SIDES outperforms existing algorithms in terms of sensitivity and specificity of identified clusters with respect to available GO annotations.
Collapse
Affiliation(s)
- Mehmet Koyutürk
- Department of Computer Science, Purdue University, West Lafayette, IN 47907, USA.
| | | | | |
Collapse
|
89
|
Li W, Liu Y, Huang HC, Peng Y, Lin Y, Ng WK, Ong KL. Dynamical systems for discovering protein complexes and functional modules from biological networks. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2007; 4:233-50. [PMID: 17473317 DOI: 10.1109/tcbb.2007.070210] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/15/2023]
Abstract
Recent advances in high throughput experiments and annotations via published literature have provided a wealth of interaction maps of several biomolecular networks, including metabolic, protein-protein, and protein-DNA interaction networks. The architecture of these molecular networks reveals important principles of cellular organization and molecular functions. Analyzing such networks, i.e., discovering dense regions in the network, is an important way to identify protein complexes and functional modules. This task has been formulated as the problem of finding heavy subgraphs, the Heaviest k-Subgraph Problem (k-HSP), which itself is NP-hard. However, any method based on the k-HSP requires the parameter k and an exact solution of k-HSP may still end up as a "spurious" heavy subgraph, thus reducing its practicability in analyzing large scale biological networks. We proposed a new formulation, called the rank-HSP, and two dynamical systems to approximate its results. In addition, a novel metric, called the Standard deviation and Mean Ratio (SMR), is proposed for use in "spurious" heavy subgraphs to automate the discovery by setting a fixed threshold. Empirical results on both the simulated graphs and biological networks have demonstrated the efficiency and effectiveness of our proposal.
Collapse
Affiliation(s)
- Wenyuan Li
- Department of Computer Science, University of Texas at Dallas, Richardson, TX 75083, USA.
| | | | | | | | | | | | | |
Collapse
|
90
|
Belacel N, Wang Q, Cuperlovic-Culf M. Clustering methods for microarray gene expression data. OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY 2007; 10:507-31. [PMID: 17233561 DOI: 10.1089/omi.2006.10.507] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
Within the field of genomics, microarray technologies have become a powerful technique for simultaneously monitoring the expression patterns of thousands of genes under different sets of conditions. A main task now is to propose analytical methods to identify groups of genes that manifest similar expression patterns and are activated by similar conditions. The corresponding analysis problem is to cluster multi-condition gene expression data. The purpose of this paper is to present a general view of clustering techniques used in microarray gene expression data analysis.
Collapse
Affiliation(s)
- Nabil Belacel
- National Research Council Canada, Institute for Information Technology, Scientific Park, Moncton, New Brunswick, Canada.
| | | | | |
Collapse
|
91
|
Sharan R, Ulitsky I, Shamir R. Network-based prediction of protein function. Mol Syst Biol 2007; 3:88. [PMID: 17353930 PMCID: PMC1847944 DOI: 10.1038/msb4100129] [Citation(s) in RCA: 620] [Impact Index Per Article: 36.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2006] [Accepted: 01/09/2007] [Indexed: 12/22/2022] Open
Abstract
Functional annotation of proteins is a fundamental problem in the post-genomic era. The recent availability of protein interaction networks for many model species has spurred on the development of computational methods for interpreting such data in order to elucidate protein function. In this review, we describe the current computational approaches for the task, including direct methods, which propagate functional information through the network, and module-assisted methods, which infer functional modules within the network and use those for the annotation task. Although a broad variety of interesting approaches has been developed, further progress in the field will depend on systematic evaluation of the methods and their dissemination in the biological community.
Collapse
Affiliation(s)
- Roded Sharan
- School of Computer Science, Tel Aviv University, Tel Aviv, Israel
| | - Igor Ulitsky
- School of Computer Science, Tel Aviv University, Tel Aviv, Israel
| | - Ron Shamir
- School of Computer Science, Tel Aviv University, Tel Aviv, Israel
- School of Computer Science, Tel Aviv University, Tel Aviv 69978, Israel. Tel.: +972 3 6405383; Fax: +972 3 6405384;
| |
Collapse
|
92
|
Di Giacomo E, Didimo W, Grilli L, Liotta G. Graph visualization techniques for web clustering engines. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 2007; 13:294-304. [PMID: 17218746 DOI: 10.1109/tvcg.2007.40] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/13/2023]
Abstract
One of the most challenging issues in mining information from the World Wide Web is the design of systems that present the data to the end user by clustering them into meaningful semantic categories. We show that the analysis of the results of a clustering engine can significantly take advantage of enhanced graph drawing and visualization techniques. We propose a graph-based user interface for Web clustering engines that makes it possible for the user to explore and visualize the different semantic categories and their relationships at the desired level of detail.
Collapse
Affiliation(s)
- Emilio Di Giacomo
- Dipartimento di Ingegneria Elettronica e dell'Informazione, Università degli Studi di Perugia, Italy.
| | | | | | | |
Collapse
|
93
|
Hwang W, Cho YR, Zhang A, Ramanathan M. A novel functional module detection algorithm for protein-protein interaction networks. Algorithms Mol Biol 2006; 1:24. [PMID: 17147822 PMCID: PMC1764415 DOI: 10.1186/1748-7188-1-24] [Citation(s) in RCA: 69] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2006] [Accepted: 12/05/2006] [Indexed: 11/29/2022] Open
Abstract
Background The sparse connectivity of protein-protein interaction data sets makes identification of functional modules challenging. The purpose of this study is to critically evaluate a novel clustering technique for clustering and detecting functional modules in protein-protein interaction networks, termed STM. Results STM selects representative proteins for each cluster and iteratively refines clusters based on a combination of the signal transduced and graph topology. STM is found to be effective at detecting clusters with a diverse range of interaction structures that are significant on measures of biological relevance. The STM approach is compared to six competing approaches including the maximum clique, quasi-clique, minimum cut, betweeness cut and Markov Clustering (MCL) algorithms. The clusters obtained by each technique are compared for enrichment of biological function. STM generates larger clusters and the clusters identified have p-values that are approximately 125-fold better than the other methods on biological function. An important strength of STM is that the percentage of proteins that are discarded to create clusters is much lower than the other approaches. Conclusion STM outperforms competing approaches and is capable of effectively detecting both densely and sparsely connected, biologically relevant functional modules with fewer discards.
Collapse
Affiliation(s)
- Woochang Hwang
- Department of Computer Science and Engineering, State University of New York at Buffalo, USA
| | - Young-Rae Cho
- Department of Computer Science and Engineering, State University of New York at Buffalo, USA
| | - Aidong Zhang
- Department of Computer Science and Engineering, State University of New York at Buffalo, USA
| | - Murali Ramanathan
- Department of Pharmaceutical Sciences, State University of New York at Buffalo, USA
| |
Collapse
|
94
|
Chen X, Chen M, Ning K. BNArray: an R package for constructing gene regulatory networks from microarray data by using Bayesian network. Bioinformatics 2006; 22:2952-4. [PMID: 17005537 DOI: 10.1093/bioinformatics/btl491] [Citation(s) in RCA: 43] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
UNLABELLED BNArray is a systemized tool developed in R. It facilitates the construction of gene regulatory networks from DNA microarray data by using Bayesian network. Significant sub-modules of regulatory networks with high confidence are reconstructed by using our extended sub-network mining algorithm of directed graphs. BNArray can handle microarray datasets with missing data. To evaluate the statistical features of generated Bayesian networks, re-sampling procedures are utilized to yield collections of candidate 1st-order network sets for mining dense coherent sub-networks. AVAILABILITY The R package and the supplementary documentation are available at http://www.cls.zju.edu.cn/binfo/BNArray/.
Collapse
Affiliation(s)
- Xiaohui Chen
- Department of Bioinformatics Zhejiang University, Hangzhou 310058, China
| | | | | |
Collapse
|
95
|
Jiang D, Pei J, Ramanathan M, Lin C, Tang C, Zhang A. Mining gene–sample–time microarray data: a coherent gene cluster discovery approach. Knowl Inf Syst 2006. [DOI: 10.1007/s10115-006-0031-9] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
96
|
|
97
|
Haynes T, Knisley D, Seier E, Zou Y. A quantitative analysis of secondary RNA structure using domination based parameters on trees. BMC Bioinformatics 2006; 7:108. [PMID: 16515683 PMCID: PMC1420337 DOI: 10.1186/1471-2105-7-108] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2005] [Accepted: 03/03/2006] [Indexed: 11/30/2022] Open
Abstract
Background It has become increasingly apparent that a comprehensive database of RNA motifs is essential in order to achieve new goals in genomic and proteomic research. Secondary RNA structures have frequently been represented by various modeling methods as graph-theoretic trees. Using graph theory as a modeling tool allows the vast resources of graphical invariants to be utilized to numerically identify secondary RNA motifs. The domination number of a graph is a graphical invariant that is sensitive to even a slight change in the structure of a tree. The invariants selected in this study are variations of the domination number of a graph. These graphical invariants are partitioned into two classes, and we define two parameters based on each of these classes. These parameters are calculated for all small order trees and a statistical analysis of the resulting data is conducted to determine if the values of these parameters can be utilized to identify which trees of orders seven and eight are RNA-like in structure. Results The statistical analysis shows that the domination based parameters correctly distinguish between the trees that represent native structures and those that are not likely candidates to represent RNA. Some of the trees previously identified as candidate structures are found to be "very" RNA like, while others are not, thereby refining the space of structures likely to be found as representing secondary RNA structure. Conclusion Search algorithms are available that mine nucleotide sequence databases. However, the number of motifs identified can be quite large, making a further search for similar motif computationally difficult. Much of the work in the bioinformatics arena is toward the development of better algorithms to address the computational problem. This work, on the other hand, uses mathematical descriptors to more clearly characterize the RNA motifs and thereby reduce the corresponding search space. These preliminary findings demonstrate that graph-theoretic quantifiers utilized in fields such as computer network design hold significant promise as an added tool for genomics and proteomics.
Collapse
Affiliation(s)
- Teresa Haynes
- Mathematics and Statistics Department, Box 70663, East Tennessee State University, Johnson City, TN, USA
| | - Debra Knisley
- Mathematics and Statistics Department, Box 70663, East Tennessee State University, Johnson City, TN, USA
| | - Edith Seier
- Mathematics and Statistics Department, Box 70663, East Tennessee State University, Johnson City, TN, USA
| | - Yue Zou
- Department of Biochemistry and Molecular Biology, Quillen College of Medicine, East Tennessee State University, Johnson City, TN, USA
| |
Collapse
|
98
|
Sauleau EA, Paumier JP, Buemi A. Medical record linkage in health information systems by approximate string matching and clustering. BMC Med Inform Decis Mak 2005; 5:32. [PMID: 16219102 PMCID: PMC1274322 DOI: 10.1186/1472-6947-5-32] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2005] [Accepted: 10/11/2005] [Indexed: 11/21/2022] Open
Abstract
BACKGROUND Multiplication of data sources within heterogeneous healthcare information systems always results in redundant information, split among multiple databases. Our objective is to detect exact and approximate duplicates within identity records, in order to attain a better quality of information and to permit cross-linkage among stand-alone and clustered databases. Furthermore, we need to assist human decision making, by computing a value reflecting identity proximity. METHODS The proposed method is in three steps. The first step is to standardise and to index elementary identity fields, using blocking variables, in order to speed up information analysis. The second is to match similar pair records, relying on a global similarity value taken from the Porter-Jaro-Winkler algorithm. And the third is to create clusters of coherent related records, using graph drawing, agglomerative clustering methods and partitioning methods. RESULTS The batch analysis of 300,000 "supposedly" distinct identities isolates 240,000 true unique records, 24,000 duplicates (clusters composed of 2 records) and 3,000 clusters whose size is greater than or equal to 3 records. CONCLUSION Duplicate-free databases, used in conjunction with relevant indexes and similarity values, allow immediate (i.e. real-time) proximity detection when inserting a new identity.
Collapse
Affiliation(s)
- Erik A Sauleau
- Service des études et applications de l'information médicale (SEAIM), Hospital, 87 Ave d'Altkirch, F68051 Mulhouse, France
| | - Jean-Philippe Paumier
- Service des études et applications de l'information médicale (SEAIM), Hospital, 87 Ave d'Altkirch, F68051 Mulhouse, France
| | - Antoine Buemi
- Service des études et applications de l'information médicale (SEAIM), Hospital, 87 Ave d'Altkirch, F68051 Mulhouse, France
| |
Collapse
|
99
|
Abstract
Data analysis plays an indispensable role for understanding various phenomena. Cluster analysis, primitive exploration with little or no prior knowledge, consists of research developed across a wide variety of communities. The diversity, on one hand, equips us with many tools. On the other hand, the profusion of options causes confusion. We survey clustering algorithms for data sets appearing in statistics, computer science, and machine learning, and illustrate their applications in some benchmark data sets, the traveling salesman problem, and bioinformatics, a new field attracting intensive efforts. Several tightly related topics, proximity measure, and cluster validation, are also discussed.
Collapse
Affiliation(s)
- Rui Xu
- Department of Electrical and Computer Engineering, University of Missouri-Rolla, Rolla, MO 65409, USA.
| | | |
Collapse
|
100
|
Pan HY, Zhu J, Han DF. Clustering gene expression data based on predicted differential effects of GV interaction. GENOMICS, PROTEOMICS & BIOINFORMATICS 2005; 3:36-41. [PMID: 16144520 PMCID: PMC5172465 DOI: 10.1016/s1672-0229(05)03005-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 11/04/2022]
Abstract
Microarray has become a popular biotechnology in biological and medical research. However, systematic and stochastic variabilities in microarray data are expected and unavoidable, resulting in the problem that the raw measurements have inherent "noise" within microarray experiments. Currently, logarithmic ratios are usually analyzed by various clustering methods directly, which may introduce bias interpretation in identifying groups of genes or samples. In this paper, a statistical method based on mixed model approaches was proposed for microarray data cluster analysis. The underlying rationale of this method is to partition the observed total gene expression level into various variations caused by different factors using an ANOVA model, and to predict the differential effects of GV (gene by variety) interaction using the adjusted unbiased prediction (AUP) method. The predicted GV interaction effects can then be used as the inputs of cluster analysis. We illustrated the application of our method with a gene expression dataset and elucidated the utility of our approach using an external validation.
Collapse
Affiliation(s)
- Hai-Yan Pan
- Institute of Bioinformatics, Zhejiang University, Hangzhou 310029, China.
| | | | | |
Collapse
|