1
|
Li R, Yi H, Ma S. A Selective Review of Network Analysis Methods for Gene Expression Data. Methods Mol Biol 2025; 2880:293-307. [PMID: 39900765 DOI: 10.1007/978-1-0716-4276-4_14] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2025]
Abstract
With the development of high-throughput profiling techniques, gene expressions have drawn significant attention due to their important biological implications, widespread data availability, and promising biological findings. The complex interactions and regulations among genes naturally lead to a network structure, which can provide a global view of molecular mechanisms and biological processes. This chapter provides a selective overview of constructing gene expression networks and utilizing them in downstream analysis. It also includes a demonstrating example.
Collapse
Affiliation(s)
- Rong Li
- Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA
| | - Huangdi Yi
- Servier Pharmaceuticals, Boston, MA, USA
| | - Shuangge Ma
- Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA.
| |
Collapse
|
2
|
Lu X, Chen G, Li J, Hu X, Sun F. MAGCN: A Multiple Attention Graph Convolution Networks for Predicting Synthetic Lethality. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:2681-2689. [PMID: 36374879 DOI: 10.1109/tcbb.2022.3221736] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
Synthetic lethality (SL) is a potential cancer therapeutic strategy and drug discovery. Computational approaches to identify synthetic lethality genes have become an effective complement to wet experiments which are time consuming and costly. Graph convolutional networks (GCN) has been utilized to such prediction task as be good at capturing the neighborhood dependency in a graph. However, it is still a lack of the mechanism of aggregating the complementary neighboring information from various heterogeneous graphs. Here, we propose the Multiple Attention Graph Convolution Networks for predicting synthetic lethality (MAGCN). First, we obtain the functional similarity features and topological structure features of genes from different data sources respectively, such as Gene Ontology data and Protein-Protein Interaction. Then, graph convolutional network is utilized to accumulate the knowledge from neighbor nodes according to synthetic lethal associations. Meanwhile, we propose a multiple graphs attention model and construct a multiple graphs attention network to learn the contribution factors of different graphs to generate embedded representation by aggregating these graphs. Finally, the generated feature matrix is decoded to predict potential synthetic lethal interaction. Experimental results show that MAGCN is superior to other baseline methods. Case study demonstrates the ability of MAGCN to predict human SL gene pairs.
Collapse
|
3
|
Xie X, Du H, Chen J, Aslam M, Wang W, Chen W, Li P, Du H, Liu X. Global Profiling of N-Glycoproteins and N-Glycans in the Diatom Phaeodactylum tricornutum. FRONTIERS IN PLANT SCIENCE 2021; 12:779307. [PMID: 34925422 PMCID: PMC8678454 DOI: 10.3389/fpls.2021.779307] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/18/2021] [Accepted: 11/05/2021] [Indexed: 05/04/2023]
Abstract
N-glycosylation is an important posttranslational modification in all eukaryotes, but little is known about the N-glycoproteins and N-glycans in microalgae. Here, N-glycoproteomic and N-glycomic approaches were used to unveil the N-glycoproteins and N-glycans in the model diatom Phaeodactylum tricornutum. In total, 863 different N-glycopeptides corresponding to 639 N-glycoproteins were identified from P. tricornutum. These N-glycoproteins participated in a variety of important metabolic pathways in P. tricornutum. Twelve proteins participating in the N-glycosylation pathway were identified as N-glycoproteins, indicating that the N-glycosylation of these proteins might be important for the protein N-glycosylation pathway. Subsequently, 69 N-glycans corresponding to 59 N-glycoproteins were identified and classified into high mannose and hybrid type N-glycans. High mannose type N-glycans contained four different classes, such as Man-5, Man-7, Man-9, and Man-10 with a terminal glucose residue. Hybrid type N-glycan harbored Man-4 with a terminal GlcNAc residue. The identification of N-glycosylation on nascent proteins expanded our understanding of this modification at a N-glycoproteomic scale, the analysis of N-glycan structures updated the N-glycan database in microalgae. The results obtained from this study facilitate the elucidation of the precise function of these N-glycoproteins and are beneficial for future designing the microalga to produce the functional humanized biopharmaceutical N-glycoproteins for the clinical therapeutics.
Collapse
Affiliation(s)
- Xihui Xie
- Guangdong Provincial Key Laboratory of Marine Biotechnology, STU-UNIVPM Joint Algal Research Center, College of Sciences, Institute of Marine Sciences, Shantou University, Shantou, China
- Southern Marine Science and Engineering Guangdong Laboratory, Guangzhou, China
| | - Hong Du
- Guangdong Provincial Key Laboratory of Marine Biotechnology, STU-UNIVPM Joint Algal Research Center, College of Sciences, Institute of Marine Sciences, Shantou University, Shantou, China
- Southern Marine Science and Engineering Guangdong Laboratory, Guangzhou, China
| | - Jichen Chen
- Guangdong Provincial Key Laboratory of Marine Biotechnology, STU-UNIVPM Joint Algal Research Center, College of Sciences, Institute of Marine Sciences, Shantou University, Shantou, China
- Southern Marine Science and Engineering Guangdong Laboratory, Guangzhou, China
| | - Muhammad Aslam
- Guangdong Provincial Key Laboratory of Marine Biotechnology, STU-UNIVPM Joint Algal Research Center, College of Sciences, Institute of Marine Sciences, Shantou University, Shantou, China
- Southern Marine Science and Engineering Guangdong Laboratory, Guangzhou, China
- Faculty of Marine Sciences, Lasbela University of Agriculture, Water & Marine Sciences, Uthal, Pakistan
| | - Wanna Wang
- Guangdong Provincial Key Laboratory of Marine Biotechnology, STU-UNIVPM Joint Algal Research Center, College of Sciences, Institute of Marine Sciences, Shantou University, Shantou, China
- Southern Marine Science and Engineering Guangdong Laboratory, Guangzhou, China
| | - Weizhou Chen
- Guangdong Provincial Key Laboratory of Marine Biotechnology, STU-UNIVPM Joint Algal Research Center, College of Sciences, Institute of Marine Sciences, Shantou University, Shantou, China
- Southern Marine Science and Engineering Guangdong Laboratory, Guangzhou, China
| | - Ping Li
- Guangdong Provincial Key Laboratory of Marine Biotechnology, STU-UNIVPM Joint Algal Research Center, College of Sciences, Institute of Marine Sciences, Shantou University, Shantou, China
| | - Hua Du
- Guangdong Provincial Key Laboratory of Marine Biotechnology, STU-UNIVPM Joint Algal Research Center, College of Sciences, Institute of Marine Sciences, Shantou University, Shantou, China
- Southern Marine Science and Engineering Guangdong Laboratory, Guangzhou, China
| | - Xiaojuan Liu
- Guangdong Provincial Key Laboratory of Marine Biotechnology, STU-UNIVPM Joint Algal Research Center, College of Sciences, Institute of Marine Sciences, Shantou University, Shantou, China
- Southern Marine Science and Engineering Guangdong Laboratory, Guangzhou, China
| |
Collapse
|
4
|
Ou-Yang L, Lu F, Zhang ZC, Wu M. Matrix factorization for biomedical link prediction and scRNA-seq data imputation: an empirical survey. Brief Bioinform 2021; 23:6447434. [PMID: 34864871 DOI: 10.1093/bib/bbab479] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2021] [Revised: 09/25/2021] [Accepted: 10/18/2021] [Indexed: 02/02/2023] Open
Abstract
Advances in high-throughput experimental technologies promote the accumulation of vast number of biomedical data. Biomedical link prediction and single-cell RNA-sequencing (scRNA-seq) data imputation are two essential tasks in biomedical data analyses, which can facilitate various downstream studies and gain insights into the mechanisms of complex diseases. Both tasks can be transformed into matrix completion problems. For a variety of matrix completion tasks, matrix factorization has shown promising performance. However, the sparseness and high dimensionality of biomedical networks and scRNA-seq data have raised new challenges. To resolve these issues, various matrix factorization methods have emerged recently. In this paper, we present a comprehensive review on such matrix factorization methods and their usage in biomedical link prediction and scRNA-seq data imputation. Moreover, we select representative matrix factorization methods and conduct a systematic empirical comparison on 15 real data sets to evaluate their performance under different scenarios. By summarizing the experimental results, we provide general guidelines for selecting matrix factorization methods for different biomedical matrix completion tasks and point out some future directions to further improve the performance for biomedical link prediction and scRNA-seq data imputation.
Collapse
Affiliation(s)
- Le Ou-Yang
- Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen Key Laboratory of Media Security, and Guangdong Laboratory of Artificial Intelligence and Digital Economy(SZ), College of Electronics and Information Engineering, Shenzhen University, Shenzhen, 518060, China.,Shenzhen Institute of Artificial Intelligence and Robotics for Society, Shenzhen,518172, China
| | - Fan Lu
- Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen Key Laboratory of Media Security, and Guangdong Laboratory of Artificial Intelligence and Digital Economy(SZ), College of Electronics and Information Engineering, Shenzhen University, Shenzhen, 518060, China
| | - Zi-Chao Zhang
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, 200433, China
| | - Min Wu
- Institute for Infocomm Research (I2R), A*STAR, 138632, Singapore
| |
Collapse
|
5
|
Zambrana C, Xenos A, Böttcher R, Malod-Dognin N, Pržulj N. Network neighbors of viral targets and differentially expressed genes in COVID-19 are drug target candidates. Sci Rep 2021; 11:18985. [PMID: 34556735 PMCID: PMC8460804 DOI: 10.1038/s41598-021-98289-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2021] [Accepted: 08/23/2021] [Indexed: 12/12/2022] Open
Abstract
The COVID-19 pandemic is raging. It revealed the importance of rapid scientific advancement towards understanding and treating new diseases. To address this challenge, we adapt an explainable artificial intelligence algorithm for data fusion and utilize it on new omics data on viral-host interactions, human protein interactions, and drugs to better understand SARS-CoV-2 infection mechanisms and predict new drug-target interactions for COVID-19. We discover that in the human interactome, the human proteins targeted by SARS-CoV-2 proteins and the genes that are differentially expressed after the infection have common neighbors central in the interactome that may be key to the disease mechanisms. We uncover 185 new drug-target interactions targeting 49 of these key genes and suggest re-purposing of 149 FDA-approved drugs, including drugs targeting VEGF and nitric oxide signaling, whose pathways coincide with the observed COVID-19 symptoms. Our integrative methodology is universal and can enable insight into this and other serious diseases.
Collapse
Affiliation(s)
| | | | | | - Noël Malod-Dognin
- Barcelona Supercomputing Center, Barcelona, Spain
- Department of Computer Science, University College London, London, WC1E 6BT, UK
| | - Nataša Pržulj
- Barcelona Supercomputing Center, Barcelona, Spain.
- Department of Computer Science, University College London, London, WC1E 6BT, UK.
- ICREA, Pg. Lluís Companys 23, Barcelona, Spain.
| |
Collapse
|
6
|
Hong Z, Liu J, Chen Y. An interpretable machine learning method for homo-trimeric protein interface residue-residue interaction prediction. Biophys Chem 2021; 278:106666. [PMID: 34418678 DOI: 10.1016/j.bpc.2021.106666] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2021] [Revised: 08/09/2021] [Accepted: 08/09/2021] [Indexed: 12/29/2022]
Abstract
Protein-protein interaction plays an important role in life activities. A more fine-grained analysis, such as residues and atoms level, will better benefit us to understand the mechanism for inter-protein interaction and drug design. The development of efficient computational methods to reduce trials and errors, as well as assisting experimental researchers to determine the complex structure are some of the ongoing studies in the field. The research of trimer protein interface, especially homotrimer, has been rarely studied. In this paper, we proposed an interpretable machine learning method for homo-trimeric protein interface residue pairs prediction. The structure, sequence, and physicochemical information are intergraded as feature input fed to model for training. Graph model is utilized to present spatial information for intra-protein. Matrix factorization captures the different features' interactions. Kernel function is designed to auto-acquire the adjacent information of our target residue pairs. The accuracy rate achieves 54.5% in an independent test set. Sequence and structure alignment exhibit the ability of model self-study. Our model indicates the biological significance between sequence and structure, and could be auxiliary for reducing trials and errors in the fields of protein complex determination and protein-protein docking, etc. SIGNIFICANCE: Protein complex structures are significant for understanding protein function and promising functional protein design. With data increasing, some computational tools have been developed for protein complex residue contact prediction, which is one of the most significant steps for complex structure prediction. But for homo-trimeric protein, the sequence-based deep learning predictors are infeasible for homologous sequences, and the algorithm black box prevents us from understanding of each step operation. In this way, we propose an interpreting machine learning method for homo-trimeric protein interface residue-residue interaction prediction, and the predictor shows a good performance. Our work provides a computational auxiliary way for determining the homo-trimeric proteins interface residue pairs which will be further verified by wet experiments, and and gives a hand for the downstream works, such as protein-protein docking, protein complex structure prediction and drug design.
Collapse
Affiliation(s)
- Zhonghua Hong
- Jiaxing Hospital of Traditional Chinese Medicine, Jiaxing University, Jiaxing 314001, PR China.
| | - Jiale Liu
- Academy for Advanced Interdisciplinary Studies, Peking University, Beijing 100871, PR China
| | - Yinggao Chen
- Shantou Central Hospital, Shantou 515041, PR China.
| |
Collapse
|
7
|
Du B, Tang L, Liu L, Zhou W. Predicting LncRNA-Disease Association Based on Generative Adversarial Network. Curr Gene Ther 2021; 22:144-151. [PMID: 33998988 DOI: 10.2174/1566523221666210506131055] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2020] [Revised: 02/19/2021] [Accepted: 02/24/2021] [Indexed: 11/22/2022]
Abstract
BACKGROUND Increasing research reveals that long non-coding RNAs (lncRNAs) play an important role in various biological processes of human diseases. Nonetheless, only a handful of lncRNA-disease associations have been experimentally verified. The study of lncRNA-disease association prediction based on the computational model has provided a preliminary basis for biological experiments to a great degree so as to cut down the huge cost of wet lab experiments. OBJECTIVE This study aims to learn the real distribution of lncRNA-disease association from a limited number of known lncRNA-disease association data. This paper proposes a new lncRNA-disease association prediction model called LDA-GAN based on a generative adversarial network (GAN). METHOD Aiming at the problems of slow convergence rate, training instabilities, and unavailability of discrete data in traditional GAN, LDA-GAN utilizes the Gumbel-softmax technology to construct a differentiable process for simulating discrete sampling. Meanwhile, the generator and the discriminator of LDA-GAN are integrated to establish the overall optimization goal based on the pairwise loss function. RESULTS Experiments on standard datasets demonstrate that LDA-GAN achieves not only high stability and high efficiency in the process of confrontation learning but also gives full play to the semi-supervised learning advantage of generative adversarial learning framework for unlabeled data, which further improves the prediction accuracy of lncRNA-disease association. Besides, case studies show that LDA-GAN can accurately generate potential diseases for several lncRNAs.
Collapse
Affiliation(s)
- Biao Du
- School of Information, Yunnan Normal University, Kunming. China
| | - Lin Tang
- Key Laboratory of Educational Informatization for Nationalities Ministry of Education, Yunnan Normal University, Kunming. China
| | - Lin Liu
- School of Information, Yunnan Normal University, Kunming. China
| | - Wei Zhou
- School of Software, Yunnan University, Kunming. China
| |
Collapse
|
8
|
Pei F, Shi Q, Zhang H, Bahar I. Predicting Protein-Protein Interactions Using Symmetric Logistic Matrix Factorization. J Chem Inf Model 2021; 61:1670-1682. [PMID: 33831302 DOI: 10.1021/acs.jcim.1c00173] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Abstract
Accurate assessment of protein-protein interactions (PPIs) is critical to deciphering disease mechanisms and developing novel drugs, and with rapidly growing PPI data, the need for more efficient predictive methods is emerging. We propose here a symmetric logistic matrix factorization (symLMF)-based approach to predict PPIs, especially useful for large PPI networks. Benchmarked against two widely used datasets (Saccharomyces cerevisiae and Homo sapiens benchmarks) and their extended versions, the symLMF-based method proves to outperform most of the state-of-the-art data-driven methods applied to human PPIs, and it shows a performance comparable to those of deep learning methods despite its conceptual and technical simplicity and efficiency. Tests performed on humans, yeast, and tissue (brain and liver)- and disease (neurodegenerative and metabolic disorders)-specific datasets further demonstrate the high capability to capture the hidden interactions. Notably, many "de novo predictions" made by symLMF are verified to exist in PPI databases other than those used for training/testing the method, indicating that the method could be of broad utility as a simple, yet efficient and accurate, tool applicable to PPI datasets.
Collapse
Affiliation(s)
| | - Qingya Shi
- School of Medicine, Tsinghua University, Beijing 100084, China
| | | | | |
Collapse
|
9
|
Li P, Chen C, Li P, Dong Y. A comprehensive examination of the lysine acetylation targets in paper mulberry based on proteomics analyses. PLoS One 2021; 16:e0240947. [PMID: 33705403 PMCID: PMC7951917 DOI: 10.1371/journal.pone.0240947] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2020] [Accepted: 02/12/2021] [Indexed: 11/19/2022] Open
Abstract
Rocky desertification is a bottleneck that reduces ecological and environmental security in karst areas. Paper mulberry, a unique deciduous tree, shows good performance in rocky desertification areas. Its resistance mechanisms are therefore of high interest. In this study, a lysine acetylation proteomics analysis of paper mulberry seedling leaves was conducted in combination with the purification of acetylated protein by high-precision nano LC-MS/MS. We identified a total of 7130 acetylation sites in 3179 proteins. Analysis of the modified sites showed a predominance of nine motifs. Six positively charged residues: lysine (K), arginine (R), and histidine (H), serine (S), threonine (T), and tyrosine (Y) occurred most frequently at the +1 position, phenylalanine (F) was both detected both upstream and downstream of the acetylated lysines; and the sequence logos showed a strong preference for lysine and arginine around acetylated lysines. Functional annotation revealed that the identified enzymes were mainly involved in translation, transcription, ribosomal structure and biological processes, showing that lysine acetylation can regulate various aspects of primary carbon and nitrogen metabolism and secondary metabolism. Acetylated proteins were enriched in the chloroplast, cytoplasm, and nucleus, and many stress response-related proteins were also discovered to be acetylated, including PAL, HSP70, and ERF. HSP70, an important protein involved in plant abiotic and disease stress responses, was identified in paper mulberry, although it is rarely found in woody plants. This may be further examined in research in other plants and could explain the good adaptation of paper mulberry to the karst environment. However, these hypotheses require further verification. Our data can provide a new starting point for the further analysis of the acetylation function in paper mulberry and other plants.
Collapse
Affiliation(s)
- Ping Li
- College of Animal Science, Guizhou university, Guiyang, Guizhou, China
| | - Chao Chen
- College of Animal Science, Guizhou university, Guiyang, Guizhou, China
| | - Ping Li
- Institute of Grassland Research, Sichuan Academy of Grassland Science, Cheng Du, Si Chuan, China
| | - Yibo Dong
- College of Animal Science, Guizhou university, Guiyang, Guizhou, China
| |
Collapse
|
10
|
Cai R, Chen X, Fang Y, Wu M, Hao Y. Dual-dropout graph convolutional network for predicting synthetic lethality in human cancers. Bioinformatics 2021; 36:4458-4465. [PMID: 32221609 DOI: 10.1093/bioinformatics/btaa211] [Citation(s) in RCA: 44] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2019] [Revised: 03/02/2020] [Accepted: 03/25/2020] [Indexed: 12/20/2022] Open
Abstract
MOTIVATION Synthetic lethality (SL) is a promising form of gene interaction for cancer therapy, as it is able to identify specific genes to target at cancer cells without disrupting normal cells. As high-throughput wet-lab settings are often costly and face various challenges, computational approaches have become a practical complement. In particular, predicting SLs can be formulated as a link prediction task on a graph of interacting genes. Although matrix factorization techniques have been widely adopted in link prediction, they focus on mapping genes to latent representations in isolation, without aggregating information from neighboring genes. Graph convolutional networks (GCN) can capture such neighborhood dependency in a graph. However, it is still challenging to apply GCN for SL prediction as SL interactions are extremely sparse, which is more likely to cause overfitting. RESULTS In this article, we propose a novel dual-dropout GCN (DDGCN) for learning more robust gene representations for SL prediction. We employ both coarse-grained node dropout and fine-grained edge dropout to address the issue that standard dropout in vanilla GCN is often inadequate in reducing overfitting on sparse graphs. In particular, coarse-grained node dropout can efficiently and systematically enforce dropout at the node (gene) level, while fine-grained edge dropout can further fine-tune the dropout at the interaction (edge) level. We further present a theoretical framework to justify our model architecture. Finally, we conduct extensive experiments on human SL datasets and the results demonstrate the superior performance of our model in comparison with state-of-the-art methods. AVAILABILITY AND IMPLEMENTATION DDGCN is implemented in Python 3.7, open-source and freely available at https://github.com/CXX1113/Dual-DropoutGCN. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ruichu Cai
- School of Computer Science, Guangdong University of Technology, Guangzhou 510006, China
| | - Xuexin Chen
- School of Computer Science, Guangdong University of Technology, Guangzhou 510006, China
| | - Yuan Fang
- School of Information Systems, Singapore Management University, 178902 Singapore
| | - Min Wu
- Institute for Infocomm Research (I2R), A*STAR, 138632 Singapore
| | - Yuexing Hao
- Computer Science Department, Rutgers Univeristy New Brunswick, New Brunswick, NJ 08854, USA
| |
Collapse
|
11
|
Ye J, Li J. First proteomic analysis of the role of lysine acetylation in extensive functions in Solenopsis invicta. PLoS One 2020; 15:e0243787. [PMID: 33326466 PMCID: PMC7743978 DOI: 10.1371/journal.pone.0243787] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2020] [Accepted: 11/25/2020] [Indexed: 12/17/2022] Open
Abstract
Lysine acetylation (Kac) plays a critical role in the regulation of many important cellular processes. However, little is known about Kac in Solenopsis invicta, which is among the 100 most dangerous invasive species in the world. Kac in S. invicta was evaluated for the first time in this study. Altogether, 2387 Kac sites were tested in 992 proteins. The prediction of subcellular localization indicated that most identified proteins were located in the cytoplasm, mitochondria, and nucleus. Venom allergen Sol i 2, Sol i 3, and Sol i 4 were found to be located in the extracellular. The enriched Kac site motifs included Kac H, Kac Y, Kac G, Kac F, Kac T, and Kac W. H, Y, F, and W frequently occurred at the +1 position, whereas G, Y, and T frequently occurred at the -1 position. In the cellular component, acetylated proteins were enriched in the cytoplasmic part, mitochondrial matrix, and cytosolic ribosome. Furthermore, 25 pathways were detected to have significant enrichment. Interestingly, arginine and proline metabolism, as well as phagosome, which are related to immunity, involved several Kac proteins. Sequence alignment analyses demonstrated that V-type proton ATPase subunit G, tubulin alpha chain, and arginine kinase, the acetylated lysine residues, were evolutionarily conserved among different ant species. In the investigation of the interaction network, diverse interactions were adjusted by Kac. The results indicated that Kac may play an important role in the sensitization, cellular energy metabolism, immune response, nerve signal transduction, and response to biotic and abiotic stress of S. invicta. It may be useful to confirm the functions of Kac target proteins for the design of specific and effective drugs to prevent and control this dangerous invasive species.
Collapse
Affiliation(s)
- Jingwen Ye
- Guangdong Key Laboratory of Animal Conservation and Resource Utilization, Guangdong Public Laboratory of Wild Animal Conservation and Utilization, Institute of Zoology, Guangdong Academy of Science, Guangzhou, Guangdong Province, The People’s Republic of China
| | - Jun Li
- Guangdong Key Laboratory of Animal Conservation and Resource Utilization, Guangdong Public Laboratory of Wild Animal Conservation and Utilization, Institute of Zoology, Guangdong Academy of Science, Guangzhou, Guangdong Province, The People’s Republic of China
- * E-mail:
| |
Collapse
|
12
|
Ata SK, Wu M, Fang Y, Ou-Yang L, Kwoh CK, Li XL. Recent advances in network-based methods for disease gene prediction. Brief Bioinform 2020; 22:6023077. [PMID: 33276376 DOI: 10.1093/bib/bbaa303] [Citation(s) in RCA: 41] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2020] [Revised: 09/29/2020] [Accepted: 10/10/2020] [Indexed: 01/28/2023] Open
Abstract
Disease-gene association through genome-wide association study (GWAS) is an arduous task for researchers. Investigating single nucleotide polymorphisms that correlate with specific diseases needs statistical analysis of associations. Considering the huge number of possible mutations, in addition to its high cost, another important drawback of GWAS analysis is the large number of false positives. Thus, researchers search for more evidence to cross-check their results through different sources. To provide the researchers with alternative and complementary low-cost disease-gene association evidence, computational approaches come into play. Since molecular networks are able to capture complex interplay among molecules in diseases, they become one of the most extensively used data for disease-gene association prediction. In this survey, we aim to provide a comprehensive and up-to-date review of network-based methods for disease gene prediction. We also conduct an empirical analysis on 14 state-of-the-art methods. To summarize, we first elucidate the task definition for disease gene prediction. Secondly, we categorize existing network-based efforts into network diffusion methods, traditional machine learning methods with handcrafted graph features and graph representation learning methods. Thirdly, an empirical analysis is conducted to evaluate the performance of the selected methods across seven diseases. We also provide distinguishing findings about the discussed methods based on our empirical analysis. Finally, we highlight potential research directions for future studies on disease gene prediction.
Collapse
Affiliation(s)
- Sezin Kircali Ata
- School of Computer Science and Engineering Nanyang Technological University (NTU)
| | - Min Wu
- Institute for Infocomm Research (I2R), A*STAR, Singapore
| | - Yuan Fang
- School of Information Systems, Singapore Management University, Singapore
| | - Le Ou-Yang
- College of Electronics and Information Engineering, Shenzhen University, Shenzhen China
| | | | - Xiao-Li Li
- Department head and principal scientist at I2R, A*STAR, Singapore
| |
Collapse
|
13
|
Yang L, Han Y, Zhang H, Li W, Dai Y. Prediction of Protein-Protein Interactions with Local Weight-Sharing Mechanism in Deep Learning. BIOMED RESEARCH INTERNATIONAL 2020; 2020:5072520. [PMID: 32626745 PMCID: PMC7312734 DOI: 10.1155/2020/5072520] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/19/2019] [Revised: 03/04/2020] [Accepted: 05/21/2020] [Indexed: 12/30/2022]
Abstract
Protein-protein interactions (PPIs) are important for almost all cellular processes, including metabolic cycles, DNA transcription and replication, and signaling cascades. The experimental methods for identifying PPIs are always time-consuming and expensive. Therefore, it is important to develop computational approaches for predicting PPIs. In this paper, an improved model is proposed to use a machine learning method in the study of protein-protein interactions. With the consideration of the factors affecting the prediction of the PPIs, a method of feature extraction and fusion is proposed to improve the variety of the features to be considered in the prediction. Besides, with the consideration of the effect affected by the different input order of the two proteins, we propose a "Y-type" Bi-RNN model and train the network by using a method which both needs backward and forward training. In order to insure the training time caused on the extra training either a backward one or a forward one, this paper proposes a weight-sharing policy to minimize the parameters in the training. The experimental results show that the proposed method can achieve an accuracy of 99.57%, recall of 99.36%, sensitivity of 99.76%, precision of 99.74%, MCC of 99.14%, and AUC of 99.56% under the benchmark dataset.
Collapse
Affiliation(s)
- Lei Yang
- College of Computer Science and Engineering, Northeastern University, Shenyang, China
- Key Laboratory of Intelligent Computing in Medical Image, Ministry of Education, Northeastern University, China
| | - Yukun Han
- College of Computer Science and Engineering, Northeastern University, Shenyang, China
| | - Huixue Zhang
- College of Computer Science and Engineering, Northeastern University, Shenyang, China
| | - Wenlong Li
- College of Software, Northeastern University, Shenyang, China
| | - Yu Dai
- College of Software, Northeastern University, Shenyang, China
| |
Collapse
|
14
|
Nicholson DN, Greene CS. Constructing knowledge graphs and their biomedical applications. Comput Struct Biotechnol J 2020; 18:1414-1428. [PMID: 32637040 PMCID: PMC7327409 DOI: 10.1016/j.csbj.2020.05.017] [Citation(s) in RCA: 76] [Impact Index Per Article: 15.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2020] [Revised: 05/22/2020] [Accepted: 05/23/2020] [Indexed: 12/31/2022] Open
Abstract
Knowledge graphs can support many biomedical applications. These graphs represent biomedical concepts and relationships in the form of nodes and edges. In this review, we discuss how these graphs are constructed and applied with a particular focus on how machine learning approaches are changing these processes. Biomedical knowledge graphs have often been constructed by integrating databases that were populated by experts via manual curation, but we are now seeing a more robust use of automated systems. A number of techniques are used to represent knowledge graphs, but often machine learning methods are used to construct a low-dimensional representation that can support many different applications. This representation is designed to preserve a knowledge graph's local and/or global structure. Additional machine learning methods can be applied to this representation to make predictions within genomic, pharmaceutical, and clinical domains. We frame our discussion first around knowledge graph construction and then around unifying representational learning techniques and unifying applications. Advances in machine learning for biomedicine are creating new opportunities across many domains, and we note potential avenues for future work with knowledge graphs that appear particularly promising.
Collapse
Affiliation(s)
- David N. Nicholson
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, United States
| | - Casey S. Greene
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Childhood Cancer Data Lab, Alex’s Lemonade Stand Foundation, United States
| |
Collapse
|
15
|
Liu Y, Wu M, Liu C, Li XL, Zheng J. SL 2MF: Predicting Synthetic Lethality in Human Cancers via Logistic Matrix Factorization. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:748-757. [PMID: 30969932 DOI: 10.1109/tcbb.2019.2909908] [Citation(s) in RCA: 31] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Synthetic lethality (SL) is a promising concept for novel discovery of anti-cancer drug targets. However, wet-lab experiments for detecting SLs are faced with various challenges, such as high cost, low consistency across platforms, or cell lines. Therefore, computational prediction methods are needed to address these issues. This paper proposes a novel SL prediction method, named SL2 MF, which employs logistic matrix factorization to learn latent representations of genes from the observed SL data. The probability that two genes are likely to form SL is modeled by the linear combination of gene latent vectors. As known SL pairs are more trustworthy than unknown pairs, we design importance weighting schemes to assign higher importance weights for known SL pairs and lower importance weights for unknown pairs in SL2 MF. Moreover, we also incorporate biological knowledge about genes from protein-protein interaction (PPI) data and Gene Ontology (GO). In particular, we calculate the similarity between genes based on their GO annotations and topological properties in the PPI network. Extensive experiments on the SL interaction data from SynLethDB database have been conducted to demonstrate the effectiveness of SL2 MF.
Collapse
|
16
|
Wang Y, Yu G, Wang J, Fu G, Guo M, Domeniconi C. Weighted matrix factorization on multi-relational data for LncRNA-disease association prediction. Methods 2020; 173:32-43. [PMID: 31226302 DOI: 10.1016/j.ymeth.2019.06.015] [Citation(s) in RCA: 28] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2019] [Revised: 06/01/2019] [Accepted: 06/13/2019] [Indexed: 02/07/2023] Open
|
17
|
Tang L, Liang Y, Jin X, Liu L, Zhou W. Hierarchical Extension Based on the Boolean Matrix for LncRNADisease Association Prediction. Curr Mol Med 2019; 20:452-460. [PMID: 31746295 DOI: 10.2174/1566524019666191119104212] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2019] [Revised: 10/30/2019] [Accepted: 10/31/2019] [Indexed: 11/22/2022]
Abstract
BACKGROUND Accumulating experimental studies demonstrated that long non-coding RNAs (LncRNAs) play crucial roles in the occurrence and development progress of various complex human diseases. Nonetheless, only a small portion of LncRNA-disease associations have been experimentally verified at present. Automatically predicting LncRNA-disease associations based on computational models can save the huge cost of wet-lab experiments. METHODS AND RESULT To develop effective computational models to integrate various heterogeneous biological data for the identification of potential disease-LncRNA, we propose a hierarchical extension based on the Boolean matrix for LncRNA-disease association prediction model (HEBLDA). HEBLDA discovers the intrinsic hierarchical correlation based on the property of the Boolean matrix from various relational sources. Then, HEBLDA integrates these hierarchical associated matrices by fusion weights. Finally, HEBLDA uses the hierarchical associated matrix to reconstruct the LncRNA- disease association matrix by hierarchical extending. HEBLDA is able to work for potential diseases or LncRNA without known association data. In 5-fold cross-validation experiments, HEBLDA obtained an area under the receiver operating characteristic curve (AUC) of 0.8913, improving previous classical methods. Besides, case studies show that HEBLDA can accurately predict candidate disease for several LncRNAs. CONCLUSION Based on its ability to discover the more-richer correlated structure of various data sources, we can anticipate that HEBLDA is a potential method that can obtain more comprehensive association prediction in a broad field.
Collapse
Affiliation(s)
- Lin Tang
- Key Laboratory of Educational Informatization for Nationalities Ministry of Education, Yunnan Normal University, Kunming, Yunnan, China
| | - Yu Liang
- School of Software, Yunnan University, Kunming, Yunnan, China
| | - Xin Jin
- School of Software, Yunnan University, Kunming, Yunnan, China
| | - Lin Liu
- School of Information, Yunnan Normal University, Kunming, Yunnan, China
| | - Wei Zhou
- School of Software, Yunnan University, Kunming, Yunnan, China
| |
Collapse
|
18
|
Wani N, Raza K. Integrative approaches to reconstruct regulatory networks from multi-omics data: A review of state-of-the-art methods. Comput Biol Chem 2019; 83:107120. [PMID: 31499298 DOI: 10.1016/j.compbiolchem.2019.107120] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2018] [Revised: 02/22/2019] [Accepted: 08/27/2019] [Indexed: 02/06/2023]
Abstract
Data generation using high throughput technologies has led to the accumulation of diverse types of molecular data. These data have different types (discrete, real, string, etc.) and occur in various formats and sizes. Datasets including gene expression, miRNA expression, protein-DNA binding data (ChIP-Seq/ChIP-ChIP), mutation data (copy number variation, single nucleotide polymorphisms), annotations, interactions, and association data are some of the commonly used biological datasets to study various cellular mechanisms of living organisms. Each of them provides a unique, complementary and partly independent view of the genome and hence embed essential information about the regulatory mechanisms of genes and their products. Therefore, integrating these data and inferring regulatory interactions from them offer a system level of biological insight in predicting gene functions and their phenotypic outcomes. To study genome functionality through regulatory networks, different methods have been proposed for collective mining of information from an integrated dataset. We survey here integration methods that reconstruct regulatory networks using state-of-the-art techniques to handle multi-omics (i.e., genomic, transcriptomic, proteomic) and other biological datasets.
Collapse
Affiliation(s)
- Nisar Wani
- Govt. Degree College Baramulla, J & K, India; Department of Computer Science, jamia Milia Islamia, New Delhi, India
| | - Khalid Raza
- Department of Computer Science, jamia Milia Islamia, New Delhi, India.
| |
Collapse
|
19
|
Guala D, Ogris C, Müller N, Sonnhammer ELL. Genome-wide functional association networks: background, data & state-of-the-art resources. Brief Bioinform 2019; 21:1224-1237. [PMID: 31281921 PMCID: PMC7373183 DOI: 10.1093/bib/bbz064] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2019] [Revised: 04/29/2019] [Accepted: 05/04/2019] [Indexed: 02/06/2023] Open
Abstract
The vast amount of experimental data from recent advances in the field of high-throughput biology begs for integration into more complex data structures such as genome-wide functional association networks. Such networks have been used for elucidation of the interplay of intra-cellular molecules to make advances ranging from the basic science understanding of evolutionary processes to the more translational field of precision medicine. The allure of the field has resulted in rapid growth of the number of available network resources, each with unique attributes exploitable to answer different biological questions. Unfortunately, the high volume of network resources makes it impossible for the intended user to select an appropriate tool for their particular research question. The aim of this paper is to provide an overview of the underlying data and representative network resources as well as to mention methods of integration, allowing a customized approach to resource selection. Additionally, this report will provide a primer for researchers venturing into the field of network integration.
Collapse
Affiliation(s)
- Dimitri Guala
- Science for Life Laboratory, Stockholm Bioinformatics Center, Department of Biochemistry and Biophysics, Stockholm University, Box 1031, 17121 Solna, Sweden
| | - Christoph Ogris
- Computational Cell Maps, Institute of Computational Biology, Helmholtz Center Munich, Ingolstädter Landstr. 1, 85764 Neuherberg, Germany
| | - Nikola Müller
- Computational Cell Maps, Institute of Computational Biology, Helmholtz Center Munich, Ingolstädter Landstr. 1, 85764 Neuherberg, Germany
| | - Erik L L Sonnhammer
- Science for Life Laboratory, Stockholm Bioinformatics Center, Department of Biochemistry and Biophysics, Stockholm University, Box 1031, 17121 Solna, Sweden
| |
Collapse
|
20
|
Fu G, Wang J, Domeniconi C, Yu G. Matrix factorization-based data fusion for the prediction of lncRNA-disease associations. Bioinformatics 2019; 34:1529-1537. [PMID: 29228285 DOI: 10.1093/bioinformatics/btx794] [Citation(s) in RCA: 134] [Impact Index Per Article: 22.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2017] [Accepted: 12/05/2017] [Indexed: 12/21/2022] Open
Abstract
Motivation Long non-coding RNAs (lncRNAs) play crucial roles in complex disease diagnosis, prognosis, prevention and treatment, but only a small portion of lncRNA-disease associations have been experimentally verified. Various computational models have been proposed to identify lncRNA-disease associations by integrating heterogeneous data sources. However, existing models generally ignore the intrinsic structure of data sources or treat them as equally relevant, while they may not be. Results To accurately identify lncRNA-disease associations, we propose a Matrix Factorization based LncRNA-Disease Association prediction model (MFLDA in short). MFLDA decomposes data matrices of heterogeneous data sources into low-rank matrices via matrix tri-factorization to explore and exploit their intrinsic and shared structure. MFLDA can select and integrate the data sources by assigning different weights to them. An iterative solution is further introduced to simultaneously optimize the weights and low-rank matrices. Next, MFLDA uses the optimized low-rank matrices to reconstruct the lncRNA-disease association matrix and thus to identify potential associations. In 5-fold cross validation experiments to identify verified lncRNA-disease associations, MFLDA achieves an area under the receiver operating characteristic curve (AUC) of 0.7408, at least 3% higher than those given by state-of-the-art data fusion based computational models. An empirical study on identifying masked lncRNA-disease associations again shows that MFLDA can identify potential associations more accurately than competing models. A case study on identifying lncRNAs associated with breast, lung and stomach cancers show that 38 out of 45 (84%) associations predicted by MFLDA are supported by recent biomedical literature and further proves the capability of MFLDA in identifying novel lncRNA-disease associations. MFLDA is a general data fusion framework, and as such it can be adopted to predict associations between other biological entities. Availability and implementation The source code for MFLDA is available at: http://mlda.swu.edu.cn/codes.php? name = MFLDA. Contact gxyu@swu.edu.cn. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Guangyuan Fu
- College of Computer and Information Science, Southwest University, Chongqing 400715, China
| | - Jun Wang
- College of Computer and Information Science, Southwest University, Chongqing 400715, China
| | - Carlotta Domeniconi
- Department of Computer Science, George Mason University, Farifax, VA 22030, USA
| | - Guoxian Yu
- College of Computer and Information Science, Southwest University, Chongqing 400715, China
| |
Collapse
|
21
|
Yang X, Huang K, Zhang R, Hussain A. Learning Latent Features With Infinite Nonnegative Binary Matrix Trifactorization. IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE 2018. [DOI: 10.1109/tetci.2018.2806934] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
22
|
Hui M, Cheng J, Sha Z. First comprehensive analysis of lysine acetylation in Alvinocaris longirostris from the deep-sea hydrothermal vents. BMC Genomics 2018; 19:352. [PMID: 29747590 PMCID: PMC5946511 DOI: 10.1186/s12864-018-4745-3] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2017] [Accepted: 04/30/2018] [Indexed: 11/27/2022] Open
Abstract
Background Deep-sea hydrothermal vents are unique chemoautotrophic ecosystems with harsh conditions. Alvinocaris longirostris is one of the dominant crustacean species inhabiting in these extreme environments. It is significant to clarify mechanisms in their adaptation to the vents. Lysine acetylation has been known to play critical roles in the regulation of many cellular processes. However, its function in A. longirostris and even marine invertebrates remains elusive. Our study is the first, to our knowledge, to comprehensively investigate lysine acetylome in A. longirostris. Results In total, 501 unique acetylation sites from 206 proteins were identified by combination of affinity enrichment and high-sensitive-massspectrometer. It was revealed that Arg, His and Lys occurred most frequently at the + 1 position downstream of the acetylation sites, which were all alkaline amino acids and positively charged. Functional analysis revealed that the protein acetylation was involved in diverse cellular processes, such as biosynthesis of amino acids, citrate cycle, fatty acid degradation and oxidative phosphorylation. Acetylated proteins were found enriched in mitochondrion and peroxisome, and many stress response related proteins were also discovered to be acetylated, like arginine kinases, heat shock protein 70, and hemocyanins. In the two hemocyanins, nine acetylation sites were identified, among which one acetylation site was unique in A. longirostris when compared with other shallow water shrimps. Further studies are warranted to verify its function. Conclusion The lysine acetylome of A. longirostris is investigated for the first time and brings new insights into the regulation function of the lysine acetylation. The results supply abundant resources for exploring the functions of acetylation in A. longirostris and other shrimps. Electronic supplementary material The online version of this article (10.1186/s12864-018-4745-3) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Min Hui
- Laboratory of Marine Organism Taxonomy and Phylogeny, Institute of Oceanology, Chinese Academy of Sciences, Qingdao, 266071, China.,University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Jiao Cheng
- Laboratory of Marine Organism Taxonomy and Phylogeny, Institute of Oceanology, Chinese Academy of Sciences, Qingdao, 266071, China.,University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Zhongli Sha
- Laboratory of Marine Organism Taxonomy and Phylogeny, Institute of Oceanology, Chinese Academy of Sciences, Qingdao, 266071, China. .,University of Chinese Academy of Sciences, Beijing, 100049, China. .,Center of Deep Sea Research, Institute of Oceanology, Chinese Academy of Sciences, 7 Nanhai Road, Qingdao, 266071, China.
| |
Collapse
|
23
|
Huang L, Liao L, Wu CH. Completing sparse and disconnected protein-protein network by deep learning. BMC Bioinformatics 2018; 19:103. [PMID: 29566671 PMCID: PMC5863833 DOI: 10.1186/s12859-018-2112-7] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2017] [Accepted: 03/12/2018] [Indexed: 12/01/2022] Open
Abstract
Background Protein-protein interaction (PPI) prediction remains a central task in systems biology to achieve a better and holistic understanding of cellular and intracellular processes. Recently, an increasing number of computational methods have shifted from pair-wise prediction to network level prediction. Many of the existing network level methods predict PPIs under the assumption that the training network should be connected. However, this assumption greatly affects the prediction power and limits the application area because the current golden standard PPI networks are usually very sparse and disconnected. Therefore, how to effectively predict PPIs based on a training network that is sparse and disconnected remains a challenge. Results In this work, we developed a novel PPI prediction method based on deep learning neural network and regularized Laplacian kernel. We use a neural network with an autoencoder-like architecture to implicitly simulate the evolutionary processes of a PPI network. Neurons of the output layer correspond to proteins and are labeled with values (1 for interaction and 0 for otherwise) from the adjacency matrix of a sparse disconnected training PPI network. Unlike autoencoder, neurons at the input layer are given all zero input, reflecting an assumption of no a priori knowledge about PPIs, and hidden layers of smaller sizes mimic ancient interactome at different times during evolution. After the training step, an evolved PPI network whose rows are outputs of the neural network can be obtained. We then predict PPIs by applying the regularized Laplacian kernel to the transition matrix that is built upon the evolved PPI network. The results from cross-validation experiments show that the PPI prediction accuracies for yeast data and human data measured as AUC are increased by up to 8.4 and 14.9% respectively, as compared to the baseline. Moreover, the evolved PPI network can also help us leverage complementary information from the disconnected training network and multiple heterogeneous data sources. Tested by the yeast data with six heterogeneous feature kernels, the results show our method can further improve the prediction performance by up to 2%, which is very close to an upper bound that is obtained by an Approximate Bayesian Computation based sampling method. Conclusions The proposed evolution deep neural network, coupled with regularized Laplacian kernel, is an effective tool in completing sparse and disconnected PPI networks and in facilitating integration of heterogeneous data sources.
Collapse
Affiliation(s)
- Lei Huang
- Department of Computer and Information Sciences, University of Delaware, 18 Amstel Avenue, Newark, 19716, Delaware, USA
| | - Li Liao
- Department of Computer and Information Sciences, University of Delaware, 18 Amstel Avenue, Newark, 19716, Delaware, USA.
| | - Cathy H Wu
- Department of Computer and Information Sciences, University of Delaware, 18 Amstel Avenue, Newark, 19716, Delaware, USA.,Center for Bioinformatics and Computational Biology, University of Delaware, 15 Innovation Way, Newark, 19711, Delaware, USA
| |
Collapse
|
24
|
Havugimana PC, Hu P, Emili A. Protein complexes, big data, machine learning and integrative proteomics: lessons learned over a decade of systematic analysis of protein interaction networks. Expert Rev Proteomics 2017; 14:845-855. [PMID: 28918672 DOI: 10.1080/14789450.2017.1374179] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023]
Abstract
OVERVIEW Elucidation of the networks of physical (functional) interactions present in cells and tissues is fundamental for understanding the molecular organization of biological systems, the mechanistic basis of essential and disease-related processes, and for functional annotation of previously uncharacterized proteins (via guilt-by-association or -correlation). After a decade in the field, we felt it timely to document our own experiences in the systematic analysis of protein interaction networks. Areas covered: Researchers worldwide have contributed innovative experimental and computational approaches that have driven the rapidly evolving field of 'functional proteomics'. These include mass spectrometry-based methods to characterize macromolecular complexes on a global-scale and sophisticated data analysis tools - most notably machine learning - that allow for the generation of high-quality protein association maps. Expert commentary: Here, we recount some key lessons learned, with an emphasis on successful workflows, and challenges, arising from our own and other groups' ongoing efforts to generate, interpret and report proteome-scale interaction networks in increasingly diverse biological contexts.
Collapse
Affiliation(s)
- Pierre C Havugimana
- a Donnelly Centre for Cellular and Biomolecular Research , University of Toronto , Toronto , ON , Canada.,b Department of Molecular Genetics , University of Toronto , Toronto , ON , Canada
| | - Pingzhao Hu
- c Department of Biochemistry and Medical Genetics , University of Manitoba , Winnipeg , MB , Canada
| | - Andrew Emili
- a Donnelly Centre for Cellular and Biomolecular Research , University of Toronto , Toronto , ON , Canada.,b Department of Molecular Genetics , University of Toronto , Toronto , ON , Canada
| |
Collapse
|
25
|
Yin C, Yau SST. A coevolution analysis for identifying protein-protein interactions by Fourier transform. PLoS One 2017; 12:e0174862. [PMID: 28430779 PMCID: PMC5400233 DOI: 10.1371/journal.pone.0174862] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2016] [Accepted: 03/16/2017] [Indexed: 12/29/2022] Open
Abstract
Protein-protein interactions (PPIs) play key roles in life processes, such as signal transduction, transcription regulations, and immune response, etc. Identification of PPIs enables better understanding of the functional networks within a cell. Common experimental methods for identifying PPIs are time consuming and expensive. However, recent developments in computational approaches for inferring PPIs from protein sequences based on coevolution theory avoid these problems. In the coevolution theory model, interacted proteins may show coevolutionary mutations and have similar phylogenetic trees. The existing coevolution methods depend on multiple sequence alignments (MSA); however, the MSA-based coevolution methods often produce high false positive interactions. In this paper, we present a computational method using an alignment-free approach to accurately detect PPIs and reduce false positives. In the method, protein sequences are numerically represented by biochemical properties of amino acids, which reflect the structural and functional differences of proteins. Fourier transform is applied to the numerical representation of protein sequences to capture the dissimilarities of protein sequences in biophysical context. The method is assessed for predicting PPIs in Ebola virus. The results indicate strong coevolution between the protein pairs (NP-VP24, NP-VP30, NP-VP40, VP24-VP30, VP24-VP40, and VP30-VP40). The method is also validated for PPIs in influenza and E.coli genomes. Since our method can reduce false positive and increase the specificity of PPI prediction, it offers an effective tool to understand mechanisms of disease pathogens and find potential targets for drug design. The Python programs in this study are available to public at URL (https://github.com/cyinbox/PPI).
Collapse
Affiliation(s)
- Changchuan Yin
- Department of Mathematics, Statistics and Computer Science, The University of Illinois at Chicago, Chicago, IL 60607-7045, United States of America
| | - Stephen S. -T. Yau
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China
| |
Collapse
|
26
|
Zhang K, Yun X, Zhang XY, Zhu X, Li C, Wang S. Weighted hierarchical geographic information description model for social relation estimation. Neurocomputing 2016. [DOI: 10.1016/j.neucom.2016.08.030] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
27
|
Huang L, Liao L, Wu CH. Protein-protein interaction prediction based on multiple kernels and partial network with linear programming. BMC SYSTEMS BIOLOGY 2016. [PMCID: PMC4977483 DOI: 10.1186/s12918-016-0296-x] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/09/2025]
Abstract
Background Prediction of de novo protein-protein interaction is a critical step toward reconstructing PPI networks, which is a central task in systems biology. Recent computational approaches have shifted from making PPI prediction based on individual pairs and single data source to leveraging complementary information from multiple heterogeneous data sources and partial network structure. However, how to quickly learn weights for heterogeneous data sources remains a challenge. In this work, we developed a method to infer de novo PPIs by combining multiple data sources represented in kernel format and obtaining optimal weights based on random walk over the existing partial networks. Results Our proposed method utilizes Barker algorithm and the training data to construct a transition matrix which constrains how a random walk would traverse the partial network. Multiple heterogeneous features for the proteins in the network are then combined into the form of weighted kernel fusion, which provides a new "adjacency matrix" for the whole network that may consist of disconnected components but is required to comply with the transition matrix on the training subnetwork. This requirement is met by adjusting the weights to minimize the element-wise difference between the transition matrix and the weighted kernels. The minimization problem is solved by linear programming. The weighted kernel fusion is then transformed to regularized Laplacian (RL) kernel to infer missing or new edges in the PPI network, which can potentially connect the previously disconnected components. Conclusions The results on synthetic data demonstrated the soundness and robustness of the proposed algorithms under various conditions. And the results on real data show that the accuracies of PPI prediction for yeast data and human data measured as AUC are increased by up to 19 % and 11 % respectively, as compared to a control method without using optimal weights. Moreover, the weights learned by our method Weight Optimization by Linear Programming (WOLP) are very consistent with that learned by sampling, and can provide insights into the relations between PPIs and various feature kernel, thereby improving PPI prediction even for disconnected PPI networks.
Collapse
|
28
|
Huang L, Liao L, Wu CH. Inference of protein-protein interaction networks from multiple heterogeneous data. EURASIP JOURNAL ON BIOINFORMATICS & SYSTEMS BIOLOGY 2016; 2016:8. [PMID: 26941784 PMCID: PMC4761017 DOI: 10.1186/s13637-016-0040-2] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/06/2015] [Accepted: 02/09/2016] [Indexed: 11/29/2022]
Abstract
Protein-protein interaction (PPI) prediction is a central task in achieving a better understanding of cellular and intracellular processes. Because high-throughput experimental methods are both expensive and time-consuming, and are also known of suffering from the problems of incompleteness and noise, many computational methods have been developed, with varied degrees of success. However, the inference of PPI network from multiple heterogeneous data sources remains a great challenge. In this work, we developed a novel method based on approximate Bayesian computation and modified differential evolution sampling (ABC-DEP) and regularized laplacian (RL) kernel. The method enables inference of PPI networks from topological properties and multiple heterogeneous features including gene expression and Pfam domain profiles, in forms of weighted kernels. The optimal weights are obtained by ABC-DEP, and the kernel fusion built based on optimal weights serves as input to RL to infer missing or new edges in the PPI network. Detailed comparisons with control methods have been made, and the results show that the accuracy of PPI prediction measured by AUC is increased by up to 23 %, as compared to a baseline without using optimal weights. The method can provide insights into the relations between PPIs and various feature kernels and demonstrates strong capability of predicting faraway interactions that cannot be well detected by traditional RL method.
Collapse
Affiliation(s)
- Lei Huang
- Department of Computer and Information Sciences, University of Delaware, 18 Amstel Avenue, Newark, 19716 DE USA
| | - Li Liao
- Department of Computer and Information Sciences, University of Delaware, 18 Amstel Avenue, Newark, 19716 DE USA
| | - Cathy H Wu
- Department of Computer and Information Sciences, University of Delaware, 18 Amstel Avenue, Newark, 19716 DE USA ; Center for Bioinformatics and Computational Biology, University of Delaware, 15 Innovation Way, Newark, 19711 DE USA
| |
Collapse
|
29
|
Stražar M, Žitnik M, Zupan B, Ule J, Curk T. Orthogonal matrix factorization enables integrative analysis of multiple RNA binding proteins. Bioinformatics 2016; 32:1527-35. [PMID: 26787667 PMCID: PMC4894278 DOI: 10.1093/bioinformatics/btw003] [Citation(s) in RCA: 77] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2015] [Accepted: 01/01/2016] [Indexed: 12/15/2022] Open
Abstract
Motivation: RNA binding proteins (RBPs) play important roles in post-transcriptional control of gene expression, including splicing, transport, polyadenylation and RNA stability. To model protein–RNA interactions by considering all available sources of information, it is necessary to integrate the rapidly growing RBP experimental data with the latest genome annotation, gene function, RNA sequence and structure. Such integration is possible by matrix factorization, where current approaches have an undesired tendency to identify only a small number of the strongest patterns with overlapping features. Because protein–RNA interactions are orchestrated by multiple factors, methods that identify discriminative patterns of varying strengths are needed. Results: We have developed an integrative orthogonality-regularized nonnegative matrix factorization (iONMF) to integrate multiple data sources and discover non-overlapping, class-specific RNA binding patterns of varying strengths. The orthogonality constraint halves the effective size of the factor model and outperforms other NMF models in predicting RBP interaction sites on RNA. We have integrated the largest data compendium to date, which includes 31 CLIP experiments on 19 RBPs involved in splicing (such as hnRNPs, U2AF2, ELAVL1, TDP-43 and FUS) and processing of 3’UTR (Ago, IGF2BP). We show that the integration of multiple data sources improves the predictive accuracy of retrieval of RNA binding sites. In our study the key predictive factors of protein–RNA interactions were the position of RNA structure and sequence motifs, RBP co-binding and gene region type. We report on a number of protein-specific patterns, many of which are consistent with experimentally determined properties of RBPs. Availability and implementation: The iONMF implementation and example datasets are available at https://github.com/mstrazar/ionmf. Contact: tomaz.curk@fri.uni-lj.si Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Martin Stražar
- University of Ljubljana, Faculty of Computer and Information Science, Ljubljana, SI 1000, Slovenia
| | - Marinka Žitnik
- University of Ljubljana, Faculty of Computer and Information Science, Ljubljana, SI 1000, Slovenia
| | - Blaž Zupan
- University of Ljubljana, Faculty of Computer and Information Science, Ljubljana, SI 1000, Slovenia Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Jernej Ule
- Department of Molecular Neuroscience, UCL Institute of Neurology, Queen Square, London WC1N 3BG, UK
| | - Tomaž Curk
- University of Ljubljana, Faculty of Computer and Information Science, Ljubljana, SI 1000, Slovenia
| |
Collapse
|
30
|
Gligorijević V, Malod-Dognin N, Pržulj N. Fuse: multiple network alignment via data fusion. Bioinformatics 2015; 32:1195-203. [DOI: 10.1093/bioinformatics/btv731] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2014] [Accepted: 10/09/2015] [Indexed: 02/07/2023] Open
|
31
|
Gligorijević V, Pržulj N. Methods for biological data integration: perspectives and challenges. J R Soc Interface 2015; 12:20150571. [PMID: 26490630 PMCID: PMC4685837 DOI: 10.1098/rsif.2015.0571] [Citation(s) in RCA: 136] [Impact Index Per Article: 13.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2015] [Accepted: 09/25/2015] [Indexed: 12/17/2022] Open
Abstract
Rapid technological advances have led to the production of different types of biological data and enabled construction of complex networks with various types of interactions between diverse biological entities. Standard network data analysis methods were shown to be limited in dealing with such heterogeneous networked data and consequently, new methods for integrative data analyses have been proposed. The integrative methods can collectively mine multiple types of biological data and produce more holistic, systems-level biological insights. We survey recent methods for collective mining (integration) of various types of networked biological data. We compare different state-of-the-art methods for data integration and highlight their advantages and disadvantages in addressing important biological problems. We identify the important computational challenges of these methods and provide a general guideline for which methods are suited for specific biological problems, or specific data types. Moreover, we propose that recent non-negative matrix factorization-based approaches may become the integration methodology of choice, as they are well suited and accurate in dealing with heterogeneous data and have many opportunities for further development.
Collapse
Affiliation(s)
| | - Nataša Pržulj
- Department of Computing, Imperial College London, London SW7 2AZ, UK
| |
Collapse
|
32
|
You ZH, Chan KCC, Hu P. Predicting protein-protein interactions from primary protein sequences using a novel multi-scale local feature representation scheme and the random forest. PLoS One 2015; 10:e0125811. [PMID: 25946106 PMCID: PMC4422660 DOI: 10.1371/journal.pone.0125811] [Citation(s) in RCA: 98] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2014] [Accepted: 03/04/2015] [Indexed: 11/18/2022] Open
Abstract
The study of protein-protein interactions (PPIs) can be very important for the understanding of biological cellular functions. However, detecting PPIs in the laboratories are both time-consuming and expensive. For this reason, there has been much recent effort to develop techniques for computational prediction of PPIs as this can complement laboratory procedures and provide an inexpensive way of predicting the most likely set of interactions at the entire proteome scale. Although much progress has already been achieved in this direction, the problem is still far from being solved. More effective approaches are still required to overcome the limitations of the current ones. In this study, a novel Multi-scale Local Descriptor (MLD) feature representation scheme is proposed to extract features from a protein sequence. This scheme can capture multi-scale local information by varying the length of protein-sequence segments. Based on the MLD, an ensemble learning method, the Random Forest (RF) method, is used as classifier. The MLD feature representation scheme facilitates the mining of interaction information from multi-scale continuous amino acid segments, making it easier to capture multiple overlapping continuous binding patterns within a protein sequence. When the proposed method is tested with the PPI data of Saccharomyces cerevisiae, it achieves a prediction accuracy of 94.72% with 94.34% sensitivity at the precision of 98.91%. Extensive experiments are performed to compare our method with existing sequence-based method. Experimental results show that the performance of our predictor is better than several other state-of-the-art predictors also with the H. pylori dataset. The reason why such good results are achieved can largely be credited to the learning capabilities of the RF model and the novel MLD feature representation scheme. The experiment results show that the proposed approach can be very promising for predicting PPIs and can be a useful tool for future proteomic studies.
Collapse
Affiliation(s)
- Zhu-Hong You
- Department of Computing, The Hong Kong Polytechnic University, Hong Kong, China; School of Electronics and Information Engineering, Tongji University, Shanghai, China
| | - Keith C C Chan
- Department of Computing, The Hong Kong Polytechnic University, Hong Kong, China
| | - Pengwei Hu
- Department of Computing, The Hong Kong Polytechnic University, Hong Kong, China
| |
Collapse
|
33
|
Žitnik M, Zupan B. Data Imputation in Epistatic MAPs by Network-Guided Matrix Completion. J Comput Biol 2015; 22:595-608. [PMID: 25658751 DOI: 10.1089/cmb.2014.0158] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023] Open
Abstract
Epistatic miniarray profile (E-MAP) is a popular large-scale genetic interaction discovery platform. E-MAPs benefit from quantitative output, which makes it possible to detect subtle interactions with greater precision. However, due to the limits of biotechnology, E-MAP studies fail to measure genetic interactions for up to 40% of gene pairs in an assay. Missing measurements can be recovered by computational techniques for data imputation, in this way completing the interaction profiles and enabling downstream analysis algorithms that could otherwise be sensitive to missing data values. We introduce a new interaction data imputation method called network-guided matrix completion (NG-MC). The core part of NG-MC is low-rank probabilistic matrix completion that incorporates prior knowledge presented as a collection of gene networks. NG-MC assumes that interactions are transitive, such that latent gene interaction profiles inferred by NG-MC depend on the profiles of their direct neighbors in gene networks. As the NG-MC inference algorithm progresses, it propagates latent interaction profiles through each of the networks and updates gene network weights toward improved prediction. In a study with four different E-MAP data assays and considered protein-protein interaction and gene ontology similarity networks, NG-MC significantly surpassed existing alternative techniques. Inclusion of information from gene networks also allowed NG-MC to predict interactions for genes that were not included in original E-MAP assays, a task that could not be considered by current imputation approaches.
Collapse
Affiliation(s)
- Marinka Žitnik
- 1Faculty of Computer and Information Science, University of Ljubljana, Ljubljana, Slovenia
| | - Blaž Zupan
- 1Faculty of Computer and Information Science, University of Ljubljana, Ljubljana, Slovenia.,2Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas
| |
Collapse
|
34
|
Abstract
Motivation: Recently, a shift was made from using Gene Ontology (GO) to evaluate molecular network data to using these data to construct and evaluate GO. Dutkowski et al. provide the first evidence that a large part of GO can be reconstructed solely from topologies of molecular networks. Motivated by this work, we develop a novel data integration framework that integrates multiple types of molecular network data to reconstruct and update GO. We ask how much of GO can be recovered by integrating various molecular interaction data. Results: We introduce a computational framework for integration of various biological networks using penalized non-negative matrix tri-factorization (PNMTF). It takes all network data in a matrix form and performs simultaneous clustering of genes and GO terms, inducing new relations between genes and GO terms (annotations) and between GO terms themselves. To improve the accuracy of our predicted relations, we extend the integration methodology to include additional topological information represented as the similarity in wiring around non-interacting genes. Surprisingly, by integrating topologies of bakers’ yeasts protein–protein interaction, genetic interaction (GI) and co-expression networks, our method reports as related 96% of GO terms that are directly related in GO. The inclusion of the wiring similarity of non-interacting genes contributes 6% to this large GO term association capture. Furthermore, we use our method to infer new relationships between GO terms solely from the topologies of these networks and validate 44% of our predictions in the literature. In addition, our integration method reproduces 48% of cellular component, 41% of molecular function and 41% of biological process GO terms, outperforming the previous method in the former two domains of GO. Finally, we predict new GO annotations of yeast genes and validate our predictions through GIs profiling. Availability and implementation: Supplementary Tables of new GO term associations and predicted gene annotations are available at http://bio-nets.doc.ic.ac.uk/GO-Reconstruction/. Contact:natasha@imperial.ac.uk Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Vuk Janjić
- Department of Computing, Imperial College London SW7 2AZ, UK
| | - Nataša Pržulj
- Department of Computing, Imperial College London SW7 2AZ, UK
| |
Collapse
|
35
|
Lukman S, Aung Z, Sim K. Multiple Structural Clustering of Bromodomains of the Bromo and Extra Terminal (BET) Proteins Highlights Subtle Differences in Their Structural Dynamics and Acetylated Leucine Binding Pocket. ACTA ACUST UNITED AC 2015. [DOI: 10.1016/j.procs.2015.05.192] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023]
|
36
|
Saha S, Chatterjee P, Basu S, Kundu M, Nasipuri M. FunPred-1: protein function prediction from a protein interaction network using neighborhood analysis. Cell Mol Biol Lett 2014; 19:675-91. [PMID: 25424913 PMCID: PMC6275854 DOI: 10.2478/s11658-014-0221-5] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2014] [Accepted: 11/20/2014] [Indexed: 01/05/2023] Open
Abstract
Proteins are responsible for all biological activities in living organisms. Thanks to genome sequencing projects, large amounts of DNA and protein sequence data are now available, but the biological functions of many proteins are still not annotated in most cases. The unknown function of such non-annotated proteins may be inferred or deduced from their neighbors in a protein interaction network. In this paper, we propose two new methods to predict protein functions based on network neighborhood properties. FunPred 1.1 uses a combination of three simple-yet-effective scoring techniques: the neighborhood ratio, the protein path connectivity and the relative functional similarity. FunPred 1.2 applies a heuristic approach using the edge clustering coefficient to reduce the search space by identifying densely connected neighborhood regions. The overall accuracy achieved in FunPred 1.2 over 8 functional groups involving hetero-interactions in 650 yeast proteins is around 87%, which is higher than the accuracy with FunPred 1.1. It is also higher than the accuracy of many of the state-of-the-art protein function prediction methods described in the literature. The test datasets and the complete source code of the developed software are now freely available at http://code.google.com/p/cmaterbioinfo/ .
Collapse
Affiliation(s)
- Sovan Saha
- Department of Computer Science and Engineering, Dr. Sudhir Chandra Sur Degree Engineering College, Dumdum, Kolkata 700074 India
| | - Piyali Chatterjee
- Department of Computer Science and Engineering, Netaji Subhash Engineering College, Garia, Kolkata 700152 India
| | - Subhadip Basu
- Department of Computer Science and Engineering, Jadavpur University, Kolkata, 700032 India
| | - Mahantapas Kundu
- Department of Computer Science and Engineering, Jadavpur University, Kolkata, 700032 India
| | - Mita Nasipuri
- Department of Computer Science and Engineering, Jadavpur University, Kolkata, 700032 India
| |
Collapse
|