1
|
Shi W, Zhang Y, Sun Y, Lin Z. Function-Genes and Disease-Genes Prediction Based on Network Embedding and One-Class Classification. Interdiscip Sci 2024:10.1007/s12539-024-00638-7. [PMID: 39230798 DOI: 10.1007/s12539-024-00638-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2023] [Revised: 05/14/2024] [Accepted: 05/21/2024] [Indexed: 09/05/2024]
Abstract
Using genes which have been experimentally-validated for diseases (functions) can develop machine learning methods to predict new disease/function-genes. However, the prediction of both function-genes and disease-genes faces the same problem: there are only certain positive examples, but no negative examples. To solve this problem, we proposed a function/disease-genes prediction algorithm based on network embedding (Variational Graph Auto-Encoders, VGAE) and one-class classification (Fast Minimum Covariance Determinant, Fast-MCD): VGAEMCD. Firstly, we constructed a protein-protein interaction (PPI) network centered on experimentally-validated genes; then VGAE was used to get the embeddings of nodes (genes) in the network; finally, the embeddings were input into the improved deep learning one-class classifier based on Fast-MCD to predict function/disease-genes. VGAEMCD can predict function-gene and disease-gene in a unified way, and only the experimentally-verified genes are needed to provide (no need for expression profile). VGAEMCD outperforms classical one-class classification algorithms in Recall, Precision, F-measure, Specificity, and Accuracy. Further experiments show that seven metrics of VGAEMCD are higher than those of state-of-art function/disease-genes prediction algorithms. The above results indicate that VGAEMCD can well learn the distribution characteristics of positive examples and accurately identify function/disease-genes.
Collapse
Affiliation(s)
- Weiyu Shi
- College of Maritime Economics and Management, Dalian Maritime University, Dalian, 116026, China
| | - Yan Zhang
- Institute of Environmental Systems Biology, College of Environmental Science and Engineering, Dalian Maritime University, Dalian, 116026, China
| | - Yeqing Sun
- Institute of Environmental Systems Biology, College of Environmental Science and Engineering, Dalian Maritime University, Dalian, 116026, China.
| | - Zhengkui Lin
- College of Maritime Economics and Management, Dalian Maritime University, Dalian, 116026, China.
| |
Collapse
|
2
|
Zhapa-Camacho F, Tang Z, Kulmanov M, Hoehndorf R. Predicting protein functions using positive-unlabeled ranking with ontology-based priors. Bioinformatics 2024; 40:i401-i409. [PMID: 38940168 PMCID: PMC11211813 DOI: 10.1093/bioinformatics/btae237] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/29/2024] Open
Abstract
Automated protein function prediction is a crucial and widely studied problem in bioinformatics. Computationally, protein function is a multilabel classification problem where only positive samples are defined and there is a large number of unlabeled annotations. Most existing methods rely on the assumption that the unlabeled set of protein function annotations are negatives, inducing the false negative issue, where potential positive samples are trained as negatives. We introduce a novel approach named PU-GO, wherein we address function prediction as a positive-unlabeled ranking problem. We apply empirical risk minimization, i.e. we minimize the classification risk of a classifier where class priors are obtained from the Gene Ontology hierarchical structure. We show that our approach is more robust than other state-of-the-art methods on similarity-based and time-based benchmark datasets. AVAILABILITY AND IMPLEMENTATION Data and code are available at https://github.com/bio-ontology-research-group/PU-GO.
Collapse
Affiliation(s)
- Fernando Zhapa-Camacho
- Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology, Thuwal, 23955-6900, Saudi Arabia
- Computer, Electrical and Mathematical Sciences & Engineering Division (CEMSE), King Abdullah University of Science and Technology, Thuwal, 23955-6900, Saudi Arabia
| | - Zhenwei Tang
- Department of Computer Science, University of Toronto, Toronto, ON M5S 1A1, Canada
| | - Maxat Kulmanov
- Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology, Thuwal, 23955-6900, Saudi Arabia
- Computer, Electrical and Mathematical Sciences & Engineering Division (CEMSE), King Abdullah University of Science and Technology, Thuwal, 23955-6900, Saudi Arabia
- SDAIA-KAUST Center of Excellence in Data Science and Artificial Intelligence, King Abdullah University of Science and Technology, Thuwal, 23955-6900, Saudi Arabia
| | - Robert Hoehndorf
- Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology, Thuwal, 23955-6900, Saudi Arabia
- Computer, Electrical and Mathematical Sciences & Engineering Division (CEMSE), King Abdullah University of Science and Technology, Thuwal, 23955-6900, Saudi Arabia
- SDAIA-KAUST Center of Excellence in Data Science and Artificial Intelligence, King Abdullah University of Science and Technology, Thuwal, 23955-6900, Saudi Arabia
| |
Collapse
|
3
|
Ansari M, White AD. Learning peptide properties with positive examples only. DIGITAL DISCOVERY 2024; 3:977-986. [PMID: 38756224 PMCID: PMC11094695 DOI: 10.1039/d3dd00218g] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/05/2023] [Accepted: 03/30/2024] [Indexed: 05/18/2024]
Abstract
Deep learning can create accurate predictive models by exploiting existing large-scale experimental data, and guide the design of molecules. However, a major barrier is the requirement of both positive and negative examples in the classical supervised learning frameworks. Notably, most peptide databases come with missing information and low number of observations on negative examples, as such sequences are hard to obtain using high-throughput screening methods. To address this challenge, we solely exploit the limited known positive examples in a semi-supervised setting, and discover peptide sequences that are likely to map to certain antimicrobial properties via positive-unlabeled learning (PU). In particular, we use the two learning strategies of adapting base classifier and reliable negative identification to build deep learning models for inferring solubility, hemolysis, binding against SHP-2, and non-fouling activity of peptides, given their sequence. We evaluate the predictive performance of our PU learning method and show that by only using the positive data, it can achieve competitive performance when compared with the classical positive-negative (PN) classification approach, where there is access to both positive and negative examples.
Collapse
Affiliation(s)
- Mehrad Ansari
- Department of Chemical Engineering, University of Rochester Rochester NY 14627 USA
| | - Andrew D White
- Department of Chemical Engineering, University of Rochester Rochester NY 14627 USA
| |
Collapse
|
4
|
Yu G, Yang Y, Yan Y, Guo M, Zhang X, Wang J. DeepIDA: Predicting Isoform-Disease Associations by Data Fusion and Deep Neural Networks. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:2166-2176. [PMID: 33571094 DOI: 10.1109/tcbb.2021.3058801] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Alternative splicing produces different isoforms from the same gene locus, it is an important mechanism for regulating gene expression and proteome diversity. Although the prediction of gene(ncRNA)-disease associations has been extensively studied, few (or no) computational solutions have been proposed for the prediction of isoform-disease association (IDA) at a large scale, mainly due to the lack of disease annotations of isoforms. However, increasing evidences confirm the associations between diseases and isoforms, which can more precisely uncover the pathology of complex diseases. Therefore, it is highly desirable to predict IDAs. To bridge this gap, we propose a deep neural network based solution (DeepIDA) to fuse multi-type genomics and transcriptomics data to predict IDAs. Particularly, DeepIDA uses gene-isoform relations to dispatch gene-disease associations to isoforms. In addition, it utilizes two DNN sub-networks with different structures to capture nucleotide and expression features of isoforms, Gene Ontology data and miRNA target data, respectively. After that, these two sub-networks are merged in a dense layer to predict IDAs. The experimental results on public datasets show that DeepIDA can effectively predict IDAs with AUPRC (area under the precision-recall curve) of 0.9141, macro F-measure of 0.9155, G-mean of 0.9278 and balanced accuracy of 0.9303 across 732 diseases, which are much higher than those of competitive methods. Further study on sixteen isoform-disease association cases again corroborates the superiority of DeepIDA. The code of DeepIDA is available at http://mlda.swu.edu.cn/codes.php?name=DeepIDA.
Collapse
|
5
|
Shi X, Wang X, Hou X, Tian Q, Hui M. Gene Mining and Flavour Metabolism Analyses of Wickerhamomyces anomalus Y-1 Isolated From a Chinese Liquor Fermentation Starter. Front Microbiol 2022; 13:891387. [PMID: 35586860 PMCID: PMC9108772 DOI: 10.3389/fmicb.2022.891387] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2022] [Accepted: 04/04/2022] [Indexed: 11/13/2022] Open
Abstract
Luzhou-flavoured liquor is one of Chinese most popular distilled liquors. Hundreds of flavoured components have been detected from this liquor, with esters as its primary flavouring substance. Among these esters, ethyl hexanoate was the main component. As an essential functional microbe that produces ethyl hexanoate, yeast is an important functional microorganism that produces ethyl hexanoate. The synthesis of ethyl hexanoate in yeast mainly involves the lipase/esterase synthesis pathway, alcohol transferase pathway and alcohol dehydrogenase pathway. In this study, whole-genome sequencing of W. anomalus Y-1 isolated from a Chinese liquor fermentation starter, a fermented wheat starter containing brewing microorganisms, was carried out using the Illumina HiSeq X Ten platform. The sequence had a length of 15,127,803 bp with 34.56% GC content, encoding 7,024 CDS sequences, 69 tRNAs and 1 rRNA. Then, genome annotation was performed using three high-quality databases, namely, COG, KEGG and GO databases. The annotation results showed that the ko7019 pathway of gene 6,340 contained the Eht1p enzyme, which was considered a putative acyltransferase similar to Eeb1p and had 51.57% homology with two known medium-chain fatty acid ethyl ester synthases, namely, Eht1 and Eeb1. Ethyl hexanoate in W. anomalus was found to be synthesised through the alcohol acyltransferase pathway, while acyl-coenzyme A and alcohol were synthesised under the catalytic action of Eht1p. The results of this study are beneficial to the exploration of key genes of ester synthesis and provide reference for the improvement of liquor flavoured.
Collapse
Affiliation(s)
- Xin Shi
- College of Biological Engineering, Henan University of Technology, Zhengzhou, China
| | - Xin Wang
- College of Biological Engineering, Henan University of Technology, Zhengzhou, China
- Industrial Microorganism Preservation and Breeding Henan Engineering Laboratory, Zhengzhou, China
| | - Xiaoge Hou
- College of Biological Engineering, Henan University of Technology, Zhengzhou, China
- School of Food and Bioengineering, Henan College of Animal Husbandry Economics, Zhengzhou, China
| | - Qing Tian
- College of Biological Engineering, Henan University of Technology, Zhengzhou, China
| | - Ming Hui
- College of Biological Engineering, Henan University of Technology, Zhengzhou, China
- Industrial Microorganism Preservation and Breeding Henan Engineering Laboratory, Zhengzhou, China
- *Correspondence: Ming Hui,
| |
Collapse
|
6
|
Zhou L, Tang Y, Yan G. A New Estimation Method for the Biological Interaction Predicting Problems. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:1415-1423. [PMID: 33406043 DOI: 10.1109/tcbb.2021.3049642] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
For the past decades, computational methods have been developed to predict various interactions in biological problems. Usually these methods treated the predicting problems as semi-supervised problem or positive-unlabeled(PU) learning problem. Researchers focused on the prediction of unlabeled samples and hoped to find novel interactions in the datasets they collected. However, most of the computational methods could only predict a small proportion of undiscovered interactions and the total number was unknown. In this paper, we developed an estimation method with deep learning to calculate the number of undiscovered interactions in the unlabeled samples, derived its asymptotic interval estimation, and applied it to the compound synergism dataset, drug-target interaction(DTI) dataset and MicroRNA-disease interaction dataset successfully. Moreover, this method could reveal which dataset contained more undiscovered interactions and would be a guidance for the experimental validation. Furthermore, we compared our method with some mixture proportion estimators and demonstarted the efficacy of our method. Finally, we proved that AUC and AUPR were related with the number of undiscovered interactions, which was regarded as another evaluation indicator for the computational methods.
Collapse
|
7
|
Liu L, Zhu S. Computational Methods for Prediction of Human Protein-Phenotype Associations: A Review. PHENOMICS (CHAM, SWITZERLAND) 2021; 1:171-185. [PMID: 36939789 PMCID: PMC9590544 DOI: 10.1007/s43657-021-00019-w] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/23/2021] [Revised: 06/05/2021] [Accepted: 06/16/2021] [Indexed: 12/01/2022]
Abstract
Deciphering the relationship between human proteins (genes) and phenotypes is one of the fundamental tasks in phenomics research. The Human Phenotype Ontology (HPO) builds upon a standardized logical vocabulary to describe the abnormal phenotypes encountered in human diseases and paves the way towards the computational analysis of their genetic causes. To date, many computational methods have been proposed to predict the HPO annotations of proteins. In this paper, we conduct a comprehensive review of the existing approaches to predicting HPO annotations of novel proteins, identifying missing HPO annotations, and prioritizing candidate proteins with respect to a certain HPO term. For each topic, we first give the formalized description of the problem, and then systematically revisit the published literatures highlighting their advantages and disadvantages, followed by the discussion on the challenges and promising future directions. In addition, we point out several potential topics to be worthy of exploration including the selection of negative HPO annotations and detecting HPO misannotations. We believe that this review will provide insight to the researchers in the field of computational phenotype analyses in terms of comprehending and developing novel prediction algorithms.
Collapse
Affiliation(s)
- Lizhi Liu
- School of Computer Science, Fudan University, Shanghai, 200433 China
| | - Shanfeng Zhu
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, 200433 China
- Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, Shanghai, 200433 China
- MOE Frontiers Center for Brain Science, Fudan University, Shanghai, 200433 China
- Zhangjiang Fudan International Innovation Center, Shanghai, 200433 China
- Shanghai Key Lab of Intelligent Information Processing, Fudan University, Shanghai, 200433 China
| |
Collapse
|
8
|
Chen S, Gan M, Lv H, Jiang R. DeepCAPE: A Deep Convolutional Neural Network for the Accurate Prediction of Enhancers. GENOMICS PROTEOMICS & BIOINFORMATICS 2021; 19:565-577. [PMID: 33581335 PMCID: PMC9040020 DOI: 10.1016/j.gpb.2019.04.006] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/24/2018] [Revised: 03/15/2019] [Accepted: 04/29/2019] [Indexed: 12/12/2022]
Abstract
The establishment of a landscape of enhancers across human cells is crucial to deciphering the mechanism of gene regulation, cell differentiation, and disease development. High-throughput experimental approaches, which contain successfully reported enhancers in typical cell lines, are still too costly and time-consuming to perform systematic identification of enhancers specific to different cell lines. Existing computational methods, capable of predicting regulatory elements purely relying on DNA sequences, lack the power of cell line-specific screening. Recent studies have suggested that chromatin accessibility of a DNA segment is closely related to its potential function in regulation, and thus may provide useful information in identifying regulatory elements. Motivated by the aforementioned understanding, we integrate DNA sequences and chromatin accessibility data to accurately predict enhancers in a cell line-specific manner. We proposed DeepCAPE, a deep convolutional neural network to predict enhancers via the integration of DNA sequences and DNase-seq data. Benefitting from the well-designed feature extraction mechanism and skip connection strategy, our model not only consistently outperforms existing methods in the imbalanced classification of cell line-specific enhancers against background sequences, but also has the ability to self-adapt to different sizes of datasets. Besides, with the adoption of auto-encoder, our model is capable of making cross-cell line predictions. We further visualize kernels of the first convolutional layer and show the match of identified sequence signatures and known motifs. We finally demonstrate the potential ability of our model to explain functional implications of putative disease-associated genetic variants and discriminate disease-related enhancers. The source code and detailed tutorial of DeepCAPE are freely available at https://github.com/ShengquanChen/DeepCAPE.
Collapse
Affiliation(s)
- Shengquan Chen
- MOE Key Laboratory of Bioinformatics, Research Department of Bioinformatics at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Mingxin Gan
- Department of Management Science and Engineering, School of Economics and Management, University of Science and Technology Beijing, Beijing 100083, China
| | - Hairong Lv
- MOE Key Laboratory of Bioinformatics, Research Department of Bioinformatics at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Rui Jiang
- MOE Key Laboratory of Bioinformatics, Research Department of Bioinformatics at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing 100084, China.
| |
Collapse
|
9
|
Ju Z, Wang SY. Computational Identification of Lysine Glutarylation Sites Using Positive-Unlabeled Learning. Curr Genomics 2020; 21:204-211. [PMID: 33071614 PMCID: PMC7521029 DOI: 10.2174/1389202921666200511072327] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2019] [Revised: 04/12/2020] [Accepted: 04/13/2020] [Indexed: 12/27/2022] Open
Abstract
Background
As a new type of protein acylation modification, lysine glutarylation has been found to play a crucial role in metabolic processes and mitochondrial functions. To further explore the biological mechanisms and functions of glutarylation, it is significant to predict the potential glutarylation sites. In the existing glutarylation site predictors, experimentally verified glutarylation sites are treated as positive samples and non-verified lysine sites as the negative samples to train predictors. However, the non-verified lysine sites may contain some glutarylation sites which have not been experimentally identified yet. Methods
In this study, experimentally verified glutarylation sites are treated as the positive samples, whereas the remaining non-verified lysine sites are treated as unlabeled samples. A bioinformatics tool named PUL-GLU was developed to identify glutarylation sites using a positive-unlabeled learning algorithm. Results
Experimental results show that PUL-GLU significantly outperforms the current glutarylation site predictors. Therefore, PUL-GLU can be a powerful tool for accurate identification of protein glutarylation sites. Conclusion
A user-friendly web-server for PUL-GLU is available at http://bioinform.cn/pul_glu/.
Collapse
Affiliation(s)
- Zhe Ju
- College of Science, Shenyang Aerospace University, Shenyang110136, P.R. China
| | - Shi-Yun Wang
- College of Science, Shenyang Aerospace University, Shenyang110136, P.R. China
| |
Collapse
|
10
|
Lan C, Chandrasekaran SN, Huan J. On the Unreported-Profile-is-Negative Assumption for Predictive Cheminformatics. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:1352-1363. [PMID: 31056508 DOI: 10.1109/tcbb.2019.2913855] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
In cheminformatics, compound-target binding profiles has been a main source of data for research. For data repositories that only provide positive profiles, a popular assumption is that unreported profiles are all negative. In this paper, we caution the audience not to take this assumption for granted, and present empirical evidence of its ineffectiveness from a machine learning perspective. Our examination is based on a setting where binding profiles are used as features to train predictive models; we show (1) prediction performance degrades when the assumption fails and (2) explicit recovery of unreported profiles improves prediction performance. In particular, we propose a framework that jointly recovers profiles and learns predictive model, and show it achieves further performance improvement. The presented study not only suggests applying matrix recovery methods to recover unreported profiles, but also initiates a new missing feature problem which we called Learning with Positive and Unknown Features.
Collapse
|
11
|
Using Two-dimensional Principal Component Analysis and Rotation Forest for Prediction of Protein-Protein Interactions. Sci Rep 2018; 8:12874. [PMID: 30150728 PMCID: PMC6110764 DOI: 10.1038/s41598-018-30694-1] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2018] [Accepted: 07/17/2018] [Indexed: 11/09/2022] Open
Abstract
The interaction among proteins is essential in all life activities, and it is the basis of all the metabolic activities of the cells. By studying the protein-protein interactions (PPIs), people can better interpret the function of protein, decoding the phenomenon of life, especially in the design of new drugs with great practical value. Although many high-throughput techniques have been devised for large-scale detection of PPIs, these methods are still expensive and time-consuming. For this reason, there is a much-needed to develop computational methods for predicting PPIs at the entire proteome scale. In this article, we propose a new approach to predict PPIs using Rotation Forest (RF) classifier combine with matrix-based protein sequence. We apply the Position-Specific Scoring Matrix (PSSM), which contains biological evolution information, to represent protein sequences and extract the features through the two-dimensional Principal Component Analysis (2DPCA) algorithm. The descriptors are then sending to the rotation forest classifier for classification. We obtained 97.43% prediction accuracy with 94.92% sensitivity at the precision of 99.93% when the proposed method was applied to the PPIs data of yeast. To evaluate the performance of the proposed method, we compared it with other methods in the same dataset, and validate it on an independent datasets. The results obtained show that the proposed method is an appropriate and promising method for predicting PPIs.
Collapse
|
12
|
Li Z, Liao B, Li Y, Liu W, Chen M, Cai L. Gene function prediction based on combining gene ontology hierarchy with multi-instance multi-label learning. RSC Adv 2018; 8:28503-28509. [PMID: 35542493 PMCID: PMC9083914 DOI: 10.1039/c8ra05122d] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2018] [Accepted: 07/12/2018] [Indexed: 12/04/2022] Open
Abstract
Gene function annotation is the main challenge in the post genome era, which is an important part of the genome annotation. The sequencing of the human genome project produces a whole genome data, providing abundant biological information for the study of gene function annotation. However, to obtain useful knowledge from a large amount of data, a potential strategy is to apply machine learning methods to mine these data and predict gene function. In this study, we improved multi-instance hierarchical clustering by using gene ontology hierarchy to annotate gene function, which combines gene ontology hierarchy with multi-instance multi-label learning frame structure. Then, we used multi-label support vector machine (MLSVM) and multi-label k-nearest neighbor (MLKNN) algorithm to predict the function of gene. Finally, we verified our method in four yeast expression datasets. The performance of the simulated experiments proved that our method is efficient.
Collapse
Affiliation(s)
- Zejun Li
- College of Information Science and Engineering, Hunan University Changsha Hunan 410082 China
- School of Computer and Information Science, Hunan Institute of Technology Hengyang 412002 China
| | - Bo Liao
- College of Information Science and Engineering, Hunan University Changsha Hunan 410082 China
| | - Yun Li
- College of Information Science and Engineering, Hunan University Changsha Hunan 410082 China
| | - Wenhua Liu
- School of Computer and Information Science, Hunan Institute of Technology Hengyang 412002 China
| | - Min Chen
- College of Information Science and Engineering, Hunan University Changsha Hunan 410082 China
- School of Computer and Information Science, Hunan Institute of Technology Hengyang 412002 China
| | - Lijun Cai
- College of Information Science and Engineering, Hunan University Changsha Hunan 410082 China
| |
Collapse
|
13
|
Sastry A, Monk J, Tegel H, Uhlen M, Palsson BO, Rockberg J, Brunk E. Machine learning in computational biology to accelerate high-throughput protein expression. Bioinformatics 2018; 33:2487-2495. [PMID: 28398465 DOI: 10.1093/bioinformatics/btx207] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2016] [Accepted: 04/05/2017] [Indexed: 01/21/2023] Open
Abstract
Motivation The Human Protein Atlas (HPA) enables the simultaneous characterization of thousands of proteins across various tissues to pinpoint their spatial location in the human body. This has been achieved through transcriptomics and high-throughput immunohistochemistry-based approaches, where over 40 000 unique human protein fragments have been expressed in E. coli. These datasets enable quantitative tracking of entire cellular proteomes and present new avenues for understanding molecular-level properties influencing expression and solubility. Results Combining computational biology and machine learning identifies protein properties that hinder the HPA high-throughput antibody production pipeline. We predict protein expression and solubility with accuracies of 70% and 80%, respectively, based on a subset of key properties (aromaticity, hydropathy and isoelectric point). We guide the selection of protein fragments based on these characteristics to optimize high-throughput experimentation. Availability and implementation We present the machine learning workflow as a series of IPython notebooks hosted on GitHub (https://github.com/SBRG/Protein_ML). The workflow can be used as a template for analysis of further expression and solubility datasets. Contact ebrunk@ucsd.edu or johanr@biotech.kth.se. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Anand Sastry
- Department of Bioengineering, University of California, San Diego, CA, USA
| | - Jonathan Monk
- Department of Bioengineering, University of California, San Diego, CA, USA
| | - Hanna Tegel
- KTH - Royal Institute of Technology, Department of Proteomics and Nanobiotechnology, SE-106 91 Stockholm, Sweden
| | - Mathias Uhlen
- KTH - Royal Institute of Technology, Department of Proteomics and Nanobiotechnology, SE-106 91 Stockholm, Sweden.,The Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, 2800 Lyngby, Denmark
| | - Bernhard O Palsson
- Department of Bioengineering, University of California, San Diego, CA, USA.,The Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, 2800 Lyngby, Denmark
| | - Johan Rockberg
- KTH - Royal Institute of Technology, Department of Proteomics and Nanobiotechnology, SE-106 91 Stockholm, Sweden
| | - Elizabeth Brunk
- Department of Bioengineering, University of California, San Diego, CA, USA.,The Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, 2800 Lyngby, Denmark
| |
Collapse
|
14
|
Prediction of protein-protein interactions by label propagation with protein evolutionary and chemical information derived from heterogeneous network. J Theor Biol 2017. [DOI: 10.1016/j.jtbi.2017.06.003] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
15
|
Havugimana PC, Hu P, Emili A. Protein complexes, big data, machine learning and integrative proteomics: lessons learned over a decade of systematic analysis of protein interaction networks. Expert Rev Proteomics 2017; 14:845-855. [PMID: 28918672 DOI: 10.1080/14789450.2017.1374179] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023]
Abstract
OVERVIEW Elucidation of the networks of physical (functional) interactions present in cells and tissues is fundamental for understanding the molecular organization of biological systems, the mechanistic basis of essential and disease-related processes, and for functional annotation of previously uncharacterized proteins (via guilt-by-association or -correlation). After a decade in the field, we felt it timely to document our own experiences in the systematic analysis of protein interaction networks. Areas covered: Researchers worldwide have contributed innovative experimental and computational approaches that have driven the rapidly evolving field of 'functional proteomics'. These include mass spectrometry-based methods to characterize macromolecular complexes on a global-scale and sophisticated data analysis tools - most notably machine learning - that allow for the generation of high-quality protein association maps. Expert commentary: Here, we recount some key lessons learned, with an emphasis on successful workflows, and challenges, arising from our own and other groups' ongoing efforts to generate, interpret and report proteome-scale interaction networks in increasingly diverse biological contexts.
Collapse
Affiliation(s)
- Pierre C Havugimana
- a Donnelly Centre for Cellular and Biomolecular Research , University of Toronto , Toronto , ON , Canada.,b Department of Molecular Genetics , University of Toronto , Toronto , ON , Canada
| | - Pingzhao Hu
- c Department of Biochemistry and Medical Genetics , University of Manitoba , Winnipeg , MB , Canada
| | - Andrew Emili
- a Donnelly Centre for Cellular and Biomolecular Research , University of Toronto , Toronto , ON , Canada.,b Department of Molecular Genetics , University of Toronto , Toronto , ON , Canada
| |
Collapse
|
16
|
Zhou C, Yu H, Ding Y, Guo F, Gong XJ. Multi-scale encoding of amino acid sequences for predicting protein interactions using gradient boosting decision tree. PLoS One 2017; 12:e0181426. [PMID: 28792503 PMCID: PMC5549711 DOI: 10.1371/journal.pone.0181426] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2017] [Accepted: 06/30/2017] [Indexed: 11/23/2022] Open
Abstract
Nowadays a number of computational approaches have been developed to effectively and accurately predict protein interactions. However, most of these methods typically perform worse when other biological data sources (e.g., protein structure information, protein domains, or gene neighborhoods information) are not available. In the present work, we propose a method for predicting protein interactions making full use of physicochemical characteristics of amino acids. A protein sequence is encoded at multi-scale by seven properties, including their qualitative and quantitative descriptions, of amino acids. Five kinds of protein descriptors, frequency, composition, transformation, distribution and auto covariance, are extracted from these encodings for representing each protein sequence. The new formed feature representation consisted of 347 dimensions is able to capture not only the compositional and positional information but also their statistical significance of amino acids in the sequence. Based on such a feature representation, the gradient boosting decision tree algorithm is introduced to predict protein interaction class. When the proposed method is tested with the PPI data of S.cerevisiae, it achieves a prediction accuracy of 95.28% at the Matthew's correlation coefficient of 90.68%. Compared with the state-of-the-art works on H.pylori and Human, the accuracies can be raised to 89.27% and 98.00% respectively. Extensive experiments are performed for a crossover protein-protein interactions network and the prediction accuracies are also very promising. Because of learning capabilities of the gradient boosting decision tree and the mutil-scale feature representation scheme, the proposed method might be a useful tool for future proteomics studies.
Collapse
Affiliation(s)
- Chang Zhou
- School of Computer Science and Technology, Tianjin University, Tianjin, China, 300072
- Tianjin Key Laboratory of Cognitive Computing and Application, Tianjin, China, 300072
| | - Hua Yu
- School of Computer Science and Technology, Tianjin University, Tianjin, China, 300072
- Tianjin Key Laboratory of Cognitive Computing and Application, Tianjin, China, 300072
| | - Yijie Ding
- School of Computer Science and Technology, Tianjin University, Tianjin, China, 300072
- Tianjin Key Laboratory of Cognitive Computing and Application, Tianjin, China, 300072
| | - Fei Guo
- School of Computer Science and Technology, Tianjin University, Tianjin, China, 300072
- Tianjin Key Laboratory of Cognitive Computing and Application, Tianjin, China, 300072
| | - Xiu-Jun Gong
- School of Computer Science and Technology, Tianjin University, Tianjin, China, 300072
- Tianjin Key Laboratory of Cognitive Computing and Application, Tianjin, China, 300072
| |
Collapse
|
17
|
Hameed PN, Verspoor K, Kusljic S, Halgamuge S. Positive-Unlabeled Learning for inferring drug interactions based on heterogeneous attributes. BMC Bioinformatics 2017; 18:140. [PMID: 28249566 PMCID: PMC5333429 DOI: 10.1186/s12859-017-1546-7] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2016] [Accepted: 02/13/2017] [Indexed: 12/16/2022] Open
Abstract
BACKGROUND Investigating and understanding drug-drug interactions (DDIs) is important in improving the effectiveness of clinical care. DDIs can occur when two or more drugs are administered together. Experimentally based DDI detection methods require a large cost and time. Hence, there is a great interest in developing efficient and useful computational methods for inferring potential DDIs. Standard binary classifiers require both positives and negatives for training. In a DDI context, drug pairs that are known to interact can serve as positives for predictive methods. But, the negatives or drug pairs that have been confirmed to have no interaction are scarce. To address this lack of negatives, we introduce a Positive-Unlabeled Learning method for inferring potential DDIs. RESULTS The proposed method consists of three steps: i) application of Growing Self Organizing Maps to infer negatives from the unlabeled dataset; ii) using a pairwise similarity function to quantify the overlap between individual features of drugs and iii) using support vector machine classifier for inferring DDIs. We obtained 6036 DDIs from DrugBank database. Using the proposed approach, we inferred 589 drug pairs that are likely to not interact with each other; these drug pairs are used as representative data for the negative class in binary classification for DDI prediction. Moreover, we classify the predicted DDIs as Cytochrome P450 (CYP) enzyme-Dependent and CYP-Independent interactions invoking their locations on the Growing Self Organizing Map, due to the particular importance of these enzymes in clinically significant interaction effects. Further, we provide a case study on three predicted CYP-Dependent DDIs to evaluate the clinical relevance of this study. CONCLUSION Our proposed approach showed an absolute improvement in F1-score of 14 and 38% in comparison to the method that randomly selects unlabeled data points as likely negatives, depending on the choice of similarity function. We inferred 5300 possible CYP-Dependent DDIs and 592 CYP-Independent DDIs with the highest posterior probabilities. Our discoveries can be used to improve clinical care as well as the research outcomes of drug development.
Collapse
Affiliation(s)
- Pathima Nusrath Hameed
- Department of Mechanical Engineering, University of Melbourne, Parkville, Melbourne, 3010, Australia. .,Data61, Victoria Research Lab, West Melbourne, 3003, Australia. .,Department of Computer Science, University of Ruhuna, Matara, 81000, Sri Lanka.
| | - Karin Verspoor
- Department of Computing and Information Systems, University of Melbourne, Parkville, Melbourne, 3010, Australia
| | - Snezana Kusljic
- Department of Nursing, University of Melbourne, Parkville, Melbourne, 3010, Australia.,The Florey Institute of Neuroscience and Mental Health, University of Melbourne, Parkville, Melbourne, 3010, Australia
| | - Saman Halgamuge
- Research School of Engineering, College of Engineering and Computer Science, The Australian National University, Canberra, 2601, ACT, Australia
| |
Collapse
|
18
|
SELF-BLM: Prediction of drug-target interactions via self-training SVM. PLoS One 2017; 12:e0171839. [PMID: 28192537 PMCID: PMC5305209 DOI: 10.1371/journal.pone.0171839] [Citation(s) in RCA: 43] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2016] [Accepted: 01/26/2017] [Indexed: 01/08/2023] Open
Abstract
Predicting drug-target interactions is important for the development of novel drugs and the repositioning of drugs. To predict such interactions, there are a number of methods based on drug and target protein similarity. Although these methods, such as the bipartite local model (BLM), show promise, they often categorize unknown interactions as negative interaction. Therefore, these methods are not ideal for finding potential drug-target interactions that have not yet been validated as positive interactions. Thus, here we propose a method that integrates machine learning techniques, such as self-training support vector machine (SVM) and BLM, to develop a self-training bipartite local model (SELF-BLM) that facilitates the identification of potential interactions. The method first categorizes unlabeled interactions and negative interactions among unknown interactions using a clustering method. Then, using the BLM method and self-training SVM, the unlabeled interactions are self-trained and final local classification models are constructed. When applied to four classes of proteins that include enzymes, G-protein coupled receptors (GPCRs), ion channels, and nuclear receptors, SELF-BLM showed the best performance for predicting not only known interactions but also potential interactions in three protein classes compare to other related studies. The implemented software and supporting data are available at https://github.com/GIST-CSBL/SELF-BLM.
Collapse
|
19
|
|
20
|
Fu G, Wang J, Yang B, Yu G. NegGOA: negative GO annotations selection using ontology structure. Bioinformatics 2016; 32:2996-3004. [PMID: 27318205 DOI: 10.1093/bioinformatics/btw366] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2016] [Accepted: 06/01/2016] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Predicting the biological functions of proteins is one of the key challenges in the post-genomic era. Computational models have demonstrated the utility of applying machine learning methods to predict protein function. Most prediction methods explicitly require a set of negative examples-proteins that are known not carrying out a particular function. However, Gene Ontology (GO) almost always only provides the knowledge that proteins carry out a particular function, and functional annotations of proteins are incomplete. GO structurally organizes more than tens of thousands GO terms and a protein is annotated with several (or dozens) of these terms. For these reasons, the negative examples of a protein can greatly help distinguishing true positive examples of the protein from such a large candidate GO space. RESULTS In this paper, we present a novel approach (called NegGOA) to select negative examples. Specifically, NegGOA takes advantage of the ontology structure, available annotations and potentiality of additional annotations of a protein to choose negative examples of the protein. We compare NegGOA with other negative examples selection algorithms and find that NegGOA produces much fewer false negatives than them. We incorporate the selected negative examples into an efficient function prediction model to predict the functions of proteins in Yeast, Human, Mouse and Fly. NegGOA also demonstrates improved accuracy than these comparing algorithms across various evaluation metrics. In addition, NegGOA is less suffered from incomplete annotations of proteins than these comparing methods. AVAILABILITY AND IMPLEMENTATION The Matlab and R codes are available at https://sites.google.com/site/guoxian85/neggoa CONTACT gxyu@swu.edu.cn SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Guangyuan Fu
- College of Computer and Information Science, Southwest University, Chongqing 400715, China
| | - Jun Wang
- College of Computer and Information Science, Southwest University, Chongqing 400715, China
| | - Bo Yang
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China
| | - Guoxian Yu
- College of Computer and Information Science, Southwest University, Chongqing 400715, China Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China
| |
Collapse
|
21
|
Pathogenic Network Analysis Predicts Candidate Genes for Cervical Cancer. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2016; 2016:3186051. [PMID: 27034707 PMCID: PMC4789371 DOI: 10.1155/2016/3186051] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/07/2015] [Revised: 01/25/2016] [Accepted: 02/07/2016] [Indexed: 12/15/2022]
Abstract
Purpose. The objective of our study was to predicate candidate genes in cervical cancer (CC) using a network-based strategy and to understand the pathogenic process of CC. Methods. A pathogenic network of CC was extracted based on known pathogenic genes (seed genes) and differentially expressed genes (DEGs) between CC and normal controls. Subsequently, cluster analysis was performed to identify the subnetworks in the pathogenic network using ClusterONE. Each gene in the pathogenic network was assigned a weight value, and then candidate genes were obtained based on the weight distribution. Eventually, pathway enrichment analysis for candidate genes was performed. Results. In this work, a total of 330 DEGs were identified between CC and normal controls. From the pathogenic network, 2 intensely connected clusters were extracted, and a total of 52 candidate genes were detected under the weight values greater than 0.10. Among these candidate genes, VIM had the highest weight value. Moreover, candidate genes MMP1, CDC45, and CAT were, respectively, enriched in pathway in cancer, cell cycle, and methane metabolism. Conclusion. Candidate pathogenic genes including MMP1, CDC45, CAT, and VIM might be involved in the pathogenesis of CC. We believe that our results can provide theoretical guidelines for future clinical application.
Collapse
|
22
|
iDPF-PseRAAAC: A Web-Server for Identifying the Defensin Peptide Family and Subfamily Using Pseudo Reduced Amino Acid Alphabet Composition. PLoS One 2015; 10:e0145541. [PMID: 26713618 PMCID: PMC4694767 DOI: 10.1371/journal.pone.0145541] [Citation(s) in RCA: 44] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2015] [Accepted: 12/04/2015] [Indexed: 11/29/2022] Open
Abstract
Defensins as one of the most abundant classes of antimicrobial peptides are an essential part of the innate immunity that has evolved in most living organisms from lower organisms to humans. To identify specific defensins as interesting antifungal leads, in this study, we constructed a more rigorous benchmark dataset and the iDPF-PseRAAAC server was developed to predict the defensin family and subfamily. Using reduced dipeptide compositions were used, the overall accuracy of proposed method increased to 95.10% for the defensin family, and 98.39% for the vertebrate subfamily, which is higher than the accuracy from other methods. The jackknife test shows that more than 4% improvement was obtained comparing with the previous method. A free online server was further established for the convenience of most experimental scientists at http://wlxy.imu.edu.cn/college/biostation/fuwu/iDPF-PseRAAAC/index.asp. A friendly guide is provided to describe how to use the web server. We anticipate that iDPF-PseRAAAC may become a useful high-throughput tool for both basic research and drug design.
Collapse
|
23
|
Pazos Obregón F, Papalardo C, Castro S, Guerberoff G, Cantera R. Putative synaptic genes defined from a Drosophila whole body developmental transcriptome by a machine learning approach. BMC Genomics 2015; 16:694. [PMID: 26370122 PMCID: PMC4570697 DOI: 10.1186/s12864-015-1888-3] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2015] [Accepted: 09/01/2015] [Indexed: 12/02/2022] Open
Abstract
BACKGROUND Assembly and function of neuronal synapses require the coordinated expression of a yet undetermined set of genes. Although roughly a thousand genes are expected to be important for this function in Drosophila melanogaster, just a few hundreds of them are known so far. RESULTS In this work we trained three learning algorithms to predict a "synaptic function" for genes of Drosophila using data from a whole-body developmental transcriptome published by others. Using statistical and biological criteria to analyze and combine the predictions, we obtained a gene catalogue that is highly enriched in genes of relevance for Drosophila synapse assembly and function but still not recognized as such. CONCLUSIONS The utility of our approach is that it reduces the number of genes to be tested through hypothesis-driven experimentation.
Collapse
Affiliation(s)
- Flavio Pazos Obregón
- Departamento de Biología del Neurodesarrollo, Instituto de Investigaciones Biológicas Clemente Estable, Avenida Italia 3318, PC 11600, Montevideo, Uruguay.
| | - Cecilia Papalardo
- Instituto de Matemática y Estadística "Prof. Ing. Rafael Laguardia", Facultad de Ingeniería, Universidad de la República, Montevideo, Uruguay.
| | - Sebastián Castro
- Instituto de Matemática y Estadística "Prof. Ing. Rafael Laguardia", Facultad de Ingeniería, Universidad de la República, Montevideo, Uruguay.
| | - Gustavo Guerberoff
- Instituto de Matemática y Estadística "Prof. Ing. Rafael Laguardia", Facultad de Ingeniería, Universidad de la República, Montevideo, Uruguay.
| | - Rafael Cantera
- Departamento de Biología del Neurodesarrollo, Instituto de Investigaciones Biológicas Clemente Estable, Avenida Italia 3318, PC 11600, Montevideo, Uruguay.
- Zoology Department, Stockholm University, Stockholm, Sweden.
| |
Collapse
|
24
|
Detecting protein-protein interactions with a novel matrix-based protein sequence representation and support vector machines. BIOMED RESEARCH INTERNATIONAL 2015; 2015:867516. [PMID: 26000305 PMCID: PMC4426769 DOI: 10.1155/2015/867516] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/01/2014] [Revised: 01/09/2015] [Accepted: 01/09/2015] [Indexed: 11/27/2022]
Abstract
Proteins and their interactions lie at the heart of most underlying biological processes. Consequently, correct detection of protein-protein interactions (PPIs) is of fundamental importance to understand the molecular mechanisms in biological systems. Although the convenience brought by high-throughput experiment in technological advances makes it possible to detect a large amount of PPIs, the data generated through these methods is unreliable and may not be completely inclusive of all possible PPIs. Targeting at this problem, this study develops a novel computational approach to effectively detect the protein interactions. This approach is proposed based on a novel matrix-based representation of protein sequence combined with the algorithm of support vector machine (SVM), which fully considers the sequence order and dipeptide information of the protein primary sequence. When performed on yeast PPIs datasets, the proposed method can reach 90.06% prediction accuracy with 94.37% specificity at the sensitivity of 85.74%, indicating that this predictor is a useful tool to predict PPIs. Achieved results also demonstrate that our approach can be a helpful supplement for the interactions that have been detected experimentally.
Collapse
|
25
|
Mousavian Z, Masoudi-Nejad A. Drug-target interaction prediction via chemogenomic space: learning-based methods. Expert Opin Drug Metab Toxicol 2014; 10:1273-87. [PMID: 25112457 DOI: 10.1517/17425255.2014.950222] [Citation(s) in RCA: 55] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
Abstract
INTRODUCTION Identification of the interaction between drugs and target proteins is a crucial task in genomic drug discovery. The in silico prediction is an appropriate alternative for the laborious and costly experimental process of drug-target interaction prediction. Developing a variety of computational methods opens a new direction in analyzing and detecting new drug-target pairs. AREAS COVERED In this review, we will focus on chemogenomic methods which have established a learning framework for predicting drug-target interactions. Learning-based methods are classified into supervised and semi-supervised, and the supervised learning methods are studied as two separate parts including similarity-based methods and feature-based methods. EXPERT OPINION In spite of many improvements for pharmacology applications by learning-based methods, there are many over simplification settings in construction of predictive models that may lead to over-optimistic results on drug-target interaction prediction.
Collapse
Affiliation(s)
- Zaynab Mousavian
- University of Tehran, Institute of Biochemistry and Biophysics, Laboratory of Systems Biology and Bioinformatics (LBB) , Tehran , Iran +98 21 6695 9256 ; +98 21 6640 4680 ;
| | | |
Collapse
|
26
|
Youngs N, Penfold-Brown D, Bonneau R, Shasha D. Negative example selection for protein function prediction: the NoGO database. PLoS Comput Biol 2014; 10:e1003644. [PMID: 24922051 PMCID: PMC4055410 DOI: 10.1371/journal.pcbi.1003644] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2013] [Accepted: 04/08/2014] [Indexed: 12/28/2022] Open
Abstract
Negative examples – genes that are known not to carry out a given protein function – are rarely recorded in genome and proteome annotation databases, such as the Gene Ontology database. Negative examples are required, however, for several of the most powerful machine learning methods for integrative protein function prediction. Most protein function prediction efforts have relied on a variety of heuristics for the choice of negative examples. Determining the accuracy of methods for negative example prediction is itself a non-trivial task, given that the Open World Assumption as applied to gene annotations rules out many traditional validation metrics. We present a rigorous comparison of these heuristics, utilizing a temporal holdout, and a novel evaluation strategy for negative examples. We add to this comparison several algorithms adapted from Positive-Unlabeled learning scenarios in text-classification, which are the current state of the art methods for generating negative examples in low-density annotation contexts. Lastly, we present two novel algorithms of our own construction, one based on empirical conditional probability, and the other using topic modeling applied to genes and annotations. We demonstrate that our algorithms achieve significantly fewer incorrect negative example predictions than the current state of the art, using multiple benchmarks covering multiple organisms. Our methods may be applied to generate negative examples for any type of method that deals with protein function, and to this end we provide a database of negative examples in several well-studied organisms, for general use (The NoGO database, available at: bonneaulab.bio.nyu.edu/nogo.html). Many machine learning methods have been applied to the task of predicting the biological function of proteins based on a variety of available data. The majority of these methods require negative examples: proteins that are known not to perform a function, in order to achieve meaningful predictions, but negative examples are often not available. In addition, past heuristic methods for negative example selection suffer from a high error rate. Here, we rigorously compare two novel algorithms against past heuristics, as well as some algorithms adapted from a similar task in text-classification. Through this comparison, performed on several different benchmarks, we demonstrate that our algorithms make significantly fewer mistakes when predicting negative examples. We also provide a database of negative examples for general use in machine learning for protein function prediction (The NoGO database, available at: bonneaulab.bio.nyu.edu/nogo.html).
Collapse
Affiliation(s)
- Noah Youngs
- Department of Computer Science, New York University, New York, New York, United States of America
| | - Duncan Penfold-Brown
- Social Media and Political Participation Lab, New York University, New York, New York, United States of America
| | - Richard Bonneau
- Department of Computer Science, New York University, New York, New York, United States of America
- Department of Biology, New York University, New York, New York, United States of America
- Center for Genomics and Systems Biology, Department of Biology, New York University, New York, New York, United States of America
- * E-mail: (RB); (DS)
| | - Dennis Shasha
- Department of Computer Science, New York University, New York, New York, United States of America
- Center for Genomics and Systems Biology, Department of Biology, New York University, New York, New York, United States of America
- * E-mail: (RB); (DS)
| |
Collapse
|
27
|
Wang X, Xie Y, Gao P, Zhang S, Tan H, Yang F, Lian R, Tian J, Xu G. A metabolomics-based method for studying the effect of yfcC gene in Escherichia coli on metabolism. Anal Biochem 2014; 451:48-55. [DOI: 10.1016/j.ab.2014.01.018] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2013] [Revised: 01/07/2014] [Accepted: 01/21/2014] [Indexed: 10/25/2022]
|
28
|
Zhang ZY, Sun KD, Wang SQ. Enhanced community structure detection in complex networks with partial background information. Sci Rep 2013; 3:3241. [PMID: 24247657 PMCID: PMC4894381 DOI: 10.1038/srep03241] [Citation(s) in RCA: 46] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2013] [Accepted: 10/31/2013] [Indexed: 11/09/2022] Open
Abstract
Community structure detection in complex networks is important since it can help better understand the network topology and how the network works. However, there is still not a clear and widely-accepted definition of community structure, and in practice, different models may give very different results of communities, making it hard to explain the results. In this paper, different from the traditional methodologies, we design an enhanced semi-supervised learning framework for community detection, which can effectively incorporate the available prior information to guide the detection process and can make the results more explainable. By logical inference, the prior information is more fully utilized. The experiments on both the synthetic and the real-world networks confirm the effectiveness of the framework.
Collapse
Affiliation(s)
- Zhong-Yuan Zhang
- School of Statistics and Mathematics, Central University of Finance and Economics, P.R.China
| | | | | |
Collapse
|
29
|
Abstract
BACKGROUND Computational methods that make use of heterogeneous biological datasets to predict gene function provide a cost-effective and rapid way for annotating genomes. A common framework shared by many such methods is to construct a combined functional association network from multiple networks representing different sources of data, and use this combined network as input to network-based or kernel-based learning algorithms. In these methods, a key factor contributing to the prediction accuracy is the network quality, which is the ability of the network to reflect the functional relatedness of gene pairs. To improve the network quality, a large effort has been spent on developing methods for network integration. These methods, however, produce networks, which then remain unchanged, and nearly no effort has been made to optimize the networks after their construction. RESULTS Here, we propose an alternative method to improve the network quality. The proposed method takes as input a combined network produced by an existing network integration algorithm, and reconstructs this network to better represent the co-functionality relationships between gene pairs. At the core of the method is a learning algorithm that can learn a measure of functional similarity between genes, which we then use to reconstruct the input network. In experiments with yeast and human, the proposed method produced improved networks and achieved more accurate results than two other leading gene function prediction approaches. CONCLUSIONS The results show that it is possible to improve the accuracy of network-based gene function prediction methods by optimizing combined networks with appropriate similarity measures learned from data. The proposed learning procedure can handle noisy training data and scales well to large genomes.
Collapse
Affiliation(s)
- Tu Minh Phuong
- Department of Computer Science, Posts & Telecommunications Institute of Technology, Hanoi, Viet Nam
| | - Ngo Phuong Nhung
- KRDB Research Center, Free University of Bolzano, Bolzano, Italy
| |
Collapse
|
30
|
Hu P, Jiang H, Emili A. Incorporating Correlations among Gene Ontology Terms into Predicting Protein Functions. Bioinformatics 2013. [DOI: 10.4018/978-1-4666-3604-0.ch045] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022] Open
Abstract
The authors describe a new strategy that has better prediction performance than previous methods, which gives additional insights about the importance of the dependence between functional terms when inferring protein function.
Collapse
Affiliation(s)
- Pingzhao Hu
- York University, Canada & University of Toronto, Canada
| | | | | |
Collapse
|
31
|
|
32
|
|
33
|
Standage DS, Brendel VP. ParsEval: parallel comparison and analysis of gene structure annotations. BMC Bioinformatics 2012; 13:187. [PMID: 22852583 PMCID: PMC3439248 DOI: 10.1186/1471-2105-13-187] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2012] [Accepted: 07/16/2012] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Accurate gene structure annotation is a fundamental but somewhat elusive goal of genome projects, as witnessed by the fact that (model) genomes typically undergo several cycles of re-annotation. In many cases, it is not only different versions of annotations that need to be compared but also different sources of annotation of the same genome, derived from distinct gene prediction workflows. Such comparisons are of interest to annotation providers, prediction software developers, and end-users, who all need to assess what is common and what is different among distinct annotation sources. We developed ParsEval, a software application for pairwise comparison of sets of gene structure annotations. ParsEval calculates several statistics that highlight the similarities and differences between the two sets of annotations provided. These statistics are presented in an aggregate summary report, with additional details provided as individual reports specific to non-overlapping, gene-model-centric genomic loci. Genome browser styled graphics embedded in these reports help visualize the genomic context of the annotations. Output from ParsEval is both easily read and parsed, enabling systematic identification of problematic gene models for subsequent focused analysis. RESULTS ParsEval is capable of analyzing annotations for large eukaryotic genomes on typical desktop or laptop hardware. In comparison to existing methods, ParsEval exhibits a considerable performance improvement, both in terms of runtime and memory consumption. Reports from ParsEval can provide relevant biological insights into the gene structure annotations being compared. CONCLUSIONS Implemented in C, ParsEval provides the quickest and most feature-rich solution for genome annotation comparison to date. The source code is freely available (under an ISC license) at http://parseval.sourceforge.net/.
Collapse
Affiliation(s)
- Daniel S Standage
- Department of Genetics, Development, and Cell Biology, Iowa State University, Ames, Iowa 50011, USA
| | | |
Collapse
|
34
|
Constructing support vector machine ensemble with segmentation for imbalanced datasets. Neural Comput Appl 2012. [DOI: 10.1007/s00521-012-1041-z] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|
35
|
Jiang JQ, McQuay LJ. Predicting protein function by multi-label correlated semi-supervised learning. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2012; 9:1059-1069. [PMID: 22595236 DOI: 10.1109/tcbb.2011.156] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
Assigning biological functions to uncharacterized proteins is a fundamental problem in the postgenomic era. The increasing availability of large amounts of data on protein-protein interactions (PPIs) has led to the emergence of a considerable number of computational methods for determining protein function in the context of a network. These algorithms, however, treat each functional class in isolation and thereby often suffer from the difficulty of the scarcity of labeled data. In reality, different functional classes are naturally dependent on one another. We propose a new algorithm, Multi-label Correlated Semi-supervised Learning (MCSL), to incorporate the intrinsic correlations among functional classes into protein function prediction by leveraging the relationships provided by the PPI network and the functional class network. The guiding intuition is that the classification function should be sufficiently smooth on subgraphs where the respective topologies of these two networks are a good match. We encode this intuition as regularized learning with intraclass and interclass consistency, which can be understood as an extension of the graph-based learning with local and global consistency (LGC) method. Cross validation on the yeast proteome illustrates that MCSL consistently outperforms several state-of-the-art methods. Most notably, it effectively overcomes the problem associated with scarcity of label data. The supplementary files are freely available at http://sites.google.com/site/csaijiang/MCSL.
Collapse
Affiliation(s)
- Jonathan Q Jiang
- Department of Information System, City University of Hong Kong, 83 Tat Chee Avenue, Kowloon, Hong Kong.
| | | |
Collapse
|
36
|
Kılıç C, Tan M. Positive unlabeled learning for deriving protein interaction networks. ACTA ACUST UNITED AC 2012. [DOI: 10.1007/s13721-012-0012-8] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|
37
|
ZHAO XINGMING, CHEUNG YIUMING, HUANG DESHUANG. ANALYSIS OF GENE EXPRESSION DATA USING RPEM ALGORITHM IN NORMAL MIXTURE MODEL WITH DYNAMIC ADJUSTMENT OF LEARNING RATE. INT J PATTERN RECOGN 2011. [DOI: 10.1142/s0218001410008056] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Microarray technology is a useful tool for monitoring the expression levels of thousands of genes simultaneously. Recently, mixture modeling has been used to extract expression signatures from gene expression profiles. In general, two separate steps are utilized to estimate the number of classes and model parameters, respectively. However, such a method is often time-consuming and leads to suboptimal solutions. In this paper, we therefore apply a one-step approach, namely Rival Penalized Expectation-Maximization (RPEM) algorithm, to analyze the gene expression data. The RPEM algorithm is capable of estimating the parameters of normal mixture model, while determining the number of classes automatically at the same time. Furthermore, we speed up the learning procedure of RPEM by proposing a new mechanism to adjust the learning rate dynamically. The numerical results on real gene expression data demonstrate that our proposed method is indeed effective and efficient.
Collapse
Affiliation(s)
- XING-MING ZHAO
- Institute of Systems Biology, Shanghai University, Shanghai 200444, P. R. China
| | - YIU-MING CHEUNG
- Department of Computer Science, Hong Kong Baptist University, Hong Kong, P. R. China
| | - DE-SHUANG HUANG
- Intelligent Computing Lab, Institute of Intelligent Machines, Chinese Academy of Sciences, P. O. Box 1130, Hefei, Anhui 230031, P. R. China
| |
Collapse
|
38
|
Sapkota A, Liu X, Zhao XM, Cao Y, Liu J, Liu ZP, Chen L. DIPOS: database of interacting proteins in Oryza sativa. MOLECULAR BIOSYSTEMS 2011; 7:2615-21. [PMID: 21713282 DOI: 10.1039/c1mb05120b] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
Abstract
Rice is an important crop throughout the world and is the staple food for about half the world's population. For better breeding and improved production, we need to know the function of rice molecules which facilitate their function through interactions with each other. The database of interacting proteins in Oryza sativa (DIPOS) provides comprehensive information of interacting proteins in rice, where the interactions are predicted using two computational methods, i.e., interologs and domain based methods. DIPOS contains 14 614 067 pairwise interactions among 27 746 proteins, covering about 41% of the whole Oryaza sativa proteome. Furthermore, each interaction is assigned a confidence score which further enables biologists to sort out the important proteins. Biological explanations of pathways and interactions are also provided based on the database. Public access to the DIPOS is available at and .
Collapse
Affiliation(s)
- Achyut Sapkota
- Institute of Systems Biology, Shanghai University, Shanghai 200444, China
| | | | | | | | | | | | | |
Collapse
|
39
|
Ahmed KS, Saloma NH, Kadah YM. Improving the prediction of yeast protein function using weighted protein-protein interactions. Theor Biol Med Model 2011; 8:11. [PMID: 21524280 PMCID: PMC3104947 DOI: 10.1186/1742-4682-8-11] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2011] [Accepted: 04/27/2011] [Indexed: 11/21/2022] Open
Abstract
Background Bioinformatics can be used to predict protein function, leading to an understanding of cellular activities, and equally-weighted protein-protein interactions (PPI) are normally used to predict such protein functions. The present study provides a weighting strategy for PPI to improve the prediction of protein functions. The weights are dependent on the local and global network topologies and the number of experimental verification methods. The proposed methods were applied to the yeast proteome and integrated with the neighbour counting method to predict the functions of unknown proteins. Results A new technique to weight interactions in the yeast proteome is presented. The weights are related to the network topology (local and global) and the number of identified methods, and the results revealed improvement in the sensitivity and specificity of prediction in terms of cellular role and cellular locations. This method (new weights) was compared with a method that utilises interactions with the same weight and it was shown to be superior. Conclusions A new method for weighting the interactions in protein-protein interaction networks is presented. Experimental results concerning yeast proteins demonstrated that weighting interactions integrated with the neighbor counting method improved the sensitivity and specificity of prediction in terms of two functional categories: cellular role and cell locations.
Collapse
Affiliation(s)
- Khaled S Ahmed
- Department of Bio-electronics, MTI, El-Haddaba Elwosta, Cairo, Egypt
| | | | | |
Collapse
|
40
|
Zhou YZ, Gao Y, Zheng YY. Prediction of Protein-Protein Interactions Using Local Description of Amino Acid Sequence. COMMUNICATIONS IN COMPUTER AND INFORMATION SCIENCE 2011. [DOI: 10.1007/978-3-642-22456-0_37] [Citation(s) in RCA: 54] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/04/2022]
|
41
|
Liu X, Tang WH, Zhao XM, Chen L. A network approach to predict pathogenic genes for Fusarium graminearum. PLoS One 2010; 5:e13021. [PMID: 20957229 PMCID: PMC2949387 DOI: 10.1371/journal.pone.0013021] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2010] [Accepted: 08/17/2010] [Indexed: 11/18/2022] Open
Abstract
Fusarium graminearum is the pathogenic agent of Fusarium head blight (FHB), which is a destructive disease on wheat and barley, thereby causing huge economic loss and health problems to human by contaminating foods. Identifying pathogenic genes can shed light on pathogenesis underlying the interaction between F. graminearum and its plant host. However, it is difficult to detect pathogenic genes for this destructive pathogen by time-consuming and expensive molecular biological experiments in lab. On the other hand, computational methods provide an alternative way to solve this problem. Since pathogenesis is a complicated procedure that involves complex regulations and interactions, the molecular interaction network of F. graminearum can give clues to potential pathogenic genes. Furthermore, the gene expression data of F. graminearum before and after its invasion into plant host can also provide useful information. In this paper, a novel systems biology approach is presented to predict pathogenic genes of F. graminearum based on molecular interaction network and gene expression data. With a small number of known pathogenic genes as seed genes, a subnetwork that consists of potential pathogenic genes is identified from the protein-protein interaction network (PPIN) of F. graminearum, where the genes in the subnetwork are further required to be differentially expressed before and after the invasion of the pathogenic fungus. Therefore, the candidate genes in the subnetwork are expected to be involved in the same biological processes as seed genes, which imply that they are potential pathogenic genes. The prediction results show that most of the pathogenic genes of F. graminearum are enriched in two important signal transduction pathways, including G protein coupled receptor pathway and MAPK signaling pathway, which are known related to pathogenesis in other fungi. In addition, several pathogenic genes predicted by our method are verified in other pathogenic fungi, which demonstrate the effectiveness of the proposed method. The results presented in this paper not only can provide guidelines for future experimental verification, but also shed light on the pathogenesis of the destructive fungus F. graminearum.
Collapse
Affiliation(s)
- Xiaoping Liu
- Institute of Systems Biology, Shanghai University, Shanghai, China
- School of Communication and Information Engineering, Shanghai University, Shanghai, China
| | - Wei-Hua Tang
- National Key Laboratory of Plant Molecular Genetics, Institute of Plant Physiology and Ecology, Chinese Academy of Sciences, Shanghai, China
| | - Xing-Ming Zhao
- Institute of Systems Biology, Shanghai University, Shanghai, China
- National Key Laboratory of Plant Molecular Genetics, Institute of Plant Physiology and Ecology, Chinese Academy of Sciences, Shanghai, China
| | - Luonan Chen
- Institute of Systems Biology, Shanghai University, Shanghai, China
- Key Laboratory of Systems Biology, SIBS-Novo Nordisk Translational Research Centre for PreDiabetes, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
| |
Collapse
|
42
|
Xia JF, Zhao XM, Huang DS. Predicting protein-protein interactions from protein sequences using meta predictor. Amino Acids 2010; 39:1595-9. [PMID: 20386937 DOI: 10.1007/s00726-010-0588-1] [Citation(s) in RCA: 65] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2009] [Accepted: 03/27/2010] [Indexed: 11/24/2022]
Abstract
A novel method is proposed for predicting protein-protein interactions (PPIs) based on the meta approach, which predicts PPIs using support vector machine that combines results by six independent state-of-the-art predictors. Significant improvement in prediction performance is observed, when performed on Saccharomyces cerevisiae and Helicobacter pylori datasets. In addition, we used the final prediction model trained on the PPIs dataset of S. cerevisiae to predict interactions in other species. The results reveal that our meta model is also capable of performing cross-species predictions. The source code and the datasets are available at http://home.ustc.edu.cn/~jfxia/Meta_PPI.html.
Collapse
Affiliation(s)
- Jun-Feng Xia
- Intelligent Computing Laboratory, Hefei Institute of Intelligent Machines, Chinese Academy of Sciences, P.O. Box 1130, Hefei, Anhui, 230031, China
| | | | | |
Collapse
|
43
|
Zhao XM, Zhang XW, Tang WH, Chen L. FPPI: Fusarium graminearum protein-protein interaction database. J Proteome Res 2010; 8:4714-21. [PMID: 19673500 DOI: 10.1021/pr900415b] [Citation(s) in RCA: 47] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
The fungal pathogen Fusarium graminearum (telomorph Gibberella zeae) is the causal agent of several destructive crop diseases. Identifying interactions among F. graminearum proteins and understanding their functions can provide insights into pathogenic mechanisms underlying F. graminearum-host interactions. F. graminearum protein-protein interaction (FPPI) database provides comprehensive information of protein-protein interactions (PPIs) of F. graminearum predicted based on both interologs from several PPI databases of seven species and domain-domain interactions experimentally determined based on protein structures. FPPI contains 223,166 interactions among 7406 proteins for F. graminearum. To the best of our knowledge, it is the first PPI map for this destructive fungus, which is thereby expected to shed light on biological functions of F. graminearum proteins. The predicted interactome covers about 52% of the whole F. graminearum proteome, and each interaction is assigned a score as the confidence for the predicted PPI. In particular, we constructed a core PPI data set with high confidence that consists of 27,102 interactions and 3745 proteins. To verify the reliability of the predicted interactome, we conducted yeast two-hybrid experiments over 3 randomly selected predictions from the core PPI data set, among which one pair of proteins was confirmed to indeed interact with each other, thereby proving high confidence on the core PPI data set. In addition, FPPI contains other functional information for F. graminearum genes, including homologues in other species deposited in different databases and the inferred functional characteristics, and so on. We further constructed an intuitive query interface for the database that provides easy access to the important features of proteins. In summary, FPPI is a rich source of information for system-level understanding of gene functions and biological processes in F. graminearum. Public access to the FPPI database is available at http://csb.shu.edu.cn/fppi.
Collapse
Affiliation(s)
- Xing-Ming Zhao
- Institute of Plant Physiology and Ecology, Chinese Academy of Sciences, Shanghai 200032, China.
| | | | | | | |
Collapse
|
44
|
Identifying translation initiation sites in prokaryotes using support vector machine. J Theor Biol 2010; 262:644-9. [DOI: 10.1016/j.jtbi.2009.10.023] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2009] [Revised: 10/12/2009] [Accepted: 10/12/2009] [Indexed: 11/17/2022]
|
45
|
Yang P, Xu L, Zhou BB, Zhang Z, Zomaya AY. A particle swarm based hybrid system for imbalanced medical data sampling. BMC Genomics 2009; 10 Suppl 3:S34. [PMID: 19958499 PMCID: PMC2788388 DOI: 10.1186/1471-2164-10-s3-s34] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023] Open
Abstract
BACKGROUND Medical and biological data are commonly with small sample size, missing values, and most importantly, imbalanced class distribution. In this study we propose a particle swarm based hybrid system for remedying the class imbalance problem in medical and biological data mining. This hybrid system combines the particle swarm optimization (PSO) algorithm with multiple classifiers and evaluation metrics for evaluation fusion. Samples from the majority class are ranked using multiple objectives according to their merit in compensating the class imbalance, and then combined with the minority class to form a balanced dataset. RESULTS One important finding of this study is that different classifiers and metrics often provide different evaluation results. Nevertheless, the proposed hybrid system demonstrates consistent improvements over several alternative methods with three different metrics. The sampling results also demonstrate good generalization on different types of classification algorithms, indicating the advantage of information fusion applied in the hybrid system. CONCLUSION The experimental results demonstrate that unlike many currently available methods which often perform unevenly with different datasets the proposed hybrid system has a better generalization property which alleviates the method-data dependency problem. From the biological perspective, the system provides indication for further investigation of the highly ranked samples, which may result in the discovery of new conditions or disease subtypes.
Collapse
Affiliation(s)
- Pengyi Yang
- School of Information Technologies (J12), The University of Sydney, NSW 2006, Australia.
| | | | | | | | | |
Collapse
|
46
|
Zhao XM, Chen L, Aihara K. A discriminative approach for identifying domain-domain interactions from protein-protein interactions. Proteins 2009; 78:1243-53. [DOI: 10.1002/prot.22643] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
|
47
|
Li X, Chen H, Li J, Zhang Z. Gene function prediction with gene interaction networks: a context graph kernel approach. ACTA ACUST UNITED AC 2009; 14:119-28. [PMID: 19789115 DOI: 10.1109/titb.2009.2033116] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Predicting gene functions is a challenge for biologists in the postgenomic era. Interactions among genes and their products compose networks that can be used to infer gene functions. Most previous studies adopt a linkage assumption, i.e., they assume that gene interactions indicate functional similarities between connected genes. In this study, we propose to use a gene's context graph, i.e., the gene interaction network associated with the focal gene, to infer its functions. In a kernel-based machine-learning framework, we design a context graph kernel to capture the information in context graphs. Our experimental study on a testbed of p53-related genes demonstrates the advantage of using indirect gene interactions and shows the empirical superiority of the proposed approach over linkage-assumption-based methods, such as the algorithm to minimize inconsistent connected genes and diffusion kernels.
Collapse
Affiliation(s)
- Xin Li
- Department of Information Systems, City University of Hong Kong, Kowloon Tong, Hong Kong.
| | | | | | | |
Collapse
|
48
|
Zhao XM, Chen L, Aihara K. Protein function prediction with high-throughput data. Amino Acids 2008; 35:517-30. [PMID: 18427717 DOI: 10.1007/s00726-008-0077-y] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2008] [Accepted: 03/13/2008] [Indexed: 12/12/2022]
Abstract
Protein function prediction is one of the main challenges in post-genomic era. The availability of large amounts of high-throughput data provides an alternative approach to handling this problem from the computational viewpoint. In this review, we provide a comprehensive description of the computational methods that are currently applicable to protein function prediction, especially from the perspective of machine learning. Machine learning techniques can generally be classified as supervised learning, semi-supervised learning and unsupervised learning. By classifying the existing computational methods for protein annotation into these three groups, we are able to present a comprehensive framework on protein annotation based on machine learning techniques. In addition to describing recently developed theoretical methodologies, we also cover representative databases and software tools that are widely utilized in the prediction of protein function.
Collapse
Affiliation(s)
- Xing-Ming Zhao
- ERATO Aihara Complexity Modelling Project, JST, Tokyo, 151-0064, Japan
| | | | | |
Collapse
|
49
|
Liu ZP, Wu LY, Wang Y, Zhang XS, Chen L. Bridging protein local structures and protein functions. Amino Acids 2008; 35:627-50. [PMID: 18421562 PMCID: PMC7088341 DOI: 10.1007/s00726-008-0088-8] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2008] [Accepted: 03/10/2008] [Indexed: 12/11/2022]
Abstract
One of the major goals of molecular and evolutionary biology is to understand the functions of proteins by extracting functional information from protein sequences, structures and interactions. In this review, we summarize the repertoire of methods currently being applied and report recent progress in the field of in silico annotation of protein function based on the accumulation of vast amounts of sequence and structure data. In particular, we emphasize the newly developed structure-based methods, which are able to identify locally structural motifs and reveal their relationship with protein functions. These methods include computational tools to identify the structural motifs and reveal the strong relationship between these pre-computed local structures and protein functions. We also discuss remaining problems and possible directions for this exciting and challenging area.
Collapse
Affiliation(s)
- Zhi-Ping Liu
- Academy of Mathematics and Systems Science, Chinese Academy of Sciences, 100080, Beijing, China
| | | | | | | | | |
Collapse
|