1
|
Wen JW, Zhang HL, Du PF. Vislocas: Vision transformers for identifying protein subcellular mis-localization signatures of different cancer subtypes from immunohistochemistry images. Comput Biol Med 2024; 174:108392. [PMID: 38608321 DOI: 10.1016/j.compbiomed.2024.108392] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2024] [Revised: 03/22/2024] [Accepted: 04/01/2024] [Indexed: 04/14/2024]
Abstract
Proteins must be sorted to specific subcellular compartments to perform their functions. Abnormal protein subcellular localizations are related to many diseases. Although many efforts have been made in predicting protein subcellular localization from various static information, including sequences, structures and interactions, such static information cannot predict protein mis-localization events in diseases. On the contrary, the IHC (immunohistochemistry) images, which have been widely applied in clinical diagnosis, contains information that can be used to find protein mis-localization events in disease states. In this study, we create the Vislocas method, which is capable of finding mis-localized proteins from IHC images as markers of cancer subtypes. By combining CNNs and vision transformer encoders, Vislocas can automatically extract image features at both global and local level. Vislocas can be trained with full-sized IHC images from scratch. It is the first attempt to create an end-to-end IHC image-based protein subcellular location predictor. Vislocas achieved comparable or better performances than state-of-the-art methods. We applied Vislocas to find significant protein mis-localization events in different subtypes of glioma, melanoma and skin cancer. The mis-localized proteins, which were found purely from IHC images by Vislocas, are in consistency with clinical or experimental results in literatures. All codes of Vislocas have been deposited in a Github repository (https://github.com/JingwenWen99/Vislocas). All datasets of Vislocas have been deposited in Zenodo (https://zenodo.org/records/10632698).
Collapse
Affiliation(s)
- Jing-Wen Wen
- College of Intelligence and Computing, Tianjin University, Tianjin, 300350, China.
| | - Han-Lin Zhang
- College of Intelligence and Computing, Tianjin University, Tianjin, 300350, China.
| | - Pu-Feng Du
- College of Intelligence and Computing, Tianjin University, Tianjin, 300350, China.
| |
Collapse
|
2
|
Hu Y, Liu C, Yang J, Zhong M, Qian B, Chen J, Zhang Y, Song J. HMGB1 is involved in viral replication and the inflammatory response in coxsackievirus A16-infected 16HBE cells via proteomic analysis and identification. Virol J 2023; 20:178. [PMID: 37559147 PMCID: PMC10410909 DOI: 10.1186/s12985-023-02150-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2023] [Accepted: 08/02/2023] [Indexed: 08/11/2023] Open
Abstract
Coxsackievirus A16 (CV-A16) is still an important pathogen that causes hand, foot and mouth disease (HFMD) in young children and infants worldwide. Previous studies indicated that CV-A16 infection is usually mild or self-limiting, but it was also found that CV-A16 infection can trigger severe neurological complications and even death. However, there are currently no vaccines or antiviral compounds available to either prevent or treat CV-A16 infection. Therefore, investigation of the virus‒host interaction and identification of host proteins that play a crucial regulatory role in the pathogenesis of CV-A16 infection may provide a novel strategy to develop antiviral drugs. Here, to increase our understanding of the interaction of CV-A16 with the host cell, we analyzed changes in the proteome of 16HBE cells in response to CV-A16 using tandem mass tag (TMT) in combination with LC‒MS/MS. There were 6615 proteins quantified, and 172 proteins showed a significant alteration during CV-A16 infection. These differentially regulated proteins were involved in fundamental biological processes and signaling pathways, including metabolic processes, cytokine‒cytokine receptor interactions, B-cell receptor signaling pathways, and neuroactive ligand‒receptor interactions. Further bioinformatics analysis revealed the characteristics of the protein domains and subcellular localization of these differentially expressed proteins. Then, to validate the proteomics data, 3 randomly selected proteins exhibited consistent changes in protein expression with the TMT results using Western blotting and immunofluorescence methods. Finally, among these differentially regulated proteins, we primarily focused on HMGB1 based on its potential effects on viral replication and virus infection-induced inflammatory responses. It was demonstrated that overexpression of HMGB1 could decrease viral replication and upregulate the release of inflammatory cytokines, but deletion of HMGB1 increased viral replication and downregulated the release of inflammatory cytokines. In conclusion, the results from this study have helped further elucidate the potential molecular pathogenesis of CV-A16 based on numerous protein changes and the functions of HMGB1 Found to be involved in the processes of viral replication and inflammatory response, which may facilitate the development of new antiviral therapies as well as innovative diagnostic methods.
Collapse
Affiliation(s)
- Yajie Hu
- Department of Pulmonary and Critical Care Medicine, The First People's Hospital of Yunnan Province, Kunming, China
- The Affiliated Hospital of Kunming University of Science and Technology, Kunming, Yunnan, China
| | - Chen Liu
- Department of Pulmonary and Critical Care Medicine, The First People's Hospital of Yunnan Province, Kunming, China
- The Affiliated Hospital of Kunming University of Science and Technology, Kunming, Yunnan, China
| | - Jinghui Yang
- The Affiliated Hospital of Kunming University of Science and Technology, Kunming, Yunnan, China
- Department of Pediatrics, The First People's Hospital of Yunnan Province, Kunming, China
| | - Mingmei Zhong
- Department of Pulmonary and Critical Care Medicine, The First People's Hospital of Yunnan Province, Kunming, China
- The Affiliated Hospital of Kunming University of Science and Technology, Kunming, Yunnan, China
| | - Baojiang Qian
- Department of Pulmonary and Critical Care Medicine, The First People's Hospital of Yunnan Province, Kunming, China
- The Affiliated Hospital of Kunming University of Science and Technology, Kunming, Yunnan, China
| | - Juan Chen
- Department of Pulmonary and Critical Care Medicine, The First People's Hospital of Yunnan Province, Kunming, China
- The Affiliated Hospital of Kunming University of Science and Technology, Kunming, Yunnan, China
| | - Yunhui Zhang
- Department of Pulmonary and Critical Care Medicine, The First People's Hospital of Yunnan Province, Kunming, China.
- The Affiliated Hospital of Kunming University of Science and Technology, Kunming, Yunnan, China.
| | - Jie Song
- Institute of Medical Biology, Chinese Academy of Medical Science and Peking Union Medical College, Yunnan Key Laboratory of Vaccine Research and Development on Severe Infectious Diseases, Kunming, China.
| |
Collapse
|
3
|
Song J, Zhao G, Li H, Yang Y, Yu Y, Hu Y, Li Y, Li J, Hu Y. Tandem mass tag (TMT) labeling-based quantitative proteomic analysis reveals the cellular protein characteristics of 16HBE cells infected with coxsackievirus A10 and the potential effect of HMGB1 on viral replication. Arch Virol 2023; 168:217. [PMID: 37524962 DOI: 10.1007/s00705-023-05821-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2023] [Accepted: 05/23/2023] [Indexed: 08/02/2023]
Abstract
Coxsackievirus A10 (CV-A10) is recognized as one of the most important pathogens associated with hand, foot, and mouth disease (HFMD) in young children under 5 years of age worldwide, and it can lead to fatal neurological complications. However, available commercial vaccines fail to protect against CV-A10. Therefore, there is an urgent need to study new protein targets of CV-A10 and develop novel vaccine-based therapeutic strategies. Advances in proteomics in recent years have enabled a comprehensive understanding of host pathogen interactions. Here, to study CV-A10-host interactions, a global quantitative proteomic analysis was conducted to investigate the molecular characteristics of host cell proteins and identify key host proteins involved in CV-A10 infection. Using tandem mass tagging (TMT)-based mass spectrometry, a total of 6615 host proteins were quantified, with 293 proteins being differentially regulated. To ensure the validity and reliability of the proteomics data, three randomly selected proteins were verified by Western blot analysis, and the results were consistent with the TMT results. Further functional analysis showed that the upregulated and downregulated proteins were associated with diverse biological activities and signaling pathways, such as metabolic processes, biosynthetic processes, the AMPK signaling pathway, the neurotrophin signaling pathway, the MAPK signaling pathway, and the GABAergic synaptic signaling. Moreover, subsequent bioinformatics analysis demonstrated that these differentially expressed proteins contained distinct domains, were localized in different subcellular components, and generated a complex network. Finally, high-mobility group box 1 (HMGB1) might be a key host factor involved in CV-A10 replication. In summary, our findings provide comprehensive insights into the proteomic profile during CV-A10 infection, deepen our understanding of the relationship between CV-A10 and host cells, and establish a proteomic signature for this viral infection. Moreover, the observed effect of HMGB1 on CV-A10 replication suggests that it might be a potential therapeutic target treatment of CV-A10 infection.
Collapse
Affiliation(s)
- Jie Song
- Institute of Medical Biology, Yunnan Key Laboratory of Vaccine Research and Development on Severe Infectious Diseases, Chinese Academy of Medical Science and Peking Union Medical College, Kunming, China.
| | - Guifang Zhao
- Institute of Medical Biology, Yunnan Key Laboratory of Vaccine Research and Development on Severe Infectious Diseases, Chinese Academy of Medical Science and Peking Union Medical College, Kunming, China
| | - Hui Li
- Institute of Medical Biology, Yunnan Key Laboratory of Vaccine Research and Development on Severe Infectious Diseases, Chinese Academy of Medical Science and Peking Union Medical College, Kunming, China
| | - Yan Yang
- Institute of Medical Biology, Yunnan Key Laboratory of Vaccine Research and Development on Severe Infectious Diseases, Chinese Academy of Medical Science and Peking Union Medical College, Kunming, China
| | - Yue Yu
- Institute of Medical Biology, Yunnan Key Laboratory of Vaccine Research and Development on Severe Infectious Diseases, Chinese Academy of Medical Science and Peking Union Medical College, Kunming, China
| | - Yunguang Hu
- Institute of Medical Biology, Yunnan Key Laboratory of Vaccine Research and Development on Severe Infectious Diseases, Chinese Academy of Medical Science and Peking Union Medical College, Kunming, China
| | - Yadong Li
- Institute of Medical Biology, Yunnan Key Laboratory of Vaccine Research and Development on Severe Infectious Diseases, Chinese Academy of Medical Science and Peking Union Medical College, Kunming, China
| | - Jiang Li
- Institute of Medical Biology, Yunnan Key Laboratory of Vaccine Research and Development on Severe Infectious Diseases, Chinese Academy of Medical Science and Peking Union Medical College, Kunming, China
| | - Yajie Hu
- Department of Pulmonary and Critical Care Medicine, The First People's Hospital of Yunnan Province, Kunming, China.
| |
Collapse
|
4
|
Wang RH, Luo T, Zhang HL, Du PF. PLA-GNN: Computational inference of protein subcellular location alterations under drug treatments with deep graph neural networks. Comput Biol Med 2023; 157:106775. [PMID: 36921458 DOI: 10.1016/j.compbiomed.2023.106775] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2023] [Revised: 02/21/2023] [Accepted: 03/09/2023] [Indexed: 03/12/2023]
Abstract
The aberrant protein sorting has been observed in many conditions, including complex diseases, drug treatments, and environmental stresses. It is important to systematically identify protein mis-localization events in a given condition. Experimental methods for finding mis-localized proteins are always costly and time consuming. Predicting protein subcellular localizations has been studied for many years. However, only a handful of existing works considered protein subcellular location alterations. We proposed a computational method for identifying alterations of protein subcellular locations under drug treatments. We took three drugs, including TSA (trichostain A), bortezomib and tacrolimus, as instances for this study. By introducing dynamic protein-protein interaction networks, graph neural network algorithms were applied to aggregate topological information under different conditions. We systematically reported potential protein mis-localization events under drug treatments. As far as we know, this is the first attempt to find protein mis-localization events computationally in drug treatment conditions. Literatures validated that a number of proteins, which are highly related to pharmacological mechanisms of these drugs, may undergo protein localization alterations. We name our method as PLA-GNN (Protein Localization Alteration by Graph Neural Networks). It can be extended to other drugs and other conditions. All datasets and codes of this study has been deposited in a GitHub repository (https://github.com/quinlanW/PLA-GNN).
Collapse
Affiliation(s)
- Ren-Hua Wang
- College of Intelligence and Computing, Tianjin University, Tianjin, 300350, China.
| | - Tao Luo
- College of Intelligence and Computing, Tianjin University, Tianjin, 300350, China.
| | - Han-Lin Zhang
- College of Intelligence and Computing, Tianjin University, Tianjin, 300350, China.
| | - Pu-Feng Du
- College of Intelligence and Computing, Tianjin University, Tianjin, 300350, China.
| |
Collapse
|
5
|
Mou M, Pan Z, Lu M, Sun H, Wang Y, Luo Y, Zhu F. Application of Machine Learning in Spatial Proteomics. J Chem Inf Model 2022; 62:5875-5895. [PMID: 36378082 DOI: 10.1021/acs.jcim.2c01161] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Spatial proteomics is an interdisciplinary field that investigates the localization and dynamics of proteins, and it has gained extensive attention in recent years, especially the subcellular proteomics. Numerous evidence indicate that the subcellular localization of proteins is associated with various cellular processes and disease progression. Mass spectrometry (MS)-based and imaging-based experimental approaches have been developed to acquire large-scale spatial proteomic data. To allow the reliable analysis of increasingly complex spatial proteomics data, machine learning (ML) methods have been widely used in both MS-based and imaging-based spatial proteomic data analysis pipelines. Here, we comprehensively survey the applications of ML in spatial proteomics from following aspects: (1) data resources for spatial proteome are comprehensively introduced; (2) the roles of different ML algorithms in data analysis pipelines are elaborated; (3) successful applications of spatial proteomics and several analytical tools integrating ML methods are presented; (4) challenges existing in modern ML-based spatial proteomics studies are discussed. This review provides guidelines for researchers seeking to apply ML methods to analyze spatial proteomic data and can facilitate insightful understanding of cell biology as well as the future research in medical and drug discovery communities.
Collapse
Affiliation(s)
- Minjie Mou
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Ziqi Pan
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Mingkun Lu
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Huaicheng Sun
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Yunxia Wang
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Yongchao Luo
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Feng Zhu
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| |
Collapse
|
6
|
Li GP, Du PF, Shen ZA, Liu HY, Luo T. DPPN-SVM: Computational Identification of Mis-Localized Proteins in Cancers by Integrating Differential Gene Expressions With Dynamic Protein-Protein Interaction Networks. Front Genet 2020; 11:600454. [PMID: 33193746 PMCID: PMC7644922 DOI: 10.3389/fgene.2020.600454] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2020] [Accepted: 10/07/2020] [Indexed: 12/29/2022] Open
Abstract
Eukaryotic cells contain numerous components, which are known as subcellular compartments or subcellular organelles. Proteins must be sorted to proper subcellular compartments to carry out their molecular functions. Mis-localized proteins are related to various cancers. Identifying mis-localized proteins is important in understanding the pathology of cancers and in developing therapies. However, experimental methods, which are used to determine protein subcellular locations, are always costly and time-consuming. We tried to identify cancer-related mis-localized proteins in three different cancers using computational approaches. By integrating gene expression profiles and dynamic protein-protein interaction networks, we established DPPN-SVM (Dynamic Protein-Protein Network with Support Vector Machine), a predictive model using the SVM classifier with diffusion kernels. With this predictive model, we identified a number of mis-localized proteins. Since we introduced the dynamic protein-protein network, which has never been considered in existing works, our model is capable of identifying more mis-localized proteins than existing studies. As far as we know, this is the first study to incorporate dynamic protein-protein interaction network in identifying mis-localized proteins in cancers.
Collapse
Affiliation(s)
- Guang-Ping Li
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Pu-Feng Du
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Zi-Ang Shen
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Hang-Yu Liu
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Tao Luo
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| |
Collapse
|
7
|
Miao YY, Zhao W, Li GP, Gao Y, Du PF. Predicting Endoplasmic Reticulum Resident Proteins Using Auto-Cross Covariance Transformation With a U-Shaped Residue Weight-Transfer Function. Front Genet 2020; 10:1231. [PMID: 31921288 PMCID: PMC6932965 DOI: 10.3389/fgene.2019.01231] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2019] [Accepted: 11/06/2019] [Indexed: 11/13/2022] Open
Abstract
Background: The endoplasmic reticulum (ER) is an important organelle in eukaryotic cells. It is involved in many important biological processes, such as cell metabolism, protein synthesis, and post-translational modification. The proteins that reside within the ER are called ER-resident proteins. These proteins are closely related to the biological functions of the ER. The difference between the ER-resident proteins and other non-resident proteins should be carefully studied. Methods: We developed a support vector machine (SVM)-based method. We developed a U-shaped weight-transfer function and used it, along with the positional-specific physiochemical properties (PSPCP), to integrate together sequence order information, signaling peptides information, and evolutionary information. Result: Our method achieved over 86% accuracy in a jackknife test. We also achieved roughly 86% sensitivity and 67% specificity in an independent dataset test. Our method is capable of identifying ER-resident proteins.
Collapse
Affiliation(s)
- Yang-Yang Miao
- College of Intelligence and Computing, Tianjin University, Tianjin, China.,School of Chemical Engineering, Tianjin University, Tianjin, China
| | - Wei Zhao
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Guang-Ping Li
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Yang Gao
- School of Medicine, Nankai University, Tianjin, China
| | - Pu-Feng Du
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| |
Collapse
|
8
|
Hoseini ASH, Mirzarezaee M. Prediction of Protein Sub-Mitochondria Locations Using Protein Interaction Networks. IRANIAN JOURNAL OF BIOTECHNOLOGY 2018; 16:e1933. [PMID: 31457027 PMCID: PMC6697825 DOI: 10.15171/ijb.1933] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/29/2017] [Revised: 01/11/2018] [Accepted: 01/13/2018] [Indexed: 01/09/2023]
Abstract
Background Prediction of the protein localization is among the most important issues in the bioinformatics that is used for the prediction of the proteins in the cells and organelles such as mitochondria. In this study, several machine learning algorithms are applied for the prediction of the intracellular protein locations. These algorithms use the features extracted from protein sequences. In contrast, protein interactions have been less investigated. Objectives As protein interactions usually occur in the same or adjacent places, using this feature to find the location would be efficient and impressive. This study did not aim at increasing the total accuracy of the conducted research. The study has focused on the features of the proteins’ interaction and their employment which lead to a higher accuracy. Materials and Methods In this study, we have examined the protein interaction network as one of the features for prediction of the protein localization and its effects on the prediction results. In this regards, we have gathered some of the most common features including Amino Acid Composition, Dipeptide Compositions, Pseudo Amino Acid Compositions (PseAAC), Position Specific Scoring Matrix (PSSM), Functional Domain, Gene Ontology information, and the Pair-wise sequence alignment. The results of the classification are compared to the ones using protein interactions. For achieving this goal different machine learning algorithms were tested. Results The best-obtained results of using single feature set obtained using SVM classifier for PseAAC feature. The accuracy of combining all features with PPI data, using the Decision Tree and Random Forest classifiers, was 82.49% and 83.35%, respectively. In another experiment, using just protein interaction data with the different cutting points resulted in obtaining an accuracy of 93.035% for the protein location prediction. Conclusion In total, it was shown that protein(s) interaction has a significant impact on the prediction of the mitochondrial proteins’ location. This feature can separately distinguish the locations well. Using this feature the accuracy of the results is raised up to 5%.
Collapse
Affiliation(s)
| | - Mitra Mirzarezaee
- Department of Computer Engineering, Science and Research branch, Islamic Azad University, Tehran, Iran.,School of Biological Science, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran
| |
Collapse
|
9
|
Qiao S, Yan B, Li J. Ensemble learning for protein multiplex subcellular localization prediction based on weighted KNN with different features. APPL INTELL 2017. [DOI: 10.1007/s10489-017-1029-6] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
|
10
|
Jiao YS, Du PF. Predicting protein submitochondrial locations by incorporating the positional-specific physicochemical properties into Chou's general pseudo-amino acid compositions. J Theor Biol 2017; 416:81-87. [DOI: 10.1016/j.jtbi.2016.12.026] [Citation(s) in RCA: 40] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2016] [Revised: 12/06/2016] [Accepted: 12/30/2016] [Indexed: 11/26/2022]
|
11
|
Jiao Y, Du P. Performance measures in evaluating machine learning based bioinformatics predictors for classifications. QUANTITATIVE BIOLOGY 2016. [DOI: 10.1007/s40484-016-0081-2] [Citation(s) in RCA: 54] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
12
|
Breckels LM, Holden SB, Wojnar D, Mulvey CM, Christoforou A, Groen A, Trotter MWB, Kohlbacher O, Lilley KS, Gatto L. Learning from Heterogeneous Data Sources: An Application in Spatial Proteomics. PLoS Comput Biol 2016; 12:e1004920. [PMID: 27175778 PMCID: PMC4866734 DOI: 10.1371/journal.pcbi.1004920] [Citation(s) in RCA: 35] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2015] [Accepted: 04/16/2016] [Indexed: 11/19/2022] Open
Abstract
Sub-cellular localisation of proteins is an essential post-translational regulatory mechanism that can be assayed using high-throughput mass spectrometry (MS). These MS-based spatial proteomics experiments enable us to pinpoint the sub-cellular distribution of thousands of proteins in a specific system under controlled conditions. Recent advances in high-throughput MS methods have yielded a plethora of experimental spatial proteomics data for the cell biology community. Yet, there are many third-party data sources, such as immunofluorescence microscopy or protein annotations and sequences, which represent a rich and vast source of complementary information. We present a unique transfer learning classification framework that utilises a nearest-neighbour or support vector machine system, to integrate heterogeneous data sources to considerably improve on the quantity and quality of sub-cellular protein assignment. We demonstrate the utility of our algorithms through evaluation of five experimental datasets, from four different species in conjunction with four different auxiliary data sources to classify proteins to tens of sub-cellular compartments with high generalisation accuracy. We further apply the method to an experiment on pluripotent mouse embryonic stem cells to classify a set of previously unknown proteins, and validate our findings against a recent high resolution map of the mouse stem cell proteome. The methodology is distributed as part of the open-source Bioconductor pRoloc suite for spatial proteomics data analysis. Sub-cellular localisation of proteins is critical to their function in all cellular processes; proteins localising to their intended micro-environment, e.g organelles, vesicles or macro-molecular complexes, will meet the interaction partners and biochemical conditions suitable to pursue their molecular function. Therefore, sound data and methods to reliably and systematically study protein localisation, and hence their mis-localisation and the disruption of protein trafficking, that are relied upon by the cell biology community, are essential. Here we present a method to infer protein localisation relying on the optimal integration of experimental mass spectrometry-based data and auxiliary sources, such as GO annotation, outputs from third-party software, protein-protein interactions or immunocytochemistry data. We found that the application of transfer learning algorithms across these diverse data sources considerably improves on the quantity and reliability of sub-cellular protein assignment, compared to single data classifiers previously applied to infer sub-cellular localisation using experimental data only. We show how our method does not compromise biologically relevant experimental-specific signal after integration with heterogeneous freely available third-party resources. The integration of different data sources is an important challenge in the data intensive world of biology and we anticipate the transfer learning methods presented here will prove useful to many areas of biology, to unify data obtained from different but complimentary sources.
Collapse
Affiliation(s)
- Lisa M. Breckels
- Computational Proteomics Unit, Department of Biochemistry, University of Cambridge, Cambridge, United Kingdom
- Cambridge Centre for Proteomics, Department of Biochemistry, University of Cambridge, Cambridge, United Kingdom
| | - Sean B. Holden
- Computer Laboratory, University of Cambridge, Cambridge, United Kingdom
| | - David Wojnar
- Quantitative Biology Center, Universität Tübingen, Tübingen, Germany
| | - Claire M. Mulvey
- Cambridge Centre for Proteomics, Department of Biochemistry, University of Cambridge, Cambridge, United Kingdom
| | - Andy Christoforou
- Cambridge Centre for Proteomics, Department of Biochemistry, University of Cambridge, Cambridge, United Kingdom
| | - Arnoud Groen
- Cambridge Centre for Proteomics, Department of Biochemistry, University of Cambridge, Cambridge, United Kingdom
| | | | - Oliver Kohlbacher
- Quantitative Biology Center, Universität Tübingen, Tübingen, Germany
- Center for Bioinformatics, Universität Tübingen, Tübingen, Germany
- Biomolecular Interactions, Max Planck Institute for Developmental Biology, Tübingen, Germany
| | - Kathryn S. Lilley
- Cambridge Centre for Proteomics, Department of Biochemistry, University of Cambridge, Cambridge, United Kingdom
| | - Laurent Gatto
- Computational Proteomics Unit, Department of Biochemistry, University of Cambridge, Cambridge, United Kingdom
- Cambridge Centre for Proteomics, Department of Biochemistry, University of Cambridge, Cambridge, United Kingdom
- * E-mail:
| |
Collapse
|
13
|
Prediction of Golgi-resident protein types using general form of Chou's pseudo-amino acid compositions: Approaches with minimal redundancy maximal relevance feature selection. J Theor Biol 2016; 402:38-44. [PMID: 27155042 DOI: 10.1016/j.jtbi.2016.04.032] [Citation(s) in RCA: 44] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2016] [Revised: 04/19/2016] [Accepted: 04/26/2016] [Indexed: 11/20/2022]
Abstract
Recently, several efforts have been made in predicting Golgi-resident proteins. However, it is still a challenging task to identify the type of a Golgi-resident protein. Precise prediction of the type of a Golgi-resident protein plays a key role in understanding its molecular functions in various biological processes. In this paper, we proposed to use a mutual information based feature selection scheme with the general form Chou's pseudo-amino acid compositions to predict the Golgi-resident protein types. The positional specific physicochemical properties were applied in the Chou's pseudo-amino acid compositions. We achieved 91.24% prediction accuracy in a jackknife test with 49 selected features. It has the best performance among all the present predictors. This result indicates that our computational model can be useful in identifying Golgi-resident protein types.
Collapse
|
14
|
Jiao YS, Du PF. Predicting Golgi-resident protein types using pseudo amino acid compositions: Approaches with positional specific physicochemical properties. J Theor Biol 2015; 391:35-42. [PMID: 26702543 DOI: 10.1016/j.jtbi.2015.11.009] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2015] [Revised: 11/17/2015] [Accepted: 11/19/2015] [Indexed: 11/24/2022]
Abstract
Knowing the type of a Golgi-resident protein is an important step in understanding its molecular functions as well as its role in biological processes. In this paper, we developed a novel computational method to predict Golgi-resident protein types using positional specific physicochemical properties and analysis of variance based feature selection methods. Our method achieved 86.9% prediction accuracy in leave-one-out cross-validations with only 59 features. Our method has the potential to be applied in predicting a wide range of protein attributes.
Collapse
Affiliation(s)
- Ya-Sen Jiao
- School of Computer Science and Technology, Tianjin University, Tianjin 300072, China
| | - Pu-Feng Du
- School of Computer Science and Technology, Tianjin University, Tianjin 300072, China.
| |
Collapse
|
15
|
Identification of Chemical Toxicity Using Ontology Information of Chemicals. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2015; 2015:246374. [PMID: 26508991 PMCID: PMC4609800 DOI: 10.1155/2015/246374] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/26/2015] [Revised: 03/20/2015] [Accepted: 03/22/2015] [Indexed: 12/26/2022]
Abstract
With the advance of the combinatorial chemistry, a large number of synthetic compounds have surged. However, we have limited knowledge about them. On the other hand, the speed of designing new drugs is very slow. One of the key causes is the unacceptable toxicities of chemicals. If one can correctly identify the toxicity of chemicals, the unsuitable chemicals can be discarded in early stage, thereby accelerating the study of new drugs and reducing the R&D costs. In this study, a new prediction method was built for identification of chemical toxicities, which was based on ontology information of chemicals. By comparing to a previous method, our method is quite effective. We hope that the proposed method may give new insights to study chemical toxicity and other attributes of chemicals.
Collapse
|
16
|
Woo YH, Ansari H, Otto TD, Klinger CM, Kolisko M, Michálek J, Saxena A, Shanmugam D, Tayyrov A, Veluchamy A, Ali S, Bernal A, del Campo J, Cihlář J, Flegontov P, Gornik SG, Hajdušková E, Horák A, Janouškovec J, Katris NJ, Mast FD, Miranda-Saavedra D, Mourier T, Naeem R, Nair M, Panigrahi AK, Rawlings ND, Padron-Regalado E, Ramaprasad A, Samad N, Tomčala A, Wilkes J, Neafsey DE, Doerig C, Bowler C, Keeling PJ, Roos DS, Dacks JB, Templeton TJ, Waller RF, Lukeš J, Oborník M, Pain A. Chromerid genomes reveal the evolutionary path from photosynthetic algae to obligate intracellular parasites. eLife 2015; 4:e06974. [PMID: 26175406 PMCID: PMC4501334 DOI: 10.7554/elife.06974] [Citation(s) in RCA: 143] [Impact Index Per Article: 15.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2015] [Accepted: 06/16/2015] [Indexed: 12/18/2022] Open
Abstract
The eukaryotic phylum Apicomplexa encompasses thousands of obligate intracellular parasites of humans and animals with immense socio-economic and health impacts. We sequenced nuclear genomes of Chromera velia and Vitrella brassicaformis, free-living non-parasitic photosynthetic algae closely related to apicomplexans. Proteins from key metabolic pathways and from the endomembrane trafficking systems associated with a free-living lifestyle have been progressively and non-randomly lost during adaptation to parasitism. The free-living ancestor contained a broad repertoire of genes many of which were repurposed for parasitic processes, such as extracellular proteins, components of a motility apparatus, and DNA- and RNA-binding protein families. Based on transcriptome analyses across 36 environmental conditions, Chromera orthologs of apicomplexan invasion-related motility genes were co-regulated with genes encoding the flagellar apparatus, supporting the functional contribution of flagella to the evolution of invasion machinery. This study provides insights into how obligate parasites with diverse life strategies arose from a once free-living phototrophic marine alga. DOI:http://dx.doi.org/10.7554/eLife.06974.001 Single-celled parasites cause many severe diseases in humans and animals. The apicomplexans form probably the most successful group of these parasites and include the parasites that cause malaria. Apicomplexans infect a broad range of hosts, including humans, reptiles, birds, and insects, and often have complicated life cycles. For example, the malaria-causing parasites spread by moving from humans to female mosquitoes and then back to humans. Despite significant differences amongst apicomplexans, these single-celled parasites also share a number of features that are not seen in other living species. How and when these features arose remains unclear. It is known from previous work that apicomplexans are closely related to single-celled algae. But unlike apicomplexans, which depend on a host animal to survive, these algae live freely in their environment, often in close association with corals. Woo et al. have now sequenced the genomes of two photosynthetic algae that are thought to be close living relatives of the apicomplexans. These genomes were then compared to each other and to the genomes of other algae and apicomplexans. These comparisons reconfirmed that the two algae that were studied were close relatives of the apicomplexans. Further analyses suggested that thousands of genes were lost as an ancient free-living algae evolved into the apicomplexan ancestor, and further losses occurred as these early parasites evolved into modern species. The lost genes were typically those that are important for free-living organisms, but are either a hindrance to, or not needed in, a parasitic lifestyle. Some of the ancestor's genes, especially those that coded for the building blocks of flagella (structures which free-living algae use to move around), were repurposed in ways that helped the apicomplexans to invade their hosts. Understanding this repurposing process in greater detail will help to identify key molecules in these deadly parasites that could be targeted by drug treatments. It will also offer answers to one of the most fascinating questions in evolutionary biology: how parasites have evolved from free-living organisms. DOI:http://dx.doi.org/10.7554/eLife.06974.002
Collapse
Affiliation(s)
- Yong H Woo
- Pathogen Genomics Laboratory, Biological and Environmental Sciences and Engineering Division, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| | - Hifzur Ansari
- Pathogen Genomics Laboratory, Biological and Environmental Sciences and Engineering Division, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| | - Thomas D Otto
- Parasite Genomics, Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Cambridge, United Kingdom
| | | | - Martin Kolisko
- Canadian Institute for Advanced Research, Department of Botany, University of British Columbia, Vancouver, Canada
| | - Jan Michálek
- Institute of Parasitology, Biology Centre, Czech Academy of Sciences, České Budějovice, Czech Republic
| | - Alka Saxena
- Pathogen Genomics Laboratory, Biological and Environmental Sciences and Engineering Division, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| | | | - Annageldi Tayyrov
- Pathogen Genomics Laboratory, Biological and Environmental Sciences and Engineering Division, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| | - Alaguraj Veluchamy
- Ecology and Evolutionary Biology Section, Institut de Biologie de l'Ecole Normale Supérieure, CNRS UMR8197 INSERM U1024, Paris, France
| | - Shahjahan Ali
- Bioscience Core Laboratory, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| | - Axel Bernal
- Department of Biology, University of Pennsylvania, Philadelphia, United States
| | - Javier del Campo
- Canadian Institute for Advanced Research, Department of Botany, University of British Columbia, Vancouver, Canada
| | - Jaromír Cihlář
- Institute of Parasitology, Biology Centre, Czech Academy of Sciences, České Budějovice, Czech Republic
| | - Pavel Flegontov
- Institute of Parasitology, Biology Centre, Czech Academy of Sciences, České Budějovice, Czech Republic
| | | | - Eva Hajdušková
- Institute of Parasitology, Biology Centre, Czech Academy of Sciences, České Budějovice, Czech Republic
| | - Aleš Horák
- Institute of Parasitology, Biology Centre, Czech Academy of Sciences, České Budějovice, Czech Republic
| | - Jan Janouškovec
- Canadian Institute for Advanced Research, Department of Botany, University of British Columbia, Vancouver, Canada
| | | | - Fred D Mast
- Seattle Biomedical Research Institute, Seattle, United States
| | - Diego Miranda-Saavedra
- Centro de Biología Molecular Severo Ochoa, CSIC/Universidad Autónoma de Madrid, Madrid, Spain
| | - Tobias Mourier
- Centre for GeoGenetics, Natural History Museum of Denmark, University of Copenhagen, Copenhagen, Denmark
| | - Raeece Naeem
- Pathogen Genomics Laboratory, Biological and Environmental Sciences and Engineering Division, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| | - Mridul Nair
- Pathogen Genomics Laboratory, Biological and Environmental Sciences and Engineering Division, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| | - Aswini K Panigrahi
- Bioscience Core Laboratory, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| | - Neil D Rawlings
- European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, United Kingdom
| | - Eriko Padron-Regalado
- Pathogen Genomics Laboratory, Biological and Environmental Sciences and Engineering Division, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| | - Abhinay Ramaprasad
- Pathogen Genomics Laboratory, Biological and Environmental Sciences and Engineering Division, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| | - Nadira Samad
- School of Botany, University of Melbourne, Parkville, Australia
| | - Aleš Tomčala
- Institute of Parasitology, Biology Centre, Czech Academy of Sciences, České Budějovice, Czech Republic
| | - Jon Wilkes
- Wellcome Trust Centre For Molecular Parasitology, Institute of Infection, Immunity and Inflammation, College of Medical, Veterinary and Life Sciences, University of Glasgow, Glasgow, United Kingdom
| | - Daniel E Neafsey
- Broad Genome Sequencing and Analysis Program, Broad Institute of MIT and Harvard, Cambridge, United States
| | - Christian Doerig
- Department of Microbiology, Monash University, Clayton, Australia
| | - Chris Bowler
- Ecology and Evolutionary Biology Section, Institut de Biologie de l'Ecole Normale Supérieure, CNRS UMR8197 INSERM U1024, Paris, France
| | - Patrick J Keeling
- Canadian Institute for Advanced Research, Department of Botany, University of British Columbia, Vancouver, Canada
| | - David S Roos
- Department of Biology, University of Pennsylvania, Philadelphia, United States
| | - Joel B Dacks
- Department of Cell Biology, University of Alberta, Edmonton, Canada
| | - Thomas J Templeton
- Department of Microbiology and Immunology, Weill Cornell Medical College, New York, United States
| | - Ross F Waller
- School of Botany, University of Melbourne, Parkville, Australia
| | - Julius Lukeš
- Institute of Parasitology, Biology Centre, Czech Academy of Sciences, České Budějovice, Czech Republic
| | - Miroslav Oborník
- Institute of Parasitology, Biology Centre, Czech Academy of Sciences, České Budějovice, Czech Republic
| | - Arnab Pain
- Pathogen Genomics Laboratory, Biological and Environmental Sciences and Engineering Division, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| |
Collapse
|
17
|
Prediction of drug indications based on chemical interactions and chemical similarities. BIOMED RESEARCH INTERNATIONAL 2015; 2015:584546. [PMID: 25821813 PMCID: PMC4363546 DOI: 10.1155/2015/584546] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/09/2014] [Accepted: 09/11/2014] [Indexed: 12/13/2022]
Abstract
Discovering potential indications of novel or approved drugs is a key step in drug development. Previous computational approaches could be categorized into disease-centric and drug-centric based on the starting point of the issues or small-scaled application and large-scale application according to the diversity of the datasets. Here, a classifier has been constructed to predict the indications of a drug based on the assumption that interactive/associated drugs or drugs with similar structures are more likely to target the same diseases using a large drug indication dataset. To examine the classifier, it was conducted on a dataset with 1,573 drugs retrieved from Comprehensive Medicinal Chemistry database for five times, evaluated by 5-fold cross-validation, yielding five 1st order prediction accuracies that were all approximately 51.48%. Meanwhile, the model yielded an accuracy rate of 50.00% for the 1st order prediction by independent test on a dataset with 32 other drugs in which drug repositioning has been confirmed. Interestingly, some clinically repurposed drug indications that were not included in the datasets are successfully identified by our method. These results suggest that our method may become a useful tool to associate novel molecules with new indications or alternative indications with existing drugs.
Collapse
|
18
|
Shi SP, Xu HD, Wen PP, Qiu JD. Progress and challenges in predicting protein methylation sites. MOLECULAR BIOSYSTEMS 2015; 11:2610-9. [DOI: 10.1039/c5mb00259a] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
We review the progress in the prediction of protein methylation sites in the past 10 years and discuss the challenges that are faced while developing novel predictors in the future.
Collapse
Affiliation(s)
- Shao-Ping Shi
- Department of Chemistry
- Nanchang University
- Nanchang
- China
- Department of Mathematics
| | - Hao-Dong Xu
- Department of Chemistry
- Nanchang University
- Nanchang
- China
| | - Ping-Ping Wen
- Department of Chemistry
- Nanchang University
- Nanchang
- China
| | - Jian-Ding Qiu
- Department of Chemistry
- Nanchang University
- Nanchang
- China
| |
Collapse
|
19
|
Chen L, Lu J, Zhang N, Huang T, Cai YD. A hybrid method for prediction and repositioning of drug Anatomical Therapeutic Chemical classes. MOLECULAR BIOSYSTEMS 2014; 10:868-77. [PMID: 24492783 DOI: 10.1039/c3mb70490d] [Citation(s) in RCA: 58] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
Abstract
In the Anatomical Therapeutic Chemical (ATC) classification system, therapeutic drugs are divided into 14 main classes according to the organ or system on which they act and their chemical, pharmacological and therapeutic properties. This system, recommended by the World Health Organization (WHO), provides a global standard for classifying medical substances and serves as a tool for international drug utilization research to improve quality of drug use. In view of this, it is necessary to develop effective computational prediction methods to identify the ATC-class of a given drug, which thereby could facilitate further analysis of this system. In this study, we initiated an attempt to develop a prediction method and to gain insights from it by utilizing ontology information of drug compounds. Since only about one-fourth of drugs in the ATC classification system have ontology information, a hybrid prediction method combining the ontology information, chemical interaction information and chemical structure information of drug compounds was proposed for the prediction of drug ATC-classes. As a result, by using the Jackknife test, the 1st prediction accuracies for identifying the 14 main ATC-classes in the training dataset, the internal validation dataset and the external validation dataset were 75.90%, 75.70% and 66.36%, respectively. Analysis of some samples with false-positive predictions in the internal and external validation datasets indicated that some of them may even have a relationship with the false-positive predicted ATC-class, suggesting novel uses of these drugs. It was conceivable that the proposed method could be used as an efficient tool to identify ATC-classes of novel drugs or to discover novel uses of known drugs.
Collapse
Affiliation(s)
- Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, People's Republic of China.
| | | | | | | | | |
Collapse
|
20
|
Lu J, Huang G, Li HP, Feng KY, Chen L, Zheng MY, Cai YD. Prediction of cancer drugs by chemical-chemical interactions. PLoS One 2014; 9:e87791. [PMID: 24498372 PMCID: PMC3912061 DOI: 10.1371/journal.pone.0087791] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2013] [Accepted: 12/31/2013] [Indexed: 11/19/2022] Open
Abstract
Cancer, which is a leading cause of death worldwide, places a big burden on health-care system. In this study, an order-prediction model was built to predict a series of cancer drug indications based on chemical-chemical interactions. According to the confidence scores of their interactions, the order from the most likely cancer to the least one was obtained for each query drug. The 1(st) order prediction accuracy of the training dataset was 55.93%, evaluated by Jackknife test, while it was 55.56% and 59.09% on a validation test dataset and an independent test dataset, respectively. The proposed method outperformed a popular method based on molecular descriptors. Moreover, it was verified that some drugs were effective to the 'wrong' predicted indications, indicating that some 'wrong' drug indications were actually correct indications. Encouraged by the promising results, the method may become a useful tool to the prediction of drugs indications.
Collapse
Affiliation(s)
- Jing Lu
- Department of Medicinal Chemistry, School of Pharmacy, Yantai University, Yantai, Shandong, People’s Republic of China
| | - Guohua Huang
- Institute of Systems Biology, Shanghai University, Shanghai, People’s Republic of China
- Department of Mathematics, Shaoyang University, Shaoyang, Hunan, People’s Republic of China
| | - Hai-Peng Li
- CAS-MPG Partner Institute for Computational Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, People’s Republic of China
| | - Kai-Yan Feng
- Beijing Genomics Institute, Shenzhen Beishan Industrial zone, Shenzhen, People’s Republic of China
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai, People’s Republic of China
- * E-mail: (LC); (MYZ); (YDC)
| | - Ming-Yue Zheng
- State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Shanghai, People’s Republic of China
- * E-mail: (LC); (MYZ); (YDC)
| | - Yu-Dong Cai
- Institute of Systems Biology, Shanghai University, Shanghai, People’s Republic of China
- * E-mail: (LC); (MYZ); (YDC)
| |
Collapse
|
21
|
Predicting human protein subcellular locations by the ensemble of multiple predictors via protein-protein interaction network with edge clustering coefficients. PLoS One 2014; 9:e86879. [PMID: 24466278 PMCID: PMC3900678 DOI: 10.1371/journal.pone.0086879] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2013] [Accepted: 12/18/2013] [Indexed: 12/14/2022] Open
Abstract
One of the fundamental tasks in biology is to identify the functions of all proteins to reveal the primary machinery of a cell. Knowledge of the subcellular locations of proteins will provide key hints to reveal their functions and to understand the intricate pathways that regulate biological processes at the cellular level. Protein subcellular location prediction has been extensively studied in the past two decades. A lot of methods have been developed based on protein primary sequences as well as protein-protein interaction network. In this paper, we propose to use the protein-protein interaction network as an infrastructure to integrate existing sequence based predictors. When predicting the subcellular locations of a given protein, not only the protein itself, but also all its interacting partners were considered. Unlike existing methods, our method requires neither the comprehensive knowledge of the protein-protein interaction network nor the experimentally annotated subcellular locations of most proteins in the protein-protein interaction network. Besides, our method can be used as a framework to integrate multiple predictors. Our method achieved 56% on human proteome in absolute-true rate, which is higher than the state-of-the-art methods.
Collapse
|
22
|
Li X, Wu X, Wu G. Robust feature generation for protein subchloroplast location prediction with a weighted GO transfer model. J Theor Biol 2014; 347:84-94. [PMID: 24423409 DOI: 10.1016/j.jtbi.2014.01.003] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2013] [Revised: 10/17/2013] [Accepted: 01/03/2014] [Indexed: 10/25/2022]
Abstract
Chloroplasts are crucial organelles of green plants and eukaryotic algae since they conduct photosynthesis. Predicting the subchloroplast location of a protein can provide important insights for understanding its biological functions. The performance of subchloroplast location prediction algorithms often depends on deriving predictive and succinct features from genomic and proteomic data. In this work, a novel weighted Gene Ontology (GO) transfer model is proposed to generate discriminating features from sequence data and GO Categories. This model contains two components. First, we transfer the GO terms of the homologous protein, and then assign the bit-score as weights to GO features. Second, we employ term-selection methods to determine weights for GO terms. This model is capable of improving prediction accuracy due to the tolerance of the noise derived from homolog knowledge transfer. The proposed weighted GO transfer method based on bit-score and a logarithmic transformation of CHI-square (WS-LCHI) performs better than the baseline models, and also outperforms the four off-the-shelf subchloroplast prediction methods.
Collapse
Affiliation(s)
- Xiaomei Li
- School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230009, PR China.
| | - Xindong Wu
- School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230009, PR China; Department of Computer Science, University of Vermont, Burlington, VT 50405, USA.
| | - Gongqing Wu
- School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230009, PR China.
| |
Collapse
|
23
|
Du P, Xu C. Predicting multisite protein subcellular locations: progress and challenges. Expert Rev Proteomics 2014; 10:227-37. [DOI: 10.1586/epr.13.16] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
24
|
Prediction of gene phenotypes based on GO and KEGG pathway enrichment scores. BIOMED RESEARCH INTERNATIONAL 2013; 2013:870795. [PMID: 24312912 PMCID: PMC3838811 DOI: 10.1155/2013/870795] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/19/2013] [Accepted: 09/23/2013] [Indexed: 11/18/2022]
Abstract
Observing what phenotype the overexpression or knockdown of gene can cause is the basic method of investigating gene functions. Many advanced biotechnologies, such as RNAi, were developed to study the gene phenotype. But there are still many limitations. Besides the time and cost, the knockdown of some gene may be lethal which makes the observation of other phenotypes impossible. Due to ethical and technological reasons, the knockdown of genes in complex species, such as mammal, is extremely difficult. Thus, we proposed a new sequence-based computational method called kNNA-based method for gene phenotypes prediction. Different to the traditional sequence-based computational method, our method regards the multiphenotype as a whole network which can rank the possible phenotypes associated with the query protein and shows a more comprehensive view of the protein's biological effects. According to the prediction result of yeast, we also find some more related features, including GO and KEGG information, which are making more contributions in identifying protein phenotypes. This method can be applied in gene phenotype prediction in other species.
Collapse
|
25
|
Alignment free comparison: k word voting model and its applications. J Theor Biol 2013; 335:276-82. [DOI: 10.1016/j.jtbi.2013.06.037] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2012] [Revised: 04/25/2013] [Accepted: 06/26/2013] [Indexed: 02/06/2023]
|
26
|
Predicting drugs side effects based on chemical-chemical interactions and protein-chemical interactions. BIOMED RESEARCH INTERNATIONAL 2013; 2013:485034. [PMID: 24078917 PMCID: PMC3776367 DOI: 10.1155/2013/485034] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/18/2013] [Accepted: 07/30/2013] [Indexed: 11/18/2022]
Abstract
A drug side effect is an undesirable effect which occurs in addition to the intended therapeutic effect of the drug. The unexpected side effects that many patients suffer from are the major causes of large-scale drug withdrawal. To address the problem, it is highly demanded by pharmaceutical industries to develop computational methods for predicting the side effects of drugs. In this study, a novel computational method was developed to predict the side effects of drug compounds by hybridizing the chemical-chemical and protein-chemical interactions. Compared to most of the previous works, our method can rank the potential side effects for any query drug according to their predicted level of risk. A training dataset and test datasets were constructed from the benchmark dataset that contains 835 drug compounds to evaluate the method. By a jackknife test on the training dataset, the 1st order prediction accuracy was 86.30%, while it was 89.16% on the test dataset. It is expected that the new method may become a useful tool for drug design, and that the findings obtained by hybridizing various interactions in a network system may provide useful insights for conducting in-depth pharmacological research as well, particularly at the level of systems biomedicine.
Collapse
|
27
|
SubMito-PSPCP: predicting protein submitochondrial locations by hybridizing positional specific physicochemical properties with pseudoamino acid compositions. BIOMED RESEARCH INTERNATIONAL 2013; 2013:263829. [PMID: 24027753 PMCID: PMC3763570 DOI: 10.1155/2013/263829] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 05/12/2013] [Revised: 07/10/2013] [Accepted: 07/20/2013] [Indexed: 11/17/2022]
Abstract
Knowing the submitochondrial location of a mitochondrial protein is an important step in understanding its function. We developed a new method for predicting protein submitochondrial locations by introducing a new concept: positional specific physicochemical properties. With the framework of general form pseudoamino acid compositions, our method used only about 100 features to represent protein sequences, which is much simpler than the existing methods. On the dataset of SubMito, our method achieved over 93% overall accuracy, with 98.60% for inner membrane, 93.90% for matrix, and 70.70% for outer membrane, which are comparable to all state-of-the-art methods. As our method can be used as a general method to upgrade all pseudoamino-acid-composition-based methods, it should be very useful in future studies. We implement our method as an online service: SubMito-PSPCP.
Collapse
|
28
|
Using over-represented tetrapeptides to predict protein submitochondria locations. Acta Biotheor 2013; 61:259-68. [PMID: 23475502 DOI: 10.1007/s10441-013-9181-9] [Citation(s) in RCA: 62] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2012] [Accepted: 02/23/2013] [Indexed: 01/25/2023]
Abstract
The mitochondrion is a key organelle of eukaryotic cell that provides the energy for cellular activities. Correctly identifying submitochondria locations of proteins can provide plentiful information for understanding their functions. However, using web-experimental methods to recognize submitochondria locations of proteins are time-consuming and costly. Thus, it is highly desired to develop a bioinformatics method to predict the submitochondria locations of mitochondrion proteins. In this work, a novel method based on support vector machine was developed to predict the submitochondria locations of mitochondrion proteins by using over-represented tetrapeptides selected by using binomial distribution. A reliable and rigorous benchmark dataset including 495 mitochondrion proteins with sequence identity ≤25% was constructed for testing and evaluating the proposed model. Jackknife cross-validated results showed that the 91.1% of the 495 mitochondrion proteins can be correctly predicted. Subsequently, our model was estimated by three existing benchmark datasets. The overall accuracies are 94.0, 94.7 and 93.4%, respectively, suggesting that the proposed model is potentially useful in the realm of mitochondrion proteome research. Based on this model, we built a predictor called TetraMito which is freely available at http://lin.uestc.edu.cn/server/TetraMito.
Collapse
|
29
|
Satori CP, Henderson MM, Krautkramer EA, Kostal V, Distefano MM, Arriaga EA. Bioanalysis of eukaryotic organelles. Chem Rev 2013; 113:2733-811. [PMID: 23570618 PMCID: PMC3676536 DOI: 10.1021/cr300354g] [Citation(s) in RCA: 87] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Affiliation(s)
- Chad P. Satori
- Department of Chemistry, University of Minnesota, Twin Cities, Minneapolis, MN, USA, 55455
| | - Michelle M. Henderson
- Department of Chemistry, University of Minnesota, Twin Cities, Minneapolis, MN, USA, 55455
| | - Elyse A. Krautkramer
- Department of Chemistry, University of Minnesota, Twin Cities, Minneapolis, MN, USA, 55455
| | - Vratislav Kostal
- Tescan, Libusina trida 21, Brno, 623 00, Czech Republic
- Institute of Analytical Chemistry ASCR, Veveri 97, Brno, 602 00, Czech Republic
| | - Mark M. Distefano
- Department of Chemistry, University of Minnesota, Twin Cities, Minneapolis, MN, USA, 55455
| | - Edgar A. Arriaga
- Department of Chemistry, University of Minnesota, Twin Cities, Minneapolis, MN, USA, 55455
| |
Collapse
|
30
|
Breckels LM, Gatto L, Christoforou A, Groen AJ, Lilley KS, Trotter MWB. The effect of organelle discovery upon sub-cellular protein localisation. J Proteomics 2013; 88:129-40. [PMID: 23523639 DOI: 10.1016/j.jprot.2013.02.019] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2012] [Revised: 02/16/2013] [Accepted: 02/21/2013] [Indexed: 12/31/2022]
Abstract
UNLABELLED Prediction of protein sub-cellular localisation by employing quantitative mass spectrometry experiments is an expanding field. Several methods have led to the assignment of proteins to specific subcellular localisations by partial separation of organelles across a fractionation scheme coupled with computational analysis. Methods developed to analyse organelle data have largely employed supervised machine learning algorithms to map unannotated abundance profiles to known protein-organelle associations. Such approaches are likely to make association errors if organelle-related groupings present in experimental output are not included in data used to create a protein-organelle classifier. Currently, there is no automated way to detect organelle-specific clusters within such datasets. In order to address the above issues we adapted a phenotype discovery algorithm, originally created to filter image-based output for RNAi screens, to identify putative subcellular groupings in organelle proteomics experiments. We were able to mine datasets to a deeper level and extract interesting phenotype clusters for more comprehensive evaluation in an unbiased fashion upon application of this approach. Organelle-related protein clusters were identified beyond those sufficiently annotated for use as training data. Furthermore, we propose avenues for the incorporation of observations made into general practice for the classification of protein-organelle membership from quantitative MS experiments. BIOLOGICAL SIGNIFICANCE Protein sub-cellular localisation plays an important role in molecular interactions, signalling and transport mechanisms. The prediction of protein localisation by quantitative mass-spectrometry (MS) proteomics is a growing field and an important endeavour in improving protein annotation. Several such approaches use gradient-based separation of cellular organelle content to measure relative protein abundance across distinct gradient fractions. The distribution profiles are commonly mapped in silico to known protein-organelle associations via supervised machine learning algorithms, to create classifiers that associate unannotated proteins to specific organelles. These strategies are prone to error, however, if organelle-related groupings present in experimental output are not represented, for example owing to the lack of existing annotation, when creating the protein-organelle mapping. Here, the application of a phenotype discovery approach to LOPIT gradient-based MS data identifies candidate organelle phenotypes for further evaluation in an unbiased fashion. Software implementation and usage guidelines are provided for application to wider protein-organelle association experiments. In the wider context, semi-supervised organelle discovery is discussed as a paradigm with which to generate new protein annotations from MS-based organelle proteomics experiments.
Collapse
Affiliation(s)
- L M Breckels
- Cambridge Centre for Proteomics, Department of Biochemistry, University of Cambridge, CB2 1QR, UK
| | | | | | | | | | | |
Collapse
|
31
|
Chen L, Lu J, Zhang J, Feng KR, Zheng MY, Cai YD. Predicting chemical toxicity effects based on chemical-chemical interactions. PLoS One 2013; 8:e56517. [PMID: 23457578 PMCID: PMC3574107 DOI: 10.1371/journal.pone.0056517] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2012] [Accepted: 01/10/2013] [Indexed: 12/02/2022] Open
Abstract
Toxicity is a major contributor to high attrition rates of new chemical entities in drug discoveries. In this study, an order-classifier was built to predict a series of toxic effects based on data concerning chemical-chemical interactions under the assumption that interactive compounds are more likely to share similar toxicity profiles. According to their interaction confidence scores, the order from the most likely toxicity to the least was obtained for each compound. Ten test groups, each of them containing one training dataset and one test dataset, were constructed from a benchmark dataset consisting of 17,233 compounds. By a Jackknife test on each of these test groups, the 1st order prediction accuracies of the training dataset and the test dataset were all approximately 79.50%, substantially higher than the rate of 25.43% achieved by random guesses. Encouraged by the promising results, we expect that our method will become a useful tool in screening out drugs with high toxicity.
Collapse
Affiliation(s)
- Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai, China
| | - Jing Lu
- Drug Discovery and Design Center (DDDC), Shanghai Institute of Materia Medica, Shanghai, China
| | - Jian Zhang
- Department of Ophthalmology, Shanghai First People’s Hospital Affiliated to Shanghai Jiaotong University, Shanghai, China
| | - Kai-Rui Feng
- Simcyp Limited, Blades Enterprise Centre, Sheffield, United Kingdom
| | - Ming-Yue Zheng
- Drug Discovery and Design Center (DDDC), Shanghai Institute of Materia Medica, Shanghai, China
- * E-mail: (MYZ); (YDC)
| | - Yu-Dong Cai
- Institute of Systems Biology, Shanghai University, Shanghai, China
- * E-mail: (MYZ); (YDC)
| |
Collapse
|
32
|
Improved proteomic profiling of the cell surface of culture-expanded human bone marrow multipotent stromal cells. J Proteomics 2012; 78:1-14. [PMID: 23153793 DOI: 10.1016/j.jprot.2012.10.028] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2012] [Revised: 10/11/2012] [Accepted: 10/31/2012] [Indexed: 02/06/2023]
Abstract
A comprehensive analysis of the membrane proteome is essential to explain the biology of multipotent stromal cells and identify reliable protein biomarkers for the isolation as well as tracking of cells during differentiation and maturation. However, proteomic analysis of membrane proteins is challenging and they are noticeably under-represented in numerous proteomic studies. Here we introduce new approach, which includes high pressure-assisted membrane protein extraction, protein fractionation by gel-eluted liquid fraction entrapment electrophoresis (GELFREE), and combined use of liquid chromatography MALDI and ESI tandem mass spectrometry. This report presents the first comprehensive proteomic analysis of membrane proteome of undifferentiated and culture-expanded human bone marrow multipotent stromal cells (hBM-MSC) obtained from different human donors. Gene ontology mapping using the Ingenuity Pathway Analysis and DAVID programs revealed the largest membrane proteomic dataset for hBM-MSC reported to date. Collectively, the new workflow enabled us to identify at least two-fold more membrane proteins compared to published results on hBM-MSC. A total of 84 CDs were identified including 14 CDs identified for the first time. This dataset can serve as a basis for further exploration of self-renewal, differentiation and characterization of hBM-MSC.
Collapse
|
33
|
Subcellular localization prediction for human internal and organelle membrane proteins with projected gene ontology scores. J Theor Biol 2012; 313:61-7. [DOI: 10.1016/j.jtbi.2012.08.016] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2012] [Revised: 07/05/2012] [Accepted: 08/15/2012] [Indexed: 11/15/2022]
|
34
|
Yang L, Zhang X, Zhu H. Alignment free comparison: similarity distribution between the DNA primary sequences based on the shortest absent word. J Theor Biol 2011; 295:125-31. [PMID: 22138094 DOI: 10.1016/j.jtbi.2011.11.021] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2011] [Revised: 11/18/2011] [Accepted: 11/19/2011] [Indexed: 11/15/2022]
Abstract
This work proposes an alignment free comparison model for the DNA primary sequences. In this paper, we treat the double strands of the DNA rather than single strand. We define the shortest absent word of the double strands between the DNA sequences and some properties are studied to speed up the algorithm for searching the shortest absent word. We present a novel model for comparison, in which the similarity distribution is introduced to describe the similarity between the sequences. A distance measure is deduced based on the Shannon entropy meanwhile is used in phylogenetic analysis. Some experiments show that our model performs well in the field of sequence analysis.
Collapse
Affiliation(s)
- Lianping Yang
- College of Sciences, Northeastern University, Shenyang, China
| | | | | |
Collapse
|