1
|
Ullah M, Hadi F, Song J, Yu DJ. PScL-2LSAESM: bioimage-based prediction of protein subcellular localization by integrating heterogeneous features with the two-level SAE-SM and mean ensemble method. Bioinformatics 2023; 39:6839969. [PMID: 36413068 PMCID: PMC9947927 DOI: 10.1093/bioinformatics/btac727] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2022] [Revised: 11/02/2022] [Accepted: 11/21/2022] [Indexed: 11/23/2022] Open
Abstract
MOTIVATION Over the past decades, a variety of in silico methods have been developed to predict protein subcellular localization within cells. However, a common and major challenge in the design and development of such methods is how to effectively utilize the heterogeneous feature sets extracted from bioimages. In this regards, limited efforts have been undertaken. RESULTS We propose a new two-level stacked autoencoder network (termed 2L-SAE-SM) to improve its performance by integrating the heterogeneous feature sets. In particular, in the first level of 2L-SAE-SM, each optimal heterogeneous feature set is fed to train our designed stacked autoencoder network (SAE-SM). All the trained SAE-SMs in the first level can output the decision sets based on their respective optimal heterogeneous feature sets, known as 'intermediate decision' sets. Such intermediate decision sets are then ensembled using the mean ensemble method to generate the 'intermediate feature' set for the second-level SAE-SM. Using the proposed framework, we further develop a novel predictor, referred to as PScL-2LSAESM, to characterize image-based protein subcellular localization. Extensive benchmarking experiments on the latest benchmark training and independent test datasets collected from the human protein atlas databank demonstrate the effectiveness of the proposed 2L-SAE-SM framework for the integration of heterogeneous feature sets. Moreover, performance comparison of the proposed PScL-2LSAESM with current state-of-the-art methods further illustrates that PScL-2LSAESM clearly outperforms the existing state-of-the-art methods for the task of protein subcellular localization. AVAILABILITY AND IMPLEMENTATION https://github.com/csbio-njust-edu/PScL-2LSAESM. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Matee Ullah
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| | - Fazal Hadi
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| | | | - Dong-Jun Yu
- To whom correspondence should be addressed. or
| |
Collapse
|
2
|
Hu JX, Yang Y, Xu YY, Shen HB. GraphLoc: a graph neural network model for predicting protein subcellular localization from immunohistochemistry images. Bioinformatics 2022; 38:4941-4948. [DOI: 10.1093/bioinformatics/btac634] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2022] [Revised: 09/07/2022] [Accepted: 09/15/2022] [Indexed: 11/14/2022] Open
Abstract
Abstract
Motivation
Recognition of protein subcellular distribution patterns and identification of location biomarker proteins in cancer tissues are important for understanding protein functions and related diseases. Immunohistochemical (IHC) images enable visualizing the distribution of proteins at the tissue level, providing an important resource for the protein localization studies. In the past decades, several image-based protein subcellular location prediction methods have been developed, but the prediction accuracies still have much space to improve due to the complexity of protein patterns resulting from multi-label proteins and variation of location patterns across cell types or states.
Results
Here, we propose a multi-label multi-instance model based on deep graph convolutional neural networks, GraphLoc, to recognize protein subcellular location patterns. GraphLoc builds a graph of multiple IHC images for one protein, learns protein-level representations by graph convolutions, and predicts multi-label information by a dynamic threshold method. Our results show that GraphLoc is a promising model for image-based protein subcellular location prediction with model interpretability. Furthermore, we apply GraphLoc to the identification of candidate location biomarkers and potential members for protein networks. A large portion of the predicted results have supporting evidence from the existing literatures and the new candidates also provide guidance for further experimental screening.
Availability
The dataset and code are available at: www.csbio.sjtu.edu.cn/bioinf/GraphLoc.
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jin-Xian Hu
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing , Ministry of Education of China, Shanghai 200240, China
| | - Yang Yang
- Shanghai Jiao Tong University Department of Computer Science and Engineering, Center for Brain-Like Computing and Machine Intelligence, , Shanghai 200240, China
| | - Ying-Ying Xu
- Southern Medical University School of Biomedical Engineering and Guangdong Provincial Key Laboratory of Medical Image Processing, , Guangzhou 510515, China
- Guangdong Province Engineering Laboratory for Medical Imaging and Diagnostic Technology, Southern Medical University , Guangzhou 510515, China
| | - Hong-Bin Shen
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing , Ministry of Education of China, Shanghai 200240, China
| |
Collapse
|
3
|
Ullah M, Hadi F, Song J, Yu DJ. PScL-DDCFPred: an ensemble deep learning-based approach for characterizing multiclass subcellular localization of human proteins from bioimage data. Bioinformatics 2022; 38:4019-4026. [PMID: 35771606 PMCID: PMC9890309 DOI: 10.1093/bioinformatics/btac432] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2022] [Revised: 06/03/2022] [Accepted: 06/28/2022] [Indexed: 02/04/2023] Open
Abstract
MOTIVATION Characterization of protein subcellular localization has become an important and long-standing task in bioinformatics and computational biology, which provides valuable information for elucidating various cellular functions of proteins and guiding drug design. RESULTS Here, we develop a novel bioimage-based computational approach, termed PScL-DDCFPred, to accurately predict protein subcellular localizations in human tissues. PScL-DDCFPred first extracts multiview image features, including global and local features, as base or pure features; next, it applies a new integrative feature selection method based on stepwise discriminant analysis and generalized discriminant analysis to identify the optimal feature sets from the extracted pure features; Finally, a classifier based on deep neural network (DNN) and deep-cascade forest (DCF) is established. Stringent 10-fold cross-validation tests on the new protein subcellular localization training dataset, constructed from the human protein atlas databank, illustrates that PScL-DDCFPred achieves a better performance than several existing state-of-the-art methods. Moreover, the independent test set further illustrates the generalization capability and superiority of PScL-DDCFPred over existing predictors. In-depth analysis shows that the excellent performance of PScL-DDCFPred can be attributed to three critical factors, namely the effective combination of the DNN and DCF models, complementarity of global and local features, and use of the optimal feature sets selected by the integrative feature selection algorithm. AVAILABILITY AND IMPLEMENTATION https://github.com/csbio-njust-edu/PScL-DDCFPred. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Matee Ullah
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| | - Fazal Hadi
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| | - Jiangning Song
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
- Monash Data Futures Institute, Monash University, Melbourne, VIC 3800, Australia
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| |
Collapse
|
4
|
Wang F, Wei L. Multi-scale deep learning for the imbalanced multi-label protein subcellular localization prediction based on immunohistochemistry images. Bioinformatics 2022; 38:2602-2611. [PMID: 35212728 DOI: 10.1093/bioinformatics/btac123] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2021] [Revised: 02/09/2022] [Accepted: 02/24/2022] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION The development of microscopic imaging techniques enables us to study protein subcellular locations from the tissue level down to the cell level, contributing to the rapid development of image-based protein subcellular location prediction approaches. However, existing methods suffer from intrinsic limitations, such as poor feature representation ability, data imbalanced issue, and multi-label classification problem, greatly impacting the model performance and generalization. RESULTS In this study, we propose MSTLoc, a novel multi-scale end-to-end deep learning model to identify protein subcellular locations in the imbalanced multi-label immunohistochemistry (IHC) images dataset. In our MSTLoc, we deploy a deep convolution neural network to extract multi-scale features from the IHC images, aggregate the high-level features and low-level features via feature fusion to sufficiently exploit the dependencies amongst various subcellular locations, and utilize Vision Transformer (ViT) to model the relationship amongst the features and enhance the feature representation ability. We demonstrate that the proposed MSTLoc achieves better performance than current state-of-the-art models in multi-label subcellular location prediction. Through feature visualization and interpretation analysis, we demonstrate that as compared with the hand-crafted features, the multi-scale deep features learnt from our model exhibit better ability in capturing discriminative patterns underlying protein subcellular locations, and the features from different scales are complementary for the improvement in performance. Finally, case study results indicate that our MSTLoc can successfully identify some biomarkers from proteins that are closely involved with cancer development. For the convenient use of our method, we establish a user-friendly webserver available at http://server.wei-group.net/ MSTLoc. AVAILABILITY AND IMPLEMENTATION http://server.wei-group.net/ MSTLoc. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Fengsheng Wang
- School of Software, Shandong University, Jinan, China.,Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan, China
| | - Leyi Wei
- School of Software, Shandong University, Jinan, China.,Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan, China
| |
Collapse
|
5
|
Tu Y, Lei H, Shen HB, Yang Y. SIFLoc: a self-supervised pre-training method for enhancing the recognition of protein subcellular localization in immunofluorescence microscopic images. Brief Bioinform 2022; 23:6527276. [DOI: 10.1093/bib/bbab605] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2021] [Revised: 12/15/2021] [Accepted: 12/27/2021] [Indexed: 12/19/2022] Open
Abstract
Abstract
With the rapid growth of high-resolution microscopy imaging data, revealing the subcellular map of human proteins has become a central task in the spatial proteome. The cell atlas of the Human Protein Atlas (HPA) provides precious resources for recognizing subcellular localization patterns at the cell level, and the large-scale annotated data enable learning via advanced deep neural networks. However, the existing predictors still suffer from the imbalanced class distribution and the lack of labeled data for minor classes. Thus, it is necessary to develop new methods for coping with these issues. We leverage the self-supervised learning protocol to address these problems. Especially, we propose a pre-training scheme to enhance the conventional supervised learning framework called SIFLoc. The pre-training is featured by a hybrid data augmentation method and a modified contrastive loss function, aiming to learn good feature representations from microscopic images. The experiments are performed on a large-scale immunofluorescence microscopic image dataset collected from the HPA database. Using the same deep neural networks as the classifier, the model pre-trained via SIFLoc not only outperforms the model without pre-training by a large margin but also shows advantages over the state-of-the-art self-supervised learning methods. Especially, SIFLoc improves the prediction accuracy for minor organelles significantly.
Collapse
Affiliation(s)
- Yanlun Tu
- Department of Computer Science and Engineering, Shanghai Jiao Tong University, 200240 Shanghai, China
| | - Houchao Lei
- Department of Computer Science and Engineering, Shanghai Jiao Tong University, 200240 Shanghai, China
| | - Hong-Bin Shen
- Department of Computer Science and Engineering, Shanghai Jiao Tong University, 200240 Shanghai, China
- Institute of Image Processing and Pattern Recognition and Key Laboratory of System Control and Information Processing, Shanghai Jiao Tong University, 200240 Shanghai, China
| | - Yang Yang
- Department of Computer Science and Engineering, Shanghai Jiao Tong University, 200240 Shanghai, China
| |
Collapse
|
6
|
Wang G, Xue MQ, Shen HB, Xu YY. Learning protein subcellular localization multi-view patterns from heterogeneous data of imaging, sequence and networks. Brief Bioinform 2022; 23:6499983. [PMID: 35018423 DOI: 10.1093/bib/bbab539] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2021] [Revised: 11/03/2021] [Accepted: 11/20/2021] [Indexed: 11/13/2022] Open
Abstract
Location proteomics seeks to provide automated high-resolution descriptions of protein location patterns within cells. Many efforts have been undertaken in location proteomics over the past decades, thereby producing plenty of automated predictors for protein subcellular localization. However, most of these predictors are trained solely from high-throughput microscopic images or protein amino acid sequences alone. Unifying heterogeneous protein data sources has yet to be exploited. In this paper, we present a pipeline called sequence, image, network-based protein subcellular locator (SIN-Locator) that constructs a multi-view description of proteins by integrating multiple data types including images of protein expression in cells or tissues, amino acid sequences and protein-protein interaction networks, to classify the patterns of protein subcellular locations. Proteins were encoded by both handcrafted features and deep learning features, and multiple combining methods were implemented. Our experimental results indicated that optimal integrations can considerately enhance the classification accuracy, and the utility of SIN-Locator has been demonstrated through applying to new released proteins in the human protein atlas. Furthermore, we also investigate the contribution of different data sources and influence of partial absence of data. This work is anticipated to provide clues for reconciliation and combination of multi-source data for protein location analysis.
Collapse
Affiliation(s)
- Ge Wang
- School of Biomedical Engineering and Guangdong Provincial Key Laboratory of Medical Image Processing, Southern Medical University, Guangzhou 510515, China.,Guangdong Province Engineering Laboratory for Medical Imaging and Diagnostic Technology, Southern Medical University, Guangzhou 510515, China
| | - Min-Qi Xue
- School of Biomedical Engineering and Guangdong Provincial Key Laboratory of Medical Image Processing, Southern Medical University, Guangzhou 510515, China.,Guangdong Province Engineering Laboratory for Medical Imaging and Diagnostic Technology, Southern Medical University, Guangzhou 510515, China
| | - Hong-Bin Shen
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Shanghai 200240, China.,School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Ying-Ying Xu
- School of Biomedical Engineering and Guangdong Provincial Key Laboratory of Medical Image Processing, Southern Medical University, Guangzhou 510515, China.,Guangdong Province Engineering Laboratory for Medical Imaging and Diagnostic Technology, Southern Medical University, Guangzhou 510515, China
| |
Collapse
|
7
|
Wang G, Zhai YJ, Xue ZZ, Xu YY. Improving Protein Subcellular Location Classification by Incorporating Three-Dimensional Structure Information. Biomolecules 2021; 11:1607. [PMID: 34827605 PMCID: PMC8615982 DOI: 10.3390/biom11111607] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2021] [Revised: 10/27/2021] [Accepted: 10/27/2021] [Indexed: 12/12/2022] Open
Abstract
The subcellular locations of proteins are closely related to their functions. In the past few decades, the application of machine learning algorithms to predict protein subcellular locations has been an important topic in proteomics. However, most studies in this field used only amino acid sequences as the data source. Only a few works focused on other protein data types. For example, three-dimensional structures, which contain far more functional protein information than sequences, remain to be explored. In this work, we extracted various handcrafted features to describe the protein structures from physical, chemical, and topological aspects, as well as the learned features obtained by deep neural networks. We then used these features to classify the protein subcellular locations. Our experimental results demonstrated that some of these structural features have a certain effect on the protein location classification, and can help improve the performance of sequence-based location predictors. Our method provides a new view for the analysis of protein spatial distribution, and is anticipated to be used in revealing the relationships between protein structures and functions.
Collapse
Affiliation(s)
- Ge Wang
- School of Biomedical Engineering, Southern Medical University, Guangzhou 510515, China; (G.W.); (Z.-Z.X.)
- Guangdong Provincial Key Laboratory of Medical Imaging Processing, Southern Medical University, Guangzhou 510515, China
- Guangdong Province Engineering Laboratory for Medical Imaging and Diagnostic Technology, Southern Medical University, Guangzhou 510515, China
| | - Yu-Jia Zhai
- Guangzhou Women and Children’s Medical Center, Department of Pharmacy, Guangzhou Medical University, Guangzhou 510623, China;
| | - Zhen-Zhen Xue
- School of Biomedical Engineering, Southern Medical University, Guangzhou 510515, China; (G.W.); (Z.-Z.X.)
- Guangdong Provincial Key Laboratory of Medical Imaging Processing, Southern Medical University, Guangzhou 510515, China
- Guangdong Province Engineering Laboratory for Medical Imaging and Diagnostic Technology, Southern Medical University, Guangzhou 510515, China
- Paul C. Lauterbur Research Center for Biomedical Imaging, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China
| | - Ying-Ying Xu
- School of Biomedical Engineering, Southern Medical University, Guangzhou 510515, China; (G.W.); (Z.-Z.X.)
- Guangdong Provincial Key Laboratory of Medical Imaging Processing, Southern Medical University, Guangzhou 510515, China
- Guangdong Province Engineering Laboratory for Medical Imaging and Diagnostic Technology, Southern Medical University, Guangzhou 510515, China
| |
Collapse
|
8
|
Hu JX, Yang Y, Xu YY, Shen HB. Incorporating label correlations into deep neural networks to classify protein subcellular location patterns in immunohistochemistry images. Proteins 2021; 90:493-503. [PMID: 34546597 DOI: 10.1002/prot.26244] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2020] [Revised: 03/16/2021] [Accepted: 09/13/2021] [Indexed: 12/17/2022]
Abstract
Analysis of protein subcellular localization is a critical part of proteomics. In recent years, as both the number and quality of microscopic images are increasing rapidly, many automated methods, especially convolutional neural networks (CNN), have been developed to predict protein subcellular location(s) based on bioimages, but their performance always suffers from some inherent properties of the problem. First, many microscopic images have non-informative or noisy sections, like unstained stroma and unspecific background, which affect the extraction of protein expression information. Second, the patterns of protein subcellular localization are very complex, as a lot of proteins locate in more than one compartment. In this study, we propose a new label-correlation enhanced deep neural network, laceDNN, to classify the subcellular locations of multi-label proteins from immunohistochemistry images. The model uses small representative patches as input to alleviate the image noise issue, and its backbone is a hybrid architecture of CNN and recurrent neural network, where the former network extracts representative image features and the latter learns the organelle dependency relationships. Our experimental results indicate that the proposed model can improve the performance of multi-label protein subcellular classification.
Collapse
Affiliation(s)
- Jin-Xian Hu
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai, China
| | - Yang Yang
- Department of Computer Science and Engineering, Center for Brain-Like Computing and Machine Intelligence, Shanghai Jiao Tong University, Shanghai, China
| | - Ying-Ying Xu
- School of Biomedical Engineering and Guangdong Provincial Key Laboratory of Medical Image Processing, Southern Medical University, Guangzhou, China
| | - Hong-Bin Shen
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai, China
| |
Collapse
|
9
|
Wang H, Ding Y, Tang J, Zou Q, Guo F. Identify RNA-associated subcellular localizations based on multi-label learning using Chou's 5-steps rule. BMC Genomics 2021; 22:56. [PMID: 33451286 PMCID: PMC7811227 DOI: 10.1186/s12864-020-07347-7] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2020] [Accepted: 12/22/2020] [Indexed: 12/04/2022] Open
Abstract
BACKGROUND Biological functions of biomolecules rely on the cellular compartments where they are located in cells. Importantly, RNAs are assigned in specific locations of a cell, enabling the cell to implement diverse biochemical processes in the way of concurrency. However, lots of existing RNA subcellular localization classifiers only solve the problem of single-label classification. It is of great practical significance to expand RNA subcellular localization into multi-label classification problem. RESULTS In this study, we extract multi-label classification datasets about RNA-associated subcellular localizations on various types of RNAs, and then construct subcellular localization datasets on four RNA categories. In order to study Homo sapiens, we further establish human RNA subcellular localization datasets. Furthermore, we utilize different nucleotide property composition models to extract effective features to adequately represent the important information of nucleotide sequences. In the most critical part, we achieve a major challenge that is to fuse the multivariate information through multiple kernel learning based on Hilbert-Schmidt independence criterion. The optimal combined kernel can be put into an integration support vector machine model for identifying multi-label RNA subcellular localizations. Our method obtained excellent results of 0.703, 0.757, 0.787, and 0.800, respectively on four RNA data sets on average precision. CONCLUSION To be specific, our novel method performs outstanding rather than other prediction tools on novel benchmark datasets. Moreover, we establish user-friendly web server with the implementation of our method.
Collapse
Affiliation(s)
- Hao Wang
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Yijie Ding
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
| | - Jijun Tang
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
- School of Computational Science and Engineering, University of South Carolina, Columbia, 29208, SC, US
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, Sichuan, China
| | - Fei Guo
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China.
| |
Collapse
|
10
|
Su R, He L, Liu T, Liu X, Wei L. Protein subcellular localization based on deep image features and criterion learning strategy. Brief Bioinform 2020; 22:6035269. [PMID: 33320936 DOI: 10.1093/bib/bbaa313] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2020] [Revised: 09/26/2020] [Accepted: 10/14/2020] [Indexed: 01/05/2023] Open
Abstract
The spatial distribution of proteome at subcellular levels provides clues for protein functions, thus is important to human biology and medicine. Imaging-based methods are one of the most important approaches for predicting protein subcellular location. Although deep neural networks have shown impressive performance in a number of imaging tasks, its application to protein subcellular localization has not been sufficiently explored. In this study, we developed a deep imaging-based approach to localize the proteins at subcellular levels. Based on deep image features extracted from convolutional neural networks (CNNs), both single-label and multi-label locations can be accurately predicted. Particularly, the multi-label prediction is quite a challenging task. Here we developed a criterion learning strategy to exploit the label-attribute relevancy and label-label relevancy. A criterion that was used to determine the final label set was automatically obtained during the learning procedure. We concluded an optimal CNN architecture that could give the best results. Besides, experiments show that compared with the hand-crafted features, the deep features present more accurate prediction with less features. The implementation for the proposed method is available at https://github.com/RanSuLab/ProteinSubcellularLocation.
Collapse
Affiliation(s)
- Ran Su
- School of Computer Software, College of Intelligence and Computing, Tianjin University, China
| | - Linlin He
- School of Computer Software, College of Intelligence and Computing, Tianjin University, China
| | - Tianling Liu
- School of Computer Software, College of Intelligence and Computing, Tianjin University, China
| | - Xiaofeng Liu
- Key Laboratory of Breast Cancer Prevention and Therapy, Ministry of Education, Tianjin Medical University Cancer Institute and Hospital, National Clinical Research Center of Cancer, Key Laboratory of Cancer Prevention and Therapy, Tianjin, China
| | - Leyi Wei
- School of Software, Shandong University, China
| |
Collapse
|
11
|
Schormann W, Hariharan S, Andrews DW. A reference library for assigning protein subcellular localizations by image-based machine learning. J Cell Biol 2020; 219:133635. [PMID: 31968357 PMCID: PMC7055006 DOI: 10.1083/jcb.201904090] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2019] [Revised: 09/30/2019] [Accepted: 12/15/2019] [Indexed: 12/11/2022] Open
Abstract
Confocal micrographs of EGFP fusion proteins localized at key cell organelles in murine and human cells were acquired for use as subcellular localization landmarks. For each of the respective 789,011 and 523,319 optically validated cell images, morphology and statistical features were measured. Machine learning algorithms using these features permit automated assignment of the localization of other proteins and dyes in both cell types with very high accuracy. Automated assignment of subcellular localizations for model tail-anchored proteins with randomly mutated C-terminal targeting sequences allowed the discovery of motifs responsible for targeting to mitochondria, endoplasmic reticulum, and the late secretory pathway. Analysis of directed mutants enabled refinement of these motifs and characterization of protein distributions in within cellular subcompartments.
Collapse
Affiliation(s)
- Wiebke Schormann
- Biological Sciences, Sunnybrook Research Institute, Toronto, Canada
| | | | - David W Andrews
- Biological Sciences, Sunnybrook Research Institute, Toronto, Canada.,Department of Biochemistry, University of Toronto, Toronto, Canada.,Department of Medical Biophysics, University of Toronto, Toronto, Canada
| |
Collapse
|
12
|
Gao J, Miao Z, Zhang Z, Wei H, Kurgan L. Prediction of Ion Channels and their Types from Protein Sequences: Comprehensive Review and Comparative Assessment. Curr Drug Targets 2020; 20:579-592. [PMID: 30360734 DOI: 10.2174/1389450119666181022153942] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2018] [Revised: 10/03/2018] [Accepted: 10/04/2018] [Indexed: 12/20/2022]
Abstract
BACKGROUND Ion channels are a large and growing protein family. Many of them are associated with diseases, and consequently, they are targets for over 700 drugs. Discovery of new ion channels is facilitated with computational methods that predict ion channels and their types from protein sequences. However, these methods were never comprehensively compared and evaluated. OBJECTIVE We offer first-of-its-kind comprehensive survey of the sequence-based predictors of ion channels. We describe eight predictors that include five methods that predict ion channels, their types, and four classes of the voltage-gated channels. We also develop and use a new benchmark dataset to perform comparative empirical analysis of the three currently available predictors. RESULTS While several methods that rely on different designs were published, only a few of them are currently available and offer a broad scope of predictions. Support and availability after publication should be required when new methods are considered for publication. Empirical analysis shows strong performance for the prediction of ion channels and modest performance for the prediction of ion channel types and voltage-gated channel classes. We identify a substantial weakness of current methods that cannot accurately predict ion channels that are categorized into multiple classes/types. CONCLUSION Several predictors of ion channels are available to the end users. They offer practical levels of predictive quality. Methods that rely on a larger and more diverse set of predictive inputs (such as PSIONplus) are more accurate. New tools that address multi-label prediction of ion channels should be developed.
Collapse
Affiliation(s)
- Jianzhao Gao
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin, China
| | - Zhen Miao
- College of Life Sciences, Nankai University, Tianjin, China
| | - Zhaopeng Zhang
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin, China
| | - Hong Wei
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin, China
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, United States
| |
Collapse
|
13
|
PSIONplus m Server for Accurate Multi-Label Prediction of Ion Channels and Their Types. Biomolecules 2020; 10:biom10060876. [PMID: 32517331 PMCID: PMC7355608 DOI: 10.3390/biom10060876] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2020] [Revised: 05/28/2020] [Accepted: 06/04/2020] [Indexed: 11/26/2022] Open
Abstract
Computational prediction of ion channels facilitates the identification of putative ion channels from protein sequences. Several predictors of ion channels and their types were developed in the last quindecennial. While they offer reasonably accurate predictions, they also suffer a few shortcomings including lack of availability, parallel prediction mode, single-label prediction (inability to predict multiple channel subtypes), and incomplete scope (inability to predict subtypes of the voltage-gated channels). We developed a first-of-its-kind PSIONplusm method that performs sequential multi-label prediction of ion channels and their subtypes for both voltage-gated and ligand-gated channels. PSIONplusm sequentially combines the outputs produced by three support vector machine-based models from the PSIONplus predictor and is available as a webserver. Empirical tests show that PSIONplusm outperforms current methods for the multi-label prediction of the ion channel subtypes. This includes the existing single-label methods that are available to the users, a naïve multi-label predictor that combines results produced by multiple single-label methods, and methods that make predictions based on sequence alignment and domain annotations. We also found that the current methods (including PSIONplusm) fail to accurately predict a few of the least frequently occurring ion channel subtypes. Thus, new predictors should be developed when a larger quantity of annotated ion channels will be available to train predictive models.
Collapse
|
14
|
Yang F, Liu Y, Wang Y, Yin Z, Yang Z. MIC_Locator: a novel image-based protein subcellular location multi-label prediction model based on multi-scale monogenic signal representation and intensity encoding strategy. BMC Bioinformatics 2019; 20:522. [PMID: 31655541 PMCID: PMC6815465 DOI: 10.1186/s12859-019-3136-3] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2019] [Accepted: 10/09/2019] [Indexed: 12/20/2022] Open
Abstract
Background Protein subcellular localization plays a crucial role in understanding cell function. Proteins need to be in the right place at the right time, and combine with the corresponding molecules to fulfill their functions. Furthermore, prediction of protein subcellular location not only should be a guiding role in drug design and development due to potential molecular targets but also be an essential role in genome annotation. Taking the current status of image-based protein subcellular localization as an example, there are three common drawbacks, i.e., obsolete datasets without updating label information, stereotypical feature descriptor on spatial domain or grey level, and single-function prediction algorithm’s limited capacity of handling single-label database. Results In this paper, a novel human protein subcellular localization prediction model MIC_Locator is proposed. Firstly, the latest datasets are collected and collated as our benchmark dataset instead of obsolete data while training prediction model. Secondly, Fourier transformation, Riesz transformation, Log-Gabor filter and intensity coding strategy are employed to obtain frequency feature based on three components of monogenic signal with different frequency scales. Thirdly, a chained prediction model is proposed to handle multi-label instead of single-label datasets. The experiment results showed that the MIC_Locator can achieve 60.56% subset accuracy and outperform the existing majority of prediction models, and the frequency feature and intensity coding strategy can be conducive to improving the classification accuracy. Conclusions Our results demonstrate that the frequency feature is more beneficial for improving the performance of model compared to features extracted from spatial domain, and the MIC_Locator proposed in this paper can speed up validation of protein annotation, knowledge of protein function and proteomics research.
Collapse
Affiliation(s)
- Fan Yang
- School of Communications and Electronics, Jiangxi Science & Technology Normal University, Nanchang, 330003, China. .,Department of Biological Chemistry and Molecular Pharmacology, Harvard Medical School, Boston, MA, 02115, USA.
| | - Yang Liu
- School of Communications and Electronics, Jiangxi Science & Technology Normal University, Nanchang, 330003, China
| | - Yanbin Wang
- School of Communications and Electronics, Jiangxi Science & Technology Normal University, Nanchang, 330003, China
| | - Zhijian Yin
- School of Communications and Electronics, Jiangxi Science & Technology Normal University, Nanchang, 330003, China
| | - Zhen Yang
- School of Communications and Electronics, Jiangxi Science & Technology Normal University, Nanchang, 330003, China
| |
Collapse
|
15
|
Li F, Zhang Y, Purcell AW, Webb GI, Chou KC, Lithgow T, Li C, Song J. Positive-unlabelled learning of glycosylation sites in the human proteome. BMC Bioinformatics 2019; 20:112. [PMID: 30841845 PMCID: PMC6404354 DOI: 10.1186/s12859-019-2700-1] [Citation(s) in RCA: 52] [Impact Index Per Article: 10.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2018] [Accepted: 02/22/2019] [Indexed: 11/30/2022] Open
Abstract
BACKGROUND As an important type of post-translational modification (PTM), protein glycosylation plays a crucial role in protein stability and protein function. The abundance and ubiquity of protein glycosylation across three domains of life involving Eukarya, Bacteria and Archaea demonstrate its roles in regulating a variety of signalling and metabolic pathways. Mutations on and in the proximity of glycosylation sites are highly associated with human diseases. Accordingly, accurate prediction of glycosylation can complement laboratory-based methods and greatly benefit experimental efforts for characterization and understanding of functional roles of glycosylation. For this purpose, a number of supervised-learning approaches have been proposed to identify glycosylation sites, demonstrating a promising predictive performance. To train a conventional supervised-learning model, both reliable positive and negative samples are required. However, in practice, a large portion of negative samples (i.e. non-glycosylation sites) are mislabelled due to the limitation of current experimental technologies. Moreover, supervised algorithms often fail to take advantage of large volumes of unlabelled data, which can aid in model learning in conjunction with positive samples (i.e. experimentally verified glycosylation sites). RESULTS In this study, we propose a positive unlabelled (PU) learning-based method, PA2DE (V2.0), based on the AlphaMax algorithm for protein glycosylation site prediction. The predictive performance of this proposed method was evaluated by a range of glycosylation data collected over a ten-year period based on an interval of three years. Experiments using both benchmarking and independent tests show that our method outperformed the representative supervised-learning algorithms (including support vector machines and random forests) and one-class learners, as well as currently available prediction methods in terms of F1 score, accuracy and AUC measures. In addition, we developed an online web server as an implementation of the optimized model (available at http://glycomine.erc.monash.edu/Lab/GlycoMine_PU/ ) to facilitate community-wide efforts for accurate prediction of protein glycosylation sites. CONCLUSION The proposed PU learning approach achieved a competitive predictive performance compared with currently available methods. This PU learning schema may also be effectively employed and applied to address the prediction problems of other important types of protein PTM site and functional sites.
Collapse
Affiliation(s)
- Fuyi Li
- Infection and Immunity Program, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800 Australia
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800 Australia
| | - Yang Zhang
- College of Information Engineering, Northwest A and F University, Yangling, 712100 Shaanxi China
| | - Anthony W. Purcell
- Infection and Immunity Program, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800 Australia
| | - Geoffrey I. Webb
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800 Australia
| | - Kuo-Chen Chou
- Gordon Life Science Institute, Boston, MA 02478 USA
- Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, 610054 China
| | - Trevor Lithgow
- Infection and Immunity Program, Biomedicine Discovery Institute and Department of Microbiology, Monash University, Melbourne, VIC 3800 Australia
| | - Chen Li
- Infection and Immunity Program, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800 Australia
- Department of Biology, Institute of Molecular Systems Biology, ETH Zürich, 8093 Zürich, Switzerland
| | - Jiangning Song
- Infection and Immunity Program, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800 Australia
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800 Australia
| |
Collapse
|
16
|
Wang S, Yue Y. Protein subnuclear localization based on a new effective representation and intelligent kernel linear discriminant analysis by dichotomous greedy genetic algorithm. PLoS One 2018; 13:e0195636. [PMID: 29649330 PMCID: PMC5896989 DOI: 10.1371/journal.pone.0195636] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2017] [Accepted: 03/26/2018] [Indexed: 01/03/2023] Open
Abstract
A wide variety of methods have been proposed in protein subnuclear localization to improve the prediction accuracy. However, one important trend of these means is to treat fusion representation by fusing multiple feature representations, of which, the fusion process takes a lot of time. In view of this, this paper novelly proposed a method by combining a new single feature representation and a new algorithm to obtain good recognition rate. Specifically, based on the position-specific scoring matrix (PSSM), we proposed a new expression, correlation position-specific scoring matrix (CoPSSM) as the protein feature representation. Based on the classic nonlinear dimension reduction algorithm, kernel linear discriminant analysis (KLDA), we added a new discriminant criterion and proposed a dichotomous greedy genetic algorithm (DGGA) to intelligently select its kernel bandwidth parameter. Two public datasets with Jackknife test and KNN classifier were used for the numerical experiments. The results showed that the overall success rate (OSR) with single representation CoPSSM is larger than that with many relevant representations. The OSR of the proposed method can reach as high as 87.444% and 90.3361% for these two datasets, respectively, outperforming many current methods. To show the generalization of the proposed algorithm, two extra standard datasets of protein subcellular were chosen to conduct the expending experiment, and the prediction accuracy by Jackknife test and Independent test is still considerable.
Collapse
Affiliation(s)
- Shunfang Wang
- School of Information Science and Engineering, Yunnan University, Kunming, PR China
- * E-mail:
| | - Yaoting Yue
- School of Information Science and Engineering, Yunnan University, Kunming, PR China
| |
Collapse
|
17
|
Riemenschneider M, Herbst A, Rasch A, Gorlatch S, Heider D. eccCL: parallelized GPU implementation of Ensemble Classifier Chains. BMC Bioinformatics 2017; 18:371. [PMID: 28818036 PMCID: PMC5561639 DOI: 10.1186/s12859-017-1783-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2016] [Accepted: 08/08/2017] [Indexed: 11/30/2022] Open
Abstract
Background Multi-label classification has recently gained great attention in diverse fields of research, e.g., in biomedical application such as protein function prediction or drug resistance testing in HIV. In this context, the concept of Classifier Chains has been shown to improve prediction accuracy, especially when applied as Ensemble Classifier Chains. However, these techniques lack computational efficiency when applied on large amounts of data, e.g., derived from next-generation sequencing experiments. By adapting algorithms for the use of graphics processing units, computational efficiency can be greatly improved due to parallelization of computations. Results Here, we provide a parallelized and optimized graphics processing unit implementation (eccCL) of Classifier Chains and Ensemble Classifier Chains. Additionally to the OpenCL implementation, we provide an R-Package with an easy to use R-interface for parallelized graphics processing unit usage. Conclusion
eccCL is a handy implementation of Classifier Chains on GPUs, which is able to process up to over 25,000 instances per second, and thus can be used efficiently in high-throughput experiments. The software is available at http://www.heiderlab.de.
Collapse
Affiliation(s)
- Mona Riemenschneider
- Department of Bioinformatics, Straubing Center of Science, Petersgasse 18, Straubing, 94315, Germany
| | - Alexander Herbst
- Institute of Computer Science, University of Münster, Einsteinstr. 62, Münster, 48149, Germany
| | - Ari Rasch
- Institute of Computer Science, University of Münster, Einsteinstr. 62, Münster, 48149, Germany
| | - Sergei Gorlatch
- Institute of Computer Science, University of Münster, Einsteinstr. 62, Münster, 48149, Germany
| | - Dominik Heider
- Department of Bioinformatics, Straubing Center of Science, Petersgasse 18, Straubing, 94315, Germany. .,Wissenschaftszentrum Weihenstephan, Technische Universität München, Alte Akademie 8, Freising, 85354, Germany. .,Present Address: Department of Mathematics and Computer Science, University of Marburg, Hans-Meerwein-Str. 6, Marburg, 35032, Germany.
| |
Collapse
|
18
|
El-Manzalawy Y, Munoz EE, Lindner SE, Honavar V. PlasmoSEP: Predicting surface-exposed proteins on the malaria parasite using semisupervised self-training and expert-annotated data. Proteomics 2016; 16:2967-2976. [PMID: 27714937 PMCID: PMC5600274 DOI: 10.1002/pmic.201600249] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2016] [Revised: 08/31/2016] [Accepted: 10/05/2016] [Indexed: 01/09/2023]
Abstract
Accurate and comprehensive identification of surface-exposed proteins (SEPs) in parasites is a key step in developing novel subunit vaccines. However, the reliability of MS-based high-throughput methods for proteome-wide mapping of SEPs continues to be limited due to high rates of false positives (i.e., proteins mistakenly identified as surface exposed) as well as false negatives (i.e., SEPs not detected due to low expression or other technical limitations). We propose a framework called PlasmoSEP for the reliable identification of SEPs using a novel semisupervised learning algorithm that combines SEPs identified by high-throughput experiments and expert annotation of high-throughput data to augment labeled data for training a predictive model. Our experiments using high-throughput data from the Plasmodium falciparum surface-exposed proteome provide several novel high-confidence predictions of SEPs in P. falciparum and also confirm expert annotations for several others. Furthermore, PlasmoSEP predicts that 25 of 37 experimentally identified SEPs in Plasmodium yoelii salivary gland sporozoites are likely to be SEPs. Finally, PlasmoSEP predicts several novel SEPs in P. yoelii and Plasmodium vivax malaria parasites that can be validated for further vaccine studies. Our computational framework can be easily adapted to improve the interpretation of data from high-throughput studies.
Collapse
Affiliation(s)
- Yasser El-Manzalawy
- College of Information Sciences and Technology, Pennsylvania State University, PA, USA
| | - Elyse E Munoz
- Center for Malaria Research, Department of Biochemistry and Molecular Biology, Pennsylvania State University, PA, USA
| | - Scott E Lindner
- Center for Malaria Research, Department of Biochemistry and Molecular Biology, Pennsylvania State University, PA, USA
| | - Vasant Honavar
- College of Information Sciences and Technology, Pennsylvania State University, PA, USA
| |
Collapse
|