1
|
Saha S, Chatterjee P, Basu S, Nasipuri M. EPI-SF: essential protein identification in protein interaction networks using sequence features. PeerJ 2024; 12:e17010. [PMID: 38495766 PMCID: PMC10944162 DOI: 10.7717/peerj.17010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2023] [Accepted: 02/05/2024] [Indexed: 03/19/2024] Open
Abstract
Proteins are considered indispensable for facilitating an organism's viability, reproductive capabilities, and other fundamental physiological functions. Conventional biological assays are characterized by prolonged duration, extensive labor requirements, and financial expenses in order to identify essential proteins. Therefore, it is widely accepted that employing computational methods is the most expeditious and effective approach to successfully discerning essential proteins. Despite being a popular choice in machine learning (ML) applications, the deep learning (DL) method is not suggested for this specific research work based on sequence features due to the restricted availability of high-quality training sets of positive and negative samples. However, some DL works on limited availability of data are also executed at recent times which will be our future scope of work. Conventional ML techniques are thus utilized in this work due to their superior performance compared to DL methodologies. In consideration of the aforementioned, a technique called EPI-SF is proposed here, which employs ML to identify essential proteins within the protein-protein interaction network (PPIN). The protein sequence is the primary determinant of protein structure and function. So, initially, relevant protein sequence features are extracted from the proteins within the PPIN. These features are subsequently utilized as input for various machine learning models, including XGB Boost Classifier, AdaBoost Classifier, logistic regression (LR), support vector classification (SVM), Decision Tree model (DT), Random Forest model (RF), and Naïve Bayes model (NB). The objective is to detect the essential proteins within the PPIN. The primary investigation conducted on yeast examined the performance of various ML models for yeast PPIN. Among these models, the RF model technique had the highest level of effectiveness, as indicated by its precision, recall, F1-score, and AUC values of 0.703, 0.720, 0.711, and 0.745, respectively. It is also found to be better in performance when compared to the other state-of-arts based on traditional centrality like betweenness centrality (BC), closeness centrality (CC), etc. and deep learning methods as well like DeepEP, as emphasized in the result section. As a result of its favorable performance, EPI-SF is later employed for the prediction of novel essential proteins inside the human PPIN. Due to the tendency of viruses to selectively target essential proteins involved in the transmission of diseases within human PPIN, investigations are conducted to assess the probable involvement of these proteins in COVID-19 and other related severe diseases.
Collapse
Affiliation(s)
- Sovan Saha
- Department of Computer Science & Engineering (Artificial Intelligence & Machine Learning), Techno Main Salt Lake, Kolkata, West Bengal, India
| | - Piyali Chatterjee
- Department of Computer Science & Engineering, Netaji Subhash Engineering College, Kolkata, West Bengal, India
| | - Subhadip Basu
- Department of Computer Science & Engineering, Jadavpur University, Kolkata, West Bengal, India
| | - Mita Nasipuri
- Department of Computer Science & Engineering, Jadavpur University, Kolkata, West Bengal, India
| |
Collapse
|
2
|
Sengupta K, Saha S, Halder AK, Chatterjee P, Nasipuri M, Basu S, Plewczynski D. PFP-GO: Integrating protein sequence, domain and protein-protein interaction information for protein function prediction using ranked GO terms. Front Genet 2022; 13:969915. [PMID: 36246645 PMCID: PMC9556876 DOI: 10.3389/fgene.2022.969915] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2022] [Accepted: 08/31/2022] [Indexed: 11/13/2022] Open
Abstract
Protein function prediction is gradually emerging as an essential field in biological and computational studies. Though the latter has clinched a significant footprint, it has been observed that the application of computational information gathered from multiple sources has more significant influence than the one derived from a single source. Considering this fact, a methodology, PFP-GO, is proposed where heterogeneous sources like Protein Sequence, Protein Domain, and Protein-Protein Interaction Network have been processed separately for ranking each individual functional GO term. Based on this ranking, GO terms are propagated to the target proteins. While Protein sequence enriches the sequence-based information, Protein Domain and Protein-Protein Interaction Networks embed structural/functional and topological based information, respectively, during the phase of GO ranking. Performance analysis of PFP-GO is also based on Precision, Recall, and F-Score. The same was found to perform reasonably better when compared to the other existing state-of-art. PFP-GO has achieved an overall Precision, Recall, and F-Score of 0.67, 0.58, and 0.62, respectively. Furthermore, we check some of the top-ranked GO terms predicted by PFP-GO through multilayer network propagation that affect the 3D structure of the genome. The complete source code of PFP-GO is freely available at https://sites.google.com/view/pfp-go/.
Collapse
Affiliation(s)
- Kaustav Sengupta
- Laboratory of Functional and Structural Genomics, Center of New Technologies, University of Warsaw, Warsaw, Poland
- Department of Computer Science and Engineering, Jadavpur University, Kolkata, India
- Laboratory of Bioinformatics and Computational Genomics, Faculty of Mathematics and Information Science, Warsaw University of Technology, Warsaw, Poland
| | - Sovan Saha
- Department of Computer Science and Engineering, Institute of Engineering and Management, Kolkata, West Bengal, India
| | - Anup Kumar Halder
- Laboratory of Functional and Structural Genomics, Center of New Technologies, University of Warsaw, Warsaw, Poland
- Laboratory of Bioinformatics and Computational Genomics, Faculty of Mathematics and Information Science, Warsaw University of Technology, Warsaw, Poland
| | - Piyali Chatterjee
- Department of Computer Science and Engineering, Netaji Subhash Engineering College, Kolkata, India
| | - Mita Nasipuri
- Department of Computer Science and Engineering, Jadavpur University, Kolkata, India
| | - Subhadip Basu
- Department of Computer Science and Engineering, Jadavpur University, Kolkata, India
- *Correspondence: Subhadip Basu, Dariusz Plewczynski,
| | - Dariusz Plewczynski
- Laboratory of Functional and Structural Genomics, Center of New Technologies, University of Warsaw, Warsaw, Poland
- Laboratory of Bioinformatics and Computational Genomics, Faculty of Mathematics and Information Science, Warsaw University of Technology, Warsaw, Poland
- *Correspondence: Subhadip Basu, Dariusz Plewczynski,
| |
Collapse
|
3
|
In silico Methods for Identification of Potential Therapeutic Targets. Interdiscip Sci 2022; 14:285-310. [PMID: 34826045 PMCID: PMC8616973 DOI: 10.1007/s12539-021-00491-y] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2021] [Revised: 10/19/2021] [Accepted: 11/01/2021] [Indexed: 11/01/2022]
Abstract
AbstractAt the initial stage of drug discovery, identifying novel targets with maximal efficacy and minimal side effects can improve the success rate and portfolio value of drug discovery projects while simultaneously reducing cycle time and cost. However, harnessing the full potential of big data to narrow the range of plausible targets through existing computational methods remains a key issue in this field. This paper reviews two categories of in silico methods—comparative genomics and network-based methods—for finding potential therapeutic targets among cellular functions based on understanding their related biological processes. In addition to describing the principles, databases, software, and applications, we discuss some recent studies and prospects of the methods. While comparative genomics is mostly applied to infectious diseases, network-based methods can be applied to infectious and non-infectious diseases. Nonetheless, the methods often complement each other in their advantages and disadvantages. The information reported here guides toward improving the application of big data-driven computational methods for therapeutic target discovery.
Graphical abstract
Collapse
|
4
|
Singh G, Gupta D. In-Silico Functional Annotation of Plasmodium falciparum Hypothetical Proteins to Identify Novel Drug Targets. Front Genet 2022; 13:821516. [PMID: 35444689 PMCID: PMC9013929 DOI: 10.3389/fgene.2022.821516] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2021] [Accepted: 03/07/2022] [Indexed: 11/16/2022] Open
Abstract
Plasmodium falciparum is one of the plasmodium species responsible for the majority of life-threatening malaria cases. The current antimalarial therapies are becoming less effective due to growing drug resistance, leading to the urgent requirement for alternative and more effective antimalarial drugs or vaccines. To facilitate the novel drug discovery or vaccine development efforts, recent advances in sequencing technologies provide valuable information about the whole genome of the parasite, yet a lot more needs to be deciphered due to its incomplete proteome annotation. Surprisingly, out of the 5,389 proteins currently annotated in the Plasmodium falciparum 3D7 strain, 1,626 proteins (∼30% data) are annotated as hypothetical proteins. In parasite genomic studies, the challenge to annotate hypothetical proteins is often ignored, which may obscure the crucial information related to the pathogenicity of the parasite. In this study, we attempt to characterize hypothetical proteins of the parasite to identify novel drug targets using a computational pipeline. The study reveals that out of the overall pool of the hypothetical proteins, 266 proteins have conserved functional signatures. Furthermore, the pathway analysis of these proteins revealed that 23 proteins have an essential role in various biochemical, signalling and metabolic pathways. Additionally, all the proteins (266) were subjected to computational structure analysis. We could successfully model 11 proteins. We validated and checked the structural stability of the models by performing molecular dynamics simulation. Interestingly, eight proteins show stable conformations, and seven proteins are specific for Plasmodium falciparum, based on homology analysis. Lastly, mapping the seven shortlisted hypothetical proteins on the Plasmodium falciparum protein-protein interaction network revealed 3,299 nodes and 2,750,692 edges. Our study revealed interesting functional details of seven hypothetical proteins of the parasite, which help learn more about the less-studied molecules and their interactions, providing valuable clues to unravel the role of these proteins via future experimental validation.
Collapse
Affiliation(s)
- Gagandeep Singh
- Translational Bioinformatics Group, International Centre for Genetic Engineering and Biotechnology, New Delhi, India
| | - Dinesh Gupta
- Translational Bioinformatics Group, International Centre for Genetic Engineering and Biotechnology, New Delhi, India
| |
Collapse
|
5
|
Loaiza CD, Duhan N, Kaundal R. GreeningDB: A Database of Host-Pathogen Protein-Protein Interactions and Annotation Features of the Bacteria Causing Huanglongbing HLB Disease. Int J Mol Sci 2021; 22:ijms221910897. [PMID: 34639237 PMCID: PMC8509195 DOI: 10.3390/ijms221910897] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2021] [Revised: 09/25/2021] [Accepted: 09/27/2021] [Indexed: 11/16/2022] Open
Abstract
The Citrus genus comprises some of the most important and commonly cultivated fruit plants. Within the last decade, citrus greening disease (also known as huanglongbing or HLB) has emerged as the biggest threat for the citrus industry. This disease does not have a cure yet and, thus, many efforts have been made to find a solution to this devastating condition. There are challenges in the generation of high-yield resistant cultivars, in part due to the limited and sparse knowledge about the mechanisms that are used by the Liberibacter bacteria to proliferate the infection in Citrus plants. Here, we present GreeningDB, a database implemented to provide the annotation of Liberibacter proteomes, as well as the host–pathogen comparactomics tool, a novel platform to compare the predicted interactomes of two HLB host–pathogen systems. GreeningDB is built to deliver a user-friendly interface, including network visualization and links to other resources. We hope that by providing these characteristics, GreeningDB can become a central resource to retrieve HLB-related protein annotations, and thus, aid the community that is pursuing the development of molecular-based strategies to mitigate this disease’s impact. The database is freely available at http://bioinfo.usu.edu/GreeningDB/ (accessed on 11 August 2021).
Collapse
Affiliation(s)
- Cristian D. Loaiza
- Department of Plants, Soils and Climate, Utah State University, Logan, UT 84322, USA; (C.D.L.); (N.D.)
| | - Naveen Duhan
- Department of Plants, Soils and Climate, Utah State University, Logan, UT 84322, USA; (C.D.L.); (N.D.)
| | - Rakesh Kaundal
- Department of Plants, Soils and Climate, Utah State University, Logan, UT 84322, USA; (C.D.L.); (N.D.)
- Bioinformatics Facility, Center for Integrated BioSystems, Utah State University, Logan, UT 84322, USA
- Department of Computer Science, Utah State University, Logan, UT 84322, USA
- Correspondence: ; Tel.: +1-(435)-797-4117; Fax: +1-(435)-797-2766
| |
Collapse
|
6
|
Saha S, Chatterjee P, Nasipuri M, Basu S. Detection of spreader nodes in human-SARS-CoV protein-protein interaction network. PeerJ 2021; 9:e12117. [PMID: 34567845 PMCID: PMC8428263 DOI: 10.7717/peerj.12117] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2021] [Accepted: 08/15/2021] [Indexed: 12/20/2022] Open
Abstract
The entire world is witnessing the coronavirus pandemic (COVID-19), caused by a novel coronavirus (n-CoV) generally distinguished as Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2). SARS-CoV-2 promotes fatal chronic respiratory disease followed by multiple organ failure, ultimately putting an end to human life. International Committee on Taxonomy of Viruses (ICTV) has reached a consensus that SARS-CoV-2 is highly genetically similar (up to 89%) to the Severe Acute Respiratory Syndrome Coronavirus (SARS-CoV), which had an outbreak in 2003. With this hypothesis, current work focuses on identifying the spreader nodes in the SARS-CoV-human protein-protein interaction network (PPIN) to find possible lineage with the disease propagation pattern of the current pandemic. Various PPIN characteristics like edge ratio, neighborhood density, and node weight have been explored for defining a new feature spreadability index by which spreader proteins and protein-protein interaction (in the form of network edges) are identified. Top spreader nodes with a high spreadability index have been validated by Susceptible-Infected-Susceptible (SIS) disease model, first using a synthetic PPIN followed by a SARS-CoV-human PPIN. The ranked edges highlight the path of entire disease propagation from SARS-CoV to human PPIN (up to level-2 neighborhood). The developed network attribute, spreadability index, and the generated SIS model, compared with the other network centrality-based methodologies, perform better than the existing state-of-art.
Collapse
Affiliation(s)
- Sovan Saha
- Computer Science and Engineering, Institute of Engineering and Management, Kolkata, West Bengal, India
| | - Piyali Chatterjee
- Computer Science and Engineering, Netaji Subhash Engineering College, Kolkata, West Bengal, India
| | - Mita Nasipuri
- Computer Science and Engineering, Jadavpur University, Kolkata, West Bengal, India
| | - Subhadip Basu
- Computer Science and Engineering, Jadavpur University, Kolkata, West Bengal, India
| |
Collapse
|
7
|
alfaNET: A Database of Alfalfa-Bacterial Stem Blight Protein-Protein Interactions Revealing the Molecular Features of the Disease-causing Bacteria. Int J Mol Sci 2021; 22:ijms22158342. [PMID: 34361108 PMCID: PMC8348475 DOI: 10.3390/ijms22158342] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2021] [Revised: 07/25/2021] [Accepted: 07/26/2021] [Indexed: 02/03/2023] Open
Abstract
Alfalfa has emerged as one of the most important forage crops, owing to its wide adaptation and high biomass production worldwide. In the last decade, the emergence of bacterial stem blight (caused by Pseudomonas syringae pv. syringae ALF3) in alfalfa has caused around 50% yield losses in the United States. Studies are being conducted to decipher the roles of the key genes and pathways regulating the disease, but due to the sparse knowledge about the infection mechanisms of Pseudomonas, the development of resistant cultivars is hampered. The database alfaNET is an attempt to assist researchers by providing comprehensive Pseudomonas proteome annotations, as well as a host–pathogen interactome tool, which predicts the interactions between host and pathogen based on orthology. alfaNET is a user-friendly and efficient tool and includes other features such as subcellular localization annotations of pathogen proteins, gene ontology (GO) annotations, network visualization, and effector protein prediction. Users can also browse and search the database using particular keywords or proteins with a specific length. Additionally, the BLAST search tool enables the user to perform a homology sequence search against the alfalfa and Pseudomonas proteomes. With the successful implementation of these attributes, alfaNET will be a beneficial resource to the research community engaged in implementing molecular strategies to mitigate the disease. alfaNET is freely available for public use at http://bioinfo.usu.edu/alfanet/.
Collapse
|
8
|
Suratanee A, Buaboocha T, Plaimas K. Prediction of Human- Plasmodium vivax Protein Associations From Heterogeneous Network Structures Based on Machine-Learning Approach. Bioinform Biol Insights 2021; 15:11779322211013350. [PMID: 34188457 PMCID: PMC8212370 DOI: 10.1177/11779322211013350] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2021] [Accepted: 04/04/2021] [Indexed: 11/24/2022] Open
Abstract
Malaria caused by Plasmodium vivax can lead to severe morbidity and death. In addition, resistance has been reported to existing drugs in treating this malaria. Therefore, the identification of new human proteins associated with malaria is urgently needed for the development of additional drugs. In this study, we established an analysis framework to predict human-P. vivax protein associations using network topological profiles from a heterogeneous network structure of human and P. vivax, machine-learning techniques and statistical analysis. Novel associations were predicted and ranked to determine the importance of human proteins associated with malaria. With the best-ranking score, 411 human proteins were identified as promising proteins. Their regulations and functions were statistically analyzed, which led to the identification of proteins involved in the regulation of membrane and vesicle formation, and proteasome complexes as potential targets for the treatment of P. vivax malaria. In conclusion, by integrating related data, our analysis was efficient in identifying potential targets providing an insight into human-parasite protein associations. Furthermore, generalizing this model could allow researchers to gain further insights into other diseases and enhance the field of biomedical science.
Collapse
Affiliation(s)
- Apichat Suratanee
- Department of Mathematics, Faculty of
Applied Science, King Mongkut’s University of Technology North Bangkok, Bangkok,
Thailand
| | - Teerapong Buaboocha
- Department of Biochemistry, Faculty of
Science, Chulalongkorn University, Bangkok, Thailand
- Omics Sciences and Bioinformatics
Center, Faculty of Science, Chulalongkorn University, Bangkok, Thailand
| | - Kitiporn Plaimas
- Omics Sciences and Bioinformatics
Center, Faculty of Science, Chulalongkorn University, Bangkok, Thailand
- Advanced Virtual and Intelligent
Computing (AVIC) Center, Department of Mathematics and Computer Science, Faculty of
Science, Chulalongkorn University, Bangkok, Thailand
| |
Collapse
|
9
|
Acharya D, Dutta TK. Elucidating the network features and evolutionary attributes of intra- and interspecific protein-protein interactions between human and pathogenic bacteria. Sci Rep 2021; 11:190. [PMID: 33420198 PMCID: PMC7794237 DOI: 10.1038/s41598-020-80549-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2020] [Accepted: 12/09/2020] [Indexed: 01/08/2023] Open
Abstract
Host–pathogen interaction is one of the most powerful determinants involved in coevolutionary processes covering a broad range of biological phenomena at molecular, cellular, organismal and/or population level. The present study explored host–pathogen interaction from the perspective of human–bacteria protein–protein interaction based on large-scale interspecific and intraspecific interactome data for human and three pathogenic bacterial species, Bacillus anthracis, Francisella tularensis and Yersinia pestis. The network features revealed a preferential enrichment of intraspecific hubs and bottlenecks for both human and bacterial pathogens in the interspecific human–bacteria interaction. Analyses unveiled that these bacterial pathogens interact mostly with human party-hubs that may enable them to affect desired functional modules, leading to pathogenesis. Structural features of pathogen-interacting human proteins indicated an abundance of protein domains, providing opportunities for interspecific domain-domain interactions. Moreover, these interactions do not always occur with high-affinity, as we observed that bacteria-interacting human proteins are rich in protein-disorder content, which correlates positively with the number of interacting pathogen proteins, facilitating low-affinity interspecific interactions. Furthermore, functional analyses of pathogen-interacting human proteins revealed an enrichment in regulation of processes like metabolism, immune system, cellular localization and transport apart from divulging functional competence to bind enzyme/protein, nucleic acids and cell adhesion molecules, necessary for host-microbial cross-talk.
Collapse
Affiliation(s)
- Debarun Acharya
- Department of Microbiology, Bose Institute, P-1/12, CIT Scheme VII M, Kolkata, West Bengal, 700 054, India
| | - Tapan K Dutta
- Department of Microbiology, Bose Institute, P-1/12, CIT Scheme VII M, Kolkata, West Bengal, 700 054, India.
| |
Collapse
|
10
|
Heterogeneous Network Model to Identify Potential Associations Between Plasmodium vivax and Human Proteins. Int J Mol Sci 2020; 21:ijms21041310. [PMID: 32075230 PMCID: PMC7072978 DOI: 10.3390/ijms21041310] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2020] [Revised: 01/29/2020] [Accepted: 02/12/2020] [Indexed: 02/06/2023] Open
Abstract
Integration of multiple sources and data levels provides a great insight into the complex associations between human and malaria systems. In this study, a meta-analysis framework was developed based on a heterogeneous network model for integrating human-malaria protein similarities, a human protein interaction network, and a Plasmodium vivax protein interaction network. An iterative network propagation was performed on the heterogeneous network until we obtained stabilized weights. The association scores were calculated for qualifying a novel potential human-malaria protein association. This method provided a better performance compared to random experiments. After that, the stabilized network was clustered into association modules. The potential association candidates were then thoroughly analyzed by statistical enrichment analysis with protein complexes and known drug targets. The most promising target proteins were the succinate dehydrogenase protein complex in the human citrate (TCA) cycle pathway and the nicotinic acetylcholine receptor in the human central nervous system. Promising associations and potential drug targets were also provided for further studies and designs in therapeutic approaches for malaria at a systematic level. In conclusion, this method is efficient to identify new human-malaria protein associations and can be generalized to infer other types of association studies to further advance biomedical science.
Collapse
|
11
|
Saha S, Chatterjee P, Basu S, Nasipuri M, Plewczynski D. FunPred 3.0: improved protein function prediction using protein interaction network. PeerJ 2019; 7:e6830. [PMID: 31198622 PMCID: PMC6535044 DOI: 10.7717/peerj.6830] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2018] [Accepted: 03/21/2019] [Indexed: 11/23/2022] Open
Abstract
Proteins are the most versatile macromolecules in living systems and perform crucial biological functions. In the advent of the post-genomic era, the next generation sequencing is done routinely at the population scale for a variety of species. The challenging problem is to massively determine the functions of proteins that are yet not characterized by detailed experimental studies. Identification of protein functions experimentally is a laborious and time-consuming task involving many resources. We therefore propose the automated protein function prediction methodology using in silico algorithms trained on carefully curated experimental datasets. We present the improved protein function prediction tool FunPred 3.0, an extended version of our previous methodology FunPred 2, which exploits neighborhood properties in protein–protein interaction network (PPIN) and physicochemical properties of amino acids. Our method is validated using the available functional annotations in the PPIN network of Saccharomyces cerevisiae in the latest Munich information center for protein (MIPS) dataset. The PPIN data of S. cerevisiae in MIPS dataset includes 4,554 unique proteins in 13,528 protein–protein interactions after the elimination of the self-replicating and the self-interacting protein pairs. Using the developed FunPred 3.0 tool, we are able to achieve the mean precision, the recall and the F-score values of 0.55, 0.82 and 0.66, respectively. FunPred 3.0 is then used to predict the functions of unpredicted protein pairs (incomplete and missing functional annotations) in MIPS dataset of S. cerevisiae. The method is also capable of predicting the subcellular localization of proteins along with its corresponding functions. The code and the complete prediction results are available freely at: https://github.com/SovanSaha/FunPred-3.0.git.
Collapse
Affiliation(s)
- Sovan Saha
- Department of Computer Science and Engineering, Dr. Sudhir Chandra Sur Degree Engineering College, Kolkata, West Bengal, India
| | - Piyali Chatterjee
- Department of Computer Science and Engineering, Netaji Subhash Engineering College, Kolkata, India
| | - Subhadip Basu
- Department of Computer Science and Engineering, Jadavpur University, Kolkata, West Bengal, India
| | - Mita Nasipuri
- Department of Computer Science and Engineering, Jadavpur University, Kolkata, West Bengal, India
| | - Dariusz Plewczynski
- Laboratory of Functional and Structural Genomics, Centre of New Technologies, University of Warsaw, Warsaw, Poland.,Faculty of Mathematics and Information Science, Warsaw University of Technology, Warsaw, Poland
| |
Collapse
|
12
|
Basu S, Plewczynski D. Emerging and threatening infectious diseases. Brief Funct Genomics 2019; 17:372-373. [PMID: 30476067 DOI: 10.1093/bfgp/ely038] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2018] [Accepted: 10/19/2018] [Indexed: 11/14/2022] Open
Affiliation(s)
- Subhadip Basu
- Department of Computer Science and Engineering, Jadavpur University, India
| | - Dariusz Plewczynski
- Center of New Technologies, University of Warsaw, Warsaw, Poland.,Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, Warsaw, Poland
| |
Collapse
|
13
|
Lian X, Yang S, Li H, Fu C, Zhang Z. Machine-Learning-Based Predictor of Human–Bacteria Protein–Protein Interactions by Incorporating Comprehensive Host-Network Properties. J Proteome Res 2019; 18:2195-2205. [DOI: 10.1021/acs.jproteome.9b00074] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
Affiliation(s)
- Xianyi Lian
- State Key Laboratory of Agrobiotechnology, College of Biological Sciences, China Agricultural University, Beijing 100193, China
| | - Shiping Yang
- State Key Laboratory of Agrobiotechnology, College of Biological Sciences, China Agricultural University, Beijing 100193, China
| | - Hong Li
- Key Laboratory of Tropical Biological Resources of Ministry of Education, Hainan University, Haikou, 570228, China
| | - Chen Fu
- State Key Laboratory of Agrobiotechnology, College of Biological Sciences, China Agricultural University, Beijing 100193, China
| | - Ziding Zhang
- State Key Laboratory of Agrobiotechnology, College of Biological Sciences, China Agricultural University, Beijing 100193, China
| |
Collapse
|
14
|
Saha S, Prasad A, Chatterjee P, Basu S, Nasipuri M. Protein function prediction from protein-protein interaction network using gene ontology based neighborhood analysis and physico-chemical features. J Bioinform Comput Biol 2018; 16:1850025. [PMID: 30400756 DOI: 10.1142/s0219720018500257] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Protein Function Prediction from Protein-Protein Interaction Network (PPIN) and physico-chemical features using the Gene Ontology (GO) classification are indeed very useful for assigning biological or biochemical functions to a protein. They also lead to the identification of those significant proteins which are responsible for the generation of various diseases whose drugs are still yet to be discovered. So, the prediction of GO functional terms from PPIN and sequence is an important field of study. In this work, we have proposed a methodology, Multi Label Protein Function Prediction (ML_PFP) which is based on Neighborhood analysis empowered with physico-chemical features of constituent amino acids to predict the functional group of unannotated protein. A protein does not perform functions in isolation rather it performs functions in a group by interacting with others. So a protein is involved in many functions or, in other words, may be associated with multiple functional groups or labels or GO terms. Though functional group of other known interacting partner protein and its physico-chemical features provide useful information, assignment of multiple labels to unannotated protein is a very challenging task. Here, we have taken Homo sapiens or Human PPIN as well as Saccharomyces cerevisiae or yeast PPIN along with their GO terms to predict functional groups or GO terms of unannotated proteins. This work has become very challenging as both Human and Yeast protein dataset are voluminous and complex in nature and multi-label functional groups assignment has also added a new dimension to this challenge. Our algorithm has been observed to achieve a better performance in Cellular Function, Molecular Function and Biological Process of both yeast and human network when compared with the other existing state-of-the-art methodologies which will be discussed in detail in the results section.
Collapse
Affiliation(s)
- Sovan Saha
- * Department of Computer Science & Engineering, Dr. Sudhir Chandra Sur Degree Engineering College, 540, Dum Dum Road, Near Dum Dum Jn. Station, Surermath, Kolkata 700074, India
| | - Abhimanyu Prasad
- * Department of Computer Science & Engineering, Dr. Sudhir Chandra Sur Degree Engineering College, 540, Dum Dum Road, Near Dum Dum Jn. Station, Surermath, Kolkata 700074, India
| | - Piyali Chatterjee
- † Department of Computer Science & Engineering, Netaji Subhash Engineering College, Techno City, Panchpota, Garia, Kolkata 700152, India
| | - Subhadip Basu
- ‡ Department of Computer Science & Engineering, Jadavpur University, 188, Raja S.C. Mallick Road, Kolkata 700032, India
| | - Mita Nasipuri
- ‡ Department of Computer Science & Engineering, Jadavpur University, 188, Raja S.C. Mallick Road, Kolkata 700032, India
| |
Collapse
|