1
|
Kushwah AS, Dixit H, Upadhyay V, Verma SK, Prasad R. The study of iron- and copper-binding proteome of Fusarium oxysporum and its effector candidates. Proteins 2024; 92:1097-1112. [PMID: 38666709 DOI: 10.1002/prot.26696] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2023] [Revised: 03/26/2024] [Accepted: 04/08/2024] [Indexed: 08/07/2024]
Abstract
Fusarium oxysporum f.sp. lycopersici is a phytopathogen which causes vascular wilt disease in tomato plants. The survival tactics of both pathogens and hosts depend on intricate interactions between host plants and pathogenic microbes. Iron-binding proteins (IBPs) and copper-binding proteins (CBPs) play a crucial role in these interactions by participating in enzyme reactions, virulence, metabolism, and transport processes. We employed high-throughput computational tools at the sequence and structural levels to investigate the IBPs and CBPs of F. oxysporum. A total of 124 IBPs and 37 CBPs were identified in the proteome of Fusarium. The ranking of amino acids based on their affinity for binding with iron is Glu > His> Asp > Asn > Cys, and for copper is His > Asp > Cys respectively. The functional annotation, determination of subcellular localization, and Gene Ontology analysis of these putative IBPs and CBPs have unveiled their potential involvement in a diverse array of cellular and biological processes. Three iron-binding glycosyl hydrolase family proteins, along with four CBPs with carbohydrate-binding domains, have been identified as potential effector candidates. These proteins are distinct from the host Solanum lycopersicum proteome. Moreover, they are known to be located extracellularly and function as enzymes that degrade the host cell wall during pathogen-host interactions. The insights gained from this report on the role of metal ions in plant-pathogen interactions can help develop a better understanding of their fundamental biology and control vascular wilt disease in tomato plants.
Collapse
Affiliation(s)
- Ankita Singh Kushwah
- Department of Biosciences and Bioengineering, Indian Institute of Technology Roorkee, Roorkee, Uttarakhand, India
| | - Himisha Dixit
- Centre for Computational Biology & Bioinformatics, Central University of Himachal Pradesh, Kangra, Himachal Pradesh, India
| | - Vipin Upadhyay
- Centre for Computational Biology & Bioinformatics, Central University of Himachal Pradesh, Kangra, Himachal Pradesh, India
| | - Shailender Kumar Verma
- Centre for Computational Biology & Bioinformatics, Central University of Himachal Pradesh, Kangra, Himachal Pradesh, India
- Department of Environmental Studies, University of Delhi, North Campus, Delhi, India
| | - Ramasare Prasad
- Department of Biosciences and Bioengineering, Indian Institute of Technology Roorkee, Roorkee, Uttarakhand, India
| |
Collapse
|
2
|
Yadav AK, Gupta PK, Singh TR. PMTPred: machine-learning-based prediction of protein methyltransferases using the composition of k-spaced amino acid pairs. Mol Divers 2024:10.1007/s11030-024-10937-2. [PMID: 39033257 DOI: 10.1007/s11030-024-10937-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2024] [Accepted: 07/10/2024] [Indexed: 07/23/2024]
Abstract
Protein methyltransferases (PMTs) are a group of enzymes that help catalyze the transfer of a methyl group to its substrates. These enzymes play an important role in epigenetic regulation and can methylate various substrates with DNA, RNA, protein, and small-molecule secondary metabolites. Dysregulation of methyltransferases is implicated in various human cancers. However, in light of the well-recognized significance of PMTs, reliable and efficient identification methods are essential. In the present work, we propose a machine-learning-based method for the identification of PMTs. Various sequence-based features were calculated, and prediction models were trained using various machine-learning algorithms using a tenfold cross-validation technique. After evaluating each model on the dataset, the SVM-based CKSAAP model achieved the highest prediction accuracy with balanced sensitivity and specificity. Also, this SVM model outperformed deep-learning algorithms for the prediction of PMTs. In addition, cross-database validation was performed to ensure the robustness of the model. Feature importance was assessed using shapley additive explanations (SHAP) values, providing insights into the contributions of different features to the model's predictions. Finally, the SVM-based CKSAAP model was implemented in a standalone tool, PMTPred, due to its consistent performance during independent testing and cross-database evaluation. We believe that PMTPred will be a useful and efficient tool for the identification of PMTs. The PMTPred is freely available for download at https://github.com/ArvindYadav7/PMTPred and http://www.bioinfoindia.org/PMTPred/home.html for research and academic use.
Collapse
Affiliation(s)
- Arvind Kumar Yadav
- Department of Biotechnology and Bioinformatics, Jaypee University of Information Technology, Solan- 173234, Himachal Pradesh, India
| | - Pradeep Kumar Gupta
- Department of Computer Science and Engineering, Jaypee University of Information Technology, Solan- 173234, Himachal Pradesh, India
- School of Computing, Department of Data Science and Engineering, Mohan Babu University, Tirupati- 517102, Andhra Pradesh, India
| | - Tiratha Raj Singh
- Department of Biotechnology and Bioinformatics, Jaypee University of Information Technology, Solan- 173234, Himachal Pradesh, India.
- Centre of Excellence in Healthcare Technologies and Informatics (CHETI), Department of Biotechnology and Bioinformatics, Jaypee University of Information Technology, Solan- 173234, Himachal Pradesh, India.
| |
Collapse
|
3
|
Madugula SS, Pujar P, Nammi B, Wang S, Jayasinghe-Arachchige VM, Pham T, Mashburn D, Artiles M, Liu J. Identification of Family-Specific Features in Cas9 and Cas12 Proteins: A Machine Learning Approach Using Complete Protein Feature Spectrum. J Chem Inf Model 2024; 64:4897-4911. [PMID: 38838358 DOI: 10.1021/acs.jcim.4c00625] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/07/2024]
Abstract
The recent development of CRISPR-Cas technology holds promise to correct gene-level defects for genetic diseases. The key element of the CRISPR-Cas system is the Cas protein, a nuclease that can edit the gene of interest assisted by guide RNA. However, these Cas proteins suffer from inherent limitations such as large size, low cleavage efficiency, and off-target effects, hindering their widespread application as a gene editing tool. Therefore, there is a need to identify novel Cas proteins with improved editing properties, for which it is necessary to understand the underlying features governing the Cas families. In this study, we aim to elucidate the unique protein features associated with Cas9 and Cas12 families and identify the features distinguishing each family from non-Cas proteins. Here, we built Random Forest (RF) binary classifiers to distinguish Cas12 and Cas9 proteins from non-Cas proteins, respectively, using the complete protein feature spectrum (13,494 features) encoding various physiochemical, topological, constitutional, and coevolutionary information on Cas proteins. Furthermore, we built multiclass RF classifiers differentiating Cas9, Cas12, and non-Cas proteins. All the models were evaluated rigorously on the test and independent data sets. The Cas12 and Cas9 binary models achieved a high overall accuracy of 92% and 95% on their respective independent data sets, while the multiclass classifier achieved an F1 score of close to 0.98. We observed that Quasi-Sequence-Order (QSO) descriptors like Schneider.lag and Composition descriptors like charge, volume, and polarizability are predominant in the Cas12 family. Conversely Amino Acid Composition descriptors, especially Tripeptide Composition (TPC), predominate the Cas9 family. Four of the top 10 descriptors identified in Cas9 classification are tripeptides PWN, PYY, HHA, and DHI, which are seen to be conserved across all Cas9 proteins and located within different catalytically important domains of the Streptococcus pyogenes Cas9 (SpCas9) structure. Among these, DHI and HHA are well-known to be involved in the DNA cleavage activity of the SpCas9 protein. Mutation studies have highlighted the significance of the PWN tripeptide in PAM recognition and DNA cleavage activity of SpCas9, while Y450 from the PYY tripeptide plays a crucial role in reducing off-target effects and improving the specificity in SpCas9. Leveraging our machine learning (ML) pipeline, we identified numerous Cas9 and Cas12 family-specific features. These features offer valuable insights for future experimental and computational studies aiming at designing Cas systems with enhanced gene-editing properties. These features suggest plausible structural modifications that can effectively guide the development of Cas proteins with improved editing capabilities.
Collapse
Affiliation(s)
- Sita Sirisha Madugula
- Department of Pharmaceutical Sciences, University of North Texas System College of Pharmacy, University of North Texas Health Science Center, 3500 Camp Bowie Blvd, Fort Worth, Texas 76107, United States
| | - Pranav Pujar
- Department of Industrial, Manufacturing and Systems Engineering, University of Texas at Arlington, 701 South Nedderman Drive, Arlington, Texas 76019, United States
| | - Bharani Nammi
- Department of Industrial, Manufacturing and Systems Engineering, University of Texas at Arlington, 701 South Nedderman Drive, Arlington, Texas 76019, United States
| | - Shouyi Wang
- Department of Industrial, Manufacturing and Systems Engineering, University of Texas at Arlington, 701 South Nedderman Drive, Arlington, Texas 76019, United States
| | - Vindi M Jayasinghe-Arachchige
- Department of Pharmaceutical Sciences, University of North Texas System College of Pharmacy, University of North Texas Health Science Center, 3500 Camp Bowie Blvd, Fort Worth, Texas 76107, United States
| | - Tyler Pham
- School of Biomedical Sciences, University of North Texas Health Science Center, 3500 Camp Bowie Blvd, Fort Worth, Texas 76107, United States
| | - Dominic Mashburn
- Department of Pharmaceutical Sciences, University of North Texas System College of Pharmacy, University of North Texas Health Science Center, 3500 Camp Bowie Blvd, Fort Worth, Texas 76107, United States
| | - Maria Artiles
- School of Biomedical Sciences, University of North Texas Health Science Center, 3500 Camp Bowie Blvd, Fort Worth, Texas 76107, United States
| | - Jin Liu
- Department of Pharmaceutical Sciences, University of North Texas System College of Pharmacy, University of North Texas Health Science Center, 3500 Camp Bowie Blvd, Fort Worth, Texas 76107, United States
- School of Biomedical Sciences, University of North Texas Health Science Center, 3500 Camp Bowie Blvd, Fort Worth, Texas 76107, United States
| |
Collapse
|
4
|
Ghafoor H, Asim MN, Ibrahim MA, Ahmed S, Dengel A. CAPTURE: Comprehensive anti-cancer peptide predictor with a unique amino acid sequence encoder. Comput Biol Med 2024; 176:108538. [PMID: 38759585 DOI: 10.1016/j.compbiomed.2024.108538] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2024] [Revised: 04/26/2024] [Accepted: 04/28/2024] [Indexed: 05/19/2024]
Abstract
Anticancer peptides (ACPs) key properties including bioactivity, high efficacy, low toxicity, and lack of drug resistance make them ideal candidates for cancer therapies. To deeply explore the potential of ACPs and accelerate development of cancer therapies, although 53 Artificial Intelligence supported computational predictors have been developed for ACPs and non ACPs classification but only one predictor has been developed for ACPs functional types annotations. Moreover, these predictors extract amino acids distribution patterns to transform peptides sequences into statistical vectors that are further fed to classifiers for discriminating peptides sequences and annotating peptides functional classes. Overall, these predictors remain fail in extracting diverse types of amino acids distribution patterns from peptide sequences. The paper in hand presents a unique CARE encoder that transforms peptides sequences into statistical vectors by extracting 4 different types of distribution patterns including correlation, distribution, composition, and transition. Across public benchmark dataset, proposed encoder potential is explored under two different evaluation settings namely; intrinsic and extrinsic. Extrinsic evaluation indicates that 12 different machine learning classifiers achieve superior performance with the proposed encoder as compared to 55 existing encoders. Furthermore, an intrinsic evaluation reveals that, unlike existing encoders, the proposed encoder generates more discriminative clusters for ACPs and non-ACPs classes. Across 8 public benchmark ACPs and non-ACPs classification datasets, proposed encoder and Adaboost classifier based CAPTURE predictor outperforms existing predictors with an average accuracy, recall and MCC score of 1%, 4%, and 2% respectively. In generalizeability evaluation case study, across 7 benchmark anti-microbial peptides classification datasets, CAPTURE surpasses existing predictors by an average AU-ROC of 2%. CAPTURE predictive pipeline along with label powerset method outperforms state-of-the-art ACPs functional types predictor by 5%, 5%, 5%, 6%, and 3% in terms of average accuracy, subset accuracy, precision, recall, and F1 respectively. CAPTURE web application is available at https://sds_genetic_analysis.opendfki.de/CAPTURE.
Collapse
Affiliation(s)
- Hina Ghafoor
- Department of Computer Science, Rhineland-Palatinate Technical University of Kaiserslautern-Landau, Kaiserslautern, 67663, Germany; German Research Center for Artificial Intelligence GmbH, Kaiserslautern, 67663, Germany
| | - Muhammad Nabeel Asim
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern, 67663, Germany.
| | - Muhammad Ali Ibrahim
- Department of Computer Science, Rhineland-Palatinate Technical University of Kaiserslautern-Landau, Kaiserslautern, 67663, Germany; German Research Center for Artificial Intelligence GmbH, Kaiserslautern, 67663, Germany
| | - Sheraz Ahmed
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern, 67663, Germany
| | - Andreas Dengel
- Department of Computer Science, Rhineland-Palatinate Technical University of Kaiserslautern-Landau, Kaiserslautern, 67663, Germany; German Research Center for Artificial Intelligence GmbH, Kaiserslautern, 67663, Germany
| |
Collapse
|
5
|
Tan Q, Xiao J, Chen J, Wang Y, Zhang Z, Zhao T, Li Y. ifDEEPre: large protein language-based deep learning enables interpretable and fast predictions of enzyme commission numbers. Brief Bioinform 2024; 25:bbae225. [PMID: 38942594 PMCID: PMC11213619 DOI: 10.1093/bib/bbae225] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2023] [Revised: 03/26/2024] [Accepted: 04/22/2024] [Indexed: 06/30/2024] Open
Abstract
Accurate understanding of the biological functions of enzymes is vital for various tasks in both pathologies and industrial biotechnology. However, the existing methods are usually not fast enough and lack explanations on the prediction results, which severely limits their real-world applications. Following our previous work, DEEPre, we propose a new interpretable and fast version (ifDEEPre) by designing novel self-guided attention and incorporating biological knowledge learned via large protein language models to accurately predict the commission numbers of enzymes and confirm their functions. Novel self-guided attention is designed to optimize the unique contributions of representations, automatically detecting key protein motifs to provide meaningful interpretations. Representations learned from raw protein sequences are strictly screened to improve the running speed of the framework, 50 times faster than DEEPre while requiring 12.89 times smaller storage space. Large language modules are incorporated to learn physical properties from hundreds of millions of proteins, extending biological knowledge of the whole network. Extensive experiments indicate that ifDEEPre outperforms all the current methods, achieving more than 14.22% larger F1-score on the NEW dataset. Furthermore, the trained ifDEEPre models accurately capture multi-level protein biological patterns and infer evolutionary trends of enzymes by taking only raw sequences without label information. Meanwhile, ifDEEPre predicts the evolutionary relationships between different yeast sub-species, which are highly consistent with the ground truth. Case studies indicate that ifDEEPre can detect key amino acid motifs, which have important implications for designing novel enzymes. A web server running ifDEEPre is available at https://proj.cse.cuhk.edu.hk/aihlab/ifdeepre/ to provide convenient services to the public. Meanwhile, ifDEEPre is freely available on GitHub at https://github.com/ml4bio/ifDEEPre/.
Collapse
Affiliation(s)
- Qingxiong Tan
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong SAR, China
| | - Jin Xiao
- Department of Computer Science, Hong Kong Baptist University, Hong Kong SAR, China
| | - Jiayang Chen
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong SAR, China
| | - Yixuan Wang
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong SAR, China
| | - Zeliang Zhang
- Department of Computer Science, University of Rochester, Rochester, New York State, USA
- School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China
| | | | - Yu Li
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong SAR, China
- The CUHK Shenzhen Research Institute, Nanshan, Shenzhen, China
| |
Collapse
|
6
|
Idhaya T, Suruliandi A, Raja SP. A Comprehensive Review on Machine Learning Techniques for Protein Family Prediction. Protein J 2024; 43:171-186. [PMID: 38427271 DOI: 10.1007/s10930-024-10181-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/19/2024] [Indexed: 03/02/2024]
Abstract
Proteomics is a field dedicated to the analysis of proteins in cells, tissues, and organisms, aiming to gain insights into their structures, functions, and interactions. A crucial aspect within proteomics is protein family prediction, which involves identifying evolutionary relationships between proteins by examining similarities in their sequences or structures. This approach holds great potential for applications such as drug discovery and functional annotation of genomes. However, current methods for protein family prediction have certain limitations, including limited accuracy, high false positive rates, and challenges in handling large datasets. Some methods also rely on homologous sequences or protein structures, which introduce biases and restrict their applicability to specific protein families or structures. To overcome these limitations, researchers have turned to machine learning (ML) approaches that can identify connections between protein features and simplify complex high-dimensional datasets. This paper presents a comprehensive survey of articles that employ various ML techniques for predicting protein families. The primary objective is to explore and improve ML techniques specifically for protein family prediction, thus advancing future research in the field. Through qualitative and quantitative analyses of ML techniques, it is evident that multiple methods utilizing a range of classifiers have been applied for protein family prediction. However, there has been limited focus on developing novel classifiers for protein family classification, highlighting the urgent need for improved approaches in this area. By addressing these challenges, this research aims to enhance the accuracy and effectiveness of protein family prediction, ultimately facilitating advancements in proteomics and its diverse applications.
Collapse
Affiliation(s)
- T Idhaya
- Department of Computer Science and Engineering, Manonmaniam Sundaranar University, Tirunelveli, TamilNadu, India.
| | - A Suruliandi
- Department of Computer Science and Engineering, Manonmaniam Sundaranar University, Tirunelveli, TamilNadu, India
| | - S P Raja
- School of Computer Science and Engineering, Vellore Institute of Technology, Vellore, TamilNadu, India
| |
Collapse
|
7
|
Ahmed SH, Bose DB, Khandoker R, Rahman MS. StackDPP: a stacking ensemble based DNA-binding protein prediction model. BMC Bioinformatics 2024; 25:111. [PMID: 38486135 PMCID: PMC10941422 DOI: 10.1186/s12859-024-05714-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2022] [Accepted: 02/20/2024] [Indexed: 03/17/2024] Open
Abstract
BACKGROUND DNA-binding proteins (DNA-BPs) are the proteins that bind and interact with DNA. DNA-BPs regulate and affect numerous biological processes, such as, transcription and DNA replication, repair, and organization of the chromosomal DNA. Very few proteins, however, are DNA-binding in nature. Therefore, it is necessary to develop an efficient predictor for identifying DNA-BPs. RESULT In this work, we have proposed new benchmark datasets for the DNA-binding protein prediction problem. We discovered several quality concerns with the widely used benchmark datasets, PDB1075 (for training) and PDB186 (for independent testing), which necessitated the preparation of new benchmark datasets. Our proposed datasets UNIPROT1424 and UNIPROT356 can be used for model training and independent testing respectively. We have retrained selected state-of-the-art DNA-BP predictors in the new dataset and reported their performance results. We also trained a novel predictor using the new benchmark dataset. We extracted features from various feature categories, then used a Random Forest classifier and Recursive Feature Elimination with Cross-validation (RFECV) to select the optimal set of 452 features. We then proposed a stacking ensemble architecture as our final prediction model. Named Stacking Ensemble Model for DNA-binding Protein Prediction, or StackDPP in short, our model achieved 0.92, 0.92 and 0.93 accuracy in 10-fold cross-validation, jackknife and independent testing respectively. CONCLUSION StackDPP has performed very well in cross-validation testing and has outperformed all the state-of-the-art prediction models in independent testing. Its performance scores in cross-validation testing generalized very well in the independent test set. The source code of the model is publicly available at https://github.com/HasibAhmed1624/StackDPP . Therefore, we expect this generalized model can be adopted by researchers and practitioners to identify novel DNA-binding proteins.
Collapse
Affiliation(s)
- Sheikh Hasib Ahmed
- Department of CSE, BUET, ECE Building, West Palashi, Dhaka, 1000, Bangladesh
| | | | - Rafi Khandoker
- Department of CSE, BUET, ECE Building, West Palashi, Dhaka, 1000, Bangladesh
| | - M Saifur Rahman
- Department of CSE, BUET, ECE Building, West Palashi, Dhaka, 1000, Bangladesh.
| |
Collapse
|
8
|
Dutta S, Zunjare RU, Sil A, Mishra DC, Arora A, Gain N, Chand G, Chhabra R, Muthusamy V, Hossain F. Prediction of matrilineal specific patatin-like protein governing in-vivo maternal haploid induction in maize using support vector machine and di-peptide composition. Amino Acids 2024; 56:20. [PMID: 38460024 DOI: 10.1007/s00726-023-03368-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2023] [Accepted: 12/05/2023] [Indexed: 03/11/2024]
Abstract
The mutant matrilineal (mtl) gene encoding patatin-like phospholipase activity is involved in in-vivo maternal haploid induction in maize. Doubling of chromosomes in haploids by colchicine treatment leads to complete fixation of inbreds in just one generation compared to 6-7 generations of selfing. Thus, knowledge of patatin-like proteins in other crops assumes great significance for in-vivo haploid induction. So far, no online tool is available that can classify unknown proteins into patatin-like proteins. Here, we aimed to optimize a machine learning-based algorithm to predict the patatin-like phospholipase activity of unknown proteins. Four different kernels [radial basis function (RBF), sigmoid, polynomial, and linear] were used for building support vector machine (SVM) classifiers using six different sequence-based compositional features (AAC, DPC, GDPC, CTDC, CTDT, and GAAC). A total of 1170 protein sequences including both patatin-like (585 sequences) from various monocots, dicots, and microbes; and non-patatin-like proteins (585 sequences) from different subspecies of Zea mays were analyzed. RBF and polynomial kernels were quite promising in the prediction of patatin-like proteins. Among six sequence-based compositional features, di-peptide composition attained > 90% prediction accuracies using RBF and polynomial kernels. Using mutual information, most explaining dipeptides that contributed the highest to the prediction process were identified. The knowledge generated in this study can be utilized in other crops prior to the initiation of any experiment. The developed SVM model opened a new paradigm for scientists working in in-vivo haploid induction in commercial crops. This is the first report of machine learning of the identification of proteins with patatin-like activity.
Collapse
Affiliation(s)
- Suman Dutta
- ICAR-Indian Agricultural Research Institute, New Delhi, India
| | | | - Anirban Sil
- ICAR-Indian Agricultural Research Institute, New Delhi, India
| | | | - Alka Arora
- ICAR-Indian Agricultural Statistical Research Institute, New Delhi, India
| | - Nisrita Gain
- ICAR-Indian Agricultural Research Institute, New Delhi, India
| | - Gulab Chand
- ICAR-Indian Agricultural Research Institute, New Delhi, India
| | - Rashmi Chhabra
- ICAR-Indian Agricultural Research Institute, New Delhi, India
| | | | - Firoz Hossain
- ICAR-Indian Agricultural Research Institute, New Delhi, India.
| |
Collapse
|
9
|
Zhong G, Deng L. ACPScanner: Prediction of Anticancer Peptides by Integrated Machine Learning Methodologies. J Chem Inf Model 2024; 64:1092-1104. [PMID: 38277774 DOI: 10.1021/acs.jcim.3c01860] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2024]
Abstract
Novel therapeutic alternatives for cancer treatment are increasingly attracting global research attention. Although chemotherapy remains a primary clinical solution, it often results in significant side effects for patients. In recent years, anticancer peptides (ACPs) have emerged as promising candidates for highly specific anticancer drugs, and a number of computational approaches have been developed to identify ACPs. However, existing methods do not recognize specific types of anticancer function. In this article, we propose ACPScanner, an integrated approach to predict ACPs and non-ACPs at first and then predict several specific activity types for potential ACPs. We incorporate sequential, physicochemical properties, secondary structural information, and deep representation learning embeddings which are generated from artificial intelligence methods to build feature space. Customized deep learning and statistical learning methods are combined to form an integral architecture for the comprehensive two-level prediction task. To the best of our knowledge, ACPScanner is the first approach for specific ACP activity prediction. The comparative evaluation illustrates that ACPScanner achieves competitive prediction performance in both prediction phases in independent testings. We establish a web server at http://acpscanner.denglab.org to provide convenient usage of ACPScanner and make the predictive framework, source code, and data sets publicly available.
Collapse
Affiliation(s)
- Guolun Zhong
- School of Computer Science and Engineering, Central South University, Changsha 410000, China
| | - Lei Deng
- School of Computer Science and Engineering, Central South University, Changsha 410000, China
| |
Collapse
|
10
|
Satalkar V, Degaga GD, Li W, Pang YT, McShan AC, Gumbart JC, Mitchell JC, Torres MP. Generative β-hairpin design using a residue-based physicochemical property landscape. Biophys J 2024:S0006-3495(24)00070-5. [PMID: 38297834 DOI: 10.1016/j.bpj.2024.01.029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2023] [Revised: 12/20/2023] [Accepted: 01/25/2024] [Indexed: 02/02/2024] Open
Abstract
De novo peptide design is a new frontier that has broad application potential in the biological and biomedical fields. Most existing models for de novo peptide design are largely based on sequence homology that can be restricted based on evolutionarily derived protein sequences and lack the physicochemical context essential in protein folding. Generative machine learning for de novo peptide design is a promising way to synthesize theoretical data that are based on, but unique from, the observable universe. In this study, we created and tested a custom peptide generative adversarial network intended to design peptide sequences that can fold into the β-hairpin secondary structure. This deep neural network model is designed to establish a preliminary foundation of the generative approach based on physicochemical and conformational properties of 20 canonical amino acids, for example, hydrophobicity and residue volume, using extant structure-specific sequence data from the PDB. The beta generative adversarial network model robustly distinguishes secondary structures of β hairpin from α helix and intrinsically disordered peptides with an accuracy of up to 96% and generates artificial β-hairpin peptide sequences with minimum sequence identities around 31% and 50% when compared against the current NCBI PDB and nonredundant databases, respectively. These results highlight the potential of generative models specifically anchored by physicochemical and conformational property features of amino acids to expand the sequence-to-structure landscape of proteins beyond evolutionary limits.
Collapse
Affiliation(s)
- Vardhan Satalkar
- School of Biological Sciences, Georgia Institute of Technology, Atlanta, Georgia
| | - Gemechis D Degaga
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee
| | - Wei Li
- School of Biological Sciences, Georgia Institute of Technology, Atlanta, Georgia
| | - Yui Tik Pang
- School of Physics, Georgia Institute of Technology, Atlanta, Georgia
| | - Andrew C McShan
- School of Chemistry and Biochemistry, Georgia Institute of Technology, Atlanta, Georgia
| | - James C Gumbart
- School of Physics, Georgia Institute of Technology, Atlanta, Georgia; School of Chemistry and Biochemistry, Georgia Institute of Technology, Atlanta, Georgia
| | - Julie C Mitchell
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee.
| | - Matthew P Torres
- School of Biological Sciences, Georgia Institute of Technology, Atlanta, Georgia; School of Chemistry and Biochemistry, Georgia Institute of Technology, Atlanta, Georgia.
| |
Collapse
|
11
|
Madugula SS, Pujar P, Bharani N, Wang S, Jayasinghe-Arachchige VM, Pham T, Mashburn D, Artilis M, Liu J. Identification of Family-Specific Features in Cas9 and Cas12 Proteins: A Machine Learning Approach Using Complete Protein Feature Spectrum. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.01.22.576286. [PMID: 38328240 PMCID: PMC10849529 DOI: 10.1101/2024.01.22.576286] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/09/2024]
Abstract
The recent development of CRISPR-Cas technology holds promise to correct gene-level defects for genetic diseases. The key element of the CRISPR-Cas system is the Cas protein, a nuclease that can edit the gene of interest assisted by guide RNA. However, these Cas proteins suffer from inherent limitations like large size, low cleavage efficiency, and off-target effects, hindering their widespread application as a gene editing tool. Therefore, there is a need to identify novel Cas proteins with improved editing properties, for which it is necessary to understand the underlying features governing the Cas families. In the current study, we aim to elucidate the unique protein attributes associated with Cas9 and Cas12 families and identify the features that distinguish each family from the other. Here, we built Random Forest (RF) binary classifiers to distinguish Cas12 and Cas9 proteins from non-Cas proteins, respectively, using the complete protein feature spectrum (13,495 features) encoding various physiochemical, topological, constitutional, and coevolutionary information of Cas proteins. Furthermore, we built multiclass RF classifiers differentiating Cas9, Cas12, and Non-Cas proteins. All the models were evaluated rigorously on the test and independent datasets. The Cas12 and Cas9 binary models achieved a high overall accuracy of 95% and 97% on their respective independent datasets, while the multiclass classifier achieved a high F1 score of 0.97. We observed that Quasi-sequence-order descriptors like Schneider-lag descriptors and Composition descriptors like charge, volume, and polarizability are essential for the Cas12 family. More interestingly, we discovered that Amino Acid Composition descriptors, especially the Tripeptide Composition (TPC) descriptors, are important for the Cas9 family. Four of the identified important descriptors of Cas9 classification are tripeptides PWN, PYY, HHA, and DHI, which are seen to be conserved across all the Cas9 proteins and were located within different catalytically important domains of the Cas9 protein structure. Among these four tripeptides, tripeptides DHI and HHA are well-known to be involved in the DNA cleavage activity of the Cas9 protein. We therefore propose the the other two tripeptides, PWN and PYY, may also be essential for the Cas9 family. Our identified important descriptors enhanced the understanding of the catalytic mechanisms of Cas9 and Cas12 proteins and provide valuable insights into design of novel Cas systems to achieve enhanced gene-editing properties.
Collapse
Affiliation(s)
- Sita Sirisha Madugula
- Department of Pharmaceutical Sciences, University of North Texas System College of Pharmacy, University of North Texas Health Science Center, Fort Worth, Texas, United States
| | - Pranav Pujar
- Department of Industrial, Manufacturing and Systems Engineering, University of Texas at Arlington, Arlington, Texas, United States
| | - Nammi Bharani
- Department of Industrial, Manufacturing and Systems Engineering, University of Texas at Arlington, Arlington, Texas, United States
| | - Shouyi Wang
- Department of Industrial, Manufacturing and Systems Engineering, University of Texas at Arlington, Arlington, Texas, United States
| | - Vindi M. Jayasinghe-Arachchige
- Department of Pharmaceutical Sciences, University of North Texas System College of Pharmacy, University of North Texas Health Science Center, Fort Worth, Texas, United States
| | - Tyler Pham
- Graduate School of Biomedical Sciences, University of North Texas Health Science Center, Fort Worth, Texas
| | - Dominic Mashburn
- Department of Pharmaceutical Sciences, University of North Texas System College of Pharmacy, University of North Texas Health Science Center, Fort Worth, Texas, United States
| | - Maria Artilis
- Department of Pharmaceutical Sciences, University of North Texas System College of Pharmacy, University of North Texas Health Science Center, Fort Worth, Texas, United States
| | - Jin Liu
- Department of Pharmaceutical Sciences, University of North Texas System College of Pharmacy, University of North Texas Health Science Center, Fort Worth, Texas, United States
- Graduate School of Biomedical Sciences, University of North Texas Health Science Center, Fort Worth, Texas
| |
Collapse
|
12
|
Chung CR, Liou JT, Wu LC, Horng JT, Lee TY. Multi-label classification and features investigation of antimicrobial peptides with various functional classes. iScience 2023; 26:108250. [PMID: 38025779 PMCID: PMC10679894 DOI: 10.1016/j.isci.2023.108250] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2023] [Revised: 07/15/2023] [Accepted: 10/16/2023] [Indexed: 12/01/2023] Open
Abstract
The challenge of drug-resistant bacteria to global public health has led to increased attention on antimicrobial peptides (AMPs) as a targeted therapeutic alternative with a lower risk of resistance. However, high production costs and limitations in functional class prediction have hindered progress in this field. In this study, we used multi-label classifiers with binary relevance and algorithm adaptation techniques to predict different functions of AMPs across a wide range of pathogen categories, including bacteria, mammalian cells, fungi, viruses, and cancer cells. Our classifiers attained promising AUC scores varying from 0.8492 to 0.9126 on independent testing data. Forward feature selection identified sequence order and charge as critical, with specific amino acids (C and E) as discriminative. These findings provide valuable insights for the design of antimicrobial peptides (AMPs) with multiple functionalities, thus contributing to the broader effort to combat drug-resistant pathogens.
Collapse
Affiliation(s)
- Chia-Ru Chung
- Department of Computer Science and Information Engineering, National Central University, Taoyuan, Taiwan
| | - Jhen-Ting Liou
- Department of Computer Science and Information Engineering, National Central University, Taoyuan, Taiwan
| | - Li-Ching Wu
- Department of Biomedical Sciences and Engineering, National Central University, Taoyuan, Taiwan
| | - Jorng-Tzong Horng
- Department of Computer Science and Information Engineering, National Central University, Taoyuan, Taiwan
- Department of Bioinformatics and Medical Engineering, Asia University, Taoyuan City, Taiwan
| | - Tzong-Yi Lee
- Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu City, Taiwan
- Center for Intelligent Drug Systems and Smart Biodevices (IDS2B), National Yang Ming Chiao Tung University, Hsinchu City, Taiwan
| |
Collapse
|
13
|
Liu Y, Guan S, Jiang T, Fu Q, Ma J, Cui Z, Ding Y, Wu H. DNA protein binding recognition based on lifelong learning. Comput Biol Med 2023; 164:107094. [PMID: 37459792 DOI: 10.1016/j.compbiomed.2023.107094] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2023] [Revised: 05/09/2023] [Accepted: 05/27/2023] [Indexed: 09/09/2023]
Abstract
In recent years, research in the field of bioinformatics has focused on predicting the raw sequences of proteins, and some scholars consider DNA-binding protein prediction as a classification task. Many statistical and machine learning-based methods have been widely used in DNA-binding proteins research. The aforementioned methods are indeed more efficient than those based on manual classification, but there is still room for improvement in terms of prediction accuracy and speed. In this study, researchers used Average Blocks, Discrete Cosine Transform, Discrete Wavelet Transform, Global encoding, Normalized Moreau-Broto Autocorrelation and Pseudo position-specific scoring matrix to extract evolutionary features. A dynamic deep network based on lifelong learning architecture was then proposed in order to fuse six features and thus allow for more efficient classification of DNA-binding proteins. The multi-feature fusion allows for a more accurate description of the desired protein information than single features. This model offers a fresh perspective on the dichotomous classification problem in bioinformatics and broadens the application field of lifelong learning. The researchers ran trials on three datasets and contrasted them with other classification techniques to show the model's effectiveness in this study. The findings demonstrated that the model used in this research was superior to other approaches in terms of single-sample specificity (81.0%, 83.0%) and single-sample sensitivity (82.4%, 90.7%), and achieves high accuracy on the benchmark dataset (88.4%, 80.0%, and 76.6%).
Collapse
Affiliation(s)
- Yongsan Liu
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, 215009, China
| | - ShiXuan Guan
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, 215009, China
| | - TengSheng Jiang
- Gusu School, Nanjing Medical University, Suzhou, Jiangsu, China
| | - Qiming Fu
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, 215009, China
| | - Jieming Ma
- School of Intelligent Engineering, Xijiao Liverpool University, Suzhou, 215123, China
| | - Zhiming Cui
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, 215009, China
| | - Yijie Ding
- Yangtze Delta Region Institute, University of Electronic Science and Technology of China, Quzhou, Zhejiang, China
| | - Hongjie Wu
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, 215009, China.
| |
Collapse
|
14
|
Yan J, Zhang B, Zhou M, Campbell-Valois FX, Siu SWI. A deep learning method for predicting the minimum inhibitory concentration of antimicrobial peptides against Escherichia coli using Multi-Branch-CNN and Attention. mSystems 2023; 8:e0034523. [PMID: 37431995 PMCID: PMC10506472 DOI: 10.1128/msystems.00345-23] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2023] [Accepted: 05/31/2023] [Indexed: 07/12/2023] Open
Abstract
Antimicrobial peptides (AMPs) are a promising alternative to antibiotics to combat drug resistance in pathogenic bacteria. However, the development of AMPs with high potency and specificity remains a challenge, and new tools to evaluate antimicrobial activity are needed to accelerate the discovery process. Therefore, we proposed MBC-Attention, a combination of a multi-branch convolution neural network architecture and attention mechanisms to predict the experimental minimum inhibitory concentration of peptides against Escherichia coli. The optimal MBC-Attention model achieved an average Pearson correlation coefficient (PCC) of 0.775 and a root mean squared error (RMSE) of 0.533 (log μM) in three independent tests of randomly drawn sequences from the data set. This results in a 5-12% improvement in PCC and a 6-13% improvement in RMSE compared to 17 traditional machine learning models and 2 optimally tuned models using random forest and support vector machine. Ablation studies confirmed that the two proposed attention mechanisms, global attention and local attention, contributed largely to performance improvement. IMPORTANCE Antimicrobial peptides (AMPs) are potential candidates for replacing conventional antibiotics to combat drug resistance in pathogenic bacteria. Therefore, it is necessary to evaluate the antimicrobial activity of AMPs quantitatively. However, wet-lab experiments are labor-intensive and time-consuming. To accelerate the evaluation process, we develop a deep learning method called MBC-Attention to regress the experimental minimum inhibitory concentration of AMPs against Escherichia coli. The proposed model outperforms traditional machine learning methods. Data, scripts to reproduce experiments, and the final production models are available on GitHub.
Collapse
Affiliation(s)
- Jielu Yan
- PAMI Research Group, Department of Computer and Information Science, University of Macau, Taipa, Macau, China
| | - Bob Zhang
- PAMI Research Group, Department of Computer and Information Science, University of Macau, Taipa, Macau, China
| | - Mingliang Zhou
- School of Computer Science, Chongqing University, Shapingba, Chongqing, China
| | - François-Xavier Campbell-Valois
- Host-Microbe Interactions Laboratory, Center for Chemical and Synthetic Biology, Department of Chemistry and Biomolecular Sciences, University of Ottawa, Ottawa, Ontario, Canada
- Centre for Infection, Immunity, and Inflammation, University of Ottawa, Ottawa, Ontario, Canada
- Department of Biochemistry, Microbiology and Immunology, University of Ottawa, Ottawa, Ontario, Canada
| | - Shirley W. I. Siu
- Institute of Science and Environment, University of Saint Joseph, Macau, China
| |
Collapse
|
15
|
Jain A, Jain T, Mishra GK, Chandrakar K, Mukherjee K, Tiwari SP. Molecular characterization, putative structure and function, and expression profile of OAS1 gene in the endometrium of goats (Capra hircus). Reprod Biol 2023; 23:100760. [PMID: 37023663 DOI: 10.1016/j.repbio.2023.100760] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2022] [Revised: 02/18/2023] [Accepted: 03/16/2023] [Indexed: 04/07/2023]
Abstract
An interferon-inducible gene, 2'-5'-oligoadenylate synthetase-1 (OAS1), plays an essential role in uterine receptivity and conceptus development by controlling cell growth and differentiation in addition to anti-viral activities. As OAS1 gene has not yet been studied in caprine (cp), so present study was designed with the aim to amplify, sequence, characterize and in-silico analyze the coding sequence of the cpOAS1. Further, expression profile of cpOAS1 was performed by quantitative real-time PCR and western blot in the endometrium of pregnant and cyclic does. An 890 bp fragment of the cpOAS1 was amplified and sequenced. Nucleotide and deduced amino acid sequences revealed 99.6-72.3% identities with that of ruminants and non-ruminants. A constructed phylogenetic tree revealed that Ovis aries and Capra hircus differ from large ungulates. Various post-translational modifications (PTMs), 21 phosphorylation, two sumoylation, eight cysteines and 14 immunogenic sites were found in the cpOAS1. The domain, OAS1_C, is found in the cpOAS1 which carries anti-viral enzymatic activity, cell growth, and differentiation. Among the interacted proteins with cpOAS1, Mx1 and ISG17 well-known proteins are found that have anti-viral activity and play an important role during early pregnancy in ruminants. CpOAS1 protein (42/46 kDa and/or 69/71 kDa) was detected in the endometrium of pregnant and cyclic does. Both cpOAS1 mRNA and protein were expressed maximally (P<0.05) in the endometrium during pregnancy as compared to cyclic does. In conclusion, the cpOAS1 sequence is almost similar in structure and probably in function also to other species along with its higher expression during early pregnancy.
Collapse
Affiliation(s)
- Asit Jain
- Molecular Genetics Laboratory, Department of Animal Genetics and Breeding, College of Veterinary Science and Animal Husbandry, Dau Shri Vasudev Chandrakar Kamdhenu Vishwavidyalaya (DSVCKV), Anjora, Durg, Chhattisgarh, India.
| | - Tripti Jain
- Molecular Genetics Laboratory, Department of Animal Genetics and Breeding, College of Veterinary Science and Animal Husbandry, Dau Shri Vasudev Chandrakar Kamdhenu Vishwavidyalaya (DSVCKV), Anjora, Durg, Chhattisgarh, India
| | - Girish Kumar Mishra
- Molecular Genetics Laboratory, Department of Animal Genetics and Breeding, College of Veterinary Science and Animal Husbandry, Dau Shri Vasudev Chandrakar Kamdhenu Vishwavidyalaya (DSVCKV), Anjora, Durg, Chhattisgarh, India
| | - Khushboo Chandrakar
- Molecular Genetics Laboratory, Department of Animal Genetics and Breeding, College of Veterinary Science and Animal Husbandry, Dau Shri Vasudev Chandrakar Kamdhenu Vishwavidyalaya (DSVCKV), Anjora, Durg, Chhattisgarh, India
| | - Kishore Mukherjee
- Molecular Genetics Laboratory, Department of Animal Genetics and Breeding, College of Veterinary Science and Animal Husbandry, Dau Shri Vasudev Chandrakar Kamdhenu Vishwavidyalaya (DSVCKV), Anjora, Durg, Chhattisgarh, India
| | - Sita Prasad Tiwari
- Molecular Genetics Laboratory, Department of Animal Genetics and Breeding, College of Veterinary Science and Animal Husbandry, Dau Shri Vasudev Chandrakar Kamdhenu Vishwavidyalaya (DSVCKV), Anjora, Durg, Chhattisgarh, India
| |
Collapse
|
16
|
Choudhury N, Sahu TK, Rao AR, Rout AK, Behera BK. An Improved Machine Learning-Based Approach to Assess the Microbial Diversity in Major North Indian River Ecosystems. Genes (Basel) 2023; 14:genes14051082. [PMID: 37239442 DOI: 10.3390/genes14051082] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2023] [Revised: 05/08/2023] [Accepted: 05/12/2023] [Indexed: 05/28/2023] Open
Abstract
The rapidly evolving high-throughput sequencing (HTS) technologies generate voluminous genomic and metagenomic sequences, which can help classify the microbial communities with high accuracy in many ecosystems. Conventionally, the rule-based binning techniques are used to classify the contigs or scaffolds based on either sequence composition or sequence similarity. However, the accurate classification of the microbial communities remains a major challenge due to massive data volumes at hand as well as a requirement of efficient binning methods and classification algorithms. Therefore, we attempted here to implement iterative K-Means clustering for the initial binning of metagenomics sequences and applied various machine learning algorithms (MLAs) to classify the newly identified unknown microbes. The cluster annotation was achieved through the BLAST program of NCBI, which resulted in the grouping of assembled scaffolds into five classes, i.e., bacteria, archaea, eukaryota, viruses and others. The annotated cluster sequences were used to train machine learning algorithms (MLAs) to develop prediction models to classify unknown metagenomic sequences. In this study, we used metagenomic datasets of samples collected from the Ganga (Kanpur and Farakka) and the Yamuna (Delhi) rivers in India for clustering and training the MLA models. Further, the performance of MLAs was evaluated by 10-fold cross validation. The results revealed that the developed model based on the Random Forest had a superior performance compared to the other considered learning algorithms. The proposed method can be used for annotating the metagenomic scaffolds/contigs being complementary to existing methods of metagenomic data analysis. An offline predictor source code with the best prediction model is available at (https://github.com/Nalinikanta7/metagenomics).
Collapse
Affiliation(s)
- Nalinikanta Choudhury
- ICAR-Indian Agricultural Research Institute, New Delhi 110012, India
- ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India
| | - Tanmaya Kumar Sahu
- ICAR-Indian Grassland and Fodder Research Institute, Jhansi 284003, India
| | - Atmakuri Ramakrishna Rao
- ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India
- Indian Council of Agricultural Research (ICAR), New Delhi 110001, India
| | - Ajaya Kumar Rout
- ICAR-Central Inland Fisheries Research Institute, West Bengal 700120, India
- Rani Lakshmi Bai Central Agricultural University, Jhansi 284003, India
| | - Bijay Kumar Behera
- ICAR-Central Inland Fisheries Research Institute, West Bengal 700120, India
- Rani Lakshmi Bai Central Agricultural University, Jhansi 284003, India
| |
Collapse
|
17
|
Wang M, Yan L, Jia J, Lai J, Zhou H, Yu B. DE-MHAIPs: Identification of SARS-CoV-2 phosphorylation sites based on differential evolution multi-feature learning and multi-head attention mechanism. Comput Biol Med 2023; 160:106935. [PMID: 37120990 PMCID: PMC10140648 DOI: 10.1016/j.compbiomed.2023.106935] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2023] [Revised: 03/12/2023] [Accepted: 04/13/2023] [Indexed: 05/02/2023]
Abstract
The rapid spread of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) around the world affects the normal lives of people all over the world. The computational methods can be used to accurately identify SARS-CoV-2 phosphorylation sites. In this paper, a new prediction model of SARS-CoV-2 phosphorylation sites, called DE-MHAIPs, is proposed. First, we use six feature extraction methods to extract protein sequence information from different perspectives. For the first time, we use a differential evolution (DE) algorithm to learn individual feature weights and fuse multi-information in a weighted combination. Next, Group LASSO is used to select a subset of good features. Then, the important protein information is given higher weight through multi-head attention. After that, the processed data is fed into long short-term memory network (LSTM) to further enhance model's ability to learn features. Finally, the data from LSTM are input into fully connected neural network (FCN) to predict SARS-CoV-2 phosphorylation sites. The AUC values of the S/T and Y datasets under 5-fold cross-validation reach 91.98% and 98.32%, respectively. The AUC values of the two datasets on the independent test set reach 91.72% and 97.78%, respectively. The experimental results show that the DE-MHAIPs method exhibits excellent predictive ability compared with other methods.
Collapse
Affiliation(s)
- Minghui Wang
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Lu Yan
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Jihua Jia
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Jiali Lai
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Hongyan Zhou
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China.
| | - Bin Yu
- College of Information Science and Technology, School of Data Science, Qingdao University of Science and Technology, Qingdao, 266061, China; School of Data Science, University of Science and Technology of China, Hefei, 230027, China.
| |
Collapse
|
18
|
Özdilek AS, Atakan A, Özsarı G, Acar A, Atalay MV, Doğan T, Rifaioğlu AS. ProFAB-open protein functional annotation benchmark. Brief Bioinform 2023; 24:7025464. [PMID: 36736370 DOI: 10.1093/bib/bbac627] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2022] [Revised: 11/12/2022] [Accepted: 12/25/2022] [Indexed: 02/05/2023] Open
Abstract
As the number of protein sequences increases in biological databases, computational methods are required to provide accurate functional annotation with high coverage. Although several machine learning methods have been proposed for this purpose, there are still two main issues: (i) construction of reliable positive and negative training and validation datasets, and (ii) fair evaluation of their performances based on predefined experimental settings. To address these issues, we have developed ProFAB: Open Protein Functional Annotation Benchmark, which is a platform providing an infrastructure for a fair comparison of protein function prediction methods. ProFAB provides filtered and preprocessed protein annotation datasets and enables the training and evaluation of function prediction methods via several options. We believe that ProFAB will be useful for both computational and experimental researchers by enabling the utilization of ready-to-use datasets and machine learning algorithms for protein function prediction based on Gene Ontology terms and Enzyme Commission numbers. ProFAB is available at https://github.com/kansil/ProFAB and https://profab.kansil.org.
Collapse
Affiliation(s)
- A Samet Özdilek
- Department of Health Informatics, Graduate School of Informatics, Middle East Technical University, Ankara, Turkey
| | - Ahmet Atakan
- Department of Computer Engineering, Middle East Technical University, Ankara, Turkey
- Department of Computer Engineering, Erzincan Binali Yıldırım University, Erzincan, Turkey
| | - Gökhan Özsarı
- Department of Computer Engineering, Middle East Technical University, Ankara, Turkey
- Department of Computer Engineering, Niğde Ömer Halisdemir University, Niğde, Turkey
| | - Aybar Acar
- Department of Health Informatics, Graduate School of Informatics, Middle East Technical University, Ankara, Turkey
| | - M Volkan Atalay
- Department of Computer Engineering, Middle East Technical University, Ankara, Turkey
| | - Tunca Doğan
- Department of Computer Engineering and Artificial Intelligence Engineering, Hacettepe University, Ankara, Turkey
- Department of Bioinformatics, Graduate School of Health Sciences, Hacettepe University, Ankara, Turkey
| | - Ahmet S Rifaioğlu
- Department of Electrical-Electronics Engineering, İskenderun Technical University, Hatay, Turkey
- Institute for Computational Biomedicine, Faculty of Medicine, Heidelberg University and Heidelberg University Hospital, Heidelberg, Germany
| |
Collapse
|
19
|
Li M, Shi W, Zhang F, Zeng M, Li Y. A Deep Learning Framework for Predicting Protein Functions With Co-Occurrence of GO Terms. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:833-842. [PMID: 35476573 DOI: 10.1109/tcbb.2022.3170719] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
The understanding of protein functions is critical to many biological problems such as the development of new drugs and new crops. To reduce the huge gap between the increase of protein sequences and annotations of protein functions, many methods have been proposed to deal with this problem. These methods use Gene Ontology (GO) to classify the functions of proteins and consider one GO term as a class label. However, they ignore the co-occurrence of GO terms that is helpful for protein function prediction. We propose a new deep learning model, named DeepPFP-CO, which uses Graph Convolutional Network (GCN) to explore and capture the co-occurrence of GO terms to improve the protein function prediction performance. In this way, we can further deduce the protein functions by fusing the predicted propensity of the center function and its co-occurrence functions. We use Fmax and AUPR to evaluate the performance of DeepPFP-CO and compare DeepPFP-CO with state-of-the-art methods such as DeepGOPlus and DeepGOA. The computational results show that DeepPFP-CO outperforms DeepGOPlus and other methods. Moreover, we further analyze our model at the protein level. The results have demonstrated that DeepPFP-CO improves the performance of protein function prediction. DeepPFP-CO is available at https://csuligroup.com/DeepPFP/.
Collapse
|
20
|
Liu J, Tang X, Guan X. Grain protein function prediction based on self-attention mechanism and bidirectional LSTM. Brief Bioinform 2023; 24:6886418. [PMID: 36567619 DOI: 10.1093/bib/bbac493] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2022] [Revised: 10/13/2022] [Accepted: 10/18/2022] [Indexed: 12/27/2022] Open
Abstract
With the development of genome sequencing technology, using computing technology to predict grain protein function has become one of the important tasks of bioinformatics. The protein data of four grains, soybean, maize, indica and japonica are selected in this experimental dataset. In this paper, a novel neural network algorithm Chemical-SA-BiLSTM is proposed for grain protein function prediction. The Chemical-SA-BiLSTM algorithm fuses the chemical properties of proteins on the basis of amino acid sequences, and combines the self-attention mechanism with the bidirectional Long Short-Term Memory network. The experimental results show that the Chemical-SA-BiLSTM algorithm is superior to other classical neural network algorithms, and can more accurately predict the protein function, which proves the effectiveness of the Chemical-SA-BiLSTM algorithm in the prediction of grain protein function. The source code of our method is available at https://github.com/HwaTong/Chemical-SA-BiLSTM.
Collapse
Affiliation(s)
- Jing Liu
- College of Information Engineering, Shanghai Maritime University, 201306, Shanghai, China
| | - Xinghua Tang
- College of Information Engineering, Shanghai Maritime University, 201306, Shanghai, China
| | - Xiao Guan
- School of Health Science and Engineering, University of Shanghai for Science and Technology, 200093, Shanghai, China
| |
Collapse
|
21
|
Ardern Z, Chakraborty S, Lenk F, Kaster AK. Elucidating the functional roles of prokaryotic proteins using big data and artificial intelligence. FEMS Microbiol Rev 2023; 47:fuad003. [PMID: 36725215 PMCID: PMC9960493 DOI: 10.1093/femsre/fuad003] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2022] [Revised: 01/11/2023] [Accepted: 01/31/2023] [Indexed: 02/03/2023] Open
Abstract
Annotating protein sequences according to their biological functions is one of the key steps in understanding microbial diversity, metabolic potentials, and evolutionary histories. However, even in the best-studied prokaryotic genomes, not all proteins can be characterized by classical in vivo, in vitro, and/or in silico methods-a challenge rapidly growing alongside the advent of next-generation sequencing technologies and their enormous extension of 'omics' data in public databases. These so-called hypothetical proteins (HPs) represent a huge knowledge gap and hidden potential for biotechnological applications. Opportunities for leveraging the available 'Big Data' have recently proliferated with the use of artificial intelligence (AI). Here, we review the aims and methods of protein annotation and explain the different principles behind machine and deep learning algorithms including recent research examples, in order to assist both biologists wishing to apply AI tools in developing comprehensive genome annotations and computer scientists who want to contribute to this leading edge of biological research.
Collapse
Affiliation(s)
- Zachary Ardern
- Institute for Biological Interfaces 5 (Institut für Biologische Grenzflächen IBG 5), Karlsruhe Institute of Technology (KIT), 76344 Eggenstein-Leopoldshafen, Germany
- Wellcome Trust Sanger Institute, Hinxton, Saffron Walden CB10 1RQ, United Kingdom
| | - Sagarika Chakraborty
- Institute for Biological Interfaces 5 (Institut für Biologische Grenzflächen IBG 5), Karlsruhe Institute of Technology (KIT), 76344 Eggenstein-Leopoldshafen, Germany
| | - Florian Lenk
- Institute for Biological Interfaces 5 (Institut für Biologische Grenzflächen IBG 5), Karlsruhe Institute of Technology (KIT), 76344 Eggenstein-Leopoldshafen, Germany
| | - Anne-Kristin Kaster
- Institute for Biological Interfaces 5 (Institut für Biologische Grenzflächen IBG 5), Karlsruhe Institute of Technology (KIT), 76344 Eggenstein-Leopoldshafen, Germany
| |
Collapse
|
22
|
Shomali A, Vafaei Sadi MS, Bakhtiarizadeh MR, Aliniaeifard S, Trewavas A, Calvo P. Identification of intelligence-related proteins through a robust two-layer predictor. Commun Integr Biol 2022; 15:253-264. [DOI: 10.1080/19420889.2022.2143101] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Affiliation(s)
- Aida Shomali
- Department of Horticulture, College of Aburaihan, University of Tehran, Tehran, Iran
| | | | | | - Sasan Aliniaeifard
- Department of Horticulture, College of Aburaihan, University of Tehran, Tehran, Iran
| | - Anthony Trewavas
- School of Biological Sciences, Institute of Molecular Plant Science, University of Edinburgh, UK
| | - Paco Calvo
- Minimal Intelligence Lab, University of Murcia, Spain
| |
Collapse
|
23
|
Sarker B, Khare N, Devignes MD, Aridhi S. Improving automatic GO annotation with semantic similarity. BMC Bioinformatics 2022; 23:433. [PMID: 36510133 PMCID: PMC9743508 DOI: 10.1186/s12859-022-04958-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2022] [Accepted: 09/19/2022] [Indexed: 12/14/2022] Open
Abstract
BACKGROUND Automatic functional annotation of proteins is an open research problem in bioinformatics. The growing number of protein entries in public databases, for example in UniProtKB, poses challenges in manual functional annotation. Manual annotation requires expert human curators to search and read related research articles, interpret the results, and assign the annotations to the proteins. Thus, it is a time-consuming and expensive process. Therefore, designing computational tools to perform automatic annotation leveraging the high quality manual annotations that already exist in UniProtKB/SwissProt is an important research problem RESULTS: In this paper, we extend and adapt the GrAPFI (graph-based automatic protein function inference) (Sarker et al. in BMC Bioinform 21, 2020; Sarker et al., in: Proceedings of 7th international conference on complex networks and their applications, Cambridge, 2018) method for automatic annotation of proteins with gene ontology (GO) terms renaming it as GrAPFI-GO. The original GrAPFI method uses label propagation in a similarity graph where proteins are linked through the domains, families, and superfamilies that they share. Here, we also explore various types of similarity measures based on common neighbors in the graph. Moreover, GO terms are arranged in a hierarchical manner according to semantic parent-child relations. Therefore, we propose an efficient pruning and post-processing technique that integrates both semantic similarity and hierarchical relations between the GO terms. We produce experimental results comparing the GrAPFI-GO method with and without considering common neighbors similarity. We also test the performance of GrAPFI-GO and other annotation tools for GO annotation on a benchmark of proteins with and without the proposed pruning and post-processing procedure. CONCLUSION Our results show that the proposed semantic hierarchical post-processing potentially improves the performance of GrAPFI-GO and of other annotation tools as well. Thus, GrAPFI-GO exposes an original efficient and reusable procedure, to exploit the semantic relations among the GO terms in order to improve the automatic annotation of protein functions.
Collapse
Affiliation(s)
- Bishnu Sarker
- grid.29172.3f0000 0001 2194 6418CNRS, Inria, LORIA, University of Lorraine, 54000 Nancy, France ,grid.443078.c0000 0004 0371 4228Khulna University of Engineering and Technology, Khulna, Bangladesh ,grid.259870.10000 0001 0286 752XSchool of Applied Computational Sciences, Meharry Medical College, Nashville, TN USA
| | - Navya Khare
- grid.29172.3f0000 0001 2194 6418CNRS, Inria, LORIA, University of Lorraine, 54000 Nancy, France ,grid.419361.80000 0004 1759 7632International Institute of Information Technology, Hyderabad, India
| | | | - Sabeur Aridhi
- grid.29172.3f0000 0001 2194 6418CNRS, Inria, LORIA, University of Lorraine, 54000 Nancy, France
| |
Collapse
|
24
|
Gu X, Ding Y, Xiao P, He T. A GHKNN model based on the physicochemical property extraction method to identify SNARE proteins. Front Genet 2022; 13:935717. [PMID: 36506312 PMCID: PMC9727185 DOI: 10.3389/fgene.2022.935717] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2022] [Accepted: 11/02/2022] [Indexed: 11/24/2022] Open
Abstract
There is a great deal of importance to SNARE proteins, and their absence from function can lead to a variety of diseases. The SNARE protein is known as a membrane fusion protein, and it is crucial for mediating vesicle fusion. The identification of SNARE proteins must therefore be conducted with an accurate method. Through extensive experiments, we have developed a model based on graph-regularized k-local hyperplane distance nearest neighbor model (GHKNN) binary classification. In this, the model uses the physicochemical property extraction method to extract protein sequence features and the SMOTE method to upsample protein sequence features. The combination achieves the most accurate performance for identifying all protein sequences. Finally, we compare the model based on GHKNN binary classification with other classifiers and measure them using four different metrics: SN, SP, ACC, and MCC. In experiments, the model performs significantly better than other classifiers.
Collapse
Affiliation(s)
- Xingyue Gu
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing, China
| | - Yijie Ding
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang, China,Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China,*Correspondence: Pengfeng Xiao, ; Tao He, ; Yijie Ding,
| | - Pengfeng Xiao
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing, China,*Correspondence: Pengfeng Xiao, ; Tao He, ; Yijie Ding,
| | - Tao He
- Beidahuang Industry Group General Hospital, Harbin, China,*Correspondence: Pengfeng Xiao, ; Tao He, ; Yijie Ding,
| |
Collapse
|
25
|
Zhao X, Zhai J, Liu T, Wang G. Ensemble classification based feature selection: a case of identification on plant pentatricopeptide repeat proteins. Brief Bioinform 2022; 23:6760138. [PMID: 36239380 DOI: 10.1093/bib/bbac369] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2022] [Revised: 07/20/2022] [Accepted: 08/05/2022] [Indexed: 12/14/2022] Open
Abstract
In order to identify plant pentatricopeptide repeat (PPR) proteins, a framework of variable selection has been proposed. In fact, it is an effective feature selection strategy that focuses on the performance of classification. Random forest has been used as the classifier with certain variables automatically selected for discrimination between PPR functional and non-functional proteins. However, it is found that samples regarded as PPR functional proteins are wrongly classified in a high rate. In this paper, we plan to improve the framework in order to achieve better classification results. Modifications are made on the framework for better identifying PPR functional proteins. Instead of random forest, a hybrid ensemble classifier is built with its base classifiers derived from six different classification methods. Besides, an incremental strategy and a clustering by search in descending order are alternatively used for feature selection, which can effectively select the most representative variables for identification on PPR proteins. In addition, it can be found that different base classifiers alternately play an important role in the ensemble classifier with feature dimension increasing. The experimental results demonstrate the effectiveness of our improvements.
Collapse
Affiliation(s)
- Xudong Zhao
- College of Information and Computer Engineering, Northeast Forestry University, No. 26, Hexing Road, 150040, Heilongjiang Province, China
| | - Jingwen Zhai
- College of Information and Computer Engineering, Northeast Forestry University, No. 26, Hexing Road, 150040, Heilongjiang Province, China
| | - Tong Liu
- College of Information and Computer Engineering, Northeast Forestry University, No. 26, Hexing Road, 150040, Heilongjiang Province, China
| | - Guohua Wang
- College of Information and Computer Engineering, Northeast Forestry University, No. 26, Hexing Road, 150040, Heilongjiang Province, China.,State Key Laboratory of Tree Genetics and Breeding, Northeast Forestry University, No. 26, Hexing Road, 150040, Heilongjiang Province, China
| |
Collapse
|
26
|
Chou HH, Hsu CT, Hsu CW, Yao KH, Wang HC, Hsieh SY. Novel Algorithm for Improved Protein Classification Using Graph Similarity. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:3135-3143. [PMID: 34748498 DOI: 10.1109/tcbb.2021.3125836] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Considerable sequence data are produced in genome annotation projects that relate to molecular levels, structural similarities, and molecular and biological functions. In structural genomics, the most essential task involves resolving protein structures efficiently with hardware or software, understanding these structures, and assigning their biological functions. Understanding the characteristics and functions of proteins enables the exploration of the molecular mechanisms of life. In this paper, we examine the problems of protein classification. Because they perform similar biological functions, proteins in the same family usually share similar structural characteristics. We employed this premise in designing a classification algorithm. In this algorithm, auxiliary graphs are used to represent proteins, with every amino acid in a protein to a vertex in a graph. Moreover, the links between amino acids correspond to the edges between the vertices. The proposed algorithm classifies proteins according to the similarities in their graphical structures. The proposed algorithm is efficient and accurate in distinguishing proteins from different families and outperformed related algorithms experimentally.
Collapse
|
27
|
Liu S, Cui C, Chen H, Liu T. Ensemble learning-based feature selection for phosphorylation site detection. Front Genet 2022; 13:984068. [PMID: 36338976 PMCID: PMC9634105 DOI: 10.3389/fgene.2022.984068] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2022] [Accepted: 10/05/2022] [Indexed: 11/18/2022] Open
Abstract
SARS-COV-2 is prevalent all over the world, causing more than six million deaths and seriously affecting human health. At present, there is no specific drug against SARS-COV-2. Protein phosphorylation is an important way to understand the mechanism of SARS -COV-2 infection. It is often expensive and time-consuming to identify phosphorylation sites with specific modified residues through experiments. A method that uses machine learning to make predictions about them is proposed. As all the methods of extracting protein sequence features are knowledge-driven, these features may not be effective for detecting phosphorylation sites without a complete understanding of the mechanism of protein. Moreover, redundant features also have a great impact on the fitting degree of the model. To solve these problems, we propose a feature selection method based on ensemble learning, which firstly extracts protein sequence features based on knowledge, then quantifies the importance score of each feature based on data, and finally uses the subset of important features as the final features to predict phosphorylation sites.
Collapse
Affiliation(s)
- Songbo Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Chengmin Cui
- Beijing Institute of Control Engineering, China Academy of Space Technology, Beijing, China
| | - Huipeng Chen
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
- *Correspondence: Huipeng Chen,
| | - Tong Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| |
Collapse
|
28
|
Zhao J, Jiang H, Zou G, Lin Q, Wang Q, Liu J, Ma L. CNNArginineMe: A CNN structure for training models for predicting arginine methylation sites based on the One-Hot encoding of peptide sequence. Front Genet 2022; 13:1036862. [PMID: 36324513 PMCID: PMC9618650 DOI: 10.3389/fgene.2022.1036862] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2022] [Accepted: 10/04/2022] [Indexed: 11/30/2022] Open
Abstract
Protein arginine methylation (PRme), as one post-translational modification, plays a critical role in numerous cellular processes and regulates critical cellular functions. Though several in silico models for predicting PRme sites have been reported, new models may be required to develop due to the significant increase of identified PRme sites. In this study, we constructed multiple machine-learning and deep-learning models. The deep-learning model CNN combined with the One-Hot coding showed the best performance, dubbed CNNArginineMe. CNNArginineMe performed best in AUC scoring metrics in comparisons with several reported predictors. Additionally, we employed CNNArginineMe to predict arginine methylation proteome and performed functional analysis. The arginine methylated proteome is significantly enriched in the amyotrophic lateral sclerosis (ALS) pathway. CNNArginineMe is freely available at https://github.com/guoyangzou/CNNArginineMe.
Collapse
Affiliation(s)
- Jiaojiao Zhao
- Cancer Institute of the Affiliated Hospital of Qingdao University and Qingdao Cancer Institute, Qingdao University, Qingdao, China
- School of Basic Medicine, Qingdao University, Qingdao, China
| | - Haoqiang Jiang
- School of Basic Medicine, Qingdao University, Qingdao, China
| | - Guoyang Zou
- School of Basic Medicine, Qingdao University, Qingdao, China
| | - Qian Lin
- Cancer Institute of the Affiliated Hospital of Qingdao University and Qingdao Cancer Institute, Qingdao University, Qingdao, China
| | - Qiang Wang
- Oncology Department, Shandong Second Provincial General Hospital, Jinan, China
| | - Jia Liu
- Department of Pharmacology, School of Pharmacy, Qingdao University, Qingdao, China
| | - Leina Ma
- Cancer Institute of the Affiliated Hospital of Qingdao University and Qingdao Cancer Institute, Qingdao University, Qingdao, China
- *Correspondence: Leina Ma,
| |
Collapse
|
29
|
Bheemireddy S, Sandhya S, Srinivasan N, Sowdhamini R. Computational tools to study RNA-protein complexes. Front Mol Biosci 2022; 9:954926. [PMID: 36275618 PMCID: PMC9585174 DOI: 10.3389/fmolb.2022.954926] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2022] [Accepted: 09/20/2022] [Indexed: 11/19/2022] Open
Abstract
RNA is the key player in many cellular processes such as signal transduction, replication, transport, cell division, transcription, and translation. These diverse functions are accomplished through interactions of RNA with proteins. However, protein–RNA interactions are still poorly derstood in contrast to protein–protein and protein–DNA interactions. This knowledge gap can be attributed to the limited availability of protein-RNA structures along with the experimental difficulties in studying these complexes. Recent progress in computational resources has expanded the number of tools available for studying protein-RNA interactions at various molecular levels. These include tools for predicting interacting residues from primary sequences, modelling of protein-RNA complexes, predicting hotspots in these complexes and insights into derstanding in the dynamics of their interactions. Each of these tools has its strengths and limitations, which makes it significant to select an optimal approach for the question of interest. Here we present a mini review of computational tools to study different aspects of protein-RNA interactions, with focus on overall application, development of the field and the future perspectives.
Collapse
Affiliation(s)
- Sneha Bheemireddy
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore, India
| | - Sankaran Sandhya
- Department of Biotechnology, Faculty of Life and Allied Health Sciences, M.S. Ramaiah University of Applied Sciences, Bengaluru, India
- *Correspondence: Sankaran Sandhya, ; Ramanathan Sowdhamini,
| | | | - Ramanathan Sowdhamini
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore, India
- National Centre for Biological Sciences, TIFR, GKVK Campus, Bangalore, India
- Institute of Bioinformatics and Applied Biotechnology, Bangalore, India
- *Correspondence: Sankaran Sandhya, ; Ramanathan Sowdhamini,
| |
Collapse
|
30
|
A novel deep learning-assisted hybrid network for plasmodium falciparum parasite mitochondrial proteins classification. PLoS One 2022; 17:e0275195. [PMID: 36201724 PMCID: PMC9536844 DOI: 10.1371/journal.pone.0275195] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2022] [Accepted: 09/12/2022] [Indexed: 11/18/2022] Open
Abstract
Plasmodium falciparum is a parasitic protozoan that can cause malaria, which is a deadly disease. Therefore, the accurate identification of malaria parasite mitochondrial proteins is essential for understanding their functions and identifying novel drug targets. For classifying protein sequences, several adaptive statistical techniques have been devised. Despite significant gains, prediction performance is still constrained by the lack of appropriate feature descriptors and learning strategies in current systems. Moreover, good ground truth data is important for Artificial Intelligence (AI)-based models but there is a lack of that data in the literature. Therefore, in this work, we propose a novel hybrid network that combines 1D Convolutional Neural Network (CNN) and Bidirectional Gated Recurrent Unit (BGRU) to classify the malaria parasite mitochondrial proteins. Furthermore, we curate a sequential data that are collected from National Center for Biotechnology Information (NCBI) and UniProtKB/Swiss-Prot proteins databanks to prepare a dataset that can be used by the research community for AI-based algorithms evaluation. We obtain 4204 cases after preprocessing of the collected data and denote this set of proteins as PF4204. Finally, we conduct an ablation study on several conventional and deep models using PF4204 and the benchmark PF2095 datasets. The proposed model 'CNN-BGRU' obtains the accuracy values of 0.9096 and 0.9857 on PF4204 and PF2095 datasets, respectively. In addition, the CNN-BGRU is compared with state-of-the-arts, where the results illustrate that it can extract robust features and identify proteins accurately.
Collapse
|
31
|
Chen D, Li S, Chen Y. ISTRF: Identification of sucrose transporter using random forest. Front Genet 2022; 13:1012828. [PMID: 36171889 PMCID: PMC9511101 DOI: 10.3389/fgene.2022.1012828] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2022] [Accepted: 08/22/2022] [Indexed: 12/05/2022] Open
Abstract
Sucrose transporter (SUT) is a type of transmembrane protein that exists widely in plants and plays a significant role in the transportation of sucrose and the specific signal sensing process of sucrose. Therefore, identifying sucrose transporter is significant to the study of seed development and plant flowering and growth. In this study, a random forest-based model named ISTRF was proposed to identify sucrose transporter. First, a database containing 382 SUT proteins and 911 non-SUT proteins was constructed based on the UniProt and PFAM databases. Second, k-separated-bigrams-PSSM was exploited to represent protein sequence. Third, to overcome the influence of imbalance of samples on identification performance, the Borderline-SMOTE algorithm was used to overcome the shortcoming of imbalance training data. Finally, the random forest algorithm was used to train the identification model. It was proved by 10-fold cross-validation results that k-separated-bigrams-PSSM was the most distinguishable feature for identifying sucrose transporters. The Borderline-SMOTE algorithm can improve the performance of the identification model. Furthermore, random forest was superior to other classifiers on almost all indicators. Compared with other identification models, ISTRF has the best general performance and makes great improvements in identifying sucrose transporter proteins.
Collapse
Affiliation(s)
- Dong Chen
- College of Electrical and Information Engineering, Qu Zhou University, Quzhou, China
| | - Sai Li
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Yu Chen
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| |
Collapse
|
32
|
Kha QH, Ho QT, Le NQK. Identifying SNARE Proteins Using an Alignment-Free Method Based on Multiscan Convolutional Neural Network and PSSM Profiles. J Chem Inf Model 2022; 62:4820-4826. [PMID: 36166351 PMCID: PMC9554904 DOI: 10.1021/acs.jcim.2c01034] [Citation(s) in RCA: 33] [Impact Index Per Article: 16.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022]
Abstract
![]()
Background: SNARE proteins play a vital
role in
membrane fusion and cellular physiology and pathological processes.
Many potential therapeutics for mental diseases or even cancer based
on SNAREs are also developed. Therefore, there is a dire need to predict
the SNAREs for further manipulation of these essential proteins, which
demands new and efficient approaches. Methods: Some
computational frameworks were proposed to tackle the hurdles of biological
methods, which take plenty of time and budget to conduct the identification
of SNAREs. However, the performances of existing frameworks were insufficiently
satisfied, as they failed to retain the SNARE sequence order and capture
the mass hidden features from SNAREs. This paper proposed a novel
model constructed on the multiscan convolutional neural network (CNN)
and position-specific scoring matrix (PSSM) profiles to address these
limitations. We employed and trained our model on the benchmark dataset
with fivefold cross-validation and two different independent datasets. Results: Overall, the multiscan CNN was cross-validated
on the training set and excelled in the SNARE classification reaching
0.963 in AUC and 0.955 in AUPRC. On top of that, with the sensitivity,
specificity, accuracy, and MCC of 0.842, 0.968, 0.955, and 0.767,
respectively, our proposed framework outperformed previous models
in the SNARE recognition task. Conclusions: It is
truly believed that our model can contribute to the discrimination
of SNARE proteins and general proteins.
Collapse
Affiliation(s)
- Quang-Hien Kha
- International Master/Ph.D. Program in Medicine, College of Medicine, Taipei Medical University, Taipei 110, Taiwan
| | - Quang-Thai Ho
- College of Information & Communication Technology, Can Tho University, Can Tho 90000, Viet Nam.,Department of Computer Science and Engineering, Yuan Ze University, Chung-Li 32003, Taiwan
| | - Nguyen Quoc Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, Taipei 106, Taiwan.,Research Center for Artificial Intelligence in Medicine, Taipei Medical University, Taipei 106, Taiwan.,Translational Imaging Research Center, Taipei Medical University Hospital, Taipei 110, Taiwan
| |
Collapse
|
33
|
Kha QH, Tran TO, Nguyen TTD, Nguyen VN, Than K, Le NQK. An interpretable deep learning model for classifying adaptor protein complexes from sequence information. Methods 2022; 207:90-96. [PMID: 36174933 DOI: 10.1016/j.ymeth.2022.09.007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2022] [Revised: 08/19/2022] [Accepted: 09/22/2022] [Indexed: 11/15/2022] Open
Abstract
Adaptor proteins (APs) are a family of proteins that aids in intracellular membrane trafficking, and their impairments or defects are closely related to various disorders. Traditional methods to identify and classify APs require time and complex techniques, which were then advanced by machine learning and computational approaches to facilitate the APs recognition task. However, most studies focused on recognizing separate ones in the APs family or the APs in general with non-APs, lacking one comprehensive strategy to distinguish the complexes of AP subtypes. Herein, we proposed a novel method to implement one novel task as discriminating the AP complexes in the APs family, utilizing an interpretable deep neural network architecture on sequence-based encoding features. This work also introduced a benchmark data set of AP complexes originating from the UniProt and GeneOntology databases. To assess the robustness of our proposed method, we compared our performance to various machine learning algorithms and feature extraction strategies. Furthermore, the interpretation of the model's prediction performance was implemented using t-distributed stochastic neighbor embedding (t-SNE), uniform manifold approximation and projection (UMAP), and SHapley Additive exPlanations (SHAP) analysis to show the distribution of AP complexes on optimal features. The promising performance of our architecture can assist scientists not only in AP complexes distinction but also in general protein sequences. Moreover, we have also made our work publicly on GitHub https://github.com/khanhlee/adaptor-dnn.
Collapse
Affiliation(s)
- Quang-Hien Kha
- International Master/Ph.D. Program in Medicine, College of Medicine, Taipei Medical University, Taipei, Taiwan
| | - Thi-Oanh Tran
- International Ph.D. Program for Cell Therapy and Regeneration Medicine, College of Medicine, Taipei Medical University, Taipei, Taiwan
| | - Trinh-Trung-Duong Nguyen
- Personalised Medicine Cluster, Department of Drug Design and Pharmacology, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Van-Nui Nguyen
- University of Information and Communication Technology, Thai Nguyen University, Thai Nguyen, Viet Nam
| | - Khoat Than
- School of Information and Communication Technology, Hanoi University of Science and Technology, Viet Nam
| | - Nguyen Quoc Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, Taipei 106, Taiwan; Research Center for Artificial Intelligence in Medicine, Taipei Medical University, Taipei 106, Taiwan; Translational Imaging Research Center, Taipei Medical University Hospital, Taipei 110, Taiwan.
| |
Collapse
|
34
|
MethEvo: an accurate evolutionary information-based methylation site predictor. Neural Comput Appl 2022. [DOI: 10.1007/s00521-022-07738-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/14/2022]
|
35
|
Prediction of anti-inflammatory peptides by a sequence-based stacking ensemble model named AIPStack. iScience 2022; 25:104967. [PMID: 36093066 PMCID: PMC9449674 DOI: 10.1016/j.isci.2022.104967] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2022] [Revised: 08/09/2022] [Accepted: 08/12/2022] [Indexed: 11/23/2022] Open
Abstract
Accurate and efficient identification of anti-inflammatory peptides (AIPs) is crucial for the treatment of inflammation. Here, we proposed a two-layer stacking ensemble model, AIPStack, to effectively predict AIPs. At first, we constructed a new dataset for model building and validation. Then, peptide sequences were represented by hybrid features, which were fused by two amino acid composition descriptors. Next, the stacking ensemble model was constructed by random forest and extremely randomized tree as the base-classifiers and logistic regression as the meta-classifier to receive the outputs from the base-classifiers. AIPStack achieved an AUC of 0.819, accuracy of 0.755, and MCC of 0.510 on the independent set 3, which were higher than other AIP predictors. Furthermore, the essential sequence features were highlighted by the Shapley Additive exPlanation (SHAP) method. It is anticipated that AIPStack could be used for AIP prediction in a high-throughput manner and facilitate the hypothesis-driven experimental design. AIPStack model was developed for the prediction of anti-inflammatory peptides The hybrid features were used to describe the peptide sequences The proposed model AIPStack outperformed existing ones SHAP was used to highlight the essential features required for AIP prediction
Collapse
|
36
|
Canzler S, Fischer M, Ulbricht D, Ristic N, Hildebrand PW, Staritzbichler R. ProteinPrompt: a webserver for predicting protein-protein interactions. BIOINFORMATICS ADVANCES 2022; 2:vbac059. [PMID: 36699419 PMCID: PMC9710678 DOI: 10.1093/bioadv/vbac059] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/21/2021] [Revised: 07/19/2022] [Accepted: 08/14/2022] [Indexed: 01/28/2023]
Abstract
Motivation Protein-protein interactions (PPIs) play an essential role in a great variety of cellular processes and are therefore of significant interest for the design of new therapeutic compounds as well as the identification of side effects due to unexpected binding. Here, we present ProteinPrompt, a webserver that uses machine learning algorithms to calculate specific, currently unknown PPIs. Our tool is designed to quickly and reliably predict contact propensities based on an input sequence in order to scan large sequence libraries for potential binding partners, with the goal to accelerate and assure the quality of the laborious process of drug target identification. Results We collected and thoroughly filtered a comprehensive database of known binders from several sources, which is available as download. ProteinPrompt provides two complementary search methods of similar accuracy for comparison and consensus building. The default method is a random forest (RF) algorithm that uses the auto-correlations of seven amino acid scales. Alternatively, a graph neural network (GNN) implementation can be selected. Additionally, a consensus prediction is available. For each query sequence, potential binding partners are identified from a protein sequence database. The proteom of several organisms are available and can be searched for binders. To evaluate the predictive power of the algorithms, we prepared a test dataset that was rigorously filtered for redundancy. No sequence pairs similar to the ones used for training were included in this dataset. With this challenging dataset, the RF method achieved an accuracy rate of 0.88 and an area under the curve of 0.95. The GNN achieved an accuracy rate of 0.86 using the same dataset. Since the underlying learning approaches are unrelated, comparing the results of RF and GNNs reduces the likelihood of errors. The consensus reached an accuracy of 0.89. Availability and implementation ProteinPrompt is available online at: http://proteinformatics.org/ProteinPrompt, where training and test data used to optimize the methods are also available. The server makes it possible to scan the human proteome for potential binding partners of an input sequence within minutes. For local offline usage, we furthermore created a ProteinPrompt Docker image which allows for batch submission: https://gitlab.hzdr.de/proteinprompt/ProteinPrompt. In conclusion, we offer a fast, accurate, easy-to-use online service for predicting binding partners from an input sequence.
Collapse
Affiliation(s)
| | | | - David Ulbricht
- Institute of Medical Physics and Biophysics, University of Leipzig, 04107 Leipzig, Germany
| | - Nikola Ristic
- Institute of Medical Physics and Biophysics, University of Leipzig, 04107 Leipzig, Germany
| | - Peter W Hildebrand
- Institute of Medical Physics and Biophysics, University of Leipzig, 04107 Leipzig, Germany,Charité—Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Institute of Medical Physics and Biophysics, 10117 Berlin, Germany,Berlin Institute of Health at Charité—Universitätsmedizin Berlin, 10117 Berlin, Germany
| | | |
Collapse
|
37
|
Chen Y, Li S, Guo J. A method for identifying moonlighting proteins based on linear discriminant analysis and bagging-SVM. Front Genet 2022; 13:963349. [PMID: 36046247 PMCID: PMC9420859 DOI: 10.3389/fgene.2022.963349] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2022] [Accepted: 07/18/2022] [Indexed: 11/13/2022] Open
Abstract
Moonlighting proteins have at least two independent functions and are widely found in animals, plants and microorganisms. Moonlighting proteins play important roles in signal transduction, cell growth and movement, tumor inhibition, DNA synthesis and repair, and metabolism of biological macromolecules. Moonlighting proteins are difficult to find through biological experiments, so many researchers identify moonlighting proteins through bioinformatics methods, but their accuracies are relatively low. Therefore, we propose a new method. In this study, we select SVMProt-188D as the feature input, and apply a model combining linear discriminant analysis and basic classifiers in machine learning to study moonlighting proteins, and perform bagging ensemble on the best-performing support vector machine. They are identified accurately and efficiently. The model achieves an accuracy of 93.26% and an F-sorce of 0.946 on the MPFit dataset, which is better than the existing MEL-MP model. Meanwhile, it also achieves good results on the other two moonlighting protein datasets.
Collapse
|
38
|
Mansoor M, Nauman M, Rehman HU, Omar M. Gene Ontology Capsule GAN: an improved architecture for protein function prediction. PeerJ Comput Sci 2022; 8:e1014. [PMID: 36092003 PMCID: PMC9454774 DOI: 10.7717/peerj-cs.1014] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2022] [Accepted: 05/31/2022] [Indexed: 06/15/2023]
Abstract
Proteins are the core of all functions pertaining to living things. They consist of an extended amino acid chain folding into a three-dimensional shape that dictates their behavior. Currently, convolutional neural networks (CNNs) have been pivotal in predicting protein functions based on protein sequences. While it is a technology crucial to the niche, the computation cost and translational invariance associated with CNN make it impossible to detect spatial hierarchies between complex and simpler objects. Therefore, this research utilizes capsule networks to capture spatial information as opposed to CNNs. Since capsule networks focus on hierarchical links, they have a lot of potential for solving structural biology challenges. In comparison to the standard CNNs, our results exhibit an improvement in accuracy. Gene Ontology Capsule GAN (GOCAPGAN) achieved an F1 score of 82.6%, a precision score of 90.4% and recall score of 76.1%.
Collapse
|
39
|
Gao Z, Xia R, Zhang P. Prediction of anti-proliferation effect of [1,2,3]triazolo[4,5-d]pyrimidine derivatives by random forest and mix-kernel function SVM with PSO. Chem Pharm Bull (Tokyo) 2022; 70:684-693. [PMID: 35922903 DOI: 10.1248/cpb.c22-00376] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
In order to predict the anti-gastric cancer effect of [1,2,3]triazolo[4,5-d]pyrimidine derivatives (1,2,3-TPD), quantitative structure-activity relationship (QSAR) studies were performed. Based on five descriptors selected from descriptors pool, four QSAR models were established by heuristic method (HM), random forest (RF), support vector machine with radial basis kernel function (RBF-SVM), and mix-kernel function support vector machine (MIX-SVM) including radial basis kernel and polynomial kernel function. Furthermore, the model built by RF explained the importance of the descriptors selected by HM. Compared with RBF-SVM, the MIX-SVM enhanced the generalization and learning ability of the constructed model simultaneously and the multi parameters optimization problem in this method was also solved by particle swarm optimization (PSO) algorithm with very low complexity and fast convergence. Besides, leave-one-out cross validation (LOO-CV) was adopted to test the robustness of the models and Q2 was used to describe the results. And the MIX-SVM model showed the best prediction ability and strongest model robustness: R2 = 0.927, Q2 = 0.916, MSE = 0.027 for the training set and R2 = 0.946, Q2 = 0.913, MSE = 0.023 for the test set. This study reveals five key descriptors of 1,2,3-TPD and will provide help to screen out efficient and novel drugs in the future.
Collapse
Affiliation(s)
- Zhan Gao
- College of Computer Science and Technology, Qingdao University
| | - Runze Xia
- College of Computer Science and Technology, Qingdao University
| | - Peijian Zhang
- College of Computer Science and Technology, Qingdao University
| |
Collapse
|
40
|
Zou H, Yang F, Yin Z. Integrating multiple sequence features for identifying anticancer peptides. Comput Biol Chem 2022; 99:107711. [DOI: 10.1016/j.compbiolchem.2022.107711] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2021] [Revised: 05/16/2022] [Accepted: 05/29/2022] [Indexed: 11/03/2022]
|
41
|
Liu S, Cui C, Chen H, Liu T. Ensemble Learning-Based Feature Selection for Phage Protein Prediction. Front Microbiol 2022; 13:932661. [PMID: 35910662 PMCID: PMC9335128 DOI: 10.3389/fmicb.2022.932661] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2022] [Accepted: 06/14/2022] [Indexed: 11/14/2022] Open
Abstract
Phage has high specificity for its host recognition. As a natural enemy of bacteria, it has been used to treat super bacteria many times. Identifying phage proteins from the original sequence is very important for understanding the relationship between phage and host bacteria and developing new antimicrobial agents. However, traditional experimental methods are both expensive and time-consuming. In this study, an ensemble learning-based feature selection method is proposed to find important features for phage protein identification. The method uses four types of protein sequence-derived features, quantifies the importance of each feature by adding perturbations to the features to influence the results, and finally splices the important features among the four types of features. In addition, we analyzed the selected features and their biological significance.
Collapse
Affiliation(s)
- Songbo Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Chengmin Cui
- Beijing Institute of Control Engineering, China Academy of Space Technology, Beijing, China
| | - Huipeng Chen
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
- *Correspondence: Huipeng Chen
| | - Tong Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| |
Collapse
|
42
|
Indriani F, Mahmudah KR, Purnama B, Satou K. ProtTrans-Glutar: Incorporating Features From Pre-trained Transformer-Based Models for Predicting Glutarylation Sites. Front Genet 2022; 13:885929. [PMID: 35711929 PMCID: PMC9194472 DOI: 10.3389/fgene.2022.885929] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2022] [Accepted: 04/26/2022] [Indexed: 11/16/2022] Open
Abstract
Lysine glutarylation is a post-translational modification (PTM) that plays a regulatory role in various physiological and biological processes. Identifying glutarylated peptides using proteomic techniques is expensive and time-consuming. Therefore, developing computational models and predictors can prove useful for rapid identification of glutarylation. In this study, we propose a model called ProtTrans-Glutar to classify a protein sequence into positive or negative glutarylation site by combining traditional sequence-based features with features derived from a pre-trained transformer-based protein model. The features of the model were constructed by combining several feature sets, namely the distribution feature (from composition/transition/distribution encoding), enhanced amino acid composition (EAAC), and features derived from the ProtT5-XL-UniRef50 model. Combined with random under-sampling and XGBoost classification method, our model obtained recall, specificity, and AUC scores of 0.7864, 0.6286, and 0.7075 respectively on an independent test set. The recall and AUC scores were notably higher than those of the previous glutarylation prediction models using the same dataset. This high recall score suggests that our method has the potential to identify new glutarylation sites and facilitate further research on the glutarylation process.
Collapse
Affiliation(s)
- Fatma Indriani
- Graduate School of Natural Science and Technology, Kanazawa University, Kanazawa, Japan.,Department of Computer Science, Lambung Mangkurat University, Banjarmasin, Indonesia
| | - Kunti Robiatul Mahmudah
- Department of Postgraduate of Mathematics Education, Universitas Ahmad Dahlan, Yogyakarta, Indonesia
| | - Bedy Purnama
- School of Computing, Telkom University, Bandung, Indonesia
| | - Kenji Satou
- Institute of Science and Engineering, Kanazawa University, Kanazawa, Japan
| |
Collapse
|
43
|
Yan J, Zhang B, Zhou M, Kwok HF, Siu SWI. Multi-Branch-CNN: Classification of ion channel interacting peptides using multi-branch convolutional neural network. Comput Biol Med 2022; 147:105717. [PMID: 35752114 DOI: 10.1016/j.compbiomed.2022.105717] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2022] [Revised: 05/18/2022] [Accepted: 06/05/2022] [Indexed: 11/03/2022]
Abstract
Ligand peptides that have high affinity for ion channels are critical for regulating ion flux across the plasma membrane. These peptides are now being considered as potential drug candidates for many diseases, such as cardiovascular disease and cancers. In this work, we developed Multi-Branch-CNN, a CNN method with multiple input branches for identifying three types of ion channel peptide binders (sodium, potassium, and calcium) from intra- and inter-feature types. As for its real-world applications, prediction models that are able to recognize novel sequences having high or low similarities to training sequences are required. To this end, we tested our models on two test sets: a general test set including sequences spanning different similarity levels to those of the training set, and a novel-test set consisting of only sequences that bear little resemblance to sequences from the training set. Our experiments showed that the Multi-Branch-CNN method performs better than thirteen traditional ML algorithms (TML13), yielding an improvement in accuracy of 3.2%, 1.2%, and 2.3% on the test sets as well as 8.8%, 14.3%, and 14.6% on the novel-test sets for sodium, potassium, and calcium ion channels, respectively. We confirmed the effectiveness of Multi-Branch-CNN by comparing it to the standard CNN method with one input branch (Single-Branch-CNN) and an ensemble method (TML13-Stack). The data sets, script files to reproduce the experiments, and the final predictive models are freely available at https://github.com/jieluyan/Multi-Branch-CNN.
Collapse
Affiliation(s)
- Jielu Yan
- PAMI Research Group, Department of Computer and Information Science, University of Macau, Taipa, Macao Special Administrative Region of China
| | - Bob Zhang
- PAMI Research Group, Department of Computer and Information Science, University of Macau, Taipa, Macao Special Administrative Region of China.
| | - Mingliang Zhou
- School of Computer Science, Chongqing University, Shapingba, Chongqing, China
| | - Hang Fai Kwok
- Department of Biomedical Sciences, Faculty of Health Sciences, University of Macau, Taipa, Macao Special Administrative Region of China.
| | - Shirley W I Siu
- Department of Computer and Information Science, University of Macau, Taipa, Macao Special Administrative Region of China; Institute of Science and Environment, University of Saint Joseph, Estr. Marginal da Ilha Verde, Macao Special Administrative Region of China.
| |
Collapse
|
44
|
Research on DNA-Binding Protein Identification Method Based on LSTM-CNN Feature Fusion. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2022; 2022:9705275. [PMID: 35693256 PMCID: PMC9184165 DOI: 10.1155/2022/9705275] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/25/2021] [Revised: 12/23/2021] [Accepted: 04/27/2022] [Indexed: 11/29/2022]
Abstract
Protein is closely related to life activities. As a kind of protein, DNA-binding protein plays an irreplaceable role in life activities. Therefore, it is very important to study DNA-binding protein, which is a subject worthy of study. Although traditional biotechnology has high precision, its cost and efficiency are increasingly unable to meet the needs of modern society. Machine learning methods can make up for the deficiencies of biological experimental techniques to a certain extent, but they are not as simple and fast as deep learning for data processing. In this paper, a deep learning framework based on parallel long and short-term memory(LSTM) and convolutional neural networks(CNN) was proposed to identify DNA-binding protein. This model can not only further extract the information and features of protein sequences, but also the features of evolutionary information. Finally, the two features are combined for training and testing. On the PDB2272 dataset, compared with PDBP_Fusion model, Accuracy(ACC) and Matthew's Correlation Coefficient (MCC) increased by 3.82% and 7.98% respectively. The experimental results of this model have certain advantages.
Collapse
|
45
|
Consistent Clustering Pattern of Prokaryotic Genes Based on Base Frequency at the Second Codon Position and its Association with Functional Category Preference. Interdiscip Sci 2022; 14:349-357. [PMID: 34817803 PMCID: PMC9124167 DOI: 10.1007/s12539-021-00493-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2021] [Revised: 11/02/2021] [Accepted: 11/07/2021] [Indexed: 10/26/2022]
Abstract
AbstractIn 2002, our research group observed a gene clustering pattern based on the base frequency of A versus T at the second codon position in the genome of Vibrio cholera and found that the functional category distribution of genes in the two clusters was different. With the availability of a large number of sequenced genomes, we performed a systematic investigation of A2–T2 distribution and found that 2694 out of 2764 prokaryotic genomes have an optimal clustering number of two, indicating a consistent pattern. Analysis of the functional categories of the coding genes in each cluster in 1483 prokaryotic genomes indicated, that 99.33% of the genomes exhibited a significant difference (p < 0.01) in function distribution between the two clusters. Specifically, functional category P was overrepresented in the small cluster of 98.65% of genomes, whereas categories J, K, and L were overrepresented in the larger cluster of over 98.52% of genomes. Lineage analysis uncovered that these preferences appear consistently across all phyla. Overall, our work revealed an almost universal clustering pattern based on the relative frequency of A2 versus T2 and its role in functional category preference. These findings will promote the understanding of the rationality of theoretical prediction of functional classes of genes from their nucleotide sequences and how protein function is determined by DNA sequence.
Graphical abstract
Collapse
|
46
|
Auriemma Citarella A, Di Biasi L, Risi M, Tortora G. SNARER: new molecular descriptors for SNARE proteins classification. BMC Bioinformatics 2022; 23:148. [PMID: 35462533 PMCID: PMC9035248 DOI: 10.1186/s12859-022-04677-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2021] [Accepted: 03/02/2022] [Indexed: 12/02/2022] Open
Abstract
Background SNARE proteins play an important role in different biological functions. This study aims to investigate the contribution of a new class of molecular descriptors (called SNARER) related to the chemical-physical properties of proteins in order to evaluate the performance of binary classifiers for SNARE proteins. Results We constructed a SNARE proteins balanced dataset, D128, and an unbalanced one, DUNI, on which we tested and compared the performance of the new descriptors presented here in combination with the feature sets (GAAC, CTDT, CKSAAP and 188D) already present in the literature. The machine learning algorithms used were Random Forest, k-Nearest Neighbors and AdaBoost and oversampling and subsampling techniques were applied to the unbalanced dataset. The addition of the SNARER descriptors increases the precision for all considered ML algorithms. In particular, on the unbalanced DUNI dataset the accuracy increases in parallel with the increase in sensitivity while on the balanced dataset D128 the accuracy increases compared to the counterpart without the addition of SNARER descriptors, with a strong improvement in specificity. Our best result is the combination of our descriptors SNARER with CKSAAP feature on the dataset D128 with 92.3% of accuracy, 90.1% for sensitivity and 95% for specificity with the RF algorithm. Conclusions The performed analysis has shown how the introduction of molecular descriptors linked to the chemical-physical and structural characteristics of the proteins can improve the classification performance. Additionally, it was pointed out that performance can change based on using a balanced or unbalanced dataset. The balanced nature of training can significantly improve forecast accuracy.
Collapse
|
47
|
Identification of the most damaging nsSNPs in the human CFL1 gene and their functional and structural impacts on cofilin-1 protein. Gene 2022; 819:146206. [PMID: 35092861 DOI: 10.1016/j.gene.2022.146206] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2021] [Revised: 11/04/2021] [Accepted: 01/13/2022] [Indexed: 01/28/2023]
Abstract
The cofilin-1 protein, encoded by CFL1, is an actin-binding protein that regulates F-actin depolymerization and nucleation activity through phosphorylation and dephosphorylation. CFL1 has been implicated in the development of neurodegenerative diseases (Alzheimer's disease and Huntington's disease), neuronal migration disorders (lissencephaly, epilepsy, and schizophrenia), and neural tube closure defects. Mutations in CFL1 have been associated with impaired neural crest cell migration and neural tube closure defects. In our study, various computational approaches were utilized to explore single-nucleotide polymorphisms (SNPs) in CFL1. The Variation Viewer and gnomAD databases were used to retrieve CFL1 SNPs, including 46 nonsynonymous SNPs (nsSNPs). The functional and structural annotation of SNPs was performed using 12 sequence-based web applications, which identified 20 nsSNPs as being the most likely to be deleterious or disease-causing. The conservation of cofilin-1 protein structures was illustrated using the ConSurf and PROSITE web servers, which projected the 12 most deleterious nsSNPs onto conserved domains, with the potential to disrupt the protein's functionality. These 12 nsSNPs were selected for protein structure construction, and the DynaMut/DUET servers predicted that the protein variants V7G, L84P, and L99A were the most likely to be damaging to the cofilin-1 protein structure or function. The evaluation of molecular docking studies demonstrated that the L99A and L84P cofilin-1 variants reduce the binding affinity for actin compared with the native cofilin-1 structure, and molecular dynamic simulation studies confirmed that these variants might destabilize the protein structure. The consequences of putative mutations on protein-protein interactions and post-translational modification sites in the cofilin-1 protein structure were analyzed. This study represents the first complete approach to understanding the effects of nsSNPs within the actin-depolymerizing factor/cofilin family, which suggested that SNPs resulting in L84P (rs199716082) and L99A (rs267603119) variants represent significant CFL1 mutations associated with disease development.
Collapse
|
48
|
Duhan N, Norton JM, Kaundal R. deepNEC: a novel alignment-free tool for the identification and classification of nitrogen biochemical network-related enzymes using deep learning. Brief Bioinform 2022; 23:6553605. [PMID: 35325031 DOI: 10.1093/bib/bbac071] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2021] [Revised: 01/25/2022] [Accepted: 02/10/2022] [Indexed: 11/12/2022] Open
Abstract
Nitrogen is essential for life and its transformations are an important part of the global biogeochemical cycle. Being an essential nutrient, nitrogen exists in a range of oxidation states from +5 (nitrate) to -3 (ammonium and amino-nitrogen), and its oxidation and reduction reactions catalyzed by microbial enzymes determine its environmental fate. The functional annotation of the genes encoding the core nitrogen network enzymes has a broad range of applications in metagenomics, agriculture, wastewater treatment and industrial biotechnology. This study developed an alignment-free computational approach to determine the predicted nitrogen biochemical network-related enzymes from the sequence itself. We propose deepNEC, a novel end-to-end feature selection and classification model training approach for nitrogen biochemical network-related enzyme prediction. The algorithm was developed using Deep Learning, a class of machine learning algorithms that uses multiple layers to extract higher-level features from the raw input data. The derived protein sequence is used as an input, extracting sequential and convolutional features from raw encoded protein sequences based on classification rather than traditional alignment-based methods for enzyme prediction. Two large datasets of protein sequences, enzymes and non-enzymes were used to train the models with protein sequence features like amino acid composition, dipeptide composition (DPC), conformation transition and distribution, normalized Moreau-Broto (NMBroto), conjoint and quasi order, etc. The k-fold cross-validation and independent testing were performed to validate our model training. deepNEC uses a four-tier approach for prediction; in the first phase, it will predict a query sequence as enzyme or non-enzyme; in the second phase, it will further predict and classify enzymes into nitrogen biochemical network-related enzymes or non-nitrogen metabolism enzymes; in the third phase, it classifies predicted enzymes into nine nitrogen metabolism classes; and in the fourth phase, it predicts the enzyme commission number out of 20 classes for nitrogen metabolism. Among all, the DPC + NMBroto hybrid feature gave the best prediction performance (accuracy of 96.15% in k-fold training and 93.43% in independent testing) with an Matthews correlation coefficient (0.92 training and 0.87 independent testing) in phase I; phase II (accuracy of 99.71% in k-fold training and 98.30% in independent testing); phase III (overall accuracy of 99.03% in k-fold training and 98.98% in independent testing); phase IV (overall accuracy of 99.05% in k-fold training and 98.18% in independent testing), the DPC feature gave the best prediction performance. We have also implemented a homology-based method to remove false negatives. All the models have been implemented on a web server (prediction tool), which is freely available at http://bioinfo.usu.edu/deepNEC/.
Collapse
Affiliation(s)
- Naveen Duhan
- Department of Plants, Soils, and Climate, College of Agriculture and Applied Sciences, UT 84322 USA
| | - Jeanette M Norton
- Department of Plants, Soils, and Climate, College of Agriculture and Applied Sciences, UT 84322 USA
| | - Rakesh Kaundal
- Department of Plants, Soils, and Climate, College of Agriculture and Applied Sciences, UT 84322 USA.,Bioinformatics Facility, Center for Integrated BioSystems, UT 84322 USA.,Department of Computer Science, College of Science; Utah State University, Logan, UT 84322 USA
| |
Collapse
|
49
|
Kabir MN, Wong L. EnsembleFam: towards more accurate protein family prediction in the twilight zone. BMC Bioinformatics 2022; 23:90. [PMID: 35287576 PMCID: PMC8919565 DOI: 10.1186/s12859-022-04626-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2021] [Accepted: 03/02/2022] [Indexed: 11/30/2022] Open
Abstract
Background Current protein family modeling methods like profile Hidden Markov Model (pHMM), k-mer based methods, and deep learning-based methods do not provide very accurate protein function prediction for proteins in the twilight zone, due to low sequence similarity to reference proteins with known functions. Results We present a novel method EnsembleFam, aiming at better function prediction for proteins in the twilight zone. EnsembleFam extracts the core characteristics of a protein family using similarity and dissimilarity features calculated from sequence homology relations. EnsembleFam trains three separate Support Vector Machine (SVM) classifiers for each family using these features, and an ensemble prediction is made to classify novel proteins into these families. Extensive experiments are conducted using the Clusters of Orthologous Groups (COG) dataset and G Protein-Coupled Receptor (GPCR) dataset. EnsembleFam not only outperforms state-of-the-art methods on the overall dataset but also provides a much more accurate prediction for twilight zone proteins. Conclusions EnsembleFam, a machine learning method to model protein families, can be used to better identify members with very low sequence homology. Using EnsembleFam protein functions can be predicted using just sequence information with better accuracy than state-of-the-art methods.
Collapse
Affiliation(s)
- Mohammad Neamul Kabir
- Department of Computer Science, National University of Singapore, 13 Computing Drive, 117417, Singapore, Singapore.
| | - Limsoon Wong
- Department of Computer Science, National University of Singapore, 13 Computing Drive, 117417, Singapore, Singapore
| |
Collapse
|
50
|
Mckenna A, P N Dubey S. Machine Learning Based Predictive Model for the Analysis of Sequence Activity Relationships Using Protein Spectra and Protein Descriptors. J Biomed Inform 2022; 128:104016. [PMID: 35143999 DOI: 10.1016/j.jbi.2022.104016] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2021] [Revised: 12/13/2021] [Accepted: 02/03/2022] [Indexed: 11/26/2022]
Abstract
Accurately establishing the connection between a protein sequence and its function remains a focal point within the field of protein engineering, especially in the context of predicting the effects of mutations. From this, there has been a continued drive to build accurate and reliable predictive models via machine learning that allow for the virtual screening of many protein mutant sequences, measuring the relationship between sequence and 'fitness' or 'activity', commonly known as a Sequence-Activity-Relationship (SAR). An important preliminary stage in the building of these predictive models is the encoding of the chosen sequences. Evaluated in this work is a plethora of encoding strategies using the Amino Acid Index database, where the indices are transformed into their spectral form via Digital Signal Processing (DSP) techniques, as well as numerous protein structural and physiochemical descriptors. The encoding strategies are explored on a dataset curated to measure the thermostability of various mutants from a recombination library, designed from parental cytochrome P450s. In this work it was concluded that the implementation of protein spectra in concatenation with protein descriptors, together with the Partial Least Squares Regression (PLS) algorithm, gave the most noteworthy increase in the quality of the predictive models (as described in Encoding Strategy C), highlighting their utility in identifying an SAR. The accompanying software produced for this paper is termed pySAR (Python Sequence-Activity-Relationship), which allows for a user to find the optimal arrangement of structural and or physiochemical properties to encode their specific mutant library dataset; the source code is available at: https://github.com/amckenna41/pySAR.
Collapse
Affiliation(s)
- Adam Mckenna
- School of Electronics, Electrical Engineering and Computer Science, Queen's University of Belfast, University Road, BT7 1NN, Belfast, United Kingdom.
| | - Sandhya P N Dubey
- Department of Data Science and Computer Applications, Manipal Institute of Technology, Manipal Academy of Higher Education (MAHE), Manipal, Karnataka 576104, India.
| |
Collapse
|