1
|
Yang Q, Xu S, Jiang W, Meng F, Wang S, Sun Z, Chen N, Peng D, Liu J, Xing S. Systematic qualitative proteome-wide analysis of lysine malonylation profiling in Platycodon grandiflorus. Amino Acids 2025; 57:9. [PMID: 39812870 PMCID: PMC11735498 DOI: 10.1007/s00726-024-03432-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2024] [Accepted: 11/25/2024] [Indexed: 01/16/2025]
Abstract
In recent years, it was found that lysine malonylation modification can affect biological metabolism and play an important role in plant life activities. Platycodon grandiflorus, an economic crop and medicinal plant, had no reports on malonylation in the related literature. This study qualitatively introduces lysine malonylation in P. grandiflorus. A total of 888 lysine malonylation-modified proteins in P. grandiflorus were identified, with a total of 1755 modification sites. According to the functional annotation, malonylated proteins were closely related to catalysis, binding, and other reactions. Subcellular localization showed that related proteins were enriched in chloroplasts, cytoplasm, and nuclei, indicating that this modification could regulate various metabolic processes. Motif analysis showed the enrichment of Alanine (A), Cysteine (C), Glycine (G), and Valine (V) amino acids surrounding malonylated lysine residues. Metabolic pathway and protein-protein interaction network analyses suggested these modifications are mainly involved in plant photosynthesis. Moreover, malonylated proteins are also involved in stress and defense responses. This study shows that lysine malonylation can affect a variety of biological processes and metabolic pathways, and the contents are reported for the first time in P. grandiflorus, which can provide important information for further research on P. grandiflorus and lysine malonylation's role in environment stress, photosynthesis, and secondary metabolites enrichment.
Collapse
Affiliation(s)
- Qingshan Yang
- College of Pharmacy, Anhui University of Chinese Medicine, Hefei, 230012, China
| | - Shaowei Xu
- College of Pharmacy, Anhui University of Chinese Medicine, Hefei, 230012, China
| | - Weimin Jiang
- Hunan Key Laboratory for Conservation and Utilization of Biological Resources in the Nanyue Mountainous Region, College of Life Sciences and Environment, Hengyang Normal University, Hengyang, 421008, Hunan, China
| | - Fei Meng
- College of Pharmacy, Anhui University of Chinese Medicine, Hefei, 230012, China
- Institute of Traditional Chinese Medicine Resources Protection and Development, Anhui Academy of Chinese Medicine, Hefei, 230012, China
| | - Shuting Wang
- College of Pharmacy, Anhui University of Chinese Medicine, Hefei, 230012, China
| | - Zongping Sun
- Engineering Technology Research Center of Anti-Aging, Chinese Herbal Medicine, Fuyang Normal University, Fuyang, 236037, China
| | - Na Chen
- Joint Research Center for Chinese Herbal Medicine of Anhui of IHM, Hefei Comprehensive National Science Center, Bozhou, 236814, China
| | - Daiyin Peng
- College of Pharmacy, Anhui University of Chinese Medicine, Hefei, 230012, China
- Institute of Traditional Chinese Medicine Resources Protection and Development, Anhui Academy of Chinese Medicine, Hefei, 230012, China
- MOE-Anhui Joint Collaborative Innovation Center for Quality Improvement of Anhui Genuine Chinese Medicinal Materials, Hefei, 230038, China
| | - Juan Liu
- College of Pharmacy, Anhui University of Chinese Medicine, Hefei, 230012, China.
| | - Shihai Xing
- College of Pharmacy, Anhui University of Chinese Medicine, Hefei, 230012, China.
- Institute of Traditional Chinese Medicine Resources Protection and Development, Anhui Academy of Chinese Medicine, Hefei, 230012, China.
- Joint Research Center for Chinese Herbal Medicine of Anhui of IHM, Hefei Comprehensive National Science Center, Bozhou, 236814, China.
| |
Collapse
|
2
|
Huang G, Cao Y, Dai Q, Chen W. ACP-DPE: A Dual-Channel Deep Learning Model for Anticancer Peptide Prediction. IET Syst Biol 2025; 19:e70010. [PMID: 40119615 PMCID: PMC11928748 DOI: 10.1049/syb2.70010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2024] [Revised: 02/13/2025] [Accepted: 02/20/2025] [Indexed: 03/24/2025] Open
Abstract
Cancer is a serious and complex disease caused by uncontrolled cell growth and is becoming one of the leading causes of death worldwide. Anticancer peptides (ACPs), as a bioactive peptide with lower toxicity, emerge as a promising means of effectively treating cancer. Identifying ACPs is challenging due to the limitation of experimental conditions. To address this, we proposed a dual-channel-based deep learning method, termed ACP-DPE, for ACP prediction. The ACP-DPE consisted of two parallel channels: one was an embedding layer followed by the bi-directional gated recurrent unit (Bi-GRU) module, and the other was an adaptive embedding layer followed by the dilated convolution module. The Bi-GRU module captured the peptide sequence dependencies, whereas the dilated convolution module characterised the local relationship of amino acids. Experimental results show that ACP-DPE achieves an accuracy of 82.81% and a sensitivity of 86.63%, surpassing the state-of-the-art method by 3.86% and 5.1%, respectively. These findings demonstrate the effectiveness of ACP-DPE for ACP prediction and highlight its potential as a valuable tool in cancer treatment research.
Collapse
Affiliation(s)
- Guohua Huang
- College of Information Science and EngineeringShaoyang UniversityShaoyangChina
- Hunan Provincial Key Laboratory of Finance & Economics Big Data Science and TechnologyHunan University of Finance and EconomicsChangshaChina
| | - Yujie Cao
- College of Information Science and EngineeringShaoyang UniversityShaoyangChina
| | - Qi Dai
- College of Life Science and MedicineZhejiang Sci‐Tech UniversityHangzhouChina
| | - Weihong Chen
- Hunan Provincial Key Laboratory of Finance & Economics Big Data Science and TechnologyHunan University of Finance and EconomicsChangshaChina
| |
Collapse
|
3
|
Saraswat A, Sharma U, Gandotra A, Wasan L, Artham S, Maitra A, Singh B. Pred-AHCP: Robust Feature Selection-Enabled Sequence-Specific Prediction of Anti-Hepatitis C Peptides via Machine Learning. J Chem Inf Model 2024; 64:9111-9124. [PMID: 39505690 DOI: 10.1021/acs.jcim.4c00900] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2024]
Abstract
Every year, an estimated 1.5 million people worldwide contract Hepatitis C, a significant contributor to liver problems. Although many studies have explored machine learning's potential to predict antiviral peptides, very few have addressed the problem of predicting peptides against specific viruses such as Hepatitis C. In this study, we demonstrate the application and fine-tuning of machine learning (ML) algorithms to predict peptides that are effective against Hepatitis C virus (HCV). We developed a fine-tuned and explainable ML model that harnesses the amino acid sequence of a peptide to predict its anti-hepatitis C potential. Specifically, features were computed based on sequence and physicochemical properties. The feature selection was performed using a combined strategy of mutual information and variance inflation factor. This facilitated the removal of redundant and multicollinear features, enhancing the model's generalizability in predicting anti-hepatitis C peptides (AHCPs). The model using the random forest algorithm produced the best performance with an accuracy of about 92%. The feature analysis highlights that the distributions of hydrophobicity, polarizability, coil-forming residues, frequency of glycine residues and the existence of dipeptide motifs VL, LV, and CC emerged as the key predictors for identifying AHCPs targeting different components of HCV. The developed model can be accessed through the Pred-AHCP web server, provided at http://tinyurl.com/web-Pred-AHCP. This resource facilitates the prediction and re-engineering of AHCPs for designing peptide-based therapeutics while also proposing an exploration of similar strategies for designing peptide inhibitors effective against other viruses. The developed ML model can also be used for validating peptide sequences generated using generative artificial intelligence methods for further optimization.
Collapse
Affiliation(s)
- Akash Saraswat
- Department of Applied Sciences, School of Engineering and Technology, BML Munjal University, Gurugram, Haryana 122413, India
| | - Utsav Sharma
- Department of Computer Science and Engineering, School of Engineering and Technology, BML Munjal University, Gurugram, Haryana 122413, India
| | - Aryan Gandotra
- Department of Computer Science and Engineering, School of Engineering and Technology, BML Munjal University, Gurugram, Haryana 122413, India
| | - Lakshit Wasan
- Department of Computer Science and Engineering, School of Engineering and Technology, BML Munjal University, Gurugram, Haryana 122413, India
| | - Sainithin Artham
- Department of Computer Science and Engineering, School of Engineering and Technology, BML Munjal University, Gurugram, Haryana 122413, India
| | - Arijit Maitra
- Department of Applied Sciences, School of Engineering and Technology, BML Munjal University, Gurugram, Haryana 122413, India
| | - Bipin Singh
- Centre for Life Sciences, Mahindra University, Hyderabad, Telangana 500043, India
| |
Collapse
|
4
|
Wang X, Zhang Z, Liu C. iACP-DFSRA: Identification of Anticancer Peptides Based on a Dual-channel Fusion Strategy of ResCNN and Attention. J Mol Biol 2024; 436:168810. [PMID: 39362624 DOI: 10.1016/j.jmb.2024.168810] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2024] [Revised: 09/10/2024] [Accepted: 09/27/2024] [Indexed: 10/05/2024]
Abstract
Anticancer peptides (ACPs) have been widely applied in the treatment of cancer owing to good safety, rational side effects, and high selectivity. However, the number of ACPs that have been experimentally validated is limited as identification of ACPs is extremely expensive. Hence, accurate and cost-effective identification methods for ACPs are urgently needed. In this work, we proposed a deep learning-based model, named iACP-DFSRA, for ACPs identification. Specifically, we adopted two kinds of sequence embedding technologies, ProtBert_BFD pre-training language model and handcrafted features to encode protein sequences. Then, the LightGBM was used for feature selection, and the selected features were input into ResCNN and Attention mechanism, respectively, to extract local and global features. Finally, the concatenate features were deeply fused by using the Attention mechanism to allow key features to be paid more attention to by the model and make predictions by fully connected layer. The results of 10-fold cross-validation demonstrated that the iACP-DFSRA model delivered improved results in most metrics with Sp of 94.15%, Sn of 95.32%, Acc of 94.74% and MCC of 89.48% compared to the latest AACFlow model. Indeed, the iACP-DFSRA model is the only model with Acc > 90% and MCC > 80% on this independent test dataset. Furthermore, we have further demonstrated the superiority of our model on additional datasets. In addition, t-SNE and SHAP interpretation analysis demonstrated that it is crucial to use two channels for feature extraction and use the Attention mechanism for deep fusion, which helps the iACP-DFSRA to predict ACPs more effectively.
Collapse
Affiliation(s)
- Xin Wang
- School of Science, Dalian Maritime University, Dalian 116026, China.
| | - Zimeng Zhang
- School of Science, Dalian Maritime University, Dalian 116026, China
| | - Chang Liu
- School of Science, Dalian Maritime University, Dalian 116026, China
| |
Collapse
|
5
|
Kang Y, Wang H, Qin Y, Liu G, Yu Y, Zhang Y. PSATF-6mA: an integrated learning fusion feature-encoded DNA-6 mA methylcytosine modification site recognition model based on attentional mechanisms. Front Genet 2024; 15:1498884. [PMID: 39600317 PMCID: PMC11588721 DOI: 10.3389/fgene.2024.1498884] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2024] [Accepted: 10/30/2024] [Indexed: 11/29/2024] Open
Abstract
DNA methylation is of crucial importance for biological genetic expression, such as biological cell differentiation and cellular tumours. The identification of DNA-6mA sites using traditional biological experimental methods requires more cumbersome steps and a large amount of time. The advent of neural network technology has facilitated the identification of 6 mA sites on cross-species DNA with enhanced efficacy. Nevertheless, the majority of contemporary neural network models for identifying 6 mA sites prioritize the design of the identification model, with comparatively limited research conducted on the statistically significant DNA sequence itself. Consequently, this paper will focus on the statistical strategy of DNA double-stranded features, utilising the multi-head self-attention mechanism in neural networks applied to DNA position probabilistic relationships. Furthermore, a new recognition model, PSATF-6 mA, will be constructed by continually adjusting the attentional tendency of feature fusion through an integrated learning framework. The experimental results, obtained through cross-validation with cross-species data, demonstrate that the PSATF-6 mA model outperforms the baseline model. The in-Matthews correlation coefficient (MCC) for the cross-species dataset of rice and m. musus genomes can reach a score of 0.982. The present model is expected to assist biologists in more accurately identifying 6 mA locus and in formulating new testable biological hypotheses.
Collapse
Affiliation(s)
- Yanmei Kang
- School of Cyber Science and Engineering, University of International Relations, Beijing, China
| | - Hongyuan Wang
- School of Cyber Science and Engineering, University of International Relations, Beijing, China
| | - Yubo Qin
- School of Cyber Science and Engineering, University of International Relations, Beijing, China
| | - Guanlin Liu
- School of Cyber Science and Engineering, University of International Relations, Beijing, China
| | - Yi Yu
- College of Computer Science and Technology, Guangdong University of Technology, Guangzhou, China
| | - Yongjian Zhang
- School of Cyber Science and Engineering, University of International Relations, Beijing, China
| |
Collapse
|
6
|
Yan H, Xu Z. A novel deep ensemble-based model with outlier removal and order-invariant ranking for carbon dioxide emission prediction. ENVIRONMENTAL SCIENCE AND POLLUTION RESEARCH INTERNATIONAL 2024; 31:57605-57622. [PMID: 39287736 DOI: 10.1007/s11356-024-34817-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/17/2024] [Accepted: 08/23/2024] [Indexed: 09/19/2024]
Abstract
Excessive carbon dioxide ( CO 2 ) emissions pose a formidable challenge, driving global climate change and necessitating urgent attention. Striking a balance between curbing CO 2 emissions and fostering economic growth hinges upon the ability to reliably forecast CO 2 emissions. Such forecasts are indispensable for policymakers as they endeavor to make informed decisions and proactively implement mitigation measures. In this research, we introduce an innovative deep ensemble prediction model for CO 2 emissions. This model is constructed around four parallel Long Short-Term Memory (LSTM) neural networks, complemented by a novel Multi-Layer Perception (MLP)-based ensemble framework, equipped with an outlier detection mechanism and an order-invariant ranking module. To enhance prediction accuracy and stability, a k-nearest neighbor (KNN)-based outlier detection module is employed to identify non-outliers and reasonable predictions for the ensemble models. Additionally, a novel feature ranking module is proposed to mitigate prediction fluctuations. The performance evaluation of our model is conducted using historical CO 2 emission data spanning from 1971 to 2021, encompassing six representative countries. Our findings demonstrate that the proposed methodology outperforms existing approaches across various evaluation metrics, offering considerably reduced prediction variances and greater stability. Moreover, long-term CO 2 emission predictions for the corresponding six countries have been provided, which might offer policymakers some basis for making decisions.
Collapse
Affiliation(s)
- Huan Yan
- Institute of Industrial Economics, Chinese Academy of Social Science, Beijing, 100006, PR China.
| | - Zhaoyang Xu
- Department of Paediatrics, Cambridge University, Cambridge, UK
| |
Collapse
|
7
|
Qin Z, Ren H, Zhao P, Wang K, Liu H, Miao C, Du Y, Li J, Wu L, Chen Z. Current computational tools for protein lysine acylation site prediction. Brief Bioinform 2024; 25:bbae469. [PMID: 39316944 PMCID: PMC11421846 DOI: 10.1093/bib/bbae469] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2024] [Revised: 08/20/2024] [Accepted: 09/07/2024] [Indexed: 09/26/2024] Open
Abstract
As a main subtype of post-translational modification (PTM), protein lysine acylations (PLAs) play crucial roles in regulating diverse functions of proteins. With recent advancements in proteomics technology, the identification of PTM is becoming a data-rich field. A large amount of experimentally verified data is urgently required to be translated into valuable biological insights. With computational approaches, PLA can be accurately detected across the whole proteome, even for organisms with small-scale datasets. Herein, a comprehensive summary of 166 in silico PLA prediction methods is presented, including a single type of PLA site and multiple types of PLA sites. This recapitulation covers important aspects that are critical for the development of a robust predictor, including data collection and preparation, sample selection, feature representation, classification algorithm design, model evaluation, and method availability. Notably, we discuss the application of protein language models and transfer learning to solve the small-sample learning issue. We also highlight the prediction methods developed for functionally relevant PLA sites and species/substrate/cell-type-specific PLA sites. In conclusion, this systematic review could potentially facilitate the development of novel PLA predictors and offer useful insights to researchers from various disciplines.
Collapse
Affiliation(s)
- Zhaohui Qin
- Collaborative Innovation Center of Henan Grain Crops, Henan Key Laboratory of Rice Molecular Breeding and High Efficiency Production, College of Agronomy, Henan Agricultural University, Zhengzhou 450046, China
| | - Haoran Ren
- Collaborative Innovation Center of Henan Grain Crops, Henan Key Laboratory of Rice Molecular Breeding and High Efficiency Production, College of Agronomy, Henan Agricultural University, Zhengzhou 450046, China
| | - Pei Zhao
- State Key Laboratory of Cotton Biology, Institute of Cotton Research of Chinese Academy of Agricultural Sciences (CAAS), Anyang 455000, China
| | - Kaiyuan Wang
- Collaborative Innovation Center of Henan Grain Crops, Henan Key Laboratory of Rice Molecular Breeding and High Efficiency Production, College of Agronomy, Henan Agricultural University, Zhengzhou 450046, China
| | - Huixia Liu
- Collaborative Innovation Center of Henan Grain Crops, Henan Key Laboratory of Rice Molecular Breeding and High Efficiency Production, College of Agronomy, Henan Agricultural University, Zhengzhou 450046, China
| | - Chunbo Miao
- Collaborative Innovation Center of Henan Grain Crops, Henan Key Laboratory of Rice Molecular Breeding and High Efficiency Production, College of Agronomy, Henan Agricultural University, Zhengzhou 450046, China
| | - Yanxiu Du
- Collaborative Innovation Center of Henan Grain Crops, Henan Key Laboratory of Rice Molecular Breeding and High Efficiency Production, College of Agronomy, Henan Agricultural University, Zhengzhou 450046, China
| | - Junzhou Li
- Collaborative Innovation Center of Henan Grain Crops, Henan Key Laboratory of Rice Molecular Breeding and High Efficiency Production, College of Agronomy, Henan Agricultural University, Zhengzhou 450046, China
| | - Liuji Wu
- National Key Laboratory of Wheat and Maize Crop Science, College of Agronomy, Henan Agricultural University, Zhengzhou 450046, China
| | - Zhen Chen
- Collaborative Innovation Center of Henan Grain Crops, Henan Key Laboratory of Rice Molecular Breeding and High Efficiency Production, College of Agronomy, Henan Agricultural University, Zhengzhou 450046, China
| |
Collapse
|
8
|
Zhang S, Li YD, Cai YR, Kang XP, Feng Y, Li YC, Chen YH, Li J, Bao LL, Jiang T. Compositional features analysis by machine learning in genome represents linear adaptation of monkeypox virus. Front Genet 2024; 15:1361952. [PMID: 38495668 PMCID: PMC10940399 DOI: 10.3389/fgene.2024.1361952] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2023] [Accepted: 02/21/2024] [Indexed: 03/19/2024] Open
Abstract
Introduction: The global headlines have been dominated by the sudden and widespread outbreak of monkeypox, a rare and endemic zoonotic disease caused by the monkeypox virus (MPXV). Genomic composition based machine learning (ML) methods have recently shown promise in identifying host adaptability and evolutionary patterns of virus. Our study aimed to analyze the genomic characteristics and evolutionary patterns of MPXV using ML methods. Methods: The open reading frame (ORF) regions of full-length MPXV genomes were filtered and 165 ORFs were selected as clusters with the highest homology. Unsupervised machine learning methods of t-distributed stochastic neighbor embedding (t-SNE), Principal Component Analysis (PCA), and hierarchical clustering were performed to observe the DCR characteristics of the selected ORF clusters. Results: The results showed that MPXV sequences post-2022 showed an obvious linear adaptive evolution, indicating that it has become more adapted to the human host after accumulating mutations. For further accurate analysis, the ORF regions with larger variations were filtered out based on the ranking of homology difference to narrow down the key ORF clusters, which drew the same conclusion of linear adaptability. Then key differential protein structures were predicted by AlphaFold 2, which meant that difference in main domains might be one of the internal reasons for linear adaptive evolution. Discussion: Understanding the process of linear adaptation is critical in the constant evolutionary struggle between viruses and their hosts, playing a significant role in crafting effective measures to tackle viral diseases. Therefore, the present study provides valuable insights into the evolutionary patterns of the MPXV in 2022 from the perspective of genomic composition characteristics analysis through ML methods.
Collapse
Affiliation(s)
- Sen Zhang
- State Key Laboratory of Pathogen and Biosecurity, Beijing Institute of Microbiology and Epidemiology, Academy of Military Medical Sciences, Beijing, China
| | - Ya-Dan Li
- College of Basic Medical Sciences, Anhui Medical University, Hefei, China
| | - Yu-Rong Cai
- State Key Laboratory of Pathogen and Biosecurity, Beijing Institute of Microbiology and Epidemiology, Academy of Military Medical Sciences, Beijing, China
- College of the First Clinical Medical, Inner Mongolia Medical University, Hohhot, China
| | - Xiao-Ping Kang
- State Key Laboratory of Pathogen and Biosecurity, Beijing Institute of Microbiology and Epidemiology, Academy of Military Medical Sciences, Beijing, China
| | - Ye Feng
- State Key Laboratory of Pathogen and Biosecurity, Beijing Institute of Microbiology and Epidemiology, Academy of Military Medical Sciences, Beijing, China
| | - Yu-Chang Li
- State Key Laboratory of Pathogen and Biosecurity, Beijing Institute of Microbiology and Epidemiology, Academy of Military Medical Sciences, Beijing, China
| | - Yue-Hong Chen
- State Key Laboratory of Pathogen and Biosecurity, Beijing Institute of Microbiology and Epidemiology, Academy of Military Medical Sciences, Beijing, China
| | - Jing Li
- State Key Laboratory of Pathogen and Biosecurity, Beijing Institute of Microbiology and Epidemiology, Academy of Military Medical Sciences, Beijing, China
- College of Basic Medical Sciences, Anhui Medical University, Hefei, China
| | - Li-Li Bao
- College of Basic Medical Sciences, Inner Mongolia Medical University, Hohhot, China
| | - Tao Jiang
- College of Basic Medical Sciences, Anhui Medical University, Hefei, China
| |
Collapse
|
9
|
Ramazi S, Tabatabaei SAH, Khalili E, Nia AG, Motarjem K. Analysis and review of techniques and tools based on machine learning and deep learning for prediction of lysine malonylation sites in protein sequences. Database (Oxford) 2024; 2024:baad094. [PMID: 38245002 PMCID: PMC10799748 DOI: 10.1093/database/baad094] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2023] [Revised: 11/30/2023] [Accepted: 12/20/2023] [Indexed: 01/22/2024]
Abstract
The post-translational modifications occur as crucial molecular regulatory mechanisms utilized to regulate diverse cellular processes. Malonylation of proteins, a reversible post-translational modification of lysine/k residues, is linked to a variety of biological functions, such as cellular regulation and pathogenesis. This modification plays a crucial role in metabolic pathways, mitochondrial functions, fatty acid oxidation and other life processes. However, accurately identifying malonylation sites is crucial to understand the molecular mechanism of malonylation, and the experimental identification can be a challenging and costly task. Recently, approaches based on machine learning (ML) have been suggested to address this issue. It has been demonstrated that these procedures improve accuracy while lowering costs and time constraints. However, these approaches also have specific shortcomings, including inappropriate feature extraction out of protein sequences, high-dimensional features and inefficient underlying classifiers. As a result, there is an urgent need for effective predictors and calculation methods. In this study, we provide a comprehensive analysis and review of existing prediction models, tools and benchmark datasets for predicting malonylation sites in protein sequences followed by a comparison study. The review consists of the specifications of benchmark datasets, explanation of features and encoding methods, descriptions of the predictions approaches and their embedding ML or deep learning models and the description and comparison of the existing tools in this domain. To evaluate and compare the prediction capability of the tools, a new bunch of data has been extracted based on the most updated database and the tools have been assessed based on the extracted data. Finally, a hybrid architecture consisting of several classifiers including classical ML models and a deep learning model has been proposed to ensemble the prediction results. This approach demonstrates the better performance in comparison with all prediction tools included in this study (the source codes of the models presented in this manuscript are available in https://github.com/Malonylation). Database URL: https://github.com/A-Golshan/Malonylation.
Collapse
Affiliation(s)
| | - Seyed Amir Hossein Tabatabaei
- Department of Computer Science, Faculty of Mathematical Sciences, University of Guilan, Namjoo St. Postal, Rasht 41938-33697, Iran
- Department of Biophysics, Faculty of Biological Sciences, Tarbiat Modares University, Jalal AleAhmad, Tehran 14117-13116, Iran
| | - Elham Khalili
- Department of Plant Sciences, Faculty of Science, Tarbiat Modares University, Jalal AleAhmad, Tehran 14117-13116, Iran
| | - Amirhossein Golshan Nia
- Department of Mathematics and Computer Science, Amirkabir University of Technology, No. 350, Hafez Ave, Tehran 15916-34311, Iran
| | - Kiomars Motarjem
- Department of Statistics, Faculty of Mathematical Sciences, Tarbiat Modares University, Jalal AleAhmad, Tehran 14117-13116, Iran
| |
Collapse
|
10
|
Hu F, Li W, Li Y, Hou C, Ma J, Jia C. O-GlcNAcPRED-DL: Prediction of Protein O-GlcNAcylation Sites Based on an Ensemble Model of Deep Learning. J Proteome Res 2024; 23:95-106. [PMID: 38054441 DOI: 10.1021/acs.jproteome.3c00458] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/07/2023]
Abstract
O-linked β-N-acetylglucosamine (O-GlcNAc) is a post-translational modification (i.e., O-GlcNAcylation) on serine/threonine residues of proteins, regulating a plethora of physiological and pathological events. As a dynamic process, O-GlcNAc functions in a site-specific manner. However, the experimental identification of the O-GlcNAc sites remains challenging in many scenarios. Herein, by leveraging the recent progress in cataloguing experimentally identified O-GlcNAc sites and advanced deep learning approaches, we establish an ensemble model, O-GlcNAcPRED-DL, a deep learning-based tool, for the prediction of O-GlcNAc sites. In brief, to make a benchmark O-GlcNAc data set, we extracted the information on O-GlcNAc from the recently constructed database O-GlcNAcAtlas, which contains thousands of experimentally identified and curated O-GlcNAc sites on proteins from multiple species. To overcome the imbalance between positive and negative data sets, we selected five groups of negative data sets in humans and mice to construct an ensemble predictor based on connection of a convolutional neural network and bidirectional long short-term memory. By taking into account three types of sequence information, we constructed four network frameworks, with the systematically optimized parameters used for the models. The thorough comparison analysis on two independent data sets of humans and mice and six independent data sets from other species demonstrated remarkably increased sensitivity and accuracy of the O-GlcNAcPRED-DL models, outperforming other existing tools. Moreover, a user-friendly Web server for O-GlcNAcPRED-DL has been constructed, which is freely available at http://oglcnac.org/pred_dl.
Collapse
Affiliation(s)
- Fengzhu Hu
- School of Science, Dalian Maritime University, Dalian 116026, China
| | - Weiyu Li
- Lombardi Comprehensive Cancer Center, Georgetown University Medical Center, Washington, District of Columbia 20007, United States
| | - Yaoxiang Li
- Lombardi Comprehensive Cancer Center, Georgetown University Medical Center, Washington, District of Columbia 20007, United States
| | - Chunyan Hou
- Lombardi Comprehensive Cancer Center, Georgetown University Medical Center, Washington, District of Columbia 20007, United States
| | - Junfeng Ma
- Lombardi Comprehensive Cancer Center, Georgetown University Medical Center, Washington, District of Columbia 20007, United States
| | - Cangzhi Jia
- School of Science, Dalian Maritime University, Dalian 116026, China
| |
Collapse
|
11
|
Wang H, Wang W, Zhang S, Hu Z, Yao R, Hadiatullah H, Li P, Zhao G. Identification of novel umami peptides from yeast extract and the mechanism against T1R1/T1R3. Food Chem 2023; 429:136807. [PMID: 37450993 DOI: 10.1016/j.foodchem.2023.136807] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2023] [Revised: 06/21/2023] [Accepted: 07/03/2023] [Indexed: 07/18/2023]
Abstract
Yeast extract was separated by using ultrafiltration, gel filtration chromatography, and preparative high-performance liquid chromatography for analyzing the umami mechanism. 13 kinds of umami peptides were screened out from 73 kinds of peptides which were identified in yeast extract using nanoscale ultra-performance liquid chromatography-tandem mass spectrometry and virtual screening. The umami peptides were found to have a threshold range of 0.07-0.61 mM. DWTDDVEAR exhibited a strong umami taste with a pronounced enhancement effect for monosodium glutamate. Molecular docking studies revealed that specific amino acid residues in the T1R1 subunit, including Arg316, Ser401, and Asp315, played a critical role in the umami perception with these peptides. Overall, the study highlights the potential of natural flavor enhancers and provides insights into the mechanism of umami taste perception.
Collapse
Affiliation(s)
- Hao Wang
- State Key Laboratory of Food Nutrition and Safety, Key Laboratory of Food Nutrition and Safety, Ministry of Education, College of Food Science and Engineering, Tianjin University of Science & Technology, Tianjin 300457, China
| | - Wenjun Wang
- State Key Laboratory of Food Nutrition and Safety, Key Laboratory of Food Nutrition and Safety, Ministry of Education, College of Food Science and Engineering, Tianjin University of Science & Technology, Tianjin 300457, China
| | - Shuyu Zhang
- State Key Laboratory of Food Nutrition and Safety, Key Laboratory of Food Nutrition and Safety, Ministry of Education, College of Food Science and Engineering, Tianjin University of Science & Technology, Tianjin 300457, China
| | - Zhenhao Hu
- State Key Laboratory of Food Nutrition and Safety, Key Laboratory of Food Nutrition and Safety, Ministry of Education, College of Food Science and Engineering, Tianjin University of Science & Technology, Tianjin 300457, China
| | - Ruohan Yao
- State Key Laboratory of Food Nutrition and Safety, Key Laboratory of Food Nutrition and Safety, Ministry of Education, College of Food Science and Engineering, Tianjin University of Science & Technology, Tianjin 300457, China
| | - Hadiatullah Hadiatullah
- Tianjin Key Laboratory for Modern Drug Delivery & High-Efficiency, Collaborative Innovation Center of Chemical Science and Engineering, School of Pharmaceutical Science and Technology, Tianjin University, Tianjin 300072, China
| | - Pei Li
- The Hubei Provincial Key Laboratory of Yeast Function, Angel Yeast Co. Ltd., Yichang 443003, Hubei, China
| | - Guozhong Zhao
- State Key Laboratory of Food Nutrition and Safety, Key Laboratory of Food Nutrition and Safety, Ministry of Education, College of Food Science and Engineering, Tianjin University of Science & Technology, Tianjin 300457, China.
| |
Collapse
|
12
|
Zuo D, Yang L, Jin Y, Qi H, Liu Y, Ren L. Machine learning-based models for the prediction of breast cancer recurrence risk. BMC Med Inform Decis Mak 2023; 23:276. [PMID: 38031071 PMCID: PMC10688055 DOI: 10.1186/s12911-023-02377-z] [Citation(s) in RCA: 18] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2023] [Accepted: 11/17/2023] [Indexed: 12/01/2023] Open
Abstract
Breast cancer is the most common malignancy diagnosed in women worldwide. The prevalence and incidence of breast cancer is increasing every year; therefore, early diagnosis along with suitable relapse detection is an important strategy for prognosis improvement. This study aimed to compare different machine algorithms to select the best model for predicting breast cancer recurrence. The prediction model was developed by using eleven different machine learning (ML) algorithms, including logistic regression (LR), random forest (RF), support vector classification (SVC), extreme gradient boosting (XGBoost), gradient boosting decision tree (GBDT), decision tree, multilayer perceptron (MLP), linear discriminant analysis (LDA), adaptive boosting (AdaBoost), Gaussian naive Bayes (GaussianNB), and light gradient boosting machine (LightGBM), to predict breast cancer recurrence. The area under the curve (AUC), accuracy, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV) and F1 score were used to evaluate the performance of the prognostic model. Based on performance, the optimal ML was selected, and feature importance was ranked by Shapley Additive Explanation (SHAP) values. Compared to the other 10 algorithms, the results showed that the AdaBoost algorithm had the best prediction performance for successfully predicting breast cancer recurrence and was adopted in the establishment of the prediction model. Moreover, CA125, CEA, Fbg, and tumor diameter were found to be the most important features in our dataset to predict breast cancer recurrence. More importantly, our study is the first to use the SHAP method to improve the interpretability of clinicians to predict the recurrence model of breast cancer based on the AdaBoost algorithm. The AdaBoost algorithm offers a clinical decision support model and successfully identifies the recurrence of breast cancer.
Collapse
Affiliation(s)
- Duo Zuo
- Department of Clinical Laboratory, Tianjin Medical University Cancer Institute & Hospital, Tianjin, 300060, China
- National Clinical Research Center for Cancer, Tianjin, 300060, China
- Tianjin's Clinical Research Center for Cancer, Tianjin, 300060, China
- Key Laboratory of Cancer Prevention and Therapy, Tianjin, 300060, China
- Key Laboratory of Breast Cancer Prevention and Therapy, Tianjin Medical University, Ministry of Education, Tianjin, 300060, China
| | - Lexin Yang
- Department of Clinical Laboratory, Tianjin Medical University Cancer Institute & Hospital, Tianjin, 300060, China
- National Clinical Research Center for Cancer, Tianjin, 300060, China
- Tianjin's Clinical Research Center for Cancer, Tianjin, 300060, China
- Key Laboratory of Cancer Prevention and Therapy, Tianjin, 300060, China
- Key Laboratory of Breast Cancer Prevention and Therapy, Tianjin Medical University, Ministry of Education, Tianjin, 300060, China
| | - Yu Jin
- Department of Clinical Laboratory, Tianjin Medical University Cancer Institute & Hospital, Tianjin, 300060, China
- Tongji University Cancer Center, Shanghai Tenth People's Hospital, School of Medicine, Tongji University, Shanghai, 200072, China
| | - Huan Qi
- China Mobile Group Tianjin Company Limited, Tianjin, 300308, China
| | - Yahui Liu
- Department of Clinical Laboratory, Tianjin Medical University Cancer Institute & Hospital, Tianjin, 300060, China
- National Clinical Research Center for Cancer, Tianjin, 300060, China
- Tianjin's Clinical Research Center for Cancer, Tianjin, 300060, China
- Key Laboratory of Cancer Prevention and Therapy, Tianjin, 300060, China
- Key Laboratory of Breast Cancer Prevention and Therapy, Tianjin Medical University, Ministry of Education, Tianjin, 300060, China
| | - Li Ren
- Department of Clinical Laboratory, Tianjin Medical University Cancer Institute & Hospital, Tianjin, 300060, China.
- National Clinical Research Center for Cancer, Tianjin, 300060, China.
- Tianjin's Clinical Research Center for Cancer, Tianjin, 300060, China.
- Key Laboratory of Cancer Prevention and Therapy, Tianjin, 300060, China.
- Key Laboratory of Breast Cancer Prevention and Therapy, Tianjin Medical University, Ministry of Education, Tianjin, 300060, China.
| |
Collapse
|
13
|
Li Y, Ma D, Chen D, Chen Y. ACP-GBDT: An improved anticancer peptide identification method with gradient boosting decision tree. Front Genet 2023; 14:1165765. [PMID: 37065496 PMCID: PMC10090421 DOI: 10.3389/fgene.2023.1165765] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2023] [Accepted: 03/09/2023] [Indexed: 03/31/2023] Open
Abstract
Cancer is one of the most dangerous diseases in the world, killing millions of people every year. Drugs composed of anticancer peptides have been used to treat cancer with low side effects in recent years. Therefore, identifying anticancer peptides has become a focus of research. In this study, an improved anticancer peptide predictor named ACP-GBDT, based on gradient boosting decision tree (GBDT) and sequence information, is proposed. To encode the peptide sequences included in the anticancer peptide dataset, ACP-GBDT uses a merged-feature composed of AAIndex and SVMProt-188D. A GBDT is adopted to train the prediction model in ACP-GBDT. Independent testing and ten-fold cross-validation show that ACP-GBDT can effectively distinguish anticancer peptides from non-anticancer ones. The comparison results of the benchmark dataset show that ACP-GBDT is simpler and more effective than other existing anticancer peptide prediction methods.
Collapse
Affiliation(s)
- Yanjuan Li
- College of Electrical and Information Engineering, Quzhou University, Quzhou, China
| | - Di Ma
- College of Computer, Hangzhou Dianzi University, Hangzhou, China
| | - Dong Chen
- College of Electrical and Information Engineering, Quzhou University, Quzhou, China
- *Correspondence: Dong Chen, ; Yu Chen,
| | - Yu Chen
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
- *Correspondence: Dong Chen, ; Yu Chen,
| |
Collapse
|
14
|
Zhang ZD, Li RR, Chen JY, Huang HX, Cheng YW, Xu LY, Li EM. The post-translational modification of Fascin: impact on cell biology and its associations with inhibiting tumor metastasis. Amino Acids 2022; 54:1541-1552. [PMID: 35939077 DOI: 10.1007/s00726-022-03193-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2022] [Accepted: 07/28/2022] [Indexed: 02/05/2023]
Abstract
The post-translational modifications (PTMs), which are crucial in the regulation of protein functions, have great potential as biomarkers of cancer status. Fascin (Fascin actin-bundling protein 1, FSCN1), a key protein in the formation of filopodia that is structurally based on actin filaments (F-actin), is significantly associated with tumor invasion and metastasis. Studies have revealed various regulatory mechanisms of human Fascin, including PTMs. Although a number of Fascin PTM sites have been identified, their exact functions and clinical significance are much less explored. This review explores studies on the functions of Fascin and briefly discusses the regulatory mechanisms of Fascin. Next, to review the role of Fascin PTMs in cell biology and their associations with metastatic disease, we discuss the advances in the characterization of Fascin PTMs, including phosphorylation, ubiquitination, sumoylation, and acetylation, and the main regulatory mechanisms are discussed. Fascin PTMs may be potential targets for therapy for metastatic disease.
Collapse
Affiliation(s)
- Zhi-Da Zhang
- The Key Laboratory of Molecular Biology for High Cancer Incidence Coastal Chaoshan Area, Department of Biochemistry and Molecular Biology, Shantou University Medical College, No. 22, Xinling Road, Shantou, 515041, Guangdong, China
| | - Rong-Rong Li
- The Key Laboratory of Molecular Biology for High Cancer Incidence Coastal Chaoshan Area, Department of Biochemistry and Molecular Biology, Shantou University Medical College, No. 22, Xinling Road, Shantou, 515041, Guangdong, China
| | - Jia-You Chen
- The Key Laboratory of Molecular Biology for High Cancer Incidence Coastal Chaoshan Area, Department of Biochemistry and Molecular Biology, Shantou University Medical College, No. 22, Xinling Road, Shantou, 515041, Guangdong, China
| | - Hong-Xin Huang
- The Key Laboratory of Molecular Biology for High Cancer Incidence Coastal Chaoshan Area, Department of Biochemistry and Molecular Biology, Shantou University Medical College, No. 22, Xinling Road, Shantou, 515041, Guangdong, China
| | - Yin-Wei Cheng
- The Key Laboratory of Molecular Biology for High Cancer Incidence Coastal Chaoshan Area, Department of Biochemistry and Molecular Biology, Shantou University Medical College, No. 22, Xinling Road, Shantou, 515041, Guangdong, China
- Guangdong Provincial Key Laboratory of Infectious Diseases and Molecular Immunopathology, Institute of Oncologic Pathology, Shantou University Medical College, Shantou, 515041, Guangdong, China
- Institute of Basic Medical Science, Cancer Research Center, Shantou University Medical College, Shantou, 515041, Guangdong, China
| | - Li-Yan Xu
- The Key Laboratory of Molecular Biology for High Cancer Incidence Coastal Chaoshan Area, Department of Biochemistry and Molecular Biology, Shantou University Medical College, No. 22, Xinling Road, Shantou, 515041, Guangdong, China
- Guangdong Provincial Key Laboratory of Infectious Diseases and Molecular Immunopathology, Institute of Oncologic Pathology, Shantou University Medical College, Shantou, 515041, Guangdong, China
- Institute of Basic Medical Science, Cancer Research Center, Shantou University Medical College, Shantou, 515041, Guangdong, China
| | - En-Min Li
- The Key Laboratory of Molecular Biology for High Cancer Incidence Coastal Chaoshan Area, Department of Biochemistry and Molecular Biology, Shantou University Medical College, No. 22, Xinling Road, Shantou, 515041, Guangdong, China
| |
Collapse
|
15
|
Zhao J, Jiang H, Zou G, Lin Q, Wang Q, Liu J, Ma L. CNNArginineMe: A CNN structure for training models for predicting arginine methylation sites based on the One-Hot encoding of peptide sequence. Front Genet 2022; 13:1036862. [PMID: 36324513 PMCID: PMC9618650 DOI: 10.3389/fgene.2022.1036862] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2022] [Accepted: 10/04/2022] [Indexed: 11/30/2022] Open
Abstract
Protein arginine methylation (PRme), as one post-translational modification, plays a critical role in numerous cellular processes and regulates critical cellular functions. Though several in silico models for predicting PRme sites have been reported, new models may be required to develop due to the significant increase of identified PRme sites. In this study, we constructed multiple machine-learning and deep-learning models. The deep-learning model CNN combined with the One-Hot coding showed the best performance, dubbed CNNArginineMe. CNNArginineMe performed best in AUC scoring metrics in comparisons with several reported predictors. Additionally, we employed CNNArginineMe to predict arginine methylation proteome and performed functional analysis. The arginine methylated proteome is significantly enriched in the amyotrophic lateral sclerosis (ALS) pathway. CNNArginineMe is freely available at https://github.com/guoyangzou/CNNArginineMe.
Collapse
Affiliation(s)
- Jiaojiao Zhao
- Cancer Institute of the Affiliated Hospital of Qingdao University and Qingdao Cancer Institute, Qingdao University, Qingdao, China
- School of Basic Medicine, Qingdao University, Qingdao, China
| | - Haoqiang Jiang
- School of Basic Medicine, Qingdao University, Qingdao, China
| | - Guoyang Zou
- School of Basic Medicine, Qingdao University, Qingdao, China
| | - Qian Lin
- Cancer Institute of the Affiliated Hospital of Qingdao University and Qingdao Cancer Institute, Qingdao University, Qingdao, China
| | - Qiang Wang
- Oncology Department, Shandong Second Provincial General Hospital, Jinan, China
| | - Jia Liu
- Department of Pharmacology, School of Pharmacy, Qingdao University, Qingdao, China
| | - Leina Ma
- Cancer Institute of the Affiliated Hospital of Qingdao University and Qingdao Cancer Institute, Qingdao University, Qingdao, China
- *Correspondence: Leina Ma,
| |
Collapse
|
16
|
Zhu Y, Liu Y, Chen Y, Li L. ResSUMO: A Deep Learning Architecture Based on Residual Structure for Prediction of Lysine SUMOylation Sites. Cells 2022; 11:2646. [PMID: 36078053 PMCID: PMC9454673 DOI: 10.3390/cells11172646] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2022] [Revised: 08/18/2022] [Accepted: 08/22/2022] [Indexed: 12/26/2022] Open
Abstract
Lysine SUMOylation plays an essential role in various biological functions. Several approaches integrating various algorithms have been developed for predicting SUMOylation sites based on a limited dataset. Recently, the number of identified SUMOylation sites has significantly increased due to investigation at the proteomics scale. We collected modification data and found the reported approaches had poor performance using our collected data. Therefore, it is essential to explore the characteristics of this modification and construct prediction models with improved performance based on an enlarged dataset. In this study, we constructed and compared 16 classifiers by integrating four different algorithms and four encoding features selected from 11 sequence-based or physicochemical features. We found that the convolution neural network (CNN) model integrated with residue structure, dubbed ResSUMO, performed favorably when compared with the traditional machine learning and CNN models in both cross-validation and independent tests. The area under the receiver operating characteristic (ROC) curve for ResSUMO was around 0.80, superior to that of the reported predictors. We also found that increasing the depth of neural networks in the CNN models did not improve prediction performance due to the degradation problem, but the residual structure could be included to optimize the neural networks and improve performance. This indicates that residual neural networks have the potential to be broadly applied in the prediction of other types of modification sites with great effectiveness and robustness. Furthermore, the online ResSUMO service is freely accessible.
Collapse
Affiliation(s)
- Yafei Zhu
- College of Computer Science and Technology, Qingdao University, Qingdao 266071, China
| | - Yuhai Liu
- Dawning International Information Industry, Co., Ltd., Qingdao 266101, China
| | - Yu Chen
- College of Computer Science and Technology, Qingdao University, Qingdao 266071, China
| | - Lei Li
- College of Computer Science and Technology, Qingdao University, Qingdao 266071, China
- Faculty of Biomedical and Rehabilitation Engineering, University of Health and Rehabilitation Sciences, Qingdao 266001, China
| |
Collapse
|
17
|
Zhang B, Fan T. Knowledge structure and emerging trends in the application of deep learning in genetics research: A bibliometric analysis [2000–2021]. Front Genet 2022; 13:951939. [PMID: 36081985 PMCID: PMC9445221 DOI: 10.3389/fgene.2022.951939] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2022] [Accepted: 07/13/2022] [Indexed: 11/13/2022] Open
Abstract
Introduction: Deep learning technology has been widely used in genetic research because of its characteristics of computability, statistical analysis, and predictability. Herein, we aimed to summarize standardized knowledge and potentially innovative approaches for deep learning applications of genetics by evaluating publications to encourage more research.Methods: The Science Citation Index Expanded TM (SCIE) database was searched for deep learning applications for genomics-related publications. Original articles and reviews were considered. In this study, we derived a clustered network from 69,806 references that were cited by the 1,754 related manuscripts identified. We used CiteSpace and VOSviewer to identify countries, institutions, journals, co-cited references, keywords, subject evolution, path, current characteristics, and emerging topics.Results: We assessed the rapidly increasing publications concerned about deep learning applications of genomics approaches and identified 1,754 articles that published reports focusing on this subject. Among these, a total of 101 countries and 2,487 institutes contributed publications, The United States of America had the most publications (728/1754) and the highest h-index, and the US has been in close collaborations with China and Germany. The reference clusters of SCI articles were clustered into seven categories: deep learning, logic regression, variant prioritization, random forests, scRNA-seq (single-cell RNA-seq), genomic regulation, and recombination. The keywords representing the research frontiers by year were prediction (2016–2021), sequence (2017–2021), mutation (2017–2021), and cancer (2019–2021).Conclusion: Here, we summarized the current literature related to the status of deep learning for genetics applications and analyzed the current research characteristics and future trajectories in this field. This work aims to provide resources for possible further intensive exploration and encourages more researchers to overcome the research of deep learning applications in genetics.
Collapse
Affiliation(s)
- Bijun Zhang
- Department of Clinical Genetics, Shengjing Hospital of China Medical University, Shenyang, China
| | - Ting Fan
- Department of Computer, School of Intelligent Medicine, China Medical University, Shenyang, China
- *Correspondence: Ting Fan,
| |
Collapse
|
18
|
Chen Z, Liu X, Zhao P, Li C, Wang Y, Li F, Akutsu T, Bain C, Gasser RB, Li J, Yang Z, Gao X, Kurgan L, Song J. iFeatureOmega: an integrative platform for engineering, visualization and analysis of features from molecular sequences, structural and ligand data sets. Nucleic Acids Res 2022; 50:W434-W447. [PMID: 35524557 PMCID: PMC9252729 DOI: 10.1093/nar/gkac351] [Citation(s) in RCA: 34] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2022] [Revised: 04/22/2022] [Accepted: 04/25/2022] [Indexed: 01/07/2023] Open
Abstract
The rapid accumulation of molecular data motivates development of innovative approaches to computationally characterize sequences, structures and functions of biological and chemical molecules in an efficient, accessible and accurate manner. Notwithstanding several computational tools that characterize protein or nucleic acids data, there are no one-stop computational toolkits that comprehensively characterize a wide range of biomolecules. We address this vital need by developing a holistic platform that generates features from sequence and structural data for a diverse collection of molecule types. Our freely available and easy-to-use iFeatureOmega platform generates, analyzes and visualizes 189 representations for biological sequences, structures and ligands. To the best of our knowledge, iFeatureOmega provides the largest scope when directly compared to the current solutions, in terms of the number of feature extraction and analysis approaches and coverage of different molecules. We release three versions of iFeatureOmega including a webserver, command line interface and graphical interface to satisfy needs of experienced bioinformaticians and less computer-savvy biologists and biochemists. With the assistance of iFeatureOmega, users can encode their molecular data into representations that facilitate construction of predictive models and analytical studies. We highlight benefits of iFeatureOmega based on three research applications, demonstrating how it can be used to accelerate and streamline research in bioinformatics, computational biology, and cheminformatics areas. The iFeatureOmega webserver is freely available at http://ifeatureomega.erc.monash.edu and the standalone versions can be downloaded from https://github.com/Superzchen/iFeatureOmega-GUI/ and https://github.com/Superzchen/iFeatureOmega-CLI/.
Collapse
Affiliation(s)
- Zhen Chen
- Collaborative Innovation Center of Henan Grain Crops, Henan Agricultural University, Zhengzhou 450046, China
- Center for Crop Genome Engineering, Henan Agricultural University, Zhengzhou 450046, China
| | - Xuhan Liu
- Drug Discovery and Safety, Leiden Academic Centre for Drug Research, Einsteinweg 55, Leiden 2333 CC, The Netherlands
| | - Pei Zhao
- State Key Laboratory of Cotton Biology, Institute of Cotton Research of Chinese Academy of Agricultural Sciences (CAAS), Anyang 455000, China
| | - Chen Li
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia
| | - Yanan Wang
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia
| | - Fuyi Li
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia
| | - Tatsuya Akutsu
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Kyoto 611-0011, Japan
| | - Chris Bain
- Monash Data Future Institutes, Monash University, Melbourne, Victoria 3800, Australia
| | - Robin B Gasser
- Department of Veterinary Biosciences, Melbourne Veterinary School, The University of Melbourne, Parkville, Victoria 3010, Australia
| | - Junzhou Li
- Collaborative Innovation Center of Henan Grain Crops, Henan Agricultural University, Zhengzhou 450046, China
| | - Zuoren Yang
- State Key Laboratory of Cotton Biology, Institute of Cotton Research of Chinese Academy of Agricultural Sciences (CAAS), Anyang 455000, China
| | - Xin Gao
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955, Saudi Arabia
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, USA
| | - Jiangning Song
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia
- Monash Data Future Institutes, Monash University, Melbourne, Victoria 3800, Australia
| |
Collapse
|
19
|
Lo-Thong-Viramoutou O, Charton P, Cadet XF, Grondin-Perez B, Saavedra E, Damour C, Cadet F. Non-linearity of Metabolic Pathways Critically Influences the Choice of Machine Learning Model. Front Artif Intell 2022; 5:744755. [PMID: 35757298 PMCID: PMC9226554 DOI: 10.3389/frai.2022.744755] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2021] [Accepted: 04/29/2022] [Indexed: 11/13/2022] Open
Abstract
The use of machine learning (ML) in life sciences has gained wide interest over the past years, as it speeds up the development of high performing models. Important modeling tools in biology have proven their worth for pathway design, such as mechanistic models and metabolic networks, as they allow better understanding of mechanisms involved in the functioning of organisms. However, little has been done on the use of ML to model metabolic pathways, and the degree of non-linearity associated with them is not clear. Here, we report the construction of different metabolic pathways with several linear and non-linear ML models. Different types of data are used; they lead to the prediction of important biological data, such as pathway flux and final product concentration. A comparison reveals that the data features impact model performance and highlight the effectiveness of non-linear models (e.g., QRF: RMSE = 0.021 nmol·min-1 and R2 = 1 vs. Bayesian GLM: RMSE = 1.379 nmol·min-1 R2 = 0.823). It turns out that the greater the degree of non-linearity of the pathway, the better suited a non-linear model will be. Therefore, a decision-making support for pathway modeling is established. These findings generally support the hypothesis that non-linear aspects predominate within the metabolic pathways. This must be taken into account when devising possible applications of these pathways for the identification of biomarkers of diseases (e.g., infections, cancer, neurodegenerative diseases) or the optimization of industrial production processes.
Collapse
Affiliation(s)
- Ophélie Lo-Thong-Viramoutou
- University of Paris, BIGR—Biologie Intégrée du Globule Rouge, Inserm, UMR_S1134, Paris, France
- Laboratory of Excellence GR-Ex, Paris, France
- Laboratory DSIMB, UMR_S1134, BIGR, Inserm, Faculty of Sciences and Technology, University of La Reunion, Saint-Denis, France
| | - Philippe Charton
- University of Paris, BIGR—Biologie Intégrée du Globule Rouge, Inserm, UMR_S1134, Paris, France
- Laboratory of Excellence GR-Ex, Paris, France
- Laboratory DSIMB, UMR_S1134, BIGR, Inserm, Faculty of Sciences and Technology, University of La Reunion, Saint-Denis, France
| | | | - Brigitte Grondin-Perez
- EnergyLab, EA 4079, Faculty of Sciences and Technology, University of La Reunion, Saint-Denis, France
| | - Emma Saavedra
- Departamento de Bioquímica, Instituto Nacional de Cardiología Ignacio Chávez, Mexico City, Mexico
| | - Cédric Damour
- EnergyLab, EA 4079, Faculty of Sciences and Technology, University of La Reunion, Saint-Denis, France
| | - Frédéric Cadet
- University of Paris, BIGR—Biologie Intégrée du Globule Rouge, Inserm, UMR_S1134, Paris, France
- Laboratory of Excellence GR-Ex, Paris, France
- Laboratory DSIMB, UMR_S1134, BIGR, Inserm, Faculty of Sciences and Technology, University of La Reunion, Saint-Denis, France
| |
Collapse
|
20
|
Sorkhi AG, Pirgazi J, Ghasemi V. A hybrid feature extraction scheme for efficient malonylation site prediction. Sci Rep 2022; 12:5756. [PMID: 35388017 PMCID: PMC8987080 DOI: 10.1038/s41598-022-08555-9] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2021] [Accepted: 03/07/2022] [Indexed: 11/09/2022] Open
Abstract
Lysine malonylation is one of the most important post-translational modifications (PTMs). It affects the functionality of cells. Malonylation site prediction in proteins can unfold the mechanisms of cellular functionalities. Experimental methods are one of the due prediction approaches. But they are typically costly and time-consuming to implement. Recently, methods based on machine-learning solutions have been proposed to tackle this problem. Such practices have been shown to reduce costs and time complexities and increase accuracy. However, these approaches also have specific shortcomings, including inappropriate feature extraction out of protein sequences, high-dimensional features, and inefficient underlying classifiers. A machine learning-based method is proposed in this paper to cope with these problems. In the proposed approach, seven different features are extracted. Then, the extracted features are combined, ranked based on the Fisher's score (F-score), and the most efficient ones are selected. Afterward, malonylation sites are predicted using various classifiers. Simulation results show that the proposed method has acceptable performance compared with some state-of-the-art approaches. In addition, the XGBOOST classifier, founded on extracted features such as TFCRF, has a higher prediction rate than the other methods. The codes are publicly available at: https://github.com/jimy2020/Malonylation-site-prediction.
Collapse
Affiliation(s)
- Ali Ghanbari Sorkhi
- Department of Computer Engineering, University of Science and Technology of Mazandaran, Behshahr, Iran
| | - Jamshid Pirgazi
- Department of Computer Engineering, University of Science and Technology of Mazandaran, Behshahr, Iran.
| | - Vahid Ghasemi
- Department of Computer Engineering, Faculty of Information Technology, Kermanshah University of Technology, Kermanshah, Iran
| |
Collapse
|
21
|
Wang M, Song L, Zhang Y, Gao H, Yan L, Yu B. Malsite-Deep: Prediction of protein malonylation sites through deep learning and multi-information fusion based on NearMiss-2 strategy. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.108191] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
|
22
|
Chen Z, Liu X, Li F, Li C, Marquez-Lago T, Leier A, Webb GI, Xu D, Akutsu T, Song J. Systematic Characterization of Lysine Post-translational Modification Sites Using MUscADEL. Methods Mol Biol 2022; 2499:205-219. [PMID: 35696083 DOI: 10.1007/978-1-0716-2317-6_11] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Among various types of protein post-translational modifications (PTMs), lysine PTMs play an important role in regulating a wide range of functions and biological processes. Due to the generation and accumulation of enormous amount of protein sequence data by ongoing whole-genome sequencing projects, systematic identification of different types of lysine PTM substrates and their specific PTM sites in the entire proteome is increasingly important and has therefore received much attention. Accordingly, a variety of computational methods for lysine PTM identification have been developed based on the combination of various handcrafted sequence features and machine-learning techniques. In this chapter, we first briefly review existing computational methods for lysine PTM identification and then introduce a recently developed deep learning-based method, termed MUscADEL (Multiple Scalable Accurate Deep Learner for lysine PTMs). Specifically, MUscADEL employs bidirectional long short-term memory (BiLSTM) recurrent neural networks and is capable of predicting eight major types of lysine PTMs in both the human and mouse proteomes. The web server of MUscADEL is publicly available at http://muscadel.erc.monash.edu/ for the research community to use.
Collapse
Affiliation(s)
- Zhen Chen
- Key Laboratory of Rice Biology in Henan Province, Henan Agricultural University, Zhengzhou, China
| | - Xuhan Liu
- Drug Discovery and Safety, Leiden Academic Centre for Drug Research, Leiden, The Netherlands
| | - Fuyi Li
- Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, Melbourne, VIC, Australia
| | - Chen Li
- Biomedicine Discovery Institute, Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC, Australia
| | - Tatiana Marquez-Lago
- Department of Genetics, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, USA
- Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, USA
| | - André Leier
- Department of Genetics, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, USA
- Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, USA
| | - Geoffrey I Webb
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC, Australia
| | - Dakang Xu
- Faculty of Medical Laboratory Science, Ruijin Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, China
- Department of Molecular and Translational Science, Faculty of Medicine, Hudson Institute of Medical Research, Monash University, Melbourne, VIC, Australia
| | - Tatsuya Akutsu
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Kyoto, Japan.
| | - Jiangning Song
- Biomedicine Discovery Institute, Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC, Australia.
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC, Australia.
| |
Collapse
|
23
|
Development of an experiment-split method for benchmarking the generalization of a PTM site predictor: Lysine methylome as an example. PLoS Comput Biol 2021; 17:e1009682. [PMID: 34879076 PMCID: PMC8687584 DOI: 10.1371/journal.pcbi.1009682] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2021] [Revised: 12/20/2021] [Accepted: 11/25/2021] [Indexed: 11/19/2022] Open
Abstract
Many computational classifiers have been developed to predict different types of post-translational modification sites. Their performances are measured using cross-validation or independent test, in which experimental data from different sources are mixed and randomly split into training and test sets. However, the self-reported performances of most classifiers based on this measure are generally higher than their performances in the application of new experimental data. It suggests that the cross-validation method overestimates the generalization ability of a classifier. Here, we proposed a generalization estimate method, dubbed experiment-split test, where the experimental sources for the training set are different from those for the test set that simulate the data derived from a new experiment. We took the prediction of lysine methylome (Kme) as an example and developed a deep learning-based Kme site predictor (called DeepKme) with outstanding performance. We assessed the experiment-split test by comparing it with the cross-validation method. We found that the performance measured using the experiment-split test is lower than that measured in terms of cross-validation. As the test data of the experiment-split method were derived from an independent experimental source, this method could reflect the generalization of the predictor. Therefore, we believe that the experiment-split method can be applied to benchmark the practical performance of a given PTM model. DeepKme is free accessible via https://github.com/guoyangzou/DeepKme. The performance of a model for predicting post-translational modification sites is commonly evaluated using the cross-validation method, where the data derived from different experimental sources are mixed and randomly separated into the training dataset and validation dataset. However, the performance measured through cross-validation is generally higher than the performance in the application of new experimental data, indicating that the cross-validation method overestimates the generalization of a model. In this study, we proposed a generalization estimate method, dubbed experiment-split test, where the experimental sources for the training set are different from those for the test set that simulate the data derived from a new experiment. We took the prediction of lysine methylome as an example and developed a deep learning-based Kme site predictor DeepKme with outstanding performance. We found that the performance measured by the experiment-split method is lower than that measured in terms of cross-validation. As the test data of the experiment-split method were derived from an independent experimental source, this method could reflect the generalization of the prediction model. Therefore, the experiment-split method can be applied to benchmark the practical prediction performance.
Collapse
|
24
|
Lv H, Zhang Y, Wang JS, Yuan SS, Sun ZJ, Dao FY, Guan ZX, Lin H, Deng KJ. iRice-MS: An integrated XGBoost model for detecting multitype post-translational modification sites in rice. Brief Bioinform 2021; 23:6447435. [PMID: 34864888 DOI: 10.1093/bib/bbab486] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2021] [Revised: 10/05/2021] [Accepted: 10/23/2021] [Indexed: 12/13/2022] Open
Abstract
Post-translational modification (PTM) refers to the covalent and enzymatic modification of proteins after protein biosynthesis, which orchestrates a variety of biological processes. Detecting PTM sites in proteome scale is one of the key steps to in-depth understanding their regulation mechanisms. In this study, we presented an integrated method based on eXtreme Gradient Boosting (XGBoost), called iRice-MS, to identify 2-hydroxyisobutyrylation, crotonylation, malonylation, ubiquitination, succinylation and acetylation in rice. For each PTM-specific model, we adopted eight feature encoding schemes, including sequence-based features, physicochemical property-based features and spatial mapping information-based features. The optimal feature set was identified from each encoding, and their respective models were established. Extensive experimental results show that iRice-MS always display excellent performance on 5-fold cross-validation and independent dataset test. In addition, our novel approach provides the superiority to other existing tools in terms of AUC value. Based on the proposed model, a web server named iRice-MS was established and is freely accessible at http://lin-group.cn/server/iRice-MS.
Collapse
Affiliation(s)
- Hao Lv
- Center for Informational Biology at University of Electronic Science and Technology of China, China
| | - Yang Zhang
- Innovative Institute of Chinese Medicine and Pharmacy, Chengdu University of Traditional Chinese Medicine, China
| | - Jia-Shu Wang
- Center for Informational Biology at University of Electronic Science and Technology of China, China
| | - Shi-Shi Yuan
- Center for Informational Biology at University of Electronic Science and Technology of China, China
| | - Zi-Jie Sun
- Center for Informational Biology at University of Electronic Science and Technology of China, China
| | - Fu-Ying Dao
- Center for Informational Biology at University of Electronic Science and Technology of China, China
| | - Zheng-Xing Guan
- Center for Informational Biology at University of Electronic Science and Technology of China, China
| | - Hao Lin
- Center for Informational Biology at University of Electronic Science and Technology of China, China
| | - Ke-Jun Deng
- Center for Informational Biology at University of Electronic Science and Technology of China, China
| |
Collapse
|
25
|
Sha Y, Ma C, Wei X, Liu Y, Chen Y, Li L. DeepSADPr: A hybrid-learning architecture for serine ADP-ribosylation site prediction. Methods 2021; 203:575-583. [PMID: 34560250 DOI: 10.1016/j.ymeth.2021.09.008] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2021] [Revised: 09/13/2021] [Accepted: 09/16/2021] [Indexed: 01/28/2023] Open
Abstract
Protein adenosine diphosphate-ribosylation (ADPr) is caused by the covalent binding of one or more ADP-ribose moieties to a target protein and regulates the biological functions of the target protein. To fully understand the regulatory mechanism of ADP-ribosylation, the essential step is the identification of the ADPr sites from the proteome. As the experimental approaches are costly and time-consuming, it is necessary to develop a computational tool to predict ADPr sites. Recently, serine has been found to be the major residue type for ADP-ribosylation but no predictor is available. In this study, we collected thousands of experimentally validated human ADPr sites on serine residue and constructed several different machine-learning classifiers. We found that the hybrid model, dubbed DeepSADPr, which integrated the one-dimensional convolutional neural network (CNN) with the One-Hot encoding approach and the word-embedding approach, compared favourably to other models in terms of both ten-fold cross-validation and independent test. Its AUC values reached 0.935 for ten-fold cross-validation. Its values of sensitivity, accuracy and Matthews's correlation coefficient reached 0.933, 0.867 and 0.740, respectively, with the fixed specificity value of 0.80. Overall, DeepSADPr is the first classifier for predicting Serine ADPr sites, which is available at http://www.bioinfogo.org/DeepSADPr.
Collapse
Affiliation(s)
- Yutong Sha
- College of Computer Science & Technology, Qingdao University, Qingdao 266071, China
| | - Chenglong Ma
- College of Life Sciences, Qingdao University, Qingdao 266071, China
| | - Xilin Wei
- College of Computer Science & Technology, Qingdao University, Qingdao 266071, China
| | - Yuhai Liu
- Dawning International Information Industry, Co., Ltd., Qingdao 266101, China
| | - Yu Chen
- College of Computer Science & Technology, Qingdao University, Qingdao 266071, China.
| | - Lei Li
- College of Computer Science & Technology, Qingdao University, Qingdao 266071, China; School of Basic Medicine, Qingdao University, Qingdao 266071, China; College of Life Sciences, Qingdao University, Qingdao 266071, China.
| |
Collapse
|
26
|
Nhamo L, Rwizi L, Mpandeli S, Botai J, Magidi J, Tazvinga H, Sobratee N, Liphadzi S, Naidoo D, Modi AT, Slotow R, Mabhaudhi T. Urban nexus and transformative pathways towards resilient cities: A case of the Gauteng City-Region, South Africa. CITIES (LONDON, ENGLAND) 2021; 116:103266. [PMID: 37674556 PMCID: PMC7615023 DOI: 10.1016/j.cities.2021.103266] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/08/2023]
Abstract
Challenges emanating from rapid urbanisation require innovative strategies to transform cities into global climate action and adaptation centres. We provide an analysis of the impacts of rapid urbanisation in the Gauteng City-Region, South Africa, highlighting major challenges related to (i) land use management, (ii) service delivery (water, energy, food, and waste and sanitation), and (iii) social cohesion. Geospatial techniques were used to assess spatio-temporal changes in the urban landscapes, including variations in land surface temperatures. Massive impervious surfaces, rising temperatures, flooding and heatwaves are exacerbating the challenges associated with rapid urbanisation. An outline of the response pathways towards sustainable and resilient cities is given as a lens to formulate informed and coherent adaptation urban planning strategies. The assessment facilitated developing a contextualised conceptual framework, focusing on demographic, climatic, and environmental changes, and the risks associated with rapid urbanisation. If not well managed in an integrated manner, rapid urbanisation poses a huge environmental and human health risk and could retard progress towards sustainable cities by 2030. Nexus planning provides the lens and basis to achieve urban resilience, by integrating complex, but interlinked sectors, by considering both ecological and built infrastructures, in a balanced manner, as key to resilience and adaptation strategies.
Collapse
Affiliation(s)
- Luxon Nhamo
- Water Research Commission of South Africa (WRC), Lynnwood Manor, Pretoria 0081, South Africa
- Centre for Transformative Agricultural and Food Systems (CTAFS), School of Agricultural, Earth and Environmental Sciences, University of KwaZulu-Natal, Scottsville, Pietermaritzburg 3209, South Africa
| | - Lameck Rwizi
- College of Agriculture and Environmental Sciences (CAES), University of South Africa (UNISA), Florida, Johannesburg 1710, South Africa
| | - Sylvester Mpandeli
- Water Research Commission of South Africa (WRC), Lynnwood Manor, Pretoria 0081, South Africa
- School of Environmental Sciences, University of Venda, Thohoyandou 0950, South Africa
| | - Joel Botai
- South Africa Weather Services (SAWS), Ecoglades, Centurion 0157, Pretoria, South Africa
- Centre for Transformative Agricultural and Food Systems (CTAFS), School of Agricultural, Earth and Environmental Sciences, University of KwaZulu-Natal, Scottsville, Pietermaritzburg 3209, South Africa
| | - James Magidi
- Geomatics Department, Tshwane University of Technology, Pretoria, 0001, South Africa
| | - Henerica Tazvinga
- South Africa Weather Services (SAWS), Ecoglades, Centurion 0157, Pretoria, South Africa
| | - Nafiisa Sobratee
- School of Life Sciences, University of KwaZulu-Natal, Scottsville, Pietermaritzburg 3209, South Africa
| | - Stanley Liphadzi
- Water Research Commission of South Africa (WRC), Lynnwood Manor, Pretoria 0081, South Africa
- School of Environmental Sciences, University of Venda, Thohoyandou 0950, South Africa
| | - Dhesigen Naidoo
- Water Research Commission of South Africa (WRC), Lynnwood Manor, Pretoria 0081, South Africa
- Centre for Transformative Agricultural and Food Systems (CTAFS), School of Agricultural, Earth and Environmental Sciences, University of KwaZulu-Natal, Scottsville, Pietermaritzburg 3209, South Africa
| | - Albert T. Modi
- Centre for Transformative Agricultural and Food Systems (CTAFS), School of Agricultural, Earth and Environmental Sciences, University of KwaZulu-Natal, Scottsville, Pietermaritzburg 3209, South Africa
| | - Rob Slotow
- School of Life Sciences, University of KwaZulu-Natal, Scottsville, Pietermaritzburg 3209, South Africa
- Department of Genetics, Evolution and Environment, University College London, WC1E 6BT, United Kingdom
| | - Tafadzwanashe Mabhaudhi
- Centre for Transformative Agricultural and Food Systems (CTAFS), School of Agricultural, Earth and Environmental Sciences, University of KwaZulu-Natal, Scottsville, Pietermaritzburg 3209, South Africa
- Centre for Water Resources Research (CWRR), School of Agricultural, Earth and Environmental Sciences, University of KwaZulu-Natal, Scottsville, Pietermaritzburg 3209, South Africa
| |
Collapse
|
27
|
Jia C, Zhang M, Fan C, Li F, Song J. Formator: Predicting Lysine Formylation Sites Based on the Most Distant Undersampling and Safe-Level Synthetic Minority Oversampling. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1937-1945. [PMID: 31804942 DOI: 10.1109/tcbb.2019.2957758] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Lysine formylation is a reversible type of protein post-translational modification and has been found to be involved in a myriad of biological processes, including modulation of chromatin conformation and gene expression in histones and other nuclear proteins. Accurate identification of lysine formylation sites is essential for elucidating the underlying molecular mechanisms of formylation. Traditional experimental methods are time-consuming and expensive. As such, it is desirable and necessary to develop computational methods for accurate prediction of formylation sites. In this study, we propose a novel predictor, termed Formator, for identifying lysine formylation sites from sequences information. Formator is developed using the ensemble learning (EL) strategy based on four individual support vector machine classifiers via a voting system. Moreover, the most distant undersampling and Safe-Level-SMOTE oversampling techniques were integrated to deal with the data imbalance problem of the training dataset. Four effective feature extraction methods, namely bi-profile Bayes (BPB), k-nearest neighbor (KNN), amino acid physicochemical properties (AAindex), and composition and transition (CTD) were employed to encode the surrounding sequence features of potential formylation sites. Extensive empirical studies show that Formator achieved the accuracy of 87.24 and 74.96 percent on jackknife test and the independent test, respectively. Performance comparison results on the independent test indicate that Formator outperforms current existing prediction tool, LFPred, suggesting that it has a great potential to serve as a useful tool in identifying novel lysine formylation sites and facilitating hypothesis-driven experimental efforts.
Collapse
|
28
|
Islam MKB, Rahman J, Hasan MAM, Ahmad S. predForm-Site: Formylation site prediction by incorporating multiple features and resolving data imbalance. Comput Biol Chem 2021; 94:107553. [PMID: 34384997 DOI: 10.1016/j.compbiolchem.2021.107553] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2020] [Revised: 06/22/2021] [Accepted: 07/28/2021] [Indexed: 10/20/2022]
Abstract
Formylation is one of the newly discovered post-translational modifications in lysine residue which is responsible for different kinds of diseases. In this work, a novel predictor, named predForm-Site, has been developed to predict formylation sites with higher accuracy. We have integrated multiple sequence features for developing a more informative representation of formylation sites. Moreover, decision function of the underlying classifier have been optimized on skewed formylation dataset during prediction model training for prediction quality improvement. On the dataset used by LFPred and Formator predictor, predForm-Site achieved 99.5% sensitivity, 99.8% specificity and 99.8% overall accuracy with AUC of 0.999 in the jackknife test. In the independent test, it has also achieved more than 97% sensitivity and 99% specificity. Similarly, in benchmarking with recent method CKSAAP_FormSite, the proposed predictor significantly outperformed in all the measures, particularly sensitivity by around 20%, specificity by nearly 30% and overall accuracy by more than 22%. These experimental results show that the proposed predForm-Site can be used as a complementary tool for the fast exploration of formylation sites. For convenience of the scientific community, predForm-Site has been deployed as an online tool, accessible at http://103.99.176.239:8080/predForm-Site.
Collapse
Affiliation(s)
- Md Khaled Ben Islam
- Institute for Integrated and Intelligent Systems, Griffith University, Brisbane, Australia; Department of Computer Science & Engineering, Pabna University of Science and Technology, Pabna, Bangladesh.
| | - Julia Rahman
- Institute for Integrated and Intelligent Systems, Griffith University, Brisbane, Australia; Department of Computer Science & Engineering, Rajshahi University of Engineering and Technology, Rajshahi, Bangladesh.
| | - Md Al Mehedi Hasan
- Department of Computer Science & Engineering, Rajshahi University of Engineering and Technology, Rajshahi, Bangladesh
| | - Shamim Ahmad
- Department of Computer Science & Engineering, Rajshahi University, Rajshahi, Bangladesh
| |
Collapse
|
29
|
Liu Z, Thapa N, Shaver A, Roy K, Siddula M, Yuan X, Yu A. Using Embedded Feature Selection and CNN for Classification on CCD-INID-V1-A New IoT Dataset. SENSORS 2021; 21:s21144834. [PMID: 34300574 PMCID: PMC8309834 DOI: 10.3390/s21144834] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/27/2021] [Revised: 07/09/2021] [Accepted: 07/12/2021] [Indexed: 11/20/2022]
Abstract
As Internet of Things (IoT) networks expand globally with an annual increase of active devices, providing better safeguards to threats is becoming more prominent. An intrusion detection system (IDS) is the most viable solution that mitigates the threats of cyberattacks. Given the many constraints of the ever-changing network environment of IoT devices, an effective yet lightweight IDS is required to detect cyber anomalies and categorize various cyberattacks. Additionally, most publicly available datasets used for research do not reflect the recent network behaviors, nor are they made from IoT networks. To address these issues, in this paper, we have the following contributions: (1) we create a dataset from IoT networks, namely, the Center for Cyber Defense (CCD) IoT Network Intrusion Dataset V1 (CCD-INID-V1); (2) we propose a hybrid lightweight form of IDS—an embedded model (EM) for feature selection and a convolutional neural network (CNN) for attack detection and classification. The proposed method has two models: (a) RCNN: Random Forest (RF) is combined with CNN and (b) XCNN: eXtreme Gradient Boosting (XGBoost) is combined with CNN. RF and XGBoost are the embedded models to reduce less impactful features. (3) We attempt anomaly (binary) classifications and attack-based (multiclass) classifications on CCD-INID-V1 and two other IoT datasets, the detection_of_IoT_botnet_attacks_N_BaIoT dataset (Balot) and the CIRA-CIC-DoHBrw-2020 dataset (DoH20), to explore the effectiveness of these learning-based security models. Using RCNN, we achieved an Area under the Receiver Characteristic Operator (ROC) Curve (AUC) score of 0.956 with a runtime of 32.28 s on CCD-INID-V1, 0.999 with a runtime of 71.46 s on Balot, and 0.986 with a runtime of 35.45 s on DoH20. Using XCNN, we achieved an AUC score of 0.998 with a runtime of 51.38 s for CCD-INID-V1, 0.999 with a runtime of 72.12 s for Balot, and 0.999 with a runtime of 72.91 s for DoH20. Compared to KNN, XCNN required 86.98% less computational time, and RCNN required 91.74% less computational time to achieve equal or better accurate anomaly detections. We find XCNN and RCNN are consistently efficient and handle scalability well; in particular, 1000 times faster than KNN when dealing with a relatively larger dataset-Balot. Finally, we highlight RCNN and XCNN’s ability to accurately detect anomalies with a significant reduction in computational time. This advantage grants flexibility for the IDS placement strategy. Our IDS can be placed at a central server as well as resource-constrained edge devices. Our lightweight IDS requires low train time and hence decreases reaction time to zero-day attacks.
Collapse
|
30
|
Chen Z, Zhao P, Li C, Li F, Xiang D, Chen YZ, Akutsu T, Daly RJ, Webb GI, Zhao Q, Kurgan L, Song J. iLearnPlus: a comprehensive and automated machine-learning platform for nucleic acid and protein sequence analysis, prediction and visualization. Nucleic Acids Res 2021; 49:e60. [PMID: 33660783 PMCID: PMC8191785 DOI: 10.1093/nar/gkab122] [Citation(s) in RCA: 150] [Impact Index Per Article: 37.5] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2020] [Revised: 02/05/2021] [Accepted: 02/25/2021] [Indexed: 12/14/2022] Open
Abstract
Sequence-based analysis and prediction are fundamental bioinformatic tasks that facilitate understanding of the sequence(-structure)-function paradigm for DNAs, RNAs and proteins. Rapid accumulation of sequences requires equally pervasive development of new predictive models, which depends on the availability of effective tools that support these efforts. We introduce iLearnPlus, the first machine-learning platform with graphical- and web-based interfaces for the construction of machine-learning pipelines for analysis and predictions using nucleic acid and protein sequences. iLearnPlus provides a comprehensive set of algorithms and automates sequence-based feature extraction and analysis, construction and deployment of models, assessment of predictive performance, statistical analysis, and data visualization; all without programming. iLearnPlus includes a wide range of feature sets which encode information from the input sequences and over twenty machine-learning algorithms that cover several deep-learning approaches, outnumbering the current solutions by a wide margin. Our solution caters to experienced bioinformaticians, given the broad range of options, and biologists with no programming background, given the point-and-click interface and easy-to-follow design process. We showcase iLearnPlus with two case studies concerning prediction of long noncoding RNAs (lncRNAs) from RNA transcripts and prediction of crotonylation sites in protein chains. iLearnPlus is an open-source platform available at https://github.com/Superzchen/iLearnPlus/ with the webserver at http://ilearnplus.erc.monash.edu/.
Collapse
Affiliation(s)
- Zhen Chen
- Collaborative Innovation Center of Henan Grain Crops, Henan Agricultural University, Zhengzhou 450046, China
| | - Pei Zhao
- State Key Laboratory of Cotton Biology, Institute of Cotton Research of Chinese Academy of Agricultural Sciences (CAAS), Anyang 455000, China
| | - Chen Li
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
| | - Fuyi Li
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia.,Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia.,Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, Melbourne, Victoria 3000, Australia
| | - Dongxu Xiang
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia.,Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
| | - Yong-Zi Chen
- Laboratory of Tumor Cell Biology, Key Laboratory of Cancer Prevention and Therapy, National Clinical Research Center for Cancer, Tianjin Medical University Cancer Institute and Hospital, Tianjin Medical University, Tianjin 300060, China
| | - Tatsuya Akutsu
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Kyoto 611-0011, Japan
| | - Roger J Daly
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
| | - Geoffrey I Webb
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
| | - Quanzhi Zhao
- Collaborative Innovation Center of Henan Grain Crops, Henan Agricultural University, Zhengzhou 450046, China.,Key Laboratory of Rice Biology in Henan Province, Henan Agricultural University, Zhengzhou 450046, China
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, USA
| | - Jiangning Song
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia.,Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
| |
Collapse
|
31
|
Chen YZ, Wang ZZ, Wang Y, Ying G, Chen Z, Song J. nhKcr: a new bioinformatics tool for predicting crotonylation sites on human nonhistone proteins based on deep learning. Brief Bioinform 2021; 22:6277413. [PMID: 34002774 DOI: 10.1093/bib/bbab146] [Citation(s) in RCA: 25] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2021] [Revised: 03/18/2021] [Accepted: 03/25/2021] [Indexed: 12/20/2022] Open
Abstract
Lysine crotonylation (Kcr) is a newly discovered type of protein post-translational modification and has been reported to be involved in various pathophysiological processes. High-resolution mass spectrometry is the primary approach for identification of Kcr sites. However, experimental approaches for identifying Kcr sites are often time-consuming and expensive when compared with computational approaches. To date, several predictors for Kcr site prediction have been developed, most of which are capable of predicting crotonylation sites on either histones alone or mixed histone and nonhistone proteins together. These methods exhibit high diversity in their algorithms, encoding schemes, feature selection techniques and performance assessment strategies. However, none of them were designed for predicting Kcr sites on nonhistone proteins. Therefore, it is desirable to develop an effective predictor for identifying Kcr sites from the large amount of nonhistone sequence data. For this purpose, we first provide a comprehensive review on six methods for predicting crotonylation sites. Second, we develop a novel deep learning-based computational framework termed as CNNrgb for Kcr site prediction on nonhistone proteins by integrating different types of features. We benchmark its performance against multiple commonly used machine learning classifiers (including random forest, logitboost, naïve Bayes and logistic regression) by performing both 10-fold cross-validation and independent test. The results show that the proposed CNNrgb framework achieves the best performance with high computational efficiency on large datasets. Moreover, to facilitate users' efforts to investigate Kcr sites on human nonhistone proteins, we implement an online server called nhKcr and compare it with other existing tools to illustrate the utility and robustness of our method. The nhKcr web server and all the datasets utilized in this study are freely accessible at http://nhKcr.erc.monash.edu/.
Collapse
Affiliation(s)
- Yong-Zi Chen
- Laboratory of Tumor Cell Biology, Tianjin Medical University Cancer Institute and Hospital, Tianjin 300060, China
| | | | | | - Guoguang Ying
- Laboratory of Tumor Cell Biology in Tianjin Medical University Cancer Institute and Hospital, Tianjin 300060, China
| | - Zhen Chen
- Collaborative Innovation Center of Henan Grain Crops, Henan Agricultural University, China
| | - Jiangning Song
- Monash Biomedicine Discovery Institute, Monash University, Australia
| |
Collapse
|
32
|
New Dataset for Forecasting Realized Volatility: Is the Tokyo Stock Exchange Co-Location Dataset Helpful for Expansion of the Heterogeneous Autoregressive Model in the Japanese Stock Market? JOURNAL OF RISK AND FINANCIAL MANAGEMENT 2021. [DOI: 10.3390/jrfm14050215] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
This study analyzes the importance of the Tokyo Stock Exchange Co-Location dataset (TSE Co-Location dataset) to forecast the realized volatility (RV) of Tokyo stock price index futures. The heterogeneous autoregressive (HAR) model is a popular linear regression model used to forecast RV. This study expands the HAR model using the TSE Co-Location dataset, stock full-board dataset and market volume dataset based on the random forest method, which is a popular machine learning algorithm and a nonlinear model. The TSE Co-Location dataset is a new dataset. This is the only information that shows the transaction status of high-frequency traders. In contrast, the stock full-board dataset shows the status of buying and selling dominance. The market volume dataset is used as a proxy for liquidity and is recognized as important information in finance. To the best of our knowledge, this study is the first to use the TSE co-location dataset. The experimental results show that our model yields a higher forecast out-of-sample accuracy of RV than the HAR model. Moreover, we find that the TSE Co-Location dataset has become more important in recent years, along with the increasing importance of high-frequency trading.
Collapse
|
33
|
Xu H, Jia P, Zhao Z. DeepVISP: Deep Learning for Virus Site Integration Prediction and Motif Discovery. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2021; 8:2004958. [PMID: 33977077 PMCID: PMC8097320 DOI: 10.1002/advs.202004958] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/23/2020] [Indexed: 05/08/2023]
Abstract
Approximately 15% of human cancers are estimated to be attributed to viruses. Virus sequences can be integrated into the host genome, leading to genomic instability and carcinogenesis. Here, a new deep convolutional neural network (CNN) model is developed with attention architecture, namely DeepVISP, for accurately predicting oncogenic virus integration sites (VISs) in the human genome. Using the curated benchmark integration data of three viruses, hepatitis B virus (HBV), human herpesvirus (HPV), and Epstein-Barr virus (EBV), DeepVISP achieves high accuracy and robust performance for all three viruses through automatically learning informative features and essential genomic positions only from the DNA sequences. In comparison, DeepVISP outperforms conventional machine learning methods by 8.43-34.33% measured by area under curve (AUC) value enhancement in three viruses. Moreover, DeepVISP can decode cis-regulatory factors that are potentially involved in virus integration and tumorigenesis, such as HOXB7, IKZF1, and LHX6. These findings are supported by multiple lines of evidence in literature. The clustering analysis of the informative motifs reveales that the representative k-mers in clusters could help guide virus recognition of the host genes. A user-friendly web server is developed for predicting putative oncogenic VISs in the human genome using DeepVISP.
Collapse
Affiliation(s)
- Haodong Xu
- Center for Precision HealthSchool of Biomedical InformaticsThe University of Texas Health Science Center at Houston (UTHealth)HoustonTX77030USA
| | - Peilin Jia
- Center for Precision HealthSchool of Biomedical InformaticsThe University of Texas Health Science Center at Houston (UTHealth)HoustonTX77030USA
| | - Zhongming Zhao
- Center for Precision HealthSchool of Biomedical InformaticsThe University of Texas Health Science Center at Houston (UTHealth)HoustonTX77030USA
- MD Anderson Cancer Center UTHealth Graduate School of Biomedical SciencesHoustonTX77030USA
- Department of Biomedical InformaticsVanderbilt University Medical CenterNashvilleTN37203USA
| |
Collapse
|
34
|
Santamaría-García H, Baez S, Aponte-Canencio DM, Pasciarello GO, Donnelly-Kehoe PA, Maggiotti G, Matallana D, Hesse E, Neely A, Zapata JG, Chiong W, Levy J, Decety J, Ibáñez A. Uncovering social-contextual and individual mental health factors associated with violence via computational inference. PATTERNS (NEW YORK, N.Y.) 2021; 2:100176. [PMID: 33659906 PMCID: PMC7892360 DOI: 10.1016/j.patter.2020.100176] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/03/2020] [Revised: 10/21/2020] [Accepted: 11/30/2020] [Indexed: 01/13/2023]
Abstract
The identification of human violence determinants has sparked multiple questions from different academic fields. Innovative methodological assessments of the weight and interaction of multiple determinants are still required. Here, we examine multiple features potentially associated with confessed acts of violence in ex-members of illegal armed groups in Colombia (N = 26,349) through deep learning and feature-derived machine learning. We assessed 162 social-contextual and individual mental health potential predictors of historical data regarding consequentialist, appetitive, retaliative, and reactive domains of violence. Deep learning yields high accuracy using the full set of determinants. Progressive feature elimination revealed that contextual factors were more important than individual factors. Combined social network adversities, membership identification, and normalization of violence were among the more accurate social-contextual factors. To a lesser extent the best individual factors were personality traits (borderline, paranoid, and antisocial) and psychiatric symptoms. The results provide a population-based computational classification regarding historical assessments of violence in vulnerable populations.
Collapse
Affiliation(s)
- Hernando Santamaría-García
- Doctorado de Neurociencias, Departamentos de Psiquiatría y Fisiología, Pontificia Universidad Javeriana, Bogotá, Colombia
- Centro de Memoria y Cognición Intellectus, Hospital Universitario San Ignacio, Bogotá, Colombia
| | | | - Diego Mauricio Aponte-Canencio
- Universidad Externado de Colombia, Bogotá, Colombia
- Agencia para la Reincorporación y la Normalización (ARN), Bogotá, Colombia
- Department of Neuroscience and Biomedical Engineering, Aalto University, Finland
| | - Guido Orlando Pasciarello
- Multimedia Signal Processing Group–Neuroimage Division, French-Argentine International Center for Information and Systems Sciences (CIFASIS)–National Scientific and Technical Research Council (CONICET), Rosario, Argentina
- Laboratory of Neuroimaging and Neuroscience (LANEN), INECO Foundation Rosario, Rosario, Argentina
| | - Patricio Andrés Donnelly-Kehoe
- Multimedia Signal Processing Group–Neuroimage Division, French-Argentine International Center for Information and Systems Sciences (CIFASIS)–National Scientific and Technical Research Council (CONICET), Rosario, Argentina
- Laboratory of Neuroimaging and Neuroscience (LANEN), INECO Foundation Rosario, Rosario, Argentina
| | | | - Diana Matallana
- Doctorado de Neurociencias, Departamentos de Psiquiatría y Fisiología, Pontificia Universidad Javeriana, Bogotá, Colombia
| | - Eugenia Hesse
- Cognitive Neuroscience Center (CNC), Universidad de San Andrés, Buenos Aires, Argentina
- National Scientific and Technical Research Council (CONICET), Argentina
| | - Alejandra Neely
- Latin American Institute for Brain Health (BrainLat), Center for Social and Cognitive Neuroscience (CSCN), Universidad Adolfo Ibáñez, Santiago de Chile, Chile
| | | | - Winston Chiong
- UCSF Weill Institute for Neurosciences, San Francisco, CA, USA
| | - Jonathan Levy
- Baruch Ivcher School of Psychology, Interdisciplinary Center Herzliya (IDC), Israel
- Department of Neuroscience and Biomedical Engineering, Aalto University, Finland
| | | | - Agustín Ibáñez
- Cognitive Neuroscience Center (CNC), Universidad de San Andrés, Buenos Aires, Argentina
- National Scientific and Technical Research Council (CONICET), Argentina
- Latin American Institute for Brain Health (BrainLat), Center for Social and Cognitive Neuroscience (CSCN), Universidad Adolfo Ibáñez, Santiago de Chile, Chile
- Universidad Autónoma del Caribe, Barranquilla, Colombia
- Global Brain Health Institute (GBHI), University of California San Francisco (UCSF), San Francisco, CA, USA
| |
Collapse
|
35
|
Meng F, Liang Z, Zhao K, Luo C. Drug design targeting active posttranslational modification protein isoforms. Med Res Rev 2020; 41:1701-1750. [PMID: 33355944 DOI: 10.1002/med.21774] [Citation(s) in RCA: 38] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2020] [Revised: 11/29/2020] [Accepted: 12/03/2020] [Indexed: 12/11/2022]
Abstract
Modern drug design aims to discover novel lead compounds with attractable chemical profiles to enable further exploration of the intersection of chemical space and biological space. Identification of small molecules with good ligand efficiency, high activity, and selectivity is crucial toward developing effective and safe drugs. However, the intersection is one of the most challenging tasks in the pharmaceutical industry, as chemical space is almost infinity and continuous, whereas the biological space is very limited and discrete. This bottleneck potentially limits the discovery of molecules with desirable properties for lead optimization. Herein, we present a new direction leveraging posttranslational modification (PTM) protein isoforms target space to inspire drug design termed as "Post-translational Modification Inspired Drug Design (PTMI-DD)." PTMI-DD aims to extend the intersections of chemical space and biological space. We further rationalized and highlighted the importance of PTM protein isoforms and their roles in various diseases and biological functions. We then laid out a few directions to elaborate the PTMI-DD in drug design including discovering covalent binding inhibitors mimicking PTMs, targeting PTM protein isoforms with distinctive binding sites from that of wild-type counterpart, targeting protein-protein interactions involving PTMs, and hijacking protein degeneration by ubiquitination for PTM protein isoforms. These directions will lead to a significant expansion of the biological space and/or increase the tractability of compounds, primarily due to precisely targeting PTM protein isoforms or complexes which are highly relevant to biological functions. Importantly, this new avenue will further enrich the personalized treatment opportunity through precision medicine targeting PTM isoforms.
Collapse
Affiliation(s)
- Fanwang Meng
- Drug Discovery and Design Center, the Center for Chemical Biology, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai, China.,Department of Chemistry and Chemical Biology, McMaster University, Hamilton, Ontario, Canada
| | - Zhongjie Liang
- Center for Systems Biology, Department of Bioinformatics, School of Biology and Basic Medical Sciences, Soochow University, Suzhou, China
| | - Kehao Zhao
- School of Pharmacy, Key Laboratory of Molecular Pharmacology and Drug Evaluation (Yantai University), Ministry of Education, Collaborative Innovation Center of Advanced Drug Delivery System and Biotech Drugs in Universities of Shandong, Yantai University, Yantai, China
| | - Cheng Luo
- Drug Discovery and Design Center, the Center for Chemical Biology, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai, China
| |
Collapse
|
36
|
Lyu X, Li S, Jiang C, He N, Chen Z, Zou Y, Li L. DeepCSO: A Deep-Learning Network Approach to Predicting Cysteine S-Sulphenylation Sites. Front Cell Dev Biol 2020; 8:594587. [PMID: 33335901 PMCID: PMC7736615 DOI: 10.3389/fcell.2020.594587] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2020] [Accepted: 11/12/2020] [Indexed: 01/02/2023] Open
Abstract
Cysteine S-sulphenylation (CSO), as a novel post-translational modification (PTM), has emerged as a potential mechanism to regulate protein functions and affect signal networks. Because of its functional significance, several prediction approaches have been developed. Nevertheless, they are based on a limited dataset from Homo sapiens and there is a lack of prediction tools for the CSO sites of other species. Recently, this modification has been investigated at the proteomics scale for a few species and the number of identified CSO sites has significantly increased. Thus, it is essential to explore the characteristics of this modification across different species and construct prediction models with better performances based on the enlarged dataset. In this study, we constructed several classifiers and found that the long short-term memory model with the word-embedding encoding approach, dubbed LSTMWE, performs favorably to the traditional machine-learning models and other deep-learning models across different species, in terms of cross-validation and independent test. The area under the receiver operating characteristic (ROC) curve for LSTMWE ranged from 0.82 to 0.85 for different organisms, which was superior to the reported CSO predictors. Moreover, we developed the general model based on the integrated data from different species and it showed great universality and effectiveness. We provided the on-line prediction service called DeepCSO that included both species-specific and general models, which is accessible through http://www.bioinfogo.org/DeepCSO.
Collapse
Affiliation(s)
- Xiaru Lyu
- School of Basic Medicine, Qingdao University, Qingdao, China
| | - Shuhao Li
- College of Life Sciences, Qingdao University, Qingdao, China.,School of Basic Medicine, Qingdao University, Qingdao, China
| | - Chunyang Jiang
- School of Basic Medicine, Qingdao University, Qingdao, China
| | - Ningning He
- School of Basic Medicine, Qingdao University, Qingdao, China
| | - Zhen Chen
- Collaborative Innovation Center of Henan Grain Crops, Henan Agricultural University, Zhengzhou, China.,Key Laboratory of Rice Biology in Henan Province, Henan Agricultural University, Zhengzhou, China
| | - Yang Zou
- School of Basic Medicine, Qingdao University, Qingdao, China
| | - Lei Li
- School of Basic Medicine, Qingdao University, Qingdao, China.,School of Data Science and Software Engineering, Qingdao University, Qingdao, China
| |
Collapse
|
37
|
Liu X, Wang L, Li J, Hu J, Zhang X. Mal-Prec: computational prediction of protein Malonylation sites via machine learning based feature integration : Malonylation site prediction. BMC Genomics 2020; 21:812. [PMID: 33225896 PMCID: PMC7682087 DOI: 10.1186/s12864-020-07166-w] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2020] [Accepted: 10/20/2020] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Malonylation is a recently discovered post-translational modification that is associated with a variety of diseases such as Type 2 Diabetes Mellitus and different types of cancers. Compared with experimental identification of malonylation sites, computational method is a time-effective process with comparatively low costs. RESULTS In this study, we proposed a novel computational model called Mal-Prec (Malonylation Prediction) for malonylation site prediction through the combination of Principal Component Analysis and Support Vector Machine. One-hot encoding, physio-chemical properties, and composition of k-spaced acid pairs were initially performed to extract sequence features. PCA was then applied to select optimal feature subsets while SVM was adopted to predict malonylation sites. Five-fold cross-validation results showed that Mal-Prec can achieve better prediction performance compared with other approaches. AUC (area under the receiver operating characteristic curves) analysis achieved 96.47 and 90.72% on 5-fold cross-validation of independent data sets, respectively. CONCLUSION Mal-Prec is a computationally reliable method for identifying malonylation sites in protein sequences. It outperforms existing prediction tools and can serve as a useful tool for identifying and discovering novel malonylation sites in human proteins. Mal-Prec is coded in MATLAB and is publicly available at https://github.com/flyinsky6/Mal-Prec , together with the data sets used in this study.
Collapse
Affiliation(s)
- Xin Liu
- Department of Bioinformatics, School of Medical Informatics and Engineering, Xuzhou Medical University, Xuzhou, 221004 Jiangsu China
| | - Liang Wang
- Department of Bioinformatics, School of Medical Informatics and Engineering, Xuzhou Medical University, Xuzhou, 221004 Jiangsu China
- Jiangsu Key Laboratory of New Drug Research and Clinical Pharmacy, School of Pharmacy, Xuzhou Medical University, Xuzhou, 221000 Jiangsu China
| | - Jian Li
- School of Public Health and Tropical Medicine, Tulane University, New Orleans, LA 70118 USA
| | - Junfeng Hu
- Department of Bioinformatics, School of Medical Informatics and Engineering, Xuzhou Medical University, Xuzhou, 221004 Jiangsu China
| | - Xiao Zhang
- Department of Bioinformatics, School of Medical Informatics and Engineering, Xuzhou Medical University, Xuzhou, 221004 Jiangsu China
| |
Collapse
|
38
|
Wen B, Zeng W, Liao Y, Shi Z, Savage SR, Jiang W, Zhang B. Deep Learning in Proteomics. Proteomics 2020; 20:e1900335. [PMID: 32939979 PMCID: PMC7757195 DOI: 10.1002/pmic.201900335] [Citation(s) in RCA: 78] [Impact Index Per Article: 15.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2020] [Revised: 09/14/2020] [Indexed: 12/17/2022]
Abstract
Proteomics, the study of all the proteins in biological systems, is becoming a data-rich science. Protein sequences and structures are comprehensively catalogued in online databases. With recent advancements in tandem mass spectrometry (MS) technology, protein expression and post-translational modifications (PTMs) can be studied in a variety of biological systems at the global scale. Sophisticated computational algorithms are needed to translate the vast amount of data into novel biological insights. Deep learning automatically extracts data representations at high levels of abstraction from data, and it thrives in data-rich scientific research domains. Here, a comprehensive overview of deep learning applications in proteomics, including retention time prediction, MS/MS spectrum prediction, de novo peptide sequencing, PTM prediction, major histocompatibility complex-peptide binding prediction, and protein structure prediction, is provided. Limitations and the future directions of deep learning in proteomics are also discussed. This review will provide readers an overview of deep learning and how it can be used to analyze proteomics data.
Collapse
Affiliation(s)
- Bo Wen
- Lester and Sue Smith Breast CenterBaylor College of MedicineHoustonTX77030USA
- Department of Molecular and Human GeneticsBaylor College of MedicineHoustonTX77030USA
| | - Wen‐Feng Zeng
- Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS)Chinese Academy of SciencesInstitute of Computing TechnologyBeijing100190China
| | - Yuxing Liao
- Lester and Sue Smith Breast CenterBaylor College of MedicineHoustonTX77030USA
- Department of Molecular and Human GeneticsBaylor College of MedicineHoustonTX77030USA
| | - Zhiao Shi
- Lester and Sue Smith Breast CenterBaylor College of MedicineHoustonTX77030USA
- Department of Molecular and Human GeneticsBaylor College of MedicineHoustonTX77030USA
| | - Sara R. Savage
- Lester and Sue Smith Breast CenterBaylor College of MedicineHoustonTX77030USA
- Department of Molecular and Human GeneticsBaylor College of MedicineHoustonTX77030USA
| | - Wen Jiang
- Lester and Sue Smith Breast CenterBaylor College of MedicineHoustonTX77030USA
- Department of Molecular and Human GeneticsBaylor College of MedicineHoustonTX77030USA
| | - Bing Zhang
- Lester and Sue Smith Breast CenterBaylor College of MedicineHoustonTX77030USA
- Department of Molecular and Human GeneticsBaylor College of MedicineHoustonTX77030USA
| |
Collapse
|
39
|
Zhang L, Zou Y, He N, Chen Y, Chen Z, Li L. DeepKhib: A Deep-Learning Framework for Lysine 2-Hydroxyisobutyrylation Sites Prediction. Front Cell Dev Biol 2020; 8:580217. [PMID: 33015075 PMCID: PMC7509169 DOI: 10.3389/fcell.2020.580217] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2020] [Accepted: 08/17/2020] [Indexed: 11/28/2022] Open
Abstract
As a novel type of post-translational modification, lysine 2-Hydroxyisobutyrylation (K hib ) plays an important role in gene transcription and signal transduction. In order to understand its regulatory mechanism, the essential step is the recognition of K hib sites. Thousands of K hib sites have been experimentally verified across five different species. However, there are only a couple traditional machine-learning algorithms developed to predict K hib sites for limited species, lacking a general prediction algorithm. We constructed a deep-learning algorithm based on convolutional neural network with the one-hot encoding approach, dubbed CNN OH . It performs favorably to the traditional machine-learning models and other deep-learning models across different species, in terms of cross-validation and independent test. The area under the ROC curve (AUC) values for CNN OH ranged from 0.82 to 0.87 for different organisms, which is superior to the currently available K hib predictors. Moreover, we developed the general model based on the integrated data from multiple species and it showed great universality and effectiveness with the AUC values in the range of 0.79-0.87. Accordingly, we constructed the on-line prediction tool dubbed DeepKhib for easily identifying K hib sites, which includes both species-specific and general models. DeepKhib is available at http://www.bioinfogo.org/DeepKhib.
Collapse
Affiliation(s)
- Luna Zhang
- School of Data Science and Software Engineering, Qingdao University, Qingdao, China
| | - Yang Zou
- School of Basic Medicine, Qingdao University, Qingdao, China
| | - Ningning He
- School of Basic Medicine, Qingdao University, Qingdao, China
| | - Yu Chen
- School of Data Science and Software Engineering, Qingdao University, Qingdao, China
| | - Zhen Chen
- Collaborative Innovation Center of Henan Grain Crops, Henan Agricultural University, Zhengzhou, China
- Key Laboratory of Rice Biology in Henan Province, Henan Agricultural University, Zhengzhou, China
| | - Lei Li
- School of Data Science and Software Engineering, Qingdao University, Qingdao, China
- School of Basic Medicine, Qingdao University, Qingdao, China
| |
Collapse
|
40
|
Arafat ME, Ahmad MW, Shovan S, Dehzangi A, Dipta SR, Hasan MAM, Taherzadeh G, Shatabda S, Sharma A. Accurately Predicting Glutarylation Sites Using Sequential Bi-Peptide-Based Evolutionary Features. Genes (Basel) 2020; 11:E1023. [PMID: 32878321 PMCID: PMC7565944 DOI: 10.3390/genes11091023] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2020] [Revised: 08/19/2020] [Accepted: 08/27/2020] [Indexed: 02/07/2023] Open
Abstract
Post Translational Modification (PTM) is defined as the alteration of protein sequence upon interaction with different macromolecules after the translation process. Glutarylation is considered one of the most important PTMs, which is associated with a wide range of cellular functioning, including metabolism, translation, and specified separate subcellular localizations. During the past few years, a wide range of computational approaches has been proposed to predict Glutarylation sites. However, despite all the efforts that have been made so far, the prediction performance of the Glutarylation sites has remained limited. One of the main challenges to tackle this problem is to extract features with significant discriminatory information. To address this issue, we propose a new machine learning method called BiPepGlut using the concept of a bi-peptide-based evolutionary method for feature extraction. To build this model, we also use the Extra-Trees (ET) classifier for the classification purpose, which, to the best of our knowledge, has never been used for this task. Our results demonstrate BiPepGlut is able to significantly outperform previously proposed models to tackle this problem. BiPepGlut achieves 92.0%, 84.8%, 95.6%, 0.82, and 0.88 in accuracy, sensitivity, specificity, Matthew's Correlation Coefficient, and F1-score, respectively. BiPepGlut is implemented as a publicly available online predictor.
Collapse
Affiliation(s)
- Md. Easin Arafat
- Department of Computer Science and Engineering, United International University, Dhaka 1212, Bangladesh; (M.E.A.); (M.W.A.); (S.R.D.)
| | - Md. Wakil Ahmad
- Department of Computer Science and Engineering, United International University, Dhaka 1212, Bangladesh; (M.E.A.); (M.W.A.); (S.R.D.)
| | - S.M. Shovan
- Department of Computer Science and Engineering, Rajshahi University of Engineering and Technology, Rajshahi 6204, Bangladesh; (S.M.S.); (M.A.M.H.)
| | - Abdollah Dehzangi
- Department of Computer Science, Rutgers University, Camden, NJ 08102, USA;
- Center for Computational and Integrative Biology, Rutgers University, Camden, NJ 08102, USA
| | - Shubhashis Roy Dipta
- Department of Computer Science and Engineering, United International University, Dhaka 1212, Bangladesh; (M.E.A.); (M.W.A.); (S.R.D.)
| | - Md. Al Mehedi Hasan
- Department of Computer Science and Engineering, Rajshahi University of Engineering and Technology, Rajshahi 6204, Bangladesh; (S.M.S.); (M.A.M.H.)
| | - Ghazaleh Taherzadeh
- Institute for Bioscience and Biotechnology Research, University of Maryland, College Park, MD 20742, USA
| | - Swakkhar Shatabda
- Department of Computer Science and Engineering, United International University, Dhaka 1212, Bangladesh; (M.E.A.); (M.W.A.); (S.R.D.)
| | - Alok Sharma
- Institute for Integrated and Intelligent Systems, Griffith University, Brisbane, QLD 4111, Australia
- Department of Medical Science Mathematics, Tokyo Medical and Dental University (TMDU), Tokyo 113-8510, Japan
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa 230-0045, Japan
- School of Engineering and Physics, Faculty of Science Technology and Environment, University of the South Pacific, Suva, Fiji
| |
Collapse
|
41
|
Chung CR, Chang YP, Hsu YL, Chen S, Wu LC, Horng JT, Lee TY. Incorporating hybrid models into lysine malonylation sites prediction on mammalian and plant proteins. Sci Rep 2020; 10:10541. [PMID: 32601280 PMCID: PMC7324624 DOI: 10.1038/s41598-020-67384-w] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2020] [Accepted: 06/03/2020] [Indexed: 12/22/2022] Open
Abstract
Protein malonylation, a reversible post-translational modification of lysine residues, is associated with various biological functions, such as cellular regulation and pathogenesis. In proteomics, to improve our understanding of the mechanisms of malonylation at the molecular level,
the identification of malonylation sites via an efficient methodology is essential. However, experimental identification of malonylated substrates via mass spectrometry is time-consuming, labor-intensive, and expensive. Although numerous methods have been developed to predict malonylation sites in mammalian proteins, the computational resource for identifying plant malonylation sites is very limited. In this study, a hybrid model incorporating multiple convolutional neural networks (CNNs) with physicochemical properties, evolutionary information,
and sequenced-based features was developed for identifying protein malonylation sites in mammals. For plant malonylation, multiple CNNs and random forests were integrated into a secondary modeling phase using a support vector machine. The independent testing has demonstrated that the mammalian and plant malonylation models can yield the area under the receiver operating characteristic curves (AUC) at 0.943 and 0.772, respectively. The proposed scheme has been implemented as a web-based tool, Kmalo (https://fdblab.csie.ncu.edu.tw/kmalo/home.html), which can help facilitate the functional investigation of protein malonylation on mammals and plants.
Collapse
Affiliation(s)
- Chia-Ru Chung
- Department of Computer Science and Information Engineering, National Central University, Taoyuan, 32001, Taiwan
| | - Ya-Ping Chang
- Department of Computer Science and Information Engineering, National Central University, Taoyuan, 32001, Taiwan
| | - Yu-Lin Hsu
- Department of Computer Science and Information Engineering, National Central University, Taoyuan, 32001, Taiwan
| | - Siyu Chen
- School of Life and Health Sciences, The Chinese University of Hong Kong, Shenzhen, 518172, China
| | - Li-Ching Wu
- Department of Biomedical Sciences and Engineering, National Central University, Taoyuan, 32001, Taiwan
| | - Jorng-Tzong Horng
- Department of Computer Science and Information Engineering, National Central University, Taoyuan, 32001, Taiwan. .,Department of Bioinformatics and Medical Engineering, Asia University, Taichung, 41359, Taiwan.
| | - Tzong-Yi Lee
- School of Life and Health Sciences, The Chinese University of Hong Kong, Shenzhen, 518172, China. .,Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen, 518172, China.
| |
Collapse
|
42
|
Xu H, Jia P, Zhao Z. Deep4mC: systematic assessment and computational prediction for DNA N4-methylcytosine sites by deep learning. Brief Bioinform 2020; 22:5856341. [PMID: 32578842 DOI: 10.1093/bib/bbaa099] [Citation(s) in RCA: 47] [Impact Index Per Article: 9.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2020] [Revised: 04/16/2020] [Accepted: 05/02/2020] [Indexed: 12/11/2022] Open
Abstract
DNA N4-methylcytosine (4mC) modification represents a novel epigenetic regulation. It involves in various cellular processes, including DNA replication, cell cycle and gene expression, among others. In addition to experimental identification of 4mC sites, in silico prediction of 4mC sites in the genome has emerged as an alternative and promising approach. In this study, we first reviewed the current progress in the computational prediction of 4mC sites and systematically evaluated the predictive capacity of eight conventional machine learning algorithms as well as 12 feature types commonly used in previous studies in six species. Using a representative benchmark dataset, we investigated the contribution of feature selection and stacking approach to the model construction, and found that feature optimization and proper reinforcement learning could improve the performance. We next recollected newly added 4mC sites in the six species' genomes and developed a novel deep learning-based 4mC site predictor, namely Deep4mC. Deep4mC applies convolutional neural networks with four representative features. For species with small numbers of samples, we extended our deep learning framework with a bootstrapping method. Our evaluation indicated that Deep4mC could obtain high accuracy and robust performance with the average area under curve (AUC) values greater than 0.9 in all species (range: 0.9005-0.9722). In comparison, Deep4mC achieved an AUC value improvement from 10.14 to 46.21% when compared to previous tools in these six species. A user-friendly web server (https://bioinfo.uth.edu/Deep4mC) was built for predicting putative 4mC sites in a genome.
Collapse
Affiliation(s)
- Haodong Xu
- Center for Precision Health, School of Biomedical Informatics
| | - Peilin Jia
- Center for Precision Health, School of Biomedical Informatics
| | - Zhongming Zhao
- Center for Precision Health, School of Biomedical Informatics
| |
Collapse
|
43
|
Xu HD, Liang RP, Wang YG, Qiu JD. mUSP: a high-accuracy map of the in situ crosstalk of ubiquitylation and SUMOylation proteome predicted via the feature enhancement approach. Brief Bioinform 2020; 22:5831925. [PMID: 32382739 DOI: 10.1093/bib/bbaa050] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2019] [Revised: 02/19/2020] [Indexed: 01/02/2023] Open
Abstract
Reversible post-translational modification (PTM) orchestrates various biological processes by changing the properties of proteins. Since many proteins are multiply modified by PTMs, identification of PTM crosstalk site has emerged to be an intriguing topic and attracted much attention. In this study, we systematically deciphered the in situ crosstalk of ubiquitylation and SUMOylation that co-occurs on the same lysine residue. We first collected 3363 ubiquitylation-SUMOylation (UBS) crosstalk site on 1302 proteins and then investigated the prime sequence motifs, the local evolutionary degree and the distribution of structural annotations at the residue and sequence levels between the UBS crosstalk and the single modification sites. Given the properties of UBS crosstalk sites, we thus developed the mUSP classifier to predict UBS crosstalk site by integrating different types of features with two-step feature optimization by recursive feature elimination approach. By using various cross-validations, the mUSP model achieved an average area under the curve (AUC) value of 0.8416, indicating its promising accuracy and robustness. By comparison, the mUSP has significantly better performance with the improvement of 38.41 and 51.48% AUC values compared to the cross-results by the previous single predictor. The mUSP was implemented as a web server available at http://bioinfo.ncu.edu.cn/mUSP/index.html to facilitate the query of our high-accuracy UBS crosstalk results for experimental design and validation.
Collapse
Affiliation(s)
- Hao-Dong Xu
- Department of Chemistry, Nanchang University, 999 Xuefu Road, Nanchang, Jiangxi, China
| | - Ru-Ping Liang
- Department of Chemistry, Nanchang University, 999 Xuefu Road, Nanchang, Jiangxi, China
| | - You-Gan Wang
- Department of Chemistry, Nanchang University, 999 Xuefu Road, Nanchang, Jiangxi, China
| | - Jian-Ding Qiu
- Department of Chemistry, Nanchang University, 999 Xuefu Road, Nanchang, Jiangxi, China
| |
Collapse
|
44
|
AHMAD WAKIL, ARAFAT EASIN, TAHERZADEH GHAZALEH, SHARMA ALOK, DIPTA SHUBHASHISROY, DEHZANGI ABDOLLAH, SHATABDA SWAKKHAR. Mal-Light: Enhancing Lysine Malonylation Sites Prediction Problem Using Evolutionary-based Features. IEEE ACCESS : PRACTICAL INNOVATIONS, OPEN SOLUTIONS 2020; 8:77888-77902. [PMID: 33354488 PMCID: PMC7751949 DOI: 10.1109/access.2020.2989713] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/11/2023]
Abstract
Post Translational Modification (PTM) is considered an important biological process with a tremendous impact on the function of proteins in both eukaryotes, and prokaryotes cells. During the past decades, a wide range of PTMs has been identified. Among them, malonylation is a recently identified PTM which plays a vital role in a wide range of biological interactions. Notwithstanding, this modification plays a potential role in energy metabolism in different species including Homo Sapiens. The identification of PTM sites using experimental methods is time-consuming and costly. Hence, there is a demand for introducing fast and cost-effective computational methods. In this study, we propose a new machine learning method, called Mal-Light, to address this problem. To build this model, we extract local evolutionary-based information according to the interaction of neighboring amino acids using a bi-peptide based method. We then use Light Gradient Boosting (LightGBM) as our classifier to predict malonylation sites. Our results demonstrate that Mal-Light is able to significantly improve malonylation site prediction performance compared to previous studies found in the literature. Using Mal-Light we achieve Matthew's correlation coefficient (MCC) of 0.74 and 0.60, Accuracy of 86.66% and 79.51%, Sensitivity of 78.26% and 67.27%, and Specificity of 95.05% and 91.75%, for Homo Sapiens and Mus Musculus proteins, respectively. Mal-Light is implemented as an online predictor which is publicly available at: (http://brl.uiu.ac.bd/MalLight/).
Collapse
Affiliation(s)
- WAKIL AHMAD
- Department of Computer Science and Engineering, United International University, United City, Madani Avenue, Dhaka 1212, Bangladesh
| | - EASIN ARAFAT
- Department of Computer Science and Engineering, United International University, United City, Madani Avenue, Dhaka 1212, Bangladesh
| | - GHAZALEH TAHERZADEH
- Institute for Bioscience and Biotechnology Research, University of Maryland, College Park, MD, 20742, USA
| | - ALOK SHARMA
- Institute for Integrated and Intelligent Systems, Griffith University, Brisbane, QLD-4111, Australia
- Department of Medical Science Mathematics, Medical Research Institute, Tokyo Medical and Dental University (TMDU), Tokyo, 113-8510, Japan
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, 230-0045, Kanagawa, Japan
- School of Engineering and Physics, Faculty of Science Technology and Environment, University of the South Pacific, Suva, Fiji
- CREST, JST, Tokyo, 102-8666, Japan
| | - SHUBHASHIS ROY DIPTA
- Department of Computer Science and Engineering, United International University, United City, Madani Avenue, Dhaka 1212, Bangladesh
| | - ABDOLLAH DEHZANGI
- Department of Computer Science, Morgan State University, Baltimore, MD, 21251, USA
| | - SWAKKHAR SHATABDA
- Department of Computer Science and Engineering, United International University, United City, Madani Avenue, Dhaka 1212, Bangladesh
| |
Collapse
|
45
|
RF-MaloSite and DL-Malosite: Methods based on random forest and deep learning to identify malonylation sites. Comput Struct Biotechnol J 2020; 18:852-860. [PMID: 32322367 PMCID: PMC7160427 DOI: 10.1016/j.csbj.2020.02.012] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2019] [Revised: 01/27/2020] [Accepted: 02/19/2020] [Indexed: 12/19/2022] Open
Abstract
Malonylation, which has recently emerged as an important lysine modification, regulates diverse biological activities and has been implicated in several pervasive disorders, including cardiovascular disease and cancer. However, conventional global proteomics analysis using tandem mass spectrometry can be time-consuming, expensive and technically challenging. Therefore, to complement and extend existing experimental methods for malonylation site identification, we developed two novel computational methods for malonylation site prediction based on random forest and deep learning machine learning algorithms, RF-MaloSite and DL-MaloSite, respectively. DL-MaloSite requires the primary amino acid sequence as an input and RF-MaloSite utilizes a diverse set of biochemical, physiochemical and sequence-based features. While systematic assessment of performance metrics suggests that both ‘RF-MaloSite’ and ‘DL-MaloSite’ perform well in all metrics tested, our methods perform particularly well in the areas of accuracy, sensitivity and overall method performance (assessed by the Matthew’s Correlation Coefficient). For instance, RF-MaloSite exhibited MCC scores of 0.42 and 0.40 using 10-fold cross-validation and an independent test set, respectively. Meanwhile, DL-MaloSite was characterized by MCC scores of 0.51 and 0.49 based on 10-fold cross-validation and an independent set, respectively. Importantly, both methods exhibited efficiency scores that were on par or better than those achieved by existing malonylation site prediction methods. The identification of these sites may also provide important insights into the mechanisms of crosstalk between malonylation and other lysine modifications, such as acetylation, glutarylation and succinylation. To facilitate their use, both methods have been made freely available to the research community at https://github.com/dukkakc/DL-MaloSite-and-RF-MaloSite.
Collapse
|
46
|
Chen Z, Zhao P, Li F, Wang Y, Smith AI, Webb GI, Akutsu T, Baggag A, Bensmail H, Song J. Comprehensive review and assessment of computational methods for predicting RNA post-transcriptional modification sites from RNA sequences. Brief Bioinform 2019; 21:1676-1696. [DOI: 10.1093/bib/bbz112] [Citation(s) in RCA: 57] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2019] [Revised: 07/31/2019] [Accepted: 08/07/2019] [Indexed: 12/14/2022] Open
Abstract
Abstract
RNA post-transcriptional modifications play a crucial role in a myriad of biological processes and cellular functions. To date, more than 160 RNA modifications have been discovered; therefore, accurate identification of RNA-modification sites is fundamental for a better understanding of RNA-mediated biological functions and mechanisms. However, due to limitations in experimental methods, systematic identification of different types of RNA-modification sites remains a major challenge. Recently, more than 20 computational methods have been developed to identify RNA-modification sites in tandem with high-throughput experimental methods, with most of these capable of predicting only single types of RNA-modification sites. These methods show high diversity in their dataset size, data quality, core algorithms, features extracted and feature selection techniques and evaluation strategies. Therefore, there is an urgent need to revisit these methods and summarize their methodologies, in order to improve and further develop computational techniques to identify and characterize RNA-modification sites from the large amounts of sequence data. With this goal in mind, first, we provide a comprehensive survey on a large collection of 27 state-of-the-art approaches for predicting N1-methyladenosine and N6-methyladenosine sites. We cover a variety of important aspects that are crucial for the development of successful predictors, including the dataset quality, operating algorithms, sequence and genomic features, feature selection, model performance evaluation and software utility. In addition, we also provide our thoughts on potential strategies to improve the model performance. Second, we propose a computational approach called DeepPromise based on deep learning techniques for simultaneous prediction of N1-methyladenosine and N6-methyladenosine. To extract the sequence context surrounding the modification sites, three feature encodings, including enhanced nucleic acid composition, one-hot encoding, and RNA embedding, were used as the input to seven consecutive layers of convolutional neural networks (CNNs), respectively. Moreover, DeepPromise further combined the prediction score of the CNN-based models and achieved around 43% higher area under receiver-operating curve (AUROC) for m1A site prediction and 2–6% higher AUROC for m6A site prediction, respectively, when compared with several existing state-of-the-art approaches on the independent test. In-depth analyses of characteristic sequence motifs identified from the convolution-layer filters indicated that nucleotide presentation at proximal positions surrounding the modification sites contributed most to the classification, whereas those at distal positions also affected classification but to different extents. To maximize user convenience, a web server was developed as an implementation of DeepPromise and made publicly available at http://DeepPromise.erc.monash.edu/, with the server accepting both RNA sequences and genomic sequences to allow prediction of two types of putative RNA-modification sites.
Collapse
Affiliation(s)
- Zhen Chen
- School of BasicMedical Science, Qingdao University, China
| | - Pei Zhao
- State Key Laboratory of Cotton Biology, Institute of Cotton Research of Chinese Academy of Agricultural Sciences, Anyang 455000, Henan, China
| | - Fuyi Li
- Northwest A&F University, China
| | | | - A Ian Smith
- Prince Henrys Institute Melbourne and Monash University, Australia
| | | | | | - Abdelkader Baggag
- Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha 34110, Qatar
| | - Halima Bensmail
- Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha 34110, Qatar
| | - Jiangning Song
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Victoria 3800, Australia
| |
Collapse
|
47
|
Chen Z, Zhao P, Li F, Marquez-Lago TT, Leier A, Revote J, Zhu Y, Powell DR, Akutsu T, Webb GI, Chou KC, Smith AI, Daly RJ, Li J, Song J. iLearn: an integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data. Brief Bioinform 2019; 21:1047-1057. [DOI: 10.1093/bib/bbz041] [Citation(s) in RCA: 189] [Impact Index Per Article: 31.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2019] [Revised: 02/28/2019] [Accepted: 03/13/2019] [Indexed: 12/13/2022] Open
Abstract
Abstract
With the explosive growth of biological sequences generated in the post-genomic era, one of the most challenging problems in bioinformatics and computational biology is to computationally characterize sequences, structures and functions in an efficient, accurate and high-throughput manner. A number of online web servers and stand-alone tools have been developed to address this to date; however, all these tools have their limitations and drawbacks in terms of their effectiveness, user-friendliness and capacity. Here, we present iLearn, a comprehensive and versatile Python-based toolkit, integrating the functionality of feature extraction, clustering, normalization, selection, dimensionality reduction, predictor construction, best descriptor/model selection, ensemble learning and results visualization for DNA, RNA and protein sequences. iLearn was designed for users that only want to upload their data set and select the functions they need calculated from it, while all necessary procedures and optimal settings are completed automatically by the software. iLearn includes a variety of descriptors for DNA, RNA and proteins, and four feature output formats are supported so as to facilitate direct output usage or communication with other computational tools. In total, iLearn encompasses 16 different types of feature clustering, selection, normalization and dimensionality reduction algorithms, and five commonly used machine-learning algorithms, thereby greatly facilitating feature analysis and predictor construction. iLearn is made freely available via an online web server and a stand-alone toolkit.
Collapse
Affiliation(s)
- Zhen Chen
- School of Basic Medical Science, Qingdao University, 38 Dengzhou Road, Qingdao, 266021, Shandong, China
| | - Pei Zhao
- State Key Laboratory of Cotton Biology, Institute of Cotton Research of Chinese Academy of Agricultural Sciences (CAAS), Anyang, 455000, China
| | - Fuyi Li
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
| | - Tatiana T Marquez-Lago
- Department of Genetics, School of Medicine, University of Alabama at Birmingham, USA
- Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, AL, USA
| | - André Leier
- Department of Genetics, School of Medicine, University of Alabama at Birmingham, USA
- Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, AL, USA
| | - Jerico Revote
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
| | - Yan Zhu
- Biomedicine Discovery Institute and Department of Microbiology, Monash University, Melbourne, VIC 3800, Australia
| | - David R Powell
- Biomedicine Discovery Institute and Department of Microbiology, Monash University, Melbourne, VIC 3800, Australia
| | - Tatsuya Akutsu
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Kyoto 611-0011, Japan
| | - Geoffrey I Webb
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
| | - Kuo-Chen Chou
- Gordon Life Science Institute, Boston, MA 02478, USA
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, 610054, China
| | - A Ian Smith
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
- ARC Centre of Excellence in Advanced Molecular Imaging, Monash University, Melbourne, VIC 3800, Australia
| | - Roger J Daly
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
| | - Jian Li
- Biomedicine Discovery Institute and Department of Microbiology, Monash University, Melbourne, VIC 3800, Australia
| | - Jiangning Song
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
- ARC Centre of Excellence in Advanced Molecular Imaging, Monash University, Melbourne, VIC 3800, Australia
| |
Collapse
|