1
|
Liang Y, Cao M, Zhang S. NeuroPred-ResSE: Predicting neuropeptides by integrating residual block and squeeze-excitation attention mechanism. Anal Biochem 2024; 695:115648. [PMID: 39154878 DOI: 10.1016/j.ab.2024.115648] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2024] [Revised: 07/31/2024] [Accepted: 08/15/2024] [Indexed: 08/20/2024]
Abstract
Neuropeptides play crucial roles in regulating neurological function acting as signaling molecules, which provide new opportunity for developing drugs for the treatment of neurological diseases. Therefore, it is very necessary to develop a rapid and accurate prediction model for neuropeptides. Although a few prediction tools have been developed, there is room for improvement in prediction accuracy by using deep learning approach. In this paper, we establish the NeuroPred-ResSE model based on residual block and squeeze-excitation attention mechanism. Firstly, we extract multi-features by using one-hot coding based on the NT5CT5 sequence, dipeptide deviation from expected mean and natural vector. Then, we integrate residual block and squeeze-excitation attention mechanism, which can capture and identify the most relevant attribute features. Finally, the accuracies of the training set and test set are 97.16 % and 96.60 % based on the 5-fold cross-validation and independent test, respectively, and other evaluation metrics have also obtained satisfactory results. The experimental results show that the performance of the NeuroPred-ResSE model outperforms those of existing state-of-the-art models, and our model is an effective, intelligent and robust prediction tool. The datasets and source codes are available at https://github.com/yunyunliang88/NeuroPred-ResSE.
Collapse
Affiliation(s)
- Yunyun Liang
- School of Science, Xi'an Polytechnic University, Xi'an, 710048, PR China.
| | - Mengyi Cao
- School of Science, Xi'an Polytechnic University, Xi'an, 710048, PR China
| | - Shengli Zhang
- School of Mathematics and Statistics, Xidian University, Xi'an, 710071, PR China
| |
Collapse
|
2
|
Zhong G, Liu H, Deng L. Ensemble Machine Learning and Predicted Properties Promote Antimicrobial Peptide Identification. Interdiscip Sci 2024; 16:951-965. [PMID: 38972032 DOI: 10.1007/s12539-024-00640-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2024] [Revised: 06/04/2024] [Accepted: 06/07/2024] [Indexed: 07/08/2024]
Abstract
The emergence of antibiotic-resistant microbes raises a pressing demand for novel alternative treatments. One promising alternative is the antimicrobial peptides (AMPs), a class of innate immunity mediators within the therapeutic peptide realm. AMPs offer salient advantages such as high specificity, cost-effective synthesis, and reduced toxicity. Although some computational methodologies have been proposed to identify potential AMPs with the rapid development of artificial intelligence techniques, there is still ample room to improve their performance. This study proposes a predictive framework which ensembles deep learning and statistical learning methods to screen peptides with antimicrobial activity. We integrate multiple LightGBM classifiers and convolution neural networks which leverages various predicted sequential, structural and physicochemical properties from their residue sequences extracted by diverse machine learning paradigms. Comparative experiments exhibit that our method outperforms other state-of-the-art approaches on an independent test dataset, in terms of representative capability measures. Besides, we analyse the discrimination quality under different varieties of attribute information and it reveals that combination of multiple features could improve prediction. In addition, a case study is carried out to illustrate the exemplary favorable identification effect. We establish a web application at http://amp.denglab.org to provide convenient usage of our proposal and make the predictive framework, source code, and datasets publicly accessible at https://github.com/researchprotein/amp .
Collapse
Affiliation(s)
- Guolun Zhong
- School of Computer Science and Engineering, Central South University, Changsha, 410083, China
| | - Hui Liu
- College of Computer and Information Engineering, Nanjing Tech University, Nanjing, 211816, China.
| | - Lei Deng
- School of Computer Science and Engineering, Central South University, Changsha, 410083, China.
| |
Collapse
|
3
|
Charoenkwan P, Chumnanpuen P, Schaduangrat N, Shoombuatong W. Stack-AVP: a stacked ensemble predictor based on multi-view information for fast and accurate discovery of antiviral peptides. J Mol Biol 2024:168853. [PMID: 39510347 DOI: 10.1016/j.jmb.2024.168853] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2024] [Revised: 10/22/2024] [Accepted: 10/31/2024] [Indexed: 11/15/2024]
Abstract
AVPs, or antiviral peptides, are short chains of amino acids capable of inhibiting viral replication, preventing viral entry, or disrupting viral membranes. They represent a promising area of research for developing new antiviral therapies due to their potential to target a broad spectrum of viruses, incorporating those resistant to traditional antiviral drugs. However, traditional experimental methods for identifying AVPs are often costly and labour-intensive. Thus far, multiple computational methods have been introduced for the in silico identification of AVPs, but these methods still have certain shortcomings. In this study, we propose a novel stacked ensemble learning framework, termed Stack-AVP, for fast and accurate AVP identification. In Stack-AVP, we investigated heterogeneous prediction models, which were trained with 12 commonly used machine learning algorithms coupled with a wide range of multiple feature encoding schemes. Subsequently, these prediction models were adopted to generate multi-view features providing class information and probability information. Finally, we applied our feature selection method to determine the best feature subset for the construction of the final stacked model. Comparative assessments on the independent test dataset revealed that Stack-AVP surpassed the performance of current state-of-the-art methods, with an accuracy of 0.930, MCC of 0.860, and AUC of 0.975. Furthermore, it was found that our multi-view features exhibited a crucial mechanism to improve the prediction performance of AVPs. To facilitate experimental scientists in performing high-throughput identification of AVPs, the prediction sever Stack-AVP is publicly accessible at https://pmlabqsar.pythonanywhere.com/Stack-AVP.
Collapse
Affiliation(s)
- Phasit Charoenkwan
- Modern Management and Information Technology, College of Arts, Media and Technology, Chiang Mai University, Chiang Mai, Thailand, 50200
| | - Pramote Chumnanpuen
- Department of Zoology, Faculty of Science, Kasetsart University, Bangkok, Thailand, 10900; Omics Center for Agriculture, Bioresources, Food, and Health, Kasetsart University (OmiKU), Bangkok, Thailand, 10900
| | - Nalini Schaduangrat
- Center for Research Innovation and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, Thailand, 10700
| | - Watshara Shoombuatong
- Center for Research Innovation and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, Thailand, 10700.
| |
Collapse
|
4
|
Liu YX, Song JL, Li XM, Lin H, Cao YN. Identification of target genes co-regulated by four key histone modifications of five key regions in hepatocellular carcinoma. Methods 2024; 231:165-177. [PMID: 39349287 DOI: 10.1016/j.ymeth.2024.09.017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2024] [Revised: 08/27/2024] [Accepted: 09/27/2024] [Indexed: 10/02/2024] Open
Abstract
Hepatocellular carcinoma (HCC) is a cancer with high morbidity and mortality. Studies have shown that histone modification plays an important regulatory role in the occurrence and development of HCC. However, the specific regulatory effects of histone modifications on gene expression in HCC are still unclear. This study focuses on HepG2 cell lines and hepatocyte cell lines. First, the distribution of histone modification signals in the two cell lines was calculated and analyzed. Then, using the random forest algorithm, we analyzed the effects of different histone modifications and their modified regions on gene expression in the two cell lines, four key histone modifications (H3K36me3, H3K4me3, H3K79me2, and H3K9ac) and five key regions that co-regulate gene expression were obtained. Subsequently, target genes regulated by key histone modifications in key regions were screened. Combined with clinical data, Cox regression analysis and Kaplan-Meier survival analysis were performed on the target genes, and four key target genes (CBX2, CEBPZOS, LDHA, and UMPS) related to prognosis were identified. Finally, through immune infiltration analysis and drug sensitivity analysis of key target genes, the potential role of key target genes in HCC was confirmed. Our results provide a theoretical basis for exploring the occurrence of HCC and propose potential biomarkers associated with histone modifications, which may be potential drug targets for the clinical treatment of HCC.
Collapse
Affiliation(s)
- Yu-Xian Liu
- School of Artificial Intelligence, Anhui University of Science and Technology, Huainan 232001, China.
| | - Jia-Le Song
- School of Artificial Intelligence, Anhui University of Science and Technology, Huainan 232001, China
| | - Xiao-Ming Li
- School of Artificial Intelligence, Anhui University of Science and Technology, Huainan 232001, China
| | - Hao Lin
- Key Laboratory for Neuro-Information of Ministry of Education, Center for Informational Biology, School of Life Sciences and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China.
| | - Yan-Ni Cao
- School of Artificial Intelligence, Anhui University of Science and Technology, Huainan 232001, China.
| |
Collapse
|
5
|
Chen Y, Wang J, Zou Q, Niu M, Ding Y, Song J, Wang Y. DrugDAGT: a dual-attention graph transformer with contrastive learning improves drug-drug interaction prediction. BMC Biol 2024; 22:233. [PMID: 39396972 PMCID: PMC11472440 DOI: 10.1186/s12915-024-02030-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2024] [Accepted: 10/02/2024] [Indexed: 10/15/2024] Open
Abstract
BACKGROUND Drug-drug interactions (DDIs) can result in unexpected pharmacological outcomes, including adverse drug events, which are crucial for drug discovery. Graph neural networks have substantially advanced our ability to model molecular representations; however, the precise identification of key local structures and the capture of long-distance structural correlations for better DDI prediction and interpretation remain significant challenges. RESULTS Here, we present DrugDAGT, a dual-attention graph transformer framework with contrastive learning for predicting multiple DDI types. The dual-attention graph transformer incorporates attention mechanisms at both the bond and atomic levels, thereby enabling the integration of short and long-range dependencies within drug molecules to pinpoint key local structures essential for DDI discovery. Moreover, DrugDAGT further implements graph contrastive learning to maximize the similarity of representations across different views for better discrimination of molecular structures. Experiments in both warm-start and cold-start scenarios demonstrate that DrugDAGT outperforms state-of-the-art baseline models, achieving superior overall performance. Furthermore, visualization of the learned representations of drug pairs and the attention map provides interpretable insights instead of black-box results. CONCLUSIONS DrugDAGT provides an effective tool for accurately predicting multiple DDI types by identifying key local chemical structures, offering valuable insights for prescribing medications, and guiding drug development. All data and code of our DrugDAGT can be found at https://github.com/codejiajia/DrugDAGT .
Collapse
Affiliation(s)
- Yaojia Chen
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC, 3800, Australia
- Wenzhou Medical University-Monash Biomedicine Discovery Institute Alliance in Clinical and Experimental Biomedicine, The First Affiliated Hospital of Wenzhou Medical University, Wenzhou, 325035, China
| | - Jiacheng Wang
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Mengting Niu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic University, Shenzhen, China
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Yijie Ding
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Jiangning Song
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC, 3800, Australia.
- Wenzhou Medical University-Monash Biomedicine Discovery Institute Alliance in Clinical and Experimental Biomedicine, The First Affiliated Hospital of Wenzhou Medical University, Wenzhou, 325035, China.
| | - Yansu Wang
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China.
| |
Collapse
|
6
|
Zuo Y, Fang X, Wan J, He W, Liu X, Zeng X, Deng Z. PreMLS: The undersampling technique based on ClusterCentroids to predict multiple lysine sites. PLoS Comput Biol 2024; 20:e1012544. [PMID: 39436947 PMCID: PMC11530015 DOI: 10.1371/journal.pcbi.1012544] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2024] [Revised: 11/01/2024] [Accepted: 10/09/2024] [Indexed: 10/25/2024] Open
Abstract
The translated protein undergoes a specific modification process, which involves the formation of covalent bonds on lysine residues and the attachment of small chemical moieties. The protein's fundamental physicochemical properties undergo a significant alteration. The change significantly alters the proteins' 3D structure and activity, enabling them to modulate key physiological processes. The modulation encompasses inhibiting cancer cell growth, delaying ovarian aging, regulating metabolic diseases, and ameliorating depression. Consequently, the identification and comprehension of post-translational lysine modifications hold substantial value in the realms of biological research and drug development. Post-translational modifications (PTMs) at lysine (K) sites are among the most common protein modifications. However, research on K-PTMs has been largely centered on identifying individual modification types, with a relative scarcity of balanced data analysis techniques. In this study, a classification system is developed for the prediction of concurrent multiple modifications at a single lysine residue. Initially, a well-established multi-label position-specific triad amino acid propensity algorithm is utilized for feature encoding. Subsequently, PreMLS: a novel ClusterCentroids undersampling algorithm based on MiniBatchKmeans was introduced to eliminate redundant or similar major class samples, thereby mitigating the issue of class imbalance. A convolutional neural network architecture was specifically constructed for the analysis of biological sequences to predict multiple lysine modification sites. The model, evaluated through five-fold cross-validation and independent testing, was found to significantly outperform existing models such as iMul-kSite and predML-Site. The results presented here aid in prioritizing potential lysine modification sites, facilitating subsequent biological assays and advancing pharmaceutical research. To enhance accessibility, an open-access predictive script has been crafted for the multi-label predictive model developed in this study.
Collapse
Affiliation(s)
- Yun Zuo
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China
| | - Xingze Fang
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China
| | - Jiayong Wan
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China
| | - Wenying He
- School of Artificial Intelligence, Hebei University of Technology, Tianjin, China
| | - Xiangrong Liu
- Department of Computer Science and Technology, National Institute for Data Science in Health and Medicine, Xiamen Key Laboratory of Intelligent Storage and Computing, Xiamen University, Xiamen, China
| | - Xiangxiang Zeng
- School of Information Science and Engineering, Hunan University, Changsha, China
| | - Zhaohong Deng
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China
| |
Collapse
|
7
|
Shoombuatong W, Meewan I, Mookdarsanit L, Schaduangrat N. Stack-HDAC3i: A high-precision identification of HDAC3 inhibitors by exploiting a stacked ensemble-learning framework. Methods 2024; 230:147-157. [PMID: 39191338 DOI: 10.1016/j.ymeth.2024.08.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2024] [Revised: 08/07/2024] [Accepted: 08/17/2024] [Indexed: 08/29/2024] Open
Abstract
Epigenetics involves reversible modifications in gene expression without altering the genetic code itself. Among these modifications, histone deacetylases (HDACs) play a key role by removing acetyl groups from lysine residues on histones. Overexpression of HDACs is linked to the proliferation and survival of tumor cells. To combat this, HDAC inhibitors (HDACi) are commonly used in cancer treatments. However, pan-HDAC inhibition can lead to numerous side effects. Therefore, isoform-selective HDAC inhibitors, such as HDAC3i, could be advantageous for treating various medical conditions while minimizing off-target effects. To date, computational approaches that use only the SMILES notation without any experimental evidence have become increasingly popular and necessary for the initial discovery of novel potential therapeutic drugs. In this study, we develop an innovative and high-precision stacked-ensemble framework, called Stack-HDAC3i, which can directly identify HDAC3i using only the SMILES notation. Using an up-to-date benchmark dataset, we first employed both molecular descriptors and Mol2Vec embeddings to generate feature representations that cover multi-view information embedded in HDAC3i, such as structural and contextual information. Subsequently, these feature representations were used to train baseline models using nine popular ML algorithms. Finally, the probabilistic features derived from the selected baseline models were fused to construct the final stacked model. Both cross-validation and independent tests showed that Stack-HDAC3i is a high-accuracy prediction model with great generalization ability for identifying HDAC3i. Furthermore, in the independent test, Stack-HDAC3i achieved an accuracy of 0.926 and Matthew's correlation coefficient of 0.850, which are 0.44-6.11% and 0.83-11.90% higher than its constituent baseline models, respectively.
Collapse
Affiliation(s)
- Watshara Shoombuatong
- Center for Research Innovation and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok 10700, Thailand.
| | - Ittipat Meewan
- Center for Advanced Therapeutics, Institute of Molecular Biosciences, Mahidol University, Nakhon Pathom 73170, Thailand
| | - Lawankorn Mookdarsanit
- Business Information System, Faculty of Management Science, Chandrakasem Rajabhat University, Bangkok 10900, Thailand
| | - Nalini Schaduangrat
- Center for Research Innovation and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok 10700, Thailand
| |
Collapse
|
8
|
Liu S, Shi T, Yu J, Li R, Lin H, Deng K. Research on Bitter Peptides in the Field of Bioinformatics: A Comprehensive Review. Int J Mol Sci 2024; 25:9844. [PMID: 39337334 PMCID: PMC11432553 DOI: 10.3390/ijms25189844] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2024] [Revised: 09/06/2024] [Accepted: 09/09/2024] [Indexed: 09/30/2024] Open
Abstract
Bitter peptides are small molecular peptides produced by the hydrolysis of proteins under acidic, alkaline, or enzymatic conditions. These peptides can enhance food flavor and offer various health benefits, with attributes such as antihypertensive, antidiabetic, antioxidant, antibacterial, and immune-regulating properties. They show significant potential in the development of functional foods and the prevention and treatment of diseases. This review introduces the diverse sources of bitter peptides and discusses the mechanisms of bitterness generation and their physiological functions in the taste system. Additionally, it emphasizes the application of bioinformatics in bitter peptide research, including the establishment and improvement of bitter peptide databases, the use of quantitative structure-activity relationship (QSAR) models to predict bitterness thresholds, and the latest advancements in classification prediction models built using machine learning and deep learning algorithms for bitter peptide identification. Future research directions include enhancing databases, diversifying models, and applying generative models to advance bitter peptide research towards deepening and discovering more practical applications.
Collapse
Affiliation(s)
| | | | | | | | - Hao Lin
- School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China; (S.L.); (T.S.); (J.Y.); (R.L.)
| | - Kejun Deng
- School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China; (S.L.); (T.S.); (J.Y.); (R.L.)
| |
Collapse
|
9
|
Pham NT, Zhang Y, Rakkiyappan R, Manavalan B. HOTGpred: Enhancing human O-linked threonine glycosylation prediction using integrated pretrained protein language model-based features and multi-stage feature selection approach. Comput Biol Med 2024; 179:108859. [PMID: 39029431 DOI: 10.1016/j.compbiomed.2024.108859] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2024] [Revised: 06/19/2024] [Accepted: 07/06/2024] [Indexed: 07/21/2024]
Abstract
O-linked glycosylation is a complex post-translational modification (PTM) in human proteins that plays a critical role in regulating various cellular metabolic and signaling pathways. In contrast to N-linked glycosylation, O-linked glycosylation lacks specific sequence features and maintains an unstable core structure. Identifying O-linked threonine glycosylation sites (OTGs) remains challenging, requiring extensive experimental tests. While bioinformatics tools have emerged for predicting OTGs, their reliance on limited conventional features and absence of well-defined feature selection strategies limit their effectiveness. To address these limitations, we introduced HOTGpred (Human O-linked Threonine Glycosylation predictor), employing a multi-stage feature selection process to identify the optimal feature set for accurately identifying OTGs. Initially, we assessed 25 different feature sets derived from various pretrained protein language model (PLM)-based embeddings and conventional feature descriptors using nine classifiers. Subsequently, we integrated the top five embeddings linearly and determined the most effective scoring function for ranking hybrid features, identifying the optimal feature set through a process of sequential forward search. Among the classifiers, the extreme gradient boosting (XGBT)-based model, using the optimal feature set (HOTGpred), achieved 92.03 % accuracy on the training dataset and 88.25 % on the balanced independent dataset. Notably, HOTGpred significantly outperformed the current state-of-the-art methods on both the balanced and imbalanced independent datasets, demonstrating its superior prediction capabilities. Additionally, SHapley Additive exPlanations (SHAP) and ablation analyses were conducted to identify the features contributing most significantly to HOTGpred. Finally, we developed an easy-to-navigate web server, accessible at https://balalab-skku.org/HOTGpred/, to support glycobiologists in their research on glycosylation structure and function.
Collapse
Affiliation(s)
- Nhat Truong Pham
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, 16419, Gyeonggi-do, Republic of Korea
| | - Ying Zhang
- Beidahuang Industry Group General Hospital, Harbin, China
| | - Rajan Rakkiyappan
- Department of Mathematics, Bharathiar University, Coimbatore, 641046, Tamil Nadu, India.
| | - Balachandran Manavalan
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, 16419, Gyeonggi-do, Republic of Korea.
| |
Collapse
|
10
|
Zuo Y, Zhang B, Dong Y, He W, Bi Y, Liu X, Zeng X, Deng Z. Glypred: Lysine Glycation Site Prediction via CCU-LightGBM-BiLSTM Framework with Multi-Head Attention Mechanism. J Chem Inf Model 2024; 64:6699-6711. [PMID: 39121059 DOI: 10.1021/acs.jcim.4c01034] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/11/2024]
Abstract
Glycation, a type of posttranslational modification, preferentially occurs on lysine and arginine residues, impairing protein functionality and altering characteristics. This process is linked to diseases such as Alzheimer's, diabetes, and atherosclerosis. Traditional wet lab experiments are time-consuming, whereas machine learning has significantly streamlined the prediction of protein glycation sites. Despite promising results, challenges remain, including data imbalance, feature redundancy, and suboptimal classifier performance. This research introduces Glypred, a lysine glycation site prediction model combining ClusterCentroids Undersampling (CCU), LightGBM, and bidirectional long short-term memory network (BiLSTM) methodologies, with an additional multihead attention mechanism integrated into the BiLSTM. To achieve this, the study undertakes several key steps: selecting diverse feature types to capture comprehensive protein information, employing a cluster-based undersampling strategy to balance the data set, using LightGBM for feature selection to enhance model performance, and implementing a bidirectional LSTM network for accurate classification. Together, these approaches ensure that Glypred effectively identifies glycation sites with high accuracy and robustness. For feature encoding, five distinct feature types─AAC, KMER, DR, PWAA, and EBGW─were selected to capture a broad spectrum of protein sequence and biological information. These encoded features were integrated and validated to ensure comprehensive protein information acquisition. To address the issue of highly imbalanced positive and negative samples, various undersampling algorithms, including random undersampling, NearMiss, edited nearest neighbor rule, and CCU, were evaluated. CCU was ultimately chosen to remove redundant nonglycated training data, establishing a balanced data set that enhances the model's accuracy and robustness. For feature selection, the LightGBM ensemble learning algorithm was employed to reduce feature dimensionality by identifying the most significant features. This approach accelerates model training, enhances generalization capabilities, and ensures good transferability of the model. Finally, a bidirectional long short-term memory network was used as the classifier, with a network structure designed to capture glycation modification site features from both forward and backward directions. To prevent overfitting, appropriate regularization parameters and dropout rates were introduced, achieving efficient classification. Experimental results show that Glypred achieved optimal performance. This model provides new insights for bioinformatics and encourages the application of similar strategies in other fields. A lysine glycation site prediction software tool was also developed using the PyQt5 library, offering researchers an auxiliary screening tool to reduce workload and improve efficiency. The software and data sets are available on GitHub: https://github.com/ZBYnb/Glypred.
Collapse
Affiliation(s)
- Yun Zuo
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214000, China
| | - Bangyi Zhang
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214000, China
| | - Yinkang Dong
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214000, China
| | - Wenying He
- School of Artificial Intelligence, Hebei University of Technology, Tianjin 300130, China
| | - Yue Bi
- Department of Biochemistry and Molecular Biology and Biomedicine Discovery Institute, Monash University, Clayton 3800, Australia
| | - Xiangrong Liu
- Department of Computer Science and Technology, National Institute for Data Science in Health and Medicine, Xiamen Key Laboratory of Intelligent Storage and Computing, Xiamen University, Xiamen 361005, China
| | - Xiangxiang Zeng
- School of Information Science and Engineering, Hunan University, Changsha 410012, China
| | - Zhaohong Deng
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214000, China
| |
Collapse
|
11
|
Kurata H, Harun-Or-Roshid M, Mehedi Hasan M, Tsukiyama S, Maeda K, Manavalan B. MLm5C: A high-precision human RNA 5-methylcytosine sites predictor based on a combination of hybrid machine learning models. Methods 2024; 227:37-47. [PMID: 38729455 DOI: 10.1016/j.ymeth.2024.05.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2023] [Revised: 04/22/2024] [Accepted: 05/06/2024] [Indexed: 05/12/2024] Open
Abstract
RNA modification serves as a pivotal component in numerous biological processes. Among the prevalent modifications, 5-methylcytosine (m5C) significantly influences mRNA export, translation efficiency and cell differentiation and are also associated with human diseases, including Alzheimer's disease, autoimmune disease, cancer, and cardiovascular diseases. Identification of m5C is critically responsible for understanding the RNA modification mechanisms and the epigenetic regulation of associated diseases. However, the large-scale experimental identification of m5C present significant challenges due to labor intensity and time requirements. Several computational tools, using machine learning, have been developed to supplement experimental methods, but identifying these sites lack accuracy and efficiency. In this study, we introduce a new predictor, MLm5C, for precise prediction of m5C sites using sequence data. Briefly, we evaluated eleven RNA sequence-derived features with four basic machine learning algorithms to generate baseline models. From these 44 models, we ranked them based on their performance and subsequently stacked the Top 20 baseline models as the best model, named MLm5C. The MLm5C outperformed the-state-of-the-art predictors. Notably, the optimization of the sequence length surrounding the modification sites significantly improved the prediction performance. MLm5C is an invaluable tool in accelerating the detection of m5C sites within the human genome, thereby facilitating in the characterization of their roles in post-transcriptional regulation.
Collapse
Affiliation(s)
- Hiroyuki Kurata
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan.
| | - Md Harun-Or-Roshid
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan
| | - Md Mehedi Hasan
- Division of Biotetecnology and Molecular Medicine, Department of Pathobiological Science, School of Veterinary Medicine, Lousiana State University, Baton Rouge, LA 70803, USA
| | - Sho Tsukiyama
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan
| | - Kazuhiro Maeda
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan
| | - Balachandran Manavalan
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon 16419, Republic of Korea.
| |
Collapse
|
12
|
Basith S, Pham NT, Manavalan B, Lee G. SEP-AlgPro: An efficient allergen prediction tool utilizing traditional machine learning and deep learning techniques with protein language model features. Int J Biol Macromol 2024; 273:133085. [PMID: 38871100 DOI: 10.1016/j.ijbiomac.2024.133085] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2023] [Revised: 05/20/2024] [Accepted: 06/09/2024] [Indexed: 06/15/2024]
Abstract
Allergy is a hypersensitive condition in which individuals develop objective symptoms when exposed to harmless substances at a dose that would cause no harm to a "normal" person. Most current computational methods for allergen identification rely on homology or conventional machine learning using limited set of feature descriptors or validation on specific datasets, making them inefficient and inaccurate. Here, we propose SEP-AlgPro for the accurate identification of allergen protein from sequence information. We analyzed 10 conventional protein-based features and 14 different features derived from protein language models to gauge their effectiveness in differentiating allergens from non-allergens using 15 different classifiers. However, the final optimized model employs top 10 feature descriptors with top seven machine learning classifiers. Results show that the features derived from protein language models exhibit superior discriminative capabilities compared to traditional feature sets. This enabled us to select the most discriminatory baseline models, whose predicted outputs were aggregated and used as input to a deep neural network for the final allergen prediction. Extensive case studies showed that SEP-AlgPro outperforms state-of-the-art predictors in accurately identifying allergens. A user-friendly web server was developed and made freely available at https://balalab-skku.org/SEP-AlgPro/, making it a powerful tool for identifying potential allergens.
Collapse
Affiliation(s)
- Shaherin Basith
- Department of Physiology, Ajou University School of Medicine, Suwon 16499, Republic of Korea.
| | - Nhat Truong Pham
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon 16419, Republic of Korea
| | - Balachandran Manavalan
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon 16419, Republic of Korea.
| | - Gwang Lee
- Department of Physiology, Ajou University School of Medicine, Suwon 16499, Republic of Korea; Department of Molecular Science and Technology, Ajou University, Suwon 16499, Republic of Korea.
| |
Collapse
|
13
|
Yao Z, Li F, Xie W, Chen J, Wu J, Zhan Y, Wu X, Wang Z, Zhang G. DeepSF-4mC: A deep learning model for predicting DNA cytosine 4mC methylation sites leveraging sequence features. Comput Biol Med 2024; 171:108166. [PMID: 38382385 DOI: 10.1016/j.compbiomed.2024.108166] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2023] [Revised: 02/15/2024] [Accepted: 02/15/2024] [Indexed: 02/23/2024]
Abstract
N4-methylcytosine (4mC) is a DNA modification involving the addition of a methyl group to the fourth nitrogen atom of the cytosine base. This modification may influence gene regulation, providing potential insights into gene control mechanisms. Traditional laboratory methods for detecting 4mC DNA methylation have limitations, but the rise of artificial intelligence has introduced efficient computational strategies for 4mC site prediction. Despite this progress, challenges persist in terms of model performance and interpretability. To tackle these challenges, we propose DeepSF-4mC, a deep learning model specifically designed for predicting DNA cytosine 4mC methylation sites by leveraging sequence features. Our approach incorporates multiple encoding techniques to enhance prediction accuracy, increase model stability, and reduce the computational resources needed. Leveraging transfer learning, we harness existing models to enhance performance through learned representations or fine-tuning. Ensemble learning techniques combine predictions from multiple models, boosting robustness and accuracy. This research contributes to DNA methylation analysis and lays the groundwork for understanding 4mC's multifaceted role in biological processes. The web server for DeepSF-4mC is accessible at: http://deepsf-4mc.top/and the original code can be found at: https://github.com/754131799/DeepSF-4mC.
Collapse
Affiliation(s)
- Zhaomin Yao
- Department of Nuclear Medicine, General Hospital of Northern Theater Command, Shenyang, Liaoning, 110016, China; College of Medicine and Biological Information Engineering, Northeastern University, Shenyang, Liaoning, 110167, China
| | - Fei Li
- College of Computer Science and Technology, Jilin University, Changchun, Jilin, 130012, China
| | - Weiming Xie
- Department of Nuclear Medicine, General Hospital of Northern Theater Command, Shenyang, Liaoning, 110016, China; College of Medicine and Biological Information Engineering, Northeastern University, Shenyang, Liaoning, 110167, China
| | - Jiaming Chen
- Department of Nuclear Medicine, General Hospital of Northern Theater Command, Shenyang, Liaoning, 110016, China; College of Medicine and Biological Information Engineering, Northeastern University, Shenyang, Liaoning, 110167, China
| | - Jiezhang Wu
- Department of Nuclear Medicine, General Hospital of Northern Theater Command, Shenyang, Liaoning, 110016, China; College of Medicine and Biological Information Engineering, Northeastern University, Shenyang, Liaoning, 110167, China
| | - Ying Zhan
- Department of Nuclear Medicine, General Hospital of Northern Theater Command, Shenyang, Liaoning, 110016, China; College of Medicine and Biological Information Engineering, Northeastern University, Shenyang, Liaoning, 110167, China
| | - Xiaodan Wu
- Department of Nuclear Medicine, General Hospital of Northern Theater Command, Shenyang, Liaoning, 110016, China
| | - Zhiguo Wang
- Department of Nuclear Medicine, General Hospital of Northern Theater Command, Shenyang, Liaoning, 110016, China; College of Medicine and Biological Information Engineering, Northeastern University, Shenyang, Liaoning, 110167, China.
| | - Guoxu Zhang
- Department of Nuclear Medicine, General Hospital of Northern Theater Command, Shenyang, Liaoning, 110016, China; College of Medicine and Biological Information Engineering, Northeastern University, Shenyang, Liaoning, 110167, China.
| |
Collapse
|
14
|
Jia J, Lv P, Wei X, Qiu W. SNO-DCA: A model for predicting S-nitrosylation sites based on densely connected convolutional networks and attention mechanism. Heliyon 2024; 10:e23187. [PMID: 38148797 PMCID: PMC10750070 DOI: 10.1016/j.heliyon.2023.e23187] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2023] [Revised: 11/22/2023] [Accepted: 11/29/2023] [Indexed: 12/28/2023] Open
Abstract
Protein S-nitrosylation is a reversible oxidative reduction post-translational modification that is widely present in the biological community. S-nitrosylation can regulate protein function and is closely associated with a variety of diseases, thus identifying S-nitrosylation sites are crucial for revealing the function of proteins and related drug discovery. Traditional experimental methods are time-consuming and expensive; therefore, it is necessary to explore more efficient computational methods. Deep learning algorithms perform well in the field of bioinformatics sites prediction, and many studies show that they outperform existing machine learning algorithms. In this work, we proposed a deep learning algorithm-based predictor SNO-DCA for distinguishing between S-nitrosylated and non-S-nitrosylated sequences. First, one-hot encoding of protein sequences was performed. Second, the dense convolutional blocks were used to capture feature information, and an attention module was added to weigh different features to improve the prediction ability of the model. The 10-fold cross-validation and independent testing experimental results show that our SNO-DCA model outperforms existing S-nitrosylation sites prediction models under imbalanced data. In this paper, a web server prediction website: https://sno.cangmang.xyz/SNO-DCA/was established to provide an online prediction service for users. SNO-DCA can be available at https://github.com/peanono/SNO-DCA.
Collapse
Affiliation(s)
- Jianhua Jia
- Computer Department, Jingdezhen Ceramic University, Jingdezhen, 330403, China
| | - Peinuo Lv
- Computer Department, Jingdezhen Ceramic University, Jingdezhen, 330403, China
| | - Xin Wei
- Business School, Jiangxi Institute of Fashion Technology, Nanchang, 330201, China
| | - Wangren Qiu
- Computer Department, Jingdezhen Ceramic University, Jingdezhen, 330403, China
| |
Collapse
|
15
|
Ma X, Liang Y, Zhang S. iAVPs-ResBi: Identifying antiviral peptides by using deep residual network and bidirectional gated recurrent unit. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2023; 20:21563-21587. [PMID: 38124610 DOI: 10.3934/mbe.2023954] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/23/2023]
Abstract
Human history is also the history of the fight against viral diseases. From the eradication of viruses to coexistence, advances in biomedicine have led to a more objective understanding of viruses and a corresponding increase in the tools and methods to combat them. More recently, antiviral peptides (AVPs) have been discovered, which due to their superior advantages, have achieved great impact as antiviral drugs. Therefore, it is very necessary to develop a prediction model to accurately identify AVPs. In this paper, we develop the iAVPs-ResBi model using k-spaced amino acid pairs (KSAAP), encoding based on grouped weight (EBGW), enhanced grouped amino acid composition (EGAAC) based on the N5C5 sequence, composition, transition and distribution (CTD) based on physicochemical properties for multi-feature extraction. Then we adopt bidirectional long short-term memory (BiLSTM) to fuse features for obtaining the most differentiated information from multiple original feature sets. Finally, the deep model is built by combining improved residual network and bidirectional gated recurrent unit (BiGRU) to perform classification. The results obtained are better than those of the existing methods, and the accuracies are 95.07, 98.07, 94.29 and 97.50% on the four datasets, which show that iAVPs-ResBi can be used as an effective tool for the identification of antiviral peptides. The datasets and codes are freely available at https://github.com/yunyunliang88/iAVPs-ResBi.
Collapse
Affiliation(s)
- Xinyan Ma
- School of Science, Xi'an Polytechnic University, Xi'an 710048, China
| | - Yunyun Liang
- School of Science, Xi'an Polytechnic University, Xi'an 710048, China
| | - Shengli Zhang
- School of Mathematics and Statistics, Xidian University, Xi'an 710071, China
| |
Collapse
|
16
|
Yin ZN, Lai FL, Gao F. Unveiling human origins of replication using deep learning: accurate prediction and comprehensive analysis. Brief Bioinform 2023; 25:bbad432. [PMID: 38008420 PMCID: PMC10676776 DOI: 10.1093/bib/bbad432] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2023] [Revised: 10/11/2023] [Accepted: 11/06/2023] [Indexed: 11/28/2023] Open
Abstract
Accurate identification of replication origins (ORIs) is crucial for a comprehensive investigation into the progression of human cell growth and cancer therapy. Here, we proposed a computational approach Ori-FinderH, which can efficiently and precisely predict the human ORIs of various lengths by combining the Z-curve method with deep learning approach. Compared with existing methods, Ori-FinderH exhibits superior performance, achieving an area under the receiver operating characteristic curve (AUC) of 0.9616 for K562 cell line in 10-fold cross-validation. In addition, we also established a cross-cell-line predictive model, which yielded a further improved AUC of 0.9706. The model was subsequently employed as a fitness function to support genetic algorithm for generating artificial ORIs. Sequence analysis through iORI-Euk revealed that a vast majority of the created sequences, specifically 98% or more, incorporate at least one ORI for three cell lines (Hela, MCF7 and K562). This innovative approach could provide more efficient, accurate and comprehensive information for experimental investigation, thereby further advancing the development of this field.
Collapse
Affiliation(s)
- Zhen-Ning Yin
- Department of Physics, School of Science, Tianjin University, Tianjin 300072, China
| | - Fei-Liao Lai
- Department of Physics, School of Science, Tianjin University, Tianjin 300072, China
| | - Feng Gao
- Department of Physics, School of Science, Tianjin University, Tianjin 300072, China
- Frontiers Science Center for Synthetic Biology and Key Laboratory of Systems Bioengineering (Ministry of Education), Tianjin University, Tianjin 300072, China
- SynBio Research Platform, Collaborative Innovation Center of Chemical Science and Engineering (Tianjin), Tianjin 300072, China
| |
Collapse
|
17
|
Abbas Z, Rehman MU, Tayara H, Chong KT. ORI-Explorer: a unified cell-specific tool for origin of replication sites prediction by feature fusion. Bioinformatics 2023; 39:btad664. [PMID: 37929975 PMCID: PMC10639035 DOI: 10.1093/bioinformatics/btad664] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2023] [Revised: 09/20/2023] [Accepted: 10/27/2023] [Indexed: 11/07/2023] Open
Abstract
MOTIVATION The origins of replication sites (ORIs) are precise regions inside the DNA sequence where the replication process begins. These locations are critical for preserving the genome's integrity during cell division and guaranteeing the faithful transfer of genetic data from generation to generation. The advent of experimental techniques has aided in the discovery of ORIs in many species. Experimentation, on the other hand, is often more time-consuming and pricey than computational approaches, and it necessitates specific equipment and knowledge. Recently, ORI sites have been predicted using computational techniques like motif-based searches and artificial intelligence algorithms based on sequence characteristics and chromatin states. RESULTS In this article, we developed ORI-Explorer, a unique artificial intelligence-based technique that combines multiple feature engineering techniques to train CatBoost Classifier for recognizing ORIs from four distinct eukaryotic species. ORI-Explorer was created by utilizing a unique combination of three traditional feature-encoding techniques and a feature set obtained from a deep-learning neural network model. The ORI-Explorer has significantly outperformed current predictors on the testing dataset. Furthermore, by employing the sophisticated SHapley Additive exPlanation method, we give crucial insights that aid in comprehending model success, highlighting the most relevant features vital for forecasting cell-specific ORIs. ORI-Explorer is also intended to aid community-wide attempts in discovering potential ORIs and developing innovative verifiable biological hypotheses. AVAILABILITY AND IMPLEMENTATION The used datasets along with the source code are made available through https://github.com/Z-Abbas/ORI-Explorer and https://zenodo.org/record/8358679.
Collapse
Affiliation(s)
- Zeeshan Abbas
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, South Korea
| | - Mobeen Ur Rehman
- Khalifa University Center for Autonomous Robotic Systems (KUCARS), Khalifa University, Abu Dhabi, United Arab Emirates
| | - Hilal Tayara
- School of International Engineering and Science, Jeonbuk National University, Jeonju 54896, South Korea
| | - Kil To Chong
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, South Korea
- Advances Electronics and Information Research Center, Jeonbuk National University, Jeonju 54896, South Korea
| |
Collapse
|
18
|
Zou X, Ren L, Cai P, Zhang Y, Ding H, Deng K, Yu X, Lin H, Huang C. Accurately identifying hemagglutinin using sequence information and machine learning methods. Front Med (Lausanne) 2023; 10:1281880. [PMID: 38020152 PMCID: PMC10644030 DOI: 10.3389/fmed.2023.1281880] [Citation(s) in RCA: 23] [Impact Index Per Article: 23.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2023] [Accepted: 10/16/2023] [Indexed: 12/01/2023] Open
Abstract
Introduction Hemagglutinin (HA) is responsible for facilitating viral entry and infection by promoting the fusion between the host membrane and the virus. Given its significance in the process of influenza virus infestation, HA has garnered attention as a target for influenza drug and vaccine development. Thus, accurately identifying HA is crucial for the development of targeted vaccine drugs. However, the identification of HA using in-silico methods is still lacking. This study aims to design a computational model to identify HA. Methods In this study, a benchmark dataset comprising 106 HA and 106 non-HA sequences were obtained from UniProt. Various sequence-based features were used to formulate samples. By perform feature optimization and inputting them four kinds of machine learning methods, we constructed an integrated classifier model using the stacking algorithm. Results and discussion The model achieved an accuracy of 95.85% and with an area under the receiver operating characteristic (ROC) curve of 0.9863 in the 5-fold cross-validation. In the independent test, the model exhibited an accuracy of 93.18% and with an area under the ROC curve of 0.9793. The code can be found from https://github.com/Zouxidan/HA_predict.git. The proposed model has excellent prediction performance. The model will provide convenience for biochemical scholars for the study of HA.
Collapse
Affiliation(s)
- Xidan Zou
- School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Liping Ren
- School of Healthcare Technology, Chengdu Neusoft University, Chengdu, China
| | - Peiling Cai
- School of Basic Medical Sciences, Chengdu University, Chengdu, China
| | - Yang Zhang
- Innovative Institute of Chinese Medicine and Pharmacy, Academy for Interdiscipline, Chengdu University of Traditional Chinese Medicine, Chengdu, China
| | - Hui Ding
- School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Kejun Deng
- School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Xiaolong Yu
- School of Materials Science and Engineering, Hainan University, Haikou, China
| | - Hao Lin
- School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Chengbing Huang
- School of Computer Science and Technology, Aba Teachers University, Aba, China
| |
Collapse
|
19
|
Liu B, Yang Z, Liu Q, Zhang Y, Ding H, Lai H, Li Q. Computational prediction of allergenic proteins based on multi-feature fusion. Front Genet 2023; 14:1294159. [PMID: 37928245 PMCID: PMC10622758 DOI: 10.3389/fgene.2023.1294159] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2023] [Accepted: 10/11/2023] [Indexed: 11/07/2023] Open
Abstract
Allergy is an autoimmune disorder described as an undesirable response of the immune system to typically innocuous substance in the environment. Studies have shown that the ability of proteins to trigger allergic reactions in susceptible individuals can be evaluated by bioinformatics tools. However, developing computational methods to accurately identify new allergenic proteins remains a vital challenge. This work aims to propose a machine learning model based on multi-feature fusion for predicting allergenic proteins efficiently. Firstly, we prepared a benchmark dataset of allergenic and non-allergenic protein sequences and pretested on it with a machine-learning platform. Then, three preferable feature extraction methods, including amino acid composition (AAC), dipeptide composition (DPC) and composition of k-spaced amino acid pairs (CKSAAP) were chosen to extract protein sequence features. Subsequently, these features were fused and optimized by Pearson correlation coefficient (PCC) and principal component analysis (PCA). Finally, the most representative features were picked out to build the optimal predictor based on random forest (RF) algorithm. Performance evaluation results via 5-fold cross-validation showed that the final model, called iAller (https://github.com/laihongyan/iAller), could precisely distinguish allergenic proteins from non-allergenic proteins. The prediction accuracy and AUC value for validation dataset achieved 91.4% and 0.97%, respectively. This model will provide guide for users to identify more allergenic proteins.
Collapse
Affiliation(s)
- Bin Liu
- Department of Anesthesiology, The Fourth People’s Hospital of Sichuan Province, Chengdu, Sichuan, China
| | - Ziman Yang
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Qing Liu
- Department of Pain, The Affiliated Traditional Chinese Medicine Hospital of Southwest Medical University, Luzhou, Sichuan, China
| | - Ying Zhang
- Department of Anesthesiology, The Affiliated Traditional Chinese Medicine Hospital of Southwest Medical University, Luzhou, Sichuan, China
| | - Hui Ding
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Hongyan Lai
- Chongqing Key Laboratory of Big Data for Bio Intelligence, Chongqing University of Posts and Telecommunications, Chongqing, China
| | - Qun Li
- Department of Pain, The Affiliated Traditional Chinese Medicine Hospital of Southwest Medical University, Luzhou, Sichuan, China
- Research Center of Integrated Traditional Chinese and Western Medicine, The Affiliated Traditional Chinese Medicine Hospital of Southwest Medical University, Luzhou, Sichuan, China
| |
Collapse
|
20
|
Chen XG, Yang X, Li C, Lin X, Zhang W. Non-coding RNA identification with pseudo RNA sequences and feature representation learning. Comput Biol Med 2023; 165:107355. [PMID: 37639767 DOI: 10.1016/j.compbiomed.2023.107355] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2023] [Revised: 07/16/2023] [Accepted: 08/12/2023] [Indexed: 08/31/2023]
Abstract
Distinguishing non-coding RNAs (ncRNAs) from coding RNAs is very important in bioinformatics. Although many methods have been proposed for solving this task, it remains highly challenging to further improve the accuracy of ncRNA identification. In this paper, we propose a coding potential predictor using feature representation learning based on pseudo RNA sequences named CPPFLPS. In this method, we use the pseudo RNA sequences generated by simulating RNA sequence mutations as new samples for data augmentation, and six string operations simulating RNA sequence mutations are considered: base replacement, base insertion, base deletion, subsequence reversion, subsequence repetition and subsequence deletion. In the feature representation learning framework, different types of pseudo RNA sequences are added to the training set to form new training sets that can be used to train baseline classifiers, thus obtaining baseline models. The resulting labels of these baseline models are used as feature vectors to represent RNA sequences, and the resulting feature vectors acquired after feature selection are used to train a predictive model for distinguishing ncRNAs from coding RNAs. Our method achieves better performance compared with that of existing state-of-the-art methods. The implementation of the proposed method is available at https://github.com/chenxgscuec/CPPFLPS.
Collapse
Affiliation(s)
- Xian-Gan Chen
- School of Biomedical Engineering, South-Central Minzu University, Wuhan, 430074, China; Hubei Key Laboratory of Medical Information Analysis and Tumor Diagnosis & Treatment, South-Central Minzu University, Wuhan, 430074, China; Key Laboratory of Cognitive Science(South-Central Minzu University), State Ethnic Affairs Commission, Wuhan, 430074, China.
| | - Xiaofei Yang
- School of Biomedical Engineering, South-Central Minzu University, Wuhan, 430074, China; Hubei Key Laboratory of Medical Information Analysis and Tumor Diagnosis & Treatment, South-Central Minzu University, Wuhan, 430074, China; Key Laboratory of Cognitive Science(South-Central Minzu University), State Ethnic Affairs Commission, Wuhan, 430074, China.
| | - Chenhong Li
- School of Biomedical Engineering, South-Central Minzu University, Wuhan, 430074, China; Hubei Key Laboratory of Medical Information Analysis and Tumor Diagnosis & Treatment, South-Central Minzu University, Wuhan, 430074, China; Key Laboratory of Cognitive Science(South-Central Minzu University), State Ethnic Affairs Commission, Wuhan, 430074, China.
| | - Xianguang Lin
- School of Biomedical Engineering, South-Central Minzu University, Wuhan, 430074, China; Hubei Key Laboratory of Medical Information Analysis and Tumor Diagnosis & Treatment, South-Central Minzu University, Wuhan, 430074, China; Key Laboratory of Cognitive Science(South-Central Minzu University), State Ethnic Affairs Commission, Wuhan, 430074, China.
| | - Wen Zhang
- College of Informatics, Huazhong Agricultural University, Wuhan, 430070, China.
| |
Collapse
|
21
|
Zhang X, Guo H, Zhang F, Wang X, Wu K, Qiu S, Liu B, Wang Y, Hu Y, Li J. HNetGO: protein function prediction via heterogeneous network transformer. Brief Bioinform 2023; 24:bbab556. [PMID: 37861172 PMCID: PMC10588005 DOI: 10.1093/bib/bbab556] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2021] [Revised: 11/18/2021] [Accepted: 12/04/2021] [Indexed: 10/21/2023] Open
Abstract
Protein function annotation is one of the most important research topics for revealing the essence of life at molecular level in the post-genome era. Current research shows that integrating multisource data can effectively improve the performance of protein function prediction models. However, the heavy reliance on complex feature engineering and model integration methods limits the development of existing methods. Besides, models based on deep learning only use labeled data in a certain dataset to extract sequence features, thus ignoring a large amount of existing unlabeled sequence data. Here, we propose an end-to-end protein function annotation model named HNetGO, which innovatively uses heterogeneous network to integrate protein sequence similarity and protein-protein interaction network information and combines the pretraining model to extract the semantic features of the protein sequence. In addition, we design an attention-based graph neural network model, which can effectively extract node-level features from heterogeneous networks and predict protein function by measuring the similarity between protein nodes and gene ontology term nodes. Comparative experiments on the human dataset show that HNetGO achieves state-of-the-art performance on cellular component and molecular function branches.
Collapse
Affiliation(s)
- Xiaoshuai Zhang
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong 518055, China
| | - Huannan Guo
- General Hospital of Heilongjiang Province Land Reclamation Bureau, Harbin 150086, China
| | - Fan Zhang
- Center NHC Key Laboratory of Cell Transplantation, The First Affiliated Hospital of Harbin Medical University, Harbin 150086, China
| | - Xuan Wang
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong 518055, China
| | - Kaitao Wu
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong 518055, China
| | - Shizheng Qiu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
| | - Bo Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
| | - Yadong Wang
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong 518055, China
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
| | - Yang Hu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
| | - Junyi Li
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong 518055, China
| |
Collapse
|
22
|
Liu N, Zhang Z, Wu Y, Wang Y, Liang Y. CRBSP:Prediction of CircRNA-RBP Binding Sites Based on Multimodal Intermediate Fusion. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:2898-2906. [PMID: 37130249 DOI: 10.1109/tcbb.2023.3272400] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Circular RNA (CircRNA) is widely expressed and has physiological and pathological significance, regulating post-transcriptional processes via its protein-binding activity. However, whereas much work has been done on linear RNA and RNA binding protein (RBP), little is known about the binding sites of CircRNA. The current report is on the development of a medium-term multimodal data fusion strategy, CRBSP, to predict CircRNA-RBP binding sites. CRBSP represents the CircRNA trinucleotide semantic, location, composition and frequency information as the corresponding coding methods of Word to vector (Word2vec), Position-specific trinucleotide propensity (PSTNP), Pseudo trinucleotide composition (PseTNC) and Trinucleotide nucleotide composition (TNC), respectively. CNN (Convolution Neural Networks) was used to extract global information and BiLSTM (bidirectional Long- and Short-Term Memory network) encoder and LSTM (Long- and Short-Term Memory network) decoder for local sequence information. Enhancement of the contributions of key features by the self-attention mechanism was followed by mid-term fusion of the four enhanced features. Logistic Regression (LR) classifier showed that CRBSP gives a mean AUC value of 0.9362 through 5-fold Cross Validation of all 37 datasets, a performance which is superior to five current state-of-the-art models. Similar evaluation of linear RNA-RBP binding sites gave an AUC value of 0.7615 which is also higher than other prediction methods, demonstrating the robustness of CRBSP.
Collapse
|
23
|
Fan R, Ding Y, Zou Q, Yuan L. Multi-view local hyperplane nearest neighbor model based on independence criterion for identifying vesicular transport proteins. Int J Biol Macromol 2023; 247:125774. [PMID: 37437677 DOI: 10.1016/j.ijbiomac.2023.125774] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2023] [Revised: 06/30/2023] [Accepted: 07/07/2023] [Indexed: 07/14/2023]
Abstract
Vesicular transport proteins participate in various biological processes and play a significant role in the movement of substances within cells. These proteins are associated with numerous human diseases, making their identification particularly important. In this study, we developed a novel strategy for accurately identifying vesicular transport proteins. We developed a novel multi-view classifier called graph-regularized k-local hyperplane distance nearest neighbor model (HSIC-GHKNN), which combines the Hilbert-Schmidt independence criterion (HSIC)-based multi-view learning method with a local hyperplane distance nearest-neighbor classifier. We first extracted protein evolution information using two feature extraction methods, pseudo-position-specific scoring matrix (PsePSSM) and AATP, and addressed dataset imbalance using the Edited Nearest Neighbors (ENN) algorithm. Subsequently, we employed a local hyperplane distance nearest-neighbor classifier for each view identification and added an HSIC term to maintain independence between views. We then assessed the performance of our identification strategy and analyzed the PsePSSM and AATP feature sets to determine the influencing factors of the classification results. The experimental results demonstrate that the accurate and Matthew correlation coefficients of our strategy on the independent test set are 85.8 % and 0.548, respectively. Our approach outperformed existing methods in most evaluation metrics. In addition, the proposed multi-view classification model can easily be applied to similar identification tasks.
Collapse
Affiliation(s)
- Rui Fan
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China; Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang 324000, China
| | - Yijie Ding
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang 324000, China.
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China; Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang 324000, China.
| | - Lei Yuan
- Department of Hepatobiliary Surgery, Quzhou People's Hospital, Quzhou, Zhejiang 324000, China.
| |
Collapse
|
24
|
Hu W, Guan L, Li M. Prediction of DNA Methylation based on Multi-dimensional feature encoding and double convolutional fully connected convolutional neural network. PLoS Comput Biol 2023; 19:e1011370. [PMID: 37639434 PMCID: PMC10461834 DOI: 10.1371/journal.pcbi.1011370] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2023] [Accepted: 07/18/2023] [Indexed: 08/31/2023] Open
Abstract
DNA methylation takes on critical significance to the regulation of gene expression by affecting the stability of DNA and changing the structure of chromosomes. DNA methylation modification sites should be identified, which lays a solid basis for gaining more insights into their biological functions. Existing machine learning-based methods of predicting DNA methylation have not fully exploited the hidden multidimensional information in DNA gene sequences, such that the prediction accuracy of models is significantly limited. Besides, most models have been built in terms of a single methylation type. To address the above-mentioned issues, a deep learning-based method was proposed in this study for DNA methylation site prediction, termed the MEDCNN model. The MEDCNN model is capable of extracting feature information from gene sequences in three dimensions (i.e., positional information, biological information, and chemical information). Moreover, the proposed method employs a convolutional neural network model with double convolutional layers and double fully connected layers while iteratively updating the gradient descent algorithm using the cross-entropy loss function to increase the prediction accuracy of the model. Besides, the MEDCNN model can predict different types of DNA methylation sites. As indicated by the experimental results,the deep learning method based on coding from multiple dimensions outperformed single coding methods, and the MEDCNN model was highly applicable and outperformed existing models in predicting DNA methylation between different species. As revealed by the above-described findings, the MEDCNN model can be effective in predicting DNA methylation sites.
Collapse
Affiliation(s)
- Wenxing Hu
- College of Physics and Electronic Information, Gannan Normal University, Ganzhou, Jiangxi, China
| | - Lixin Guan
- College of Physics and Electronic Information, Gannan Normal University, Ganzhou, Jiangxi, China
| | - Mengshan Li
- College of Physics and Electronic Information, Gannan Normal University, Ganzhou, Jiangxi, China
| |
Collapse
|
25
|
Yan K, Feng J, Huang J, Wu H. iDRPro-SC: identifying DNA-binding proteins and RNA-binding proteins based on subfunction classifiers. Brief Bioinform 2023:bbad251. [PMID: 37405873 DOI: 10.1093/bib/bbad251] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2023] [Revised: 06/10/2023] [Accepted: 06/12/2023] [Indexed: 07/07/2023] Open
Abstract
Nucleic acid-binding proteins are proteins that interact with DNA and RNA to regulate gene expression and transcriptional control. The pathogenesis of many human diseases is related to abnormal gene expression. Therefore, recognizing nucleic acid-binding proteins accurately and efficiently has important implications for disease research. To address this question, some scientists have proposed the method of using sequence information to identify nucleic acid-binding proteins. However, different types of nucleic acid-binding proteins have different subfunctions, and these methods ignore their internal differences, so the performance of the predictor can be further improved. In this study, we proposed a new method, called iDRPro-SC, to predict the type of nucleic acid-binding proteins based on the sequence information. iDRPro-SC considers the internal differences of nucleic acid-binding proteins and combines their subfunctions to build a complete dataset. Additionally, we used an ensemble learning to characterize and predict nucleic acid-binding proteins. The results of the test dataset showed that iDRPro-SC achieved the best prediction performance and was superior to the other existing nucleic acid-binding protein prediction methods. We have established a web server that can be accessed online: http://bliulab.net/iDRPro-SC.
Collapse
Affiliation(s)
- Ke Yan
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
| | - Jiawei Feng
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
| | - Jing Huang
- Huajian Yutong Technology (Beijing) Co., Ltd
- State Key Laboratory of Media Convergence Production Technology and Systems, Beijing China,100803
- Xinhua New Media Culture Communication Co., Ltd
| | - Hao Wu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
| |
Collapse
|
26
|
Chen R, Li F, Guo X, Bi Y, Li C, Pan S, Coin LJM, Song J. ATTIC is an integrated approach for predicting A-to-I RNA editing sites in three species. Brief Bioinform 2023; 24:bbad170. [PMID: 37150785 PMCID: PMC10565902 DOI: 10.1093/bib/bbad170] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2023] [Revised: 04/12/2023] [Accepted: 04/14/2023] [Indexed: 05/09/2023] Open
Abstract
A-to-I editing is the most prevalent RNA editing event, which refers to the change of adenosine (A) bases to inosine (I) bases in double-stranded RNAs. Several studies have revealed that A-to-I editing can regulate cellular processes and is associated with various human diseases. Therefore, accurate identification of A-to-I editing sites is crucial for understanding RNA-level (i.e. transcriptional) modifications and their potential roles in molecular functions. To date, various computational approaches for A-to-I editing site identification have been developed; however, their performance is still unsatisfactory and needs further improvement. In this study, we developed a novel stacked-ensemble learning model, ATTIC (A-To-I ediTing predICtor), to accurately identify A-to-I editing sites across three species, including Homo sapiens, Mus musculus and Drosophila melanogaster. We first comprehensively evaluated 37 RNA sequence-derived features combined with 14 popular machine learning algorithms. Then, we selected the optimal base models to build a series of stacked ensemble models. The final ATTIC framework was developed based on the optimal models improved by the feature selection strategy for specific species. Extensive cross-validation and independent tests illustrate that ATTIC outperforms state-of-the-art tools for predicting A-to-I editing sites. We also developed a web server for ATTIC, which is publicly available at http://web.unimelb-bioinfortools.cloud.edu.au/ATTIC/. We anticipate that ATTIC can be utilized as a useful tool to accelerate the identification of A-to-I RNA editing events and help characterize their roles in post-transcriptional regulation.
Collapse
Affiliation(s)
- Ruyi Chen
- College of Information Engineering, Northwest A&F University, Shaanxi 712100, China
- The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, VIC 3000, Australia
| | - Fuyi Li
- College of Information Engineering, Northwest A&F University, Shaanxi 712100, China
- The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, VIC 3000, Australia
| | - Xudong Guo
- College of Information Engineering, Northwest A&F University, Shaanxi 712100, China
| | - Yue Bi
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, VIC 3800, Australia
| | - Chen Li
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, VIC 3800, Australia
| | - Shirui Pan
- School of Information and Communication Technology, Griffith University, QLD 4222, Australia
| | - Lachlan J M Coin
- The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, VIC 3000, Australia
| | - Jiangning Song
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, VIC 3800, Australia
- Monash Data Futures Institute, Monash University, VIC 3800, Australia
| |
Collapse
|
27
|
Firoz A, Malik A, Ali HM, Akhter Y, Manavalan B, Kim CB. PRR-HyPred: A two-layer hybrid framework to predict pattern recognition receptors and their families by employing sequence encoded optimal features. Int J Biol Macromol 2023; 234:123622. [PMID: 36773859 DOI: 10.1016/j.ijbiomac.2023.123622] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2022] [Revised: 02/03/2023] [Accepted: 02/06/2023] [Indexed: 02/12/2023]
Abstract
Pattern recognition receptors (PRRs) recognize distinct features on the surface of pathogens or damaged cells and play key roles in the innate immune system. PRRs are divided into various families, including Toll-like receptors, retinoic acid-inducible gene-I-like receptors, nucleotide oligomerization domain-like receptors, and C-type lectin receptors. As these are implicated in host health and several diseases, their accurate identification is indispensable for their functional characterization and targeted therapeutic approaches. Here, we construct PRR-HyPred, a novel two-layer hybrid framework in which the first layer predicts whether a given sequence is PRR or non-PRR using a support vector machine, and in the second, the predicted PRR sequence is assigned to a specific family using a random forest-based classifier. Based on a 10-fold cross-validation test, PRR-HyPred achieved 83.4 % accuracy in the first layer and 95 % in the second, with Matthew's correlation coefficient values of 0.639 and 0.816, respectively. This is the first study that can simultaneously predict and classify PRRs into specific families. PRR-HyPred is available as a web portal at https://procarb.org/PRRHyPred/. We hope that it could be a valuable tool for the large-scale prediction and classification of PRRs and subsequently facilitate future studies.
Collapse
Affiliation(s)
- Ahmad Firoz
- Department of Biological Sciences, Faculty of Science, King Abdulaziz University, Jeddah, Saudi Arabia; Princess Dr. Najla Bint Saud Al- Saud Center for Excellence Research in Biotechnology, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Adeel Malik
- Institute of Intelligence Informatics Technology, Sangmyung University, Seoul, 03016, Republic of Korea.
| | - Hani Mohammed Ali
- Department of Biological Sciences, Faculty of Science, King Abdulaziz University, Jeddah, Saudi Arabia; Princess Dr. Najla Bint Saud Al- Saud Center for Excellence Research in Biotechnology, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Yusuf Akhter
- Department of Biotechnology, Babasaheb Bhimrao Ambedkar University, Vidya Vihar, Raebareli Road, Lucknow, Uttar Pradesh, 226025, India
| | - Balachandran Manavalan
- Computational Biology and Bioinformatics Laboratory, Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, 16419, Gyeonggi-do, Republic of Korea.
| | - Chang-Bae Kim
- Department of Biotechnology, Sangmyung University, Seoul, 03016, Republic of Korea.
| |
Collapse
|
28
|
Zulfiqar H, Ahmed Z, Kissanga Grace-Mercure B, Hassan F, Zhang ZY, Liu F. Computational prediction of promotors in Agrobacterium tumefaciens strain C58 by using the machine learning technique. Front Microbiol 2023; 14:1170785. [PMID: 37125199 PMCID: PMC10133480 DOI: 10.3389/fmicb.2023.1170785] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2023] [Accepted: 03/17/2023] [Indexed: 05/02/2023] Open
Abstract
Promotors are those genomic regions on the upstream of genes, which are bound by RNA polymerase for starting gene transcription. Because it is the most critical element of gene expression, the recognition of promoters is crucial to understand the regulation of gene expression. This study aimed to develop a machine learning-based model to predict promotors in Agrobacterium tumefaciens (A. tumefaciens) strain C58. In the model, promotor sequences were encoded by three different kinds of feature descriptors, namely, accumulated nucleotide frequency, k-mer nucleotide composition, and binary encodings. The obtained features were optimized by using correlation and the mRMR-based algorithm. These optimized features were inputted into a random forest (RF) classifier to discriminate promotor sequences from non-promotor sequences in A. tumefaciens strain C58. The examination of 10-fold cross-validation showed that the proposed model could yield an overall accuracy of 0.837. This model will provide help for the study of promoters in A. tumefaciens C58 strain.
Collapse
Affiliation(s)
- Hasan Zulfiqar
- Yangtze Delta Region Institute (Huzhou), University of Electronic Science and Technology of China, Huzhou, China
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Zahoor Ahmed
- Yangtze Delta Region Institute (Huzhou), University of Electronic Science and Technology of China, Huzhou, China
| | - Bakanina Kissanga Grace-Mercure
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Farwa Hassan
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Zhao-Yue Zhang
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Fen Liu
- Department of Radiation Oncology, Peking University Cancer Hospital (Inner Mongolia Campus), Affiliated Cancer Hospital of Inner Mongolia Medical University, Inner Mongolia Cancer Hospital, Hohhot, China
| |
Collapse
|
29
|
Zhang L, Li H, Zhang Z, Wang J, Chen G, Chen D, Shi W, Jia G, Liu M. Hybrid gMLP model for interaction prediction of MHC-peptide and TCR. Front Genet 2023; 13:1092822. [PMID: 36685858 PMCID: PMC9845249 DOI: 10.3389/fgene.2022.1092822] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2022] [Accepted: 12/01/2022] [Indexed: 01/05/2023] Open
Abstract
Understanding the interaction of T-cell receptor (TCR) with major histocompatibility-peptide (MHC-peptide) complex is extremely important in human immunotherapy and vaccine development. However, due to the limited available data, the performance of existing models for predicting the interaction of T-cell receptors (TCR) with major histocompatibility-peptide complexes is still unsatisfactory. Deep learning models have been applied to prediction tasks in various fields and have achieved better results compared with other traditional models. In this study, we leverage the gMLP model combined with attention mechanism to predict the interaction of MHC-peptide and TCR. Experiments show that our model can predict TCR-peptide interactions accurately and can handle the problems caused by different TCR lengths. Moreover, we demonstrate that the models trained with paired CDR3β-chain and CDR3α-chain data are better than those trained with only CDR3β-chain or with CDR3α-chain data. We also demonstrate that the hybrid model has greater potential than the traditional convolutional neural network.
Collapse
Affiliation(s)
- Lichao Zhang
- School of Intelligent Manufacturing and Equipment, Shenzhen Institute of Information Technology, Shenzhen, China
| | - Haojin Li
- School of Software, Shandong University, Jinan, China
| | - Zhenjiu Zhang
- School of Intelligent Manufacturing and Equipment, Shenzhen Institute of Information Technology, Shenzhen, China
| | - Jinjin Wang
- School of Intelligent Manufacturing and Equipment, Shenzhen Institute of Information Technology, Shenzhen, China
| | | | | | - Wentao Shi
- School of Intelligent Manufacturing and Equipment, Shenzhen Institute of Information Technology, Shenzhen, China
| | - Gaozhi Jia
- School of Intelligent Manufacturing and Equipment, Shenzhen Institute of Information Technology, Shenzhen, China
| | - Mingjun Liu
- School of Intelligent Manufacturing and Equipment, Shenzhen Institute of Information Technology, Shenzhen, China
| |
Collapse
|
30
|
Ren J, Li HW, Chen L, Zhang M, Liu YX, Zhang BW, Xu R, Miao YY, Xu XM, Hua X, Sun XG, Yu RJ, Long YT, Hu SS. Mass Spectrometry Imaging-Based Single-Cell Lipidomics Profiles Metabolic Signatures of Heart Failure. RESEARCH 2023; 6:0019. [PMID: 37040505 PMCID: PMC10076023 DOI: 10.34133/research.0019] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/17/2022] [Accepted: 11/14/2022] [Indexed: 01/13/2023]
Abstract
Heart failure (HF), leading as one of the main causes of mortality, has become a serious public health issue with high prevalence around the world. Single cardiomyocyte (CM) metabolomics promises to revolutionize the understanding of HF pathogenesis since the metabolic remodeling in the human hearts plays a vital role in the disease progression. Unfortunately, current metabolic analysis is often limited by the dynamic features of metabolites and the critical needs for high-quality isolated CMs. Here, high-quality CMs were directly isolated from transgenic HF mice biopsies and further employed in the cellular metabolic analysis. The lipids landscape in individual CMs was profiled with a delayed extraction mode in time-of-flight secondary ion mass spectrometry. Specific metabolic signatures were identified to distinguish HF CMs from the control subjects, presenting as possible single-cell biomarkers. The spatial distributions of these signatures were imaged in single cells, and those were further found to be strongly associated with lipoprotein metabolism, transmembrane transport, and signal transduction. Taken together, we systematically studied the lipid metabolism of single CMs with a mass spectrometry imaging method, which directly benefited the identification of HF-associated signatures and a deeper understanding of HF-related metabolic pathways.
Collapse
Affiliation(s)
- Jie Ren
- State Key Laboratory of Cardiovascular Disease, Fuwai Hospital, National Center for Cardiovascular Disease, Chinese Academy of Medicine Science (CAMS) and Perking Union Medical College (PUMC), Beijing, 100037, P. R. China
| | - Hao-Wen Li
- Institute of Molecular Medicine, Shanghai Key Laboratory for Nucleic Acid Chemistry and Nanomedicine, State Key Laboratory of Oncogenes and Related Genes, Renji Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, 200127, P. R. China
- School of Chemistry and Molecular Engineering, East China University of Science and Technology, Shanghai, 200237, P. R. China
| | - Liang Chen
- State Key Laboratory of Cardiovascular Disease, Fuwai Hospital, National Center for Cardiovascular Disease, Chinese Academy of Medicine Science (CAMS) and Perking Union Medical College (PUMC), Beijing, 100037, P. R. China
| | - Min Zhang
- School of Chemistry and Molecular Engineering, East China University of Science and Technology, Shanghai, 200237, P. R. China
| | - Yan-Xiang Liu
- State Key Laboratory of Cardiovascular Disease, Fuwai Hospital, National Center for Cardiovascular Disease, Chinese Academy of Medicine Science (CAMS) and Perking Union Medical College (PUMC), Beijing, 100037, P. R. China
| | - Bo-Wen Zhang
- State Key Laboratory of Cardiovascular Disease, Fuwai Hospital, National Center for Cardiovascular Disease, Chinese Academy of Medicine Science (CAMS) and Perking Union Medical College (PUMC), Beijing, 100037, P. R. China
| | - Rui Xu
- Institute of Molecular Medicine, Shanghai Key Laboratory for Nucleic Acid Chemistry and Nanomedicine, State Key Laboratory of Oncogenes and Related Genes, Renji Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, 200127, P. R. China
| | - Yan-Yan Miao
- Institute of Molecular Medicine, Shanghai Key Laboratory for Nucleic Acid Chemistry and Nanomedicine, State Key Laboratory of Oncogenes and Related Genes, Renji Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, 200127, P. R. China
| | - Xue-Mei Xu
- Institute of Molecular Medicine, Shanghai Key Laboratory for Nucleic Acid Chemistry and Nanomedicine, State Key Laboratory of Oncogenes and Related Genes, Renji Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, 200127, P. R. China
| | - Xin Hua
- School of Chemistry and Molecular Engineering, East China University of Science and Technology, Shanghai, 200237, P. R. China
- Jiangsu Engineering Laboratory of Smart Carbon-Rich Materials and Device, School of Chemistry and Chemical Engineering, Southeast University, Nanjing, 211189, P. R. China
| | - Xiao-Gang Sun
- State Key Laboratory of Cardiovascular Disease, Fuwai Hospital, National Center for Cardiovascular Disease, Chinese Academy of Medicine Science (CAMS) and Perking Union Medical College (PUMC), Beijing, 100037, P. R. China
| | - Ru-Jia Yu
- School of Chemistry and Molecular Engineering, East China University of Science and Technology, Shanghai, 200237, P. R. China
- State Key Laboratory of Analytical Chemistry for Life Science, School of Chemistry and Chemical Engineering, Nanjing University, Nanjing, 210023, P. R. China
| | - Yi-Tao Long
- School of Chemistry and Molecular Engineering, East China University of Science and Technology, Shanghai, 200237, P. R. China
- State Key Laboratory of Analytical Chemistry for Life Science, School of Chemistry and Chemical Engineering, Nanjing University, Nanjing, 210023, P. R. China
| | - Sheng-Shou Hu
- State Key Laboratory of Cardiovascular Disease, Fuwai Hospital, National Center for Cardiovascular Disease, Chinese Academy of Medicine Science (CAMS) and Perking Union Medical College (PUMC), Beijing, 100037, P. R. China
| |
Collapse
|
31
|
Yan K, Lv H, Guo Y, Peng W, Liu B. sAMPpred-GAT: prediction of antimicrobial peptide by graph attention network and predicted peptide structure. Bioinformatics 2023; 39:btac715. [PMID: 36342186 PMCID: PMC9805557 DOI: 10.1093/bioinformatics/btac715] [Citation(s) in RCA: 53] [Impact Index Per Article: 53.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2022] [Revised: 10/24/2022] [Accepted: 11/04/2022] [Indexed: 11/09/2022] Open
Abstract
MOTIVATION Antimicrobial peptides (AMPs) are essential components of therapeutic peptides for innate immunity. Researchers have developed several computational methods to predict the potential AMPs from many candidate peptides. With the development of artificial intelligent techniques, the protein structures can be accurately predicted, which are useful for protein sequence and function analysis. Unfortunately, the predicted peptide structure information has not been applied to the field of AMP prediction so as to improve the predictive performance. RESULTS In this study, we proposed a computational predictor called sAMPpred-GAT for AMP identification. To the best of our knowledge, sAMPpred-GAT is the first approach based on the predicted peptide structures for AMP prediction. The sAMPpred-GAT predictor constructs the graphs based on the predicted peptide structures, sequence information and evolutionary information. The Graph Attention Network (GAT) is then performed on the graphs to learn the discriminative features. Finally, the full connection networks are utilized as the output module to predict whether the peptides are AMP or not. Experimental results show that sAMPpred-GAT outperforms the other state-of-the-art methods in terms of AUC, and achieves better or highly comparable performance in terms of the other metrics on the eight independent test datasets, demonstrating that the predicted peptide structure information is important for AMP prediction. AVAILABILITY AND IMPLEMENTATION A user-friendly webserver of sAMPpred-GAT can be accessed at http://bliulab.net/sAMPpred-GAT and the source code is available at https://github.com/HongWuL/sAMPpred-GAT/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ke Yan
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
| | - Hongwu Lv
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
| | - Yichen Guo
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
| | - Wei Peng
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
| | - Bin Liu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
- Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing 100081, China
| |
Collapse
|
32
|
Dao FY, Lv H, Fullwood MJ, Lin H. Accurate Identification of DNA Replication Origin by Fusing Epigenomics and Chromatin Interaction Information. RESEARCH (WASHINGTON, D.C.) 2022; 2022:9780293. [PMID: 36405252 PMCID: PMC9667886 DOI: 10.34133/2022/9780293] [Citation(s) in RCA: 21] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/20/2022] [Accepted: 09/30/2022] [Indexed: 07/29/2023]
Abstract
DNA replication initiation is a complex process involving various genetic and epigenomic signatures. The correct identification of replication origins (ORIs) could provide important clues for the study of a variety of diseases caused by replication. Here, we design a computational approach named iORI-Epi to recognize ORIs by incorporating epigenome-based features, sequence-based features, and 3D genome-based features. The iORI-Epi displays excellent robustness and generalization ability on both training datasets and independent datasets of K562 cell line. Further experiments confirm that iORI-Epi is highly scalable in other cell lines (MCF7 and HCT116). We also analyze and clarify the regulatory role of epigenomic marks, DNA motifs, and chromatin interaction in DNA replication initiation of eukaryotic genomes. Finally, we discuss gene enrichment pathways from the perspective of ORIs in different replication timing states and heuristically dissect the effect of promoters on replication initiation. Our computational methodology is worth extending to ORI identification in other eukaryotic species.
Collapse
Affiliation(s)
- Fu-Ying Dao
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
- School of Biological Sciences, Nanyang Technological University, Singapore 639798, Singapore
- Cancer Science Institute of Singapore, National University of Singapore, 14 Medical Dr, Singapore 117599, Singapore
| | - Hao Lv
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
- Department of Molecular Life Sciences, University of Zurich, Winterthurerstrasse 190, 8057 Zurich, Switzerland
| | - Melissa J. Fullwood
- School of Biological Sciences, Nanyang Technological University, Singapore 639798, Singapore
- Cancer Science Institute of Singapore, National University of Singapore, 14 Medical Dr, Singapore 117599, Singapore
- Institute of Molecular and Cell Biology, Agency for Science, Technology and Research (A∗STAR), Singapore 138673, Singapore
| | - Hao Lin
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| |
Collapse
|
33
|
Liu S, Cui C, Chen H, Liu T. Ensemble learning-based feature selection for phosphorylation site detection. Front Genet 2022; 13:984068. [PMID: 36338976 PMCID: PMC9634105 DOI: 10.3389/fgene.2022.984068] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2022] [Accepted: 10/05/2022] [Indexed: 11/18/2022] Open
Abstract
SARS-COV-2 is prevalent all over the world, causing more than six million deaths and seriously affecting human health. At present, there is no specific drug against SARS-COV-2. Protein phosphorylation is an important way to understand the mechanism of SARS -COV-2 infection. It is often expensive and time-consuming to identify phosphorylation sites with specific modified residues through experiments. A method that uses machine learning to make predictions about them is proposed. As all the methods of extracting protein sequence features are knowledge-driven, these features may not be effective for detecting phosphorylation sites without a complete understanding of the mechanism of protein. Moreover, redundant features also have a great impact on the fitting degree of the model. To solve these problems, we propose a feature selection method based on ensemble learning, which firstly extracts protein sequence features based on knowledge, then quantifies the importance score of each feature based on data, and finally uses the subset of important features as the final features to predict phosphorylation sites.
Collapse
Affiliation(s)
- Songbo Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Chengmin Cui
- Beijing Institute of Control Engineering, China Academy of Space Technology, Beijing, China
| | - Huipeng Chen
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Tong Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| |
Collapse
|
34
|
Yuan SS, Gao D, Xie XQ, Ma CY, Su W, Zhang ZY, Zheng Y, Ding H. IBPred: A sequence-based predictor for identifying ion binding protein in phage. Comput Struct Biotechnol J 2022; 20:4942-4951. [PMID: 36147670 PMCID: PMC9474292 DOI: 10.1016/j.csbj.2022.08.053] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2022] [Revised: 08/23/2022] [Accepted: 08/24/2022] [Indexed: 11/16/2022] Open
Abstract
Ion binding proteins (IBPs) can selectively and non-covalently interact with ions. IBPs in phages also play an important role in biological processes. Therefore, accurate identification of IBPs is necessary for understanding their biological functions and molecular mechanisms that involve binding to ions. Since molecular biology experimental methods are still labor-intensive and cost-ineffective in identifying IBPs, it is helpful to develop computational methods to identify IBPs quickly and efficiently. In this work, a random forest (RF)-based model was constructed to quickly identify IBPs. Based on the protein sequence information and residues' physicochemical properties, the dipeptide composition combined with the physicochemical correlation between two residues were proposed for the extraction of features. A feature selection technique called analysis of variance (ANOVA) was used to exclude redundant information. By comparing with other classified methods, we demonstrated that our method could identify IBPs accurately. Based on the model, a Python package named IBPred was built with the source code which can be accessed at https://github.com/ShishiYuan/IBPred.
Collapse
Affiliation(s)
- Shi-Shi Yuan
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Dong Gao
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Xue-Qin Xie
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Cai-Yi Ma
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Wei Su
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Zhao-Yue Zhang
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
- School of Healthcare Technology, Chengdu Neusoft University, Chengdu 611844, China
| | - Yan Zheng
- Baotou Medical College, Baotou 014040, China
| | - Hui Ding
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| |
Collapse
|
35
|
Malik A, Mahajan N, Dar TA, Kim CB. C10Pred: A First Machine Learning Based Tool to Predict C10 Family Cysteine Peptidases Using Sequence-Derived Features. Int J Mol Sci 2022; 23:ijms23179518. [PMID: 36076915 PMCID: PMC9455582 DOI: 10.3390/ijms23179518] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2022] [Revised: 08/17/2022] [Accepted: 08/20/2022] [Indexed: 12/02/2022] Open
Abstract
Streptococcus pyogenes, or group A Streptococcus (GAS), a gram-positive bacterium, is implicated in a wide range of clinical manifestations and life-threatening diseases. One of the key virulence factors of GAS is streptopain, a C10 family cysteine peptidase. Since its discovery, various homologs of streptopain have been reported from other bacterial species. With the increased affordability of sequencing, a significant increase in the number of potential C10 family-like sequences in the public databases is anticipated, posing a challenge in classifying such sequences. Sequence-similarity-based tools are the methods of choice to identify such streptopain-like sequences. However, these methods depend on some level of sequence similarity between the existing C10 family and the target sequences. Therefore, in this work, we propose a novel predictor, C10Pred, for the prediction of C10 peptidases using sequence-derived optimal features. C10Pred is a support vector machine (SVM) based model which is efficient in predicting C10 enzymes with an overall accuracy of 92.7% and Matthews’ correlation coefficient (MCC) value of 0.855 when tested on an independent dataset. We anticipate that C10Pred will serve as a handy tool to classify novel streptopain-like proteins belonging to the C10 family and offer essential information.
Collapse
Affiliation(s)
- Adeel Malik
- Institute of Intelligence Informatics Technology, Sangmyung University, Seoul 03016, Korea
- Correspondence: (A.M.); (C.-B.K.)
| | - Nitin Mahajan
- Department of Pediatrics, Washington University in St. Louis, St. Louis, MO 63110, USA
| | - Tanveer Ali Dar
- Department of Clinical Biochemistry, University of Kashmir, Srinagar 190006, India
| | - Chang-Bae Kim
- Department of Biotechnology, Sangmyung University, Seoul 03016, Korea
- Correspondence: (A.M.); (C.-B.K.)
| |
Collapse
|
36
|
FRTpred: A novel approach for accurate prediction of protein folding rate and type. Comput Biol Med 2022; 149:105911. [DOI: 10.1016/j.compbiomed.2022.105911] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2022] [Revised: 07/08/2022] [Accepted: 07/23/2022] [Indexed: 11/20/2022]
|
37
|
Thi Phan L, Woo Park H, Pitti T, Madhavan T, Jeon YJ, Manavalan B. MLACP 2.0: An updated machine learning tool for anticancer peptide prediction. Comput Struct Biotechnol J 2022; 20:4473-4480. [PMID: 36051870 PMCID: PMC9421197 DOI: 10.1016/j.csbj.2022.07.043] [Citation(s) in RCA: 33] [Impact Index Per Article: 16.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2022] [Revised: 07/25/2022] [Accepted: 07/25/2022] [Indexed: 12/24/2022] Open
Abstract
We present a novel meta-approach, MLACP 2.0, and implement it as a user-friendly webserver for the accurate identification of ACPs. MLACP 2.0 employed 11 different encoding schemes and eight different classifiers, including convolutional neural networks, to create a stable meta-model. Benchmarking study has demonstrated that MLACP 2.0 achieves superior performance in ACP prediction compared to publicly available state-of-the-art predictors.
Anticancer peptides are emerging anticancer drug that offers fewer side effects and is more effective than chemotherapy and targeted therapy. Predicting anticancer peptides from sequence information is one of the most challenging tasks in immunoinformatics. In the past ten years, machine learning-based approaches have been proposed for identifying ACP activity from peptide sequences. These methods include our previous method MLACP (developed in 2017) which made a significant impact on anticancer research. MLACP tool has been widely used by the research community, however, its robustness must be improved significantly for its continued practical application. In this study, the first large non-redundant training and independent datasets were constructed for ACP research. Using the training dataset, the study explored a wide range of feature encodings and developed their respective models using seven different conventional classifiers. Subsequently, a subset of encoding-based models was selected for each classifier based on their performance, whose predicted scores were concatenated and trained through a convolutional neural network (CNN), whose corresponding predictor is named MLACP 2.0. The evaluation of MLACP 2.0 with a very diverse independent dataset showed excellent performance and significantly outperformed the recent ACP prediction tools. Additionally, MLACP 2.0 exhibits superior performance during cross-validation and independent assessment when compared to CNN-based embedding models and conventional single models. Consequently, we anticipate that our proposed MLACP 2.0 will facilitate the design of hypothesis-driven experiments by making it easier to discover novel ACPs. The MLACP 2.0 is freely available at https://balalab-skku.org/mlacp2.
Collapse
|
38
|
Sun Z, Huang Q, Yang Y, Li S, Lv H, Zhang Y, Lin H, Ning L. PSnoD: identifying potential snoRNA-disease associations based on bounded nuclear norm regularization. Brief Bioinform 2022; 23:6640008. [PMID: 35817303 DOI: 10.1093/bib/bbac240] [Citation(s) in RCA: 19] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2022] [Revised: 05/16/2022] [Accepted: 05/24/2022] [Indexed: 12/19/2022] Open
Abstract
Many studies have proved that small nucleolar RNAs (snoRNAs) play critical roles in the development of various human complex diseases. Discovering the associations between snoRNAs and diseases is an important step toward understanding the pathogenesis and characteristics of diseases. However, uncovering associations via traditional experimental approaches is costly and time-consuming. This study proposed a bounded nuclear norm regularization-based method, called PSnoD, to predict snoRNA-disease associations. Benchmark experiments showed that compared with the state-of-the-art methods, PSnoD achieved a superior performance in the 5-fold stratified shuffle split. PSnoD produced a robust performance with an area under receiver-operating characteristic of 0.90 and an area under precision-recall of 0.55, highlighting the effectiveness of our proposed method. In addition, the computational efficiency of PSnoD was also demonstrated by comparison with other matrix completion techniques. More importantly, the case study further elucidated the ability of PSnoD to screen potential snoRNA-disease associations. The code of PSnoD has been uploaded to https://github.com/linDing-groups/PSnoD. Based on PSnoD, we established a web server that is freely accessed via http://psnod.lin-group.cn/.
Collapse
Affiliation(s)
- Zijie Sun
- Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China.,School of Healthcare Technology, Chengdu Neusoft University, Chengdu 611844, China
| | - Qinlai Huang
- Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China.,School of Healthcare Technology, Chengdu Neusoft University, Chengdu 611844, China
| | - Yuhe Yang
- Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Shihao Li
- Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Hao Lv
- Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Yang Zhang
- Innovative Institute of Chinese Medicine and Pharmacy, Chengdu University of Traditional Chinese Medicine, Chengdu 611137, China
| | - Hao Lin
- Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Lin Ning
- School of Healthcare Technology, Chengdu Neusoft University, Chengdu 611844, China
| |
Collapse
|
39
|
Fan Y, Peng B. StackEPI: identification of cell line-specific enhancer-promoter interactions based on stacking ensemble learning. BMC Bioinformatics 2022; 23:272. [PMID: 35820811 PMCID: PMC9277947 DOI: 10.1186/s12859-022-04821-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2022] [Accepted: 07/01/2022] [Indexed: 11/10/2022] Open
Abstract
Background Understanding the regulatory role of enhancer–promoter interactions (EPIs) on specific gene expression in cells contributes to the understanding of gene regulation, cell differentiation, etc., and its identification has been a challenging task. On the one hand, using traditional wet experimental methods to identify EPIs often means a lot of human labor and time costs. On the other hand, although the currently proposed computational methods have good recognition effects, they generally require a long training time. Results In this study, we studied the EPIs of six human cell lines and designed a cell line-specific EPIs prediction method based on a stacking ensemble learning strategy, which has better prediction performance and faster training speed, called StackEPI. Specifically, by combining different encoding schemes and machine learning methods, our prediction method can extract the cell line-specific effective information of enhancer and promoter gene sequences comprehensively and in many directions, and make accurate recognition of cell line-specific EPIs. Ultimately, the source code to implement StackEPI and experimental data involved in the experiment are available at https://github.com/20032303092/StackEPI.git. Conclusions The comparison results show that our model can deliver better performance on the problem of identifying cell line-specific EPIs and outperform other state-of-the-art models. In addition, our model also has a more efficient computation speed. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04821-9.
Collapse
Affiliation(s)
- Yongxian Fan
- School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin, 541004, China.
| | - Binchao Peng
- School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin, 541004, China
| |
Collapse
|
40
|
Jeon YJ, Hasan MM, Park HW, Lee KW, Manavalan B. TACOS: a novel approach for accurate prediction of cell-specific long noncoding RNAs subcellular localization. Brief Bioinform 2022; 23:6618237. [PMID: 35753698 PMCID: PMC9294414 DOI: 10.1093/bib/bbac243] [Citation(s) in RCA: 24] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2022] [Revised: 05/23/2022] [Accepted: 05/24/2022] [Indexed: 11/14/2022] Open
Abstract
Long noncoding RNAs (lncRNAs) are primarily regulated by their cellular localization, which is responsible for their molecular functions, including cell cycle regulation and genome rearrangements. Accurately identifying the subcellular location of lncRNAs from sequence information is crucial for a better understanding of their biological functions and mechanisms. In contrast to traditional experimental methods, bioinformatics or computational methods can be applied for the annotation of lncRNA subcellular locations in humans more effectively. In the past, several machine learning-based methods have been developed to identify lncRNA subcellular localization, but relevant work for identifying cell-specific localization of human lncRNA remains limited. In this study, we present the first application of the tree-based stacking approach, TACOS, which allows users to identify the subcellular localization of human lncRNA in 10 different cell types. Specifically, we conducted comprehensive evaluations of six tree-based classifiers with 10 different feature descriptors, using a newly constructed balanced training dataset for each cell type. Subsequently, the strengths of the AdaBoost baseline models were integrated via a stacking approach, with an appropriate tree-based classifier for the final prediction. TACOS displayed consistent performance in both the cross-validation and independent assessments compared with the other two approaches employed in this study. The user-friendly online TACOS web server can be accessed at https://balalab-skku.org/TACOS.
Collapse
Affiliation(s)
- Young-Jun Jeon
- Department of Integrative Biotechnology, College of Bioengineering and Biotechnology, Sungkyunkwan University, Suwon 16419, Korea
| | - Md Mehedi Hasan
- Tulane Center for Biomedical Informatics and Genomics, Division of Biomedical Informatics and Genomics, John W. Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA 70112, USA
| | - Hyun Woo Park
- Department of Integrative Biotechnology, College of Bioengineering and Biotechnology, Sungkyunkwan University, Suwon 16419, Korea
| | - Ki Wook Lee
- Department of Integrative Biotechnology, College of Bioengineering and Biotechnology, Sungkyunkwan University, Suwon 16419, Korea
| | - Balachandran Manavalan
- Computational Biology and Bioinformatics laboratory, Department of Integrative Biotechnology, College of Bioengineering and Biotechnology, Sungkyunkwan University, Suwon 16419, Korea
| |
Collapse
|
41
|
Yao Y, Zhang S, Xue T. Integrating LASSO Feature Selection and Soft Voting Classifier to Identify Origins of Replication Sites. Curr Genomics 2022; 23:83-93. [PMID: 36778978 PMCID: PMC9878833 DOI: 10.2174/1389202923666220214122506] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2021] [Revised: 12/11/2021] [Accepted: 01/18/2022] [Indexed: 11/22/2022] Open
Abstract
Background: DNA replication plays an indispensable role in the transmission of genetic information. It is considered to be the basis of biological inheritance and the most fundamental process in all biological life. Considering that DNA replication initiates with a special location, namely the origin of replication, a better and accurate prediction of the origins of replication sites (ORIs) is essential to gain insight into the relationship with gene expression. Objective: In this study, we have developed an efficient predictor called iORI-LAVT for ORIs identification. Methods: This work focuses on extracting feature information from three aspects, including mono-nucleotide encoding, k-mer and ring-function-hydrogen-chemical properties. Subsequently, least absolute shrinkage and selection operator (LASSO) as a feature selection is applied to select the optimal features. Comparing the different combined soft voting classifiers results, the soft voting classifier based on GaussianNB and Logistic Regression is employed as the final classifier. Results: Based on 10-fold cross-validation test, the prediction accuracies of two benchmark datasets are 90.39% and 95.96%, respectively. As for the independent dataset, our method achieves high accuracy of 91.3%. Conclusion: Compared with previous predictors, iORI-LAVT outperforms the existing methods. It is believed that iORI-LAVT predictor is a promising alternative for further research on identifying ORIs.
Collapse
Affiliation(s)
- Yingying Yao
- School of Mathematics and Statistics, Xidian University, Xi’an 710071, P.R. China
| | - Shengli Zhang
- School of Mathematics and Statistics, Xidian University, Xi’an 710071, P.R. China,Address correspondence to this author at the School of Mathematics and Statistics, Xidian University, Xi’an 710071, P.R. China; Tel/Fax: +86-29- 88202860; E-mail:
| | - Tian Xue
- School of Mathematics and Statistics, Xidian University, Xi’an 710071, P.R. China
| |
Collapse
|
42
|
Yan J, Jiang T, Liu J, Lu Y, Guan S, Li H, Wu H, Ding Y. DNA-binding protein prediction based on deep transfer learning. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2022; 19:7719-7736. [PMID: 35801442 DOI: 10.3934/mbe.2022362] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
The study of DNA binding proteins (DBPs) is of great importance in the biomedical field and plays a key role in this field. At present, many researchers are working on the prediction and detection of DBPs. Traditional DBP prediction mainly uses machine learning methods. Although these methods can obtain relatively high pre-diction accuracy, they consume large quantities of human effort and material resources. Transfer learning has certain advantages in dealing with such prediction problems. Therefore, in the present study, two features were extracted from a protein sequence, a transfer learning method was used, and two classical transfer learning algorithms were compared to transfer samples and construct data sets. In the final step, DBPs are detected by building a deep learning neural network model in a way that uses attention mechanisms.
Collapse
Affiliation(s)
- Jun Yan
- College of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
| | - Tengsheng Jiang
- College of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
| | - Junkai Liu
- College of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
| | - Yaoyao Lu
- College of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
| | - Shixuan Guan
- College of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
| | - Haiou Li
- College of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
| | - Hongjie Wu
- College of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
- Suzhou Smart City Research Institute, Suzhou University of Science and Technology, Suzhou, China
| | - Yijie Ding
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| |
Collapse
|
43
|
Wu H, Liang Q, Zhang W, Zou Q, El-Latif Hesham A, Liu B. iLncDA-LTR: Identification of lncRNA-disease associations by learning to rank. Comput Biol Med 2022; 146:105605. [PMID: 35594681 DOI: 10.1016/j.compbiomed.2022.105605] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2022] [Revised: 04/27/2022] [Accepted: 05/09/2022] [Indexed: 12/12/2022]
Abstract
Identifying the associations between lncRNAs and diseases is helpful for the treatment and diagnosis of complex diseases. The existing computational methods mainly focus on the identification of associations between known lncRNAs and known diseases. However, with the application of high-throughput sequencing in lncRNA research, more and more lncRNAs have been detected. Predicting diseases related with newly detected lncRNAs has not been fully explored. Therefore, there is an urgent need for developing powerful computational methods to predict diseases related with newly detected lncRNAs. In this paper, we propose a Learning to Rank (LTR)-based method called iLncDA-LTR to predict diseases related with newly detected lncRNAs. iLncDA-LTR treats this task as an information retrieval task. The newly detected lncRNAs and diseases are considered as queries and documents, respectively. For a given newly detected lncRNA (query), iLncDA-LTR integrates multiple relevant information into LTR for predicting candidate diseases associated with query lncRNA. Experimental results show that iLncDA-LTR outperforms the other exiting state-of-the-art predictors on independent dataset. The corresponding web server of iLncDA-LTR has been constructed as well (http://bliulab.net/iLncDA-LTR/).
Collapse
Affiliation(s)
- Hao Wu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, 100081, China.
| | - Qi Liang
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, 100081, China.
| | - Wenxiang Zhang
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, 100081, China.
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China.
| | - Abd El-Latif Hesham
- Genetics Department, Faculty of Agriculture, Beni-Suef University, Beni-Suef, 62511, Egypt.
| | - Bin Liu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, 100081, China; Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing, China.
| |
Collapse
|
44
|
Hasan MM, Tsukiyama S, Cho JY, Kurata H, Alam MA, Liu X, Manavalan B, Deng HW. Deepm5C: A deep learning-based hybrid framework for identifying human RNA N5-methylcytosine sites using a stacking strategy. Mol Ther 2022; 30:2856-2867. [PMID: 35526094 PMCID: PMC9372321 DOI: 10.1016/j.ymthe.2022.05.001] [Citation(s) in RCA: 46] [Impact Index Per Article: 23.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2021] [Revised: 04/25/2022] [Accepted: 05/03/2022] [Indexed: 11/30/2022] Open
Abstract
As one of the most prevalent post-transcriptional epigenetic modifications, N5-methylcytosine (m5C), plays an essential role in various cellular processes and disease pathogenesis. Therefore, it is important accurately identify m5C modifications in order to gain a deeper understanding of cellular processes and other possible functional mechanisms. Although a few computational methods have been proposed, their respective models have been developed using small training datasets. Hence, their practical application is quite limited in genome-wide detection. To overcome the existing limitations, we propose Deepm5C, a bioinformatics method to identify RNA m5C sites in the throughout human genome. To develop Deepm5C, we constructed a novel benchmarking dataset and investigated a mixture of three conventional feature encoding algorithms and a feature derived from word embedding approaches. Afterwards, four variants of deep learning classifiers and four commonly used conventional classifiers were employed and trained with the four encodings, ultimately obtaining 32 baseline models. A stacking strategy is effectively utilized by integrating the predicted output of the optimal baseline models and trained with a 1-D convolutional neural network. As a result, the Deepm5C predictor achieved excellent performance during cross-validation with a Matthews correlation coefficient and accuracy of 0.697 and 0.855, respectively. The corresponding metrics during the independent test were 0.691 and 0.852, respectively. Overall, Deepm5C achieved a more accurate and stable performance than the baseline models and significantly outperformed the existing predictors, demonstrating the effectiveness of our proposed hybrid framework. Furthermore, Deepm5C is expected to assist community-wide efforts in identifying putative m5Cs and formulate the novel testable biological hypothesis.
Collapse
Affiliation(s)
- Md Mehedi Hasan
- Tulane Center for Biomedical Informatics and Genomics, Division of Biomedical Informatics and Genomics, John W. Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA, 70112 USA.
| | - Sho Tsukiyama
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan
| | - Jae Youl Cho
- Molecular Immunology Laboratory, Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, Korea
| | - Hiroyuki Kurata
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan
| | - Md Ashad Alam
- Tulane Center for Biomedical Informatics and Genomics, Division of Biomedical Informatics and Genomics, John W. Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA, 70112 USA
| | - Xiaowen Liu
- Tulane Center for Biomedical Informatics and Genomics, Division of Biomedical Informatics and Genomics, John W. Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA, 70112 USA
| | - Balachandran Manavalan
- Computational Biology and Bioinformatics Laboratory, Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, Korea.
| | - Hong-Wen Deng
- Tulane Center for Biomedical Informatics and Genomics, Division of Biomedical Informatics and Genomics, John W. Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA, 70112 USA.
| |
Collapse
|
45
|
Lv H, Yan K, Guo Y, Zou Q, Hesham AEL, Liu B. AMPpred-EL: An effective antimicrobial peptide prediction model based on ensemble learning. Comput Biol Med 2022; 146:105577. [PMID: 35576825 DOI: 10.1016/j.compbiomed.2022.105577] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2022] [Revised: 04/22/2022] [Accepted: 04/28/2022] [Indexed: 11/15/2022]
Abstract
Antimicrobial peptides (AMPs) are important for the human immune system and are currently applied in clinical trials. AMPs have been received much attention for accurate recognition. Recently, several computational methods for identifying AMPs have been proposed. However, existing methods have difficulty in accurately predicting AMPs. In this paper, we propose a novel AMP prediction method called AMPpred-EL based on an ensemble learning strategy. AMPred-EL is constructed based on ensemble learning combined with LightGBM and logistic regression. Experimental results demonstrate that AMPpred-EL outperforms several state-of-the-art methods on the benchmark datasets and then improves the efficiency performance.
Collapse
Affiliation(s)
- Hongwu Lv
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, 100081, China.
| | - Ke Yan
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, 100081, China.
| | - Yichen Guo
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, 100081, China.
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China.
| | - Abd El-Latif Hesham
- Genetics Department, Faculty of Agriculture, Beni-Suef University, Beni-Suef, 62511, Egypt.
| | - Bin Liu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, 100081, China; Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing, China.
| |
Collapse
|
46
|
Wu X, Zeng W, Lin F, Xu P, Li X. Anticancer Peptide Prediction via Multi-Kernel CNN and Attention Model. Front Genet 2022; 13:887894. [PMID: 35571059 PMCID: PMC9092594 DOI: 10.3389/fgene.2022.887894] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2022] [Accepted: 03/25/2022] [Indexed: 11/13/2022] Open
Abstract
Background: Modern lifestyles mean that people are more likely to suffer from some form of cancer. As anticancer peptides can effectively kill cancer cells and play an important role in fighting cancer, they have been a subject of increasing research interest. Methods: This study presents a useful tool to identify the anticancer peptides based on a multi-kernel CNN and attention model, called ACP-MCAM. This model can automatically learn adaptive embedding and the context sequence features of ACP. In addition, to obtain better interpretability and integrity, we visualized the model. Results: Benchmarking comparison shows that ACP-MCAM significantly outperforms several state-of-the-art models. Different encoding schemes have different impacts on the performance of the model. We also studied tmethod parameter optimization. Conclusion: The ACP-MCAM can integrate multi-kernel CNN and self-attention mechanism, which outperforms the previous model in identifying anticancer peptides. It is expected that the work will provide new research ideas for anticancer peptide prediction in the future. In addition, this work will promote the development of the interdisciplinary field of artificial intelligence and biomedicine.
Collapse
Affiliation(s)
- Xiujin Wu
- School of Informatics, Xiamen University, Xiamen, China
| | - Wenhua Zeng
- School of Informatics, Xiamen University, Xiamen, China
| | - Fan Lin
- School of Informatics, Xiamen University, Xiamen, China
- Boston Children’s Hospital, Boston, MA, United States
| | - Peng Xu
- Chongqing Michong Technology Co., Ltd., Chongqing, China
| | | |
Collapse
|
47
|
MLCPP 2.0: An updated cell-penetrating peptides and their uptake efficiency predictor. J Mol Biol 2022; 434:167604. [DOI: 10.1016/j.jmb.2022.167604] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2021] [Revised: 04/03/2022] [Accepted: 04/19/2022] [Indexed: 12/12/2022]
|
48
|
Shoombuatong W, Basith S, Pitti T, Lee G, Manavalan B. THRONE: a new approach for accurate prediction of human RNA N7-methylguanosine sites. J Mol Biol 2022; 434:167549. [DOI: 10.1016/j.jmb.2022.167549] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2021] [Revised: 03/08/2022] [Accepted: 03/10/2022] [Indexed: 12/30/2022]
|
49
|
Han YM, Yang H, Huang QL, Sun ZJ, Li ML, Zhang JB, Deng KJ, Chen S, Lin H. Risk prediction of diabetes and pre-diabetes based on physical examination data. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2022; 19:3597-3608. [PMID: 35341266 DOI: 10.3934/mbe.2022166] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Diabetes is a metabolic disorder caused by insufficient insulin secretion and insulin secretion disorders. From health to diabetes, there are generally three stages: health, pre-diabetes and type 2 diabetes. Early diagnosis of diabetes is the most effective way to prevent and control diabetes and its complications. In this work, we collected the physical examination data from Beijing Physical Examination Center from January 2006 to December 2017, and divided the population into three groups according to the WHO (1999) Diabetes Diagnostic Standards: normal fasting plasma glucose (NFG) (FPG < 6.1 mmol/L), mildly impaired fasting plasma glucose (IFG) (6.1 mmol/L ≤ FPG < 7.0 mmol/L) and type 2 diabetes (T2DM) (FPG > 7.0 mmol/L). Finally, we obtained1,221,598 NFG samples, 285,965 IFG samples and 387,076 T2DM samples, with a total of 15 physical examination indexes. Furthermore, taking eXtreme Gradient Boosting (XGBoost), random forest (RF), Logistic Regression (LR), and Fully connected neural network (FCN) as classifiers, four models were constructed to distinguish NFG, IFG and T2DM. The comparison results show that XGBoost has the best performance, with AUC (macro) of 0.7874 and AUC (micro) of 0.8633. In addition, based on the XGBoost classifier, three binary classification models were also established to discriminate NFG from IFG, NFG from T2DM, IFG from T2DM. On the independent dataset, the AUCs were 0.7808, 0.8687, 0.7067, respectively. Finally, we analyzed the importance of the features and identified the risk factors associated with diabetes.
Collapse
Affiliation(s)
- Yu-Mei Han
- Beijing Physical Examination Center, Beijing, China
| | - Hui Yang
- School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Qin-Lai Huang
- School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Zi-Jie Sun
- School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | | | | | - Ke-Jun Deng
- School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Shuo Chen
- Beijing Physical Examination Center, Beijing, China
| | - Hao Lin
- School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| |
Collapse
|
50
|
Li F, Guo X, Xiang D, Pitt ME, Bainomugisa A, Coin LJ. Computational analysis and prediction of PE_PGRS proteins using machine learning. Comput Struct Biotechnol J 2022; 20:662-674. [PMID: 35140886 PMCID: PMC8804200 DOI: 10.1016/j.csbj.2022.01.019] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2021] [Revised: 01/09/2022] [Accepted: 01/18/2022] [Indexed: 12/18/2022] Open
Abstract
Mycobacterium tuberculosis genome comprises approximately 10% of two families of poorly characterised genes due to their high GC content and highly repetitive nature. The largest sub-group, the proline-glutamic acid polymorphic guanine-cytosine-rich sequence (PE_PGRS) family, is thought to be involved in host response and disease pathogenicity. Due to their high genetic variability and complexity of analysis, they are typically disregarded for further research in genomic studies. There are currently limited online resources and homology computational tools that can identify and analyse PE_PGRS proteins. In addition, they are computational-intensive and time-consuming, and lack sensitivity. Therefore, computational methods that can rapidly and accurately identify PE_PGRS proteins are valuable to facilitate the functional elucidation of the PE_PGRS family proteins. In this study, we developed the first machine learning-based bioinformatics approach, termed PEPPER, to allow users to identify PE_PGRS proteins rapidly and accurately. PEPPER was built upon a comprehensive evaluation of 13 popular machine learning algorithms with various sequence and physicochemical features. Empirical studies demonstrated that PEPPER achieved significantly better performance than alignment-based approaches, BLASTP and PHMMER, in both prediction accuracy and speed. PEPPER is anticipated to facilitate community-wide efforts to conduct high-throughput identification and analysis of PE_PGRS proteins.
Collapse
Affiliation(s)
- Fuyi Li
- Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, VIC 3000, Australia
| | - Xudong Guo
- School of Information Engineering, Ningxia University, Yinchuan, Ningxia 750021, China
| | - Dongxu Xiang
- Faculty of Engineering and Information Technology, The University of Melbourne, VIC 3000, Australia
| | - Miranda E. Pitt
- Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, VIC 3000, Australia
| | | | - Lachlan J.M. Coin
- Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, VIC 3000, Australia
| |
Collapse
|