1
|
Sabir MJ, Kamli MR, Atef A, Alhibshi AM, Edris S, Hajarah NH, Bahieldin A, Manavalan B, Sabir JSM. Computational prediction of phosphorylation sites of SARS-CoV-2 infection using feature fusion and optimization strategies. Methods 2024; 229:1-8. [PMID: 38768932 DOI: 10.1016/j.ymeth.2024.04.021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2023] [Revised: 03/15/2024] [Accepted: 04/30/2024] [Indexed: 05/22/2024] Open
Abstract
SARS-CoV-2's global spread has instigated a critical health and economic emergency, impacting countless individuals. Understanding the virus's phosphorylation sites is vital to unravel the molecular intricacies of the infection and subsequent changes in host cellular processes. Several computational methods have been proposed to identify phosphorylation sites, typically focusing on specific residue (S/T) or Y phosphorylation sites. Unfortunately, current predictive tools perform best on these specific residues and may not extend their efficacy to other residues, emphasizing the urgent need for enhanced methodologies. In this study, we developed a novel predictor that integrated all the residues (STY) phosphorylation sites information. We extracted ten different feature descriptors, primarily derived from composition, evolutionary, and position-specific information, and assessed their discriminative power through five classifiers. Our results indicated that Light Gradient Boosting (LGB) showed superior performance, and five descriptors displayed excellent discriminative capabilities. Subsequently, we identified the top two integrated features have high discriminative capability and trained with LGB to develop the final prediction model, LGB-IPs. The proposed approach shows an excellent performance on 10-fold cross-validation with an ACC, MCC, and AUC values of 0.831, 0.662, 0.907, respectively. Notably, these performances are replicated in the independent evaluation. Consequently, our approach may provide valuable insights into the phosphorylation mechanisms in SARS-CoV-2 infection for biomedical researchers.
Collapse
Affiliation(s)
- Mumdooh J Sabir
- Department of Computer Science, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah 21589, Saudi Arabia
| | - Majid Rasool Kamli
- Centre of Excellence in Bionanoscience Research, King Abdulaziz University, Jeddah, Saudi Arabia; Department of Biological Sciences, Faculty of Science, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Ahmed Atef
- Centre of Excellence in Bionanoscience Research, King Abdulaziz University, Jeddah, Saudi Arabia; Department of Biological Sciences, Faculty of Science, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Alawiah M Alhibshi
- Centre of Excellence in Bionanoscience Research, King Abdulaziz University, Jeddah, Saudi Arabia; Department of Biological Sciences, Faculty of Science, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Sherif Edris
- Centre of Excellence in Bionanoscience Research, King Abdulaziz University, Jeddah, Saudi Arabia; Department of Biological Sciences, Faculty of Science, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Nahid H Hajarah
- Centre of Excellence in Bionanoscience Research, King Abdulaziz University, Jeddah, Saudi Arabia; Department of Biological Sciences, Faculty of Science, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Ahmed Bahieldin
- Centre of Excellence in Bionanoscience Research, King Abdulaziz University, Jeddah, Saudi Arabia; Department of Biological Sciences, Faculty of Science, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Balachandran Manavalan
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, Republic of Korea.
| | - Jamal S M Sabir
- Centre of Excellence in Bionanoscience Research, King Abdulaziz University, Jeddah, Saudi Arabia; Department of Biological Sciences, Faculty of Science, King Abdulaziz University, Jeddah, Saudi Arabia.
| |
Collapse
|
2
|
Androutsos L, Pallante L, Bompotas A, Stojceski F, Grasso G, Piga D, Di Benedetto G, Alexakos C, Kalogeras A, Theofilatos K, Deriu MA, Mavroudi S. Predicting multiple taste sensations with a multiobjective machine learning method. NPJ Sci Food 2024; 8:47. [PMID: 39054312 PMCID: PMC11272927 DOI: 10.1038/s41538-024-00287-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2023] [Accepted: 07/05/2024] [Indexed: 07/27/2024] Open
Abstract
Taste perception plays a pivotal role in guiding nutrient intake and aiding in the avoidance of potentially harmful substances through five basic tastes - sweet, bitter, umami, salty, and sour. Taste perception originates from molecular interactions in the oral cavity between taste receptors and chemical tastants. Hence, the recognition of taste receptors and the subsequent perception of taste heavily rely on the physicochemical properties of food ingredients. In recent years, several advances have been made towards the development of machine learning-based algorithms to classify chemical compounds' tastes using their molecular structures. Despite the great efforts, there remains significant room for improvement in developing multi-class models to predict the entire spectrum of basic tastes. Here, we present a multi-class predictor aimed at distinguishing bitter, sweet, and umami, from other taste sensations. The development of a multi-class taste predictor paves the way for a comprehensive understanding of the chemical attributes associated with each fundamental taste. It also opens the potential for integration into the evolving realm of multi-sensory perception, which encompasses visual, tactile, and olfactory sensations to holistically characterize flavour perception. This concept holds promise for introducing innovative methodologies in the rational design of foods, including pre-determining specific tastes and engineering complementary diets to augment traditional pharmacological treatments.
Collapse
Affiliation(s)
| | - Lorenzo Pallante
- PolitoBIOMedLab, Department of Mechanical and Aerospace Engineering, Politecnico di Torino, Torino, 10129, Italy
| | - Agorakis Bompotas
- Industrial Systems Institute, Athena Research Center, 265 04, Patras, Greece
| | - Filip Stojceski
- Department of Innovative Technologies, Dalle Molle Institute for Artificial Intelligence, Lugano-Viganello, 6962, Switzerland
| | - Gianvito Grasso
- Department of Innovative Technologies, Dalle Molle Institute for Artificial Intelligence, Lugano-Viganello, 6962, Switzerland
| | - Dario Piga
- Department of Innovative Technologies, Dalle Molle Institute for Artificial Intelligence, Lugano-Viganello, 6962, Switzerland
| | | | - Christos Alexakos
- Industrial Systems Institute, Athena Research Center, 265 04, Patras, Greece
| | | | | | - Marco A Deriu
- PolitoBIOMedLab, Department of Mechanical and Aerospace Engineering, Politecnico di Torino, Torino, 10129, Italy
| | - Seferina Mavroudi
- InSyBio PC, Patras, 265 04, Greece
- Department of Nursing, University of Patras, 265 04, Patras, Greece
| |
Collapse
|
3
|
Kurata H, Harun-Or-Roshid M, Mehedi Hasan M, Tsukiyama S, Maeda K, Manavalan B. MLm5C: A high-precision human RNA 5-methylcytosine sites predictor based on a combination of hybrid machine learning models. Methods 2024; 227:37-47. [PMID: 38729455 DOI: 10.1016/j.ymeth.2024.05.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2023] [Revised: 04/22/2024] [Accepted: 05/06/2024] [Indexed: 05/12/2024] Open
Abstract
RNA modification serves as a pivotal component in numerous biological processes. Among the prevalent modifications, 5-methylcytosine (m5C) significantly influences mRNA export, translation efficiency and cell differentiation and are also associated with human diseases, including Alzheimer's disease, autoimmune disease, cancer, and cardiovascular diseases. Identification of m5C is critically responsible for understanding the RNA modification mechanisms and the epigenetic regulation of associated diseases. However, the large-scale experimental identification of m5C present significant challenges due to labor intensity and time requirements. Several computational tools, using machine learning, have been developed to supplement experimental methods, but identifying these sites lack accuracy and efficiency. In this study, we introduce a new predictor, MLm5C, for precise prediction of m5C sites using sequence data. Briefly, we evaluated eleven RNA sequence-derived features with four basic machine learning algorithms to generate baseline models. From these 44 models, we ranked them based on their performance and subsequently stacked the Top 20 baseline models as the best model, named MLm5C. The MLm5C outperformed the-state-of-the-art predictors. Notably, the optimization of the sequence length surrounding the modification sites significantly improved the prediction performance. MLm5C is an invaluable tool in accelerating the detection of m5C sites within the human genome, thereby facilitating in the characterization of their roles in post-transcriptional regulation.
Collapse
Affiliation(s)
- Hiroyuki Kurata
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan.
| | - Md Harun-Or-Roshid
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan
| | - Md Mehedi Hasan
- Division of Biotetecnology and Molecular Medicine, Department of Pathobiological Science, School of Veterinary Medicine, Lousiana State University, Baton Rouge, LA 70803, USA
| | - Sho Tsukiyama
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan
| | - Kazuhiro Maeda
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan
| | - Balachandran Manavalan
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon 16419, Republic of Korea.
| |
Collapse
|
4
|
Zhang X, Liu M, Li Z, Zhuo L, Fu X, Zou Q. Fusion of multi-source relationships and topology to infer lncRNA-protein interactions. MOLECULAR THERAPY. NUCLEIC ACIDS 2024; 35:102187. [PMID: 38706631 PMCID: PMC11066462 DOI: 10.1016/j.omtn.2024.102187] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/18/2023] [Accepted: 04/03/2024] [Indexed: 05/07/2024]
Abstract
Long non-coding RNAs (lncRNAs) are important factors involved in biological regulatory networks. Accurately predicting lncRNA-protein interactions (LPIs) is vital for clarifying lncRNA's functions and pathogenic mechanisms. Existing deep learning models have yet to yield satisfactory results in LPI prediction. Recently, graph autoencoders (GAEs) have seen rapid development, excelling in tasks like link prediction and node classification. We employed GAE technology for LPI prediction, devising the FMSRT-LPI model based on path masking and degree regression strategies and thereby achieving satisfactory outcomes. This represents the first known integration of path masking and degree regression strategies into the GAE framework for potential LPI inference. The effectiveness of our FMSRT-LPI model primarily relies on four key aspects. First, within the GAE framework, our model integrates multi-source relationships of lncRNAs and proteins with LPN's topological data. Second, the implemented masking strategy efficiently identifies LPN's key paths, reconstructs the network, and reduces the impact of redundant or incorrect data. Third, the integrated degree decoder balances degree and structural information, enhancing node representation. Fourth, the PolyLoss function we introduced is more appropriate for LPI prediction tasks. The results on multiple public datasets further demonstrate our model's potential in LPI prediction.
Collapse
Affiliation(s)
- Xinyu Zhang
- School of Data Science and Artificial Intelligence, Wenzhou University of Technology, Wenzhou 325027, China
| | - Mingzhe Liu
- School of Data Science and Artificial Intelligence, Wenzhou University of Technology, Wenzhou 325027, China
| | - Zhen Li
- Institute of Computational Science and Technology, Guangzhou University, Guangzhou 510000, China
| | - Linlin Zhuo
- School of Data Science and Artificial Intelligence, Wenzhou University of Technology, Wenzhou 325027, China
| | - Xiangzheng Fu
- College of Computer Science and Electronic Engineering, Hunan University, Changsha 410012, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 611730, China
| |
Collapse
|
5
|
Yu X, Ren J, Long H, Zeng R, Zhang G, Bilal A, Cui Y. iDNA-OpenPrompt: OpenPrompt learning model for identifying DNA methylation. Front Genet 2024; 15:1377285. [PMID: 38689652 PMCID: PMC11058834 DOI: 10.3389/fgene.2024.1377285] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2024] [Accepted: 03/07/2024] [Indexed: 05/02/2024] Open
Abstract
Introduction: DNA methylation is a critical epigenetic modification involving the addition of a methyl group to the DNA molecule, playing a key role in regulating gene expression without changing the DNA sequence. The main difficulty in identifying DNA methylation sites lies in the subtle and complex nature of methylation patterns, which may vary across different tissues, developmental stages, and environmental conditions. Traditional methods for methylation site identification, such as bisulfite sequencing, are typically labor-intensive, costly, and require large amounts of DNA, hindering high-throughput analysis. Moreover, these methods may not always provide the resolution needed to detect methylation at specific sites, especially in genomic regions that are rich in repetitive sequences or have low levels of methylation. Furthermore, current deep learning approaches generally lack sufficient accuracy. Methods: This study introduces the iDNA-OpenPrompt model, leveraging the novel OpenPrompt learning framework. The model combines a prompt template, prompt verbalizer, and Pre-trained Language Model (PLM) to construct the prompt-learning framework for DNA methylation sequences. Moreover, a DNA vocabulary library, BERT tokenizer, and specific label words are also introduced into the model to enable accurate identification of DNA methylation sites. Results and Discussion: An extensive analysis is conducted to evaluate the predictive, reliability, and consistency capabilities of the iDNA-OpenPrompt model. The experimental outcomes, covering 17 benchmark datasets that include various species and three DNA methylation modifications (4mC, 5hmC, 6mA), consistently indicate that our model surpasses outstanding performance and robustness approaches.
Collapse
Affiliation(s)
- Xia Yu
- School of Information and Communication Engineering, Hainan University, Haikou, Hainan, China
- School of Information Science and Technology, Hainan Normal University, Haikou, Hainan, China
| | - Jia Ren
- School of Information and Communication Engineering, Hainan University, Haikou, Hainan, China
| | - Haixia Long
- School of Information Science and Technology, Hainan Normal University, Haikou, Hainan, China
| | - Rao Zeng
- School of Information Science and Technology, Hainan Normal University, Haikou, Hainan, China
| | - Guoqiang Zhang
- School of Information Science and Technology, Hainan Normal University, Haikou, Hainan, China
| | - Anas Bilal
- School of Information Science and Technology, Hainan Normal University, Haikou, Hainan, China
| | - Yani Cui
- School of Information and Communication Engineering, Hainan University, Haikou, Hainan, China
| |
Collapse
|
6
|
Shen C, Mao D, Tang J, Liao Z, Chen S. Prediction of LncRNA-Protein Interactions Based on Kernel Combinations and Graph Convolutional Networks. IEEE J Biomed Health Inform 2024; 28:1937-1948. [PMID: 37327093 DOI: 10.1109/jbhi.2023.3286917] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/18/2023]
Abstract
The complexes of long non-coding RNAs bound to proteins can be involved in regulating life activities at various stages of organisms. However, in the face of the growing number of lncRNAs and proteins, verifying LncRNA-Protein Interactions (LPI) based on traditional biological experiments is time-consuming and laborious. Therefore, with the improvement of computing power, predicting LPI has met new development opportunity. In virtue of the state-of-the-art works, a framework called LncRNA-Protein Interactions based on Kernel Combinations and Graph Convolutional Networks (LPI-KCGCN) has been proposed in this article. We first construct kernel matrices by taking advantage of extracting both the lncRNAs and protein concerning the sequence features, sequence similarity features, expression features, and gene ontology. Then reconstruct the existent kernel matrices as the input of the next step. Combined with known LPI interactions, the reconstructed similarity matrices, which can be used as features of the topology map of the LPI network, are exploited in extracting potential representations in the lncRNA and protein space using a two-layer Graph Convolutional Network. The predicted matrix can be finally obtained by training the network to produce scoring matrices w.r.t. lncRNAs and proteins. Different LPI-KCGCN variants are ensemble to derive the final prediction results and testify on balanced and unbalanced datasets. The 5-fold cross-validation shows that the optimal feature information combination on a dataset with 15.5% positive samples has an AUC value of 0.9714 and an AUPR value of 0.9216. On another highly unbalanced dataset with only 5% positive samples, LPI-KCGCN also has outperformed the state-of-the-art works, which achieved an AUC value of 0.9907 and an AUPR value of 0.9267.
Collapse
|
7
|
Yin Z, Lyu J, Zhang G, Huang X, Ma Q, Jiang J. SoftVoting6mA: An improved ensemble-based method for predicting DNA N6-methyladenine sites in cross-species genomes. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2024; 21:3798-3815. [PMID: 38549308 DOI: 10.3934/mbe.2024169] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/02/2024]
Abstract
The DNA N6-methyladenine (6mA) is an epigenetic modification, which plays a pivotal role in biological processes encompassing gene expression, DNA replication, repair, and recombination. Therefore, the precise identification of 6mA sites is fundamental for better understanding its function, but challenging. We proposed an improved ensemble-based method for predicting DNA N6-methyladenine sites in cross-species genomes called SoftVoting6mA. The SoftVoting6mA selected four (electron-ion-interaction pseudo potential, One-hot encoding, Kmer, and pseudo dinucleotide composition) codes from 15 types of encoding to represent DNA sequences by comparing their performances. Similarly, the SoftVoting6mA combined four learning algorithms using the soft voting strategy. The 5-fold cross-validation and the independent tests showed that SoftVoting6mA reached the state-of-the-art performance. To enhance accessibility, a user-friendly web server is provided at http://www.biolscience.cn/SoftVoting6mA/.
Collapse
Affiliation(s)
- Zhaoting Yin
- College of Information Science and Engineering, Shaoyang University, Shaoyang 422000, China
| | - Jianyi Lyu
- College of Information Science and Engineering, Shaoyang University, Shaoyang 422000, China
| | - Guiyang Zhang
- College of Information Science and Engineering, Shaoyang University, Shaoyang 422000, China
| | - Xiaohong Huang
- College of Information Science and Engineering, Shaoyang University, Shaoyang 422000, China
| | - Qinghua Ma
- College of Information Science and Engineering, Hohai University, Nanjing 210000, China
- Faculty of Information Technology, University of Jyvaskyla, Jyvaskyla, Finland
| | - Jinyun Jiang
- College of Information Science and Engineering, Shaoyang University, Shaoyang 422000, China
| |
Collapse
|
8
|
Harun-Or-Roshid M, Maeda K, Phan LT, Manavalan B, Kurata H. Stack-DHUpred: Advancing the accuracy of dihydrouridine modification sites detection via stacking approach. Comput Biol Med 2024; 169:107848. [PMID: 38145601 DOI: 10.1016/j.compbiomed.2023.107848] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2023] [Revised: 11/14/2023] [Accepted: 12/11/2023] [Indexed: 12/27/2023]
Abstract
Dihydrouridine (DHU, D) is one of the most abundant post-transcriptional uridine modifications found in tRNA, mRNA, and snoRNA, closely associated with disease pathogenesis and various biological processes in eukaryotes. Identifying D sites is important for understanding the modification mechanisms and/or epigenetic regulation. However, biological experiments for detecting D sites are time-consuming and expensive. Given these challenges, computational methods have been developed for accurately identifying the D sites in genome-wide datasets. However, existing methods have some limitations, and their prediction performance needs to be improved. In this work, we have developed a new computational predictor for accurately identifying D sites called Stack-DHUpred. Briefly, we trained 66 baseline models or single-feature models by connecting six machine learning classifiers with eleven different feature encoding methods and stacked different baseline models to build stacked ensemble learning models. Subsequently, the optimal combination of the baseline models was identified for the construction of the final stacked model. Remarkably, the Stack-DHUpred outperformed the existing predictors on our new independent dataset, indicating that the stacking approach significantly improved the prediction performance. We have made Stack-DHUpred available to the public through a web server (http://kurata35.bio.kyutech.ac.jp/Stack-DHUpred) and a standalone program (https://github.com/kuratahiroyuki/Stack-DHUpred). We believe that Stack-DHUpred will be a valuable tool for accelerating the discovery of D modifications and understanding their role in post-transcriptional regulation.
Collapse
Affiliation(s)
- Md Harun-Or-Roshid
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan
| | - Kazuhiro Maeda
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan
| | - Le Thi Phan
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, 16419, Republic of Korea
| | - Balachandran Manavalan
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, 16419, Republic of Korea.
| | - Hiroyuki Kurata
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan.
| |
Collapse
|
9
|
Yan W, Tan L, Mengshan L, Weihong Z, Sheng S, Jun W, Fu-An W. Time series-based hybrid ensemble learning model with multivariate multidimensional feature coding for DNA methylation prediction. BMC Genomics 2023; 24:758. [PMID: 38082253 PMCID: PMC10712061 DOI: 10.1186/s12864-023-09866-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2023] [Accepted: 12/02/2023] [Indexed: 12/18/2023] Open
Abstract
BACKGROUND DNA methylation is a form of epigenetic modification that impacts gene expression without modifying the DNA sequence, thereby exerting control over gene function and cellular development. The prediction of DNA methylation is vital for understanding and exploring gene regulatory mechanisms. Currently, machine learning algorithms are primarily used for model construction. However, several challenges remain to be addressed, including limited prediction accuracy, constrained generalization capability, and insufficient learning capacity. RESULTS In response to the aforementioned challenges, this paper leverages the similarities between DNA sequences and time series to introduce a time series-based hybrid ensemble learning model, called Multi2-Con-CAPSO-LSTM. The model utilizes multivariate and multidimensional encoding approach, combining three types of time series encodings with three kinds of genetic feature encodings, resulting in a total of nine types of feature encoding matrices. Convolutional Neural Networks are utilized to extract features from DNA sequences, including temporal, positional, physicochemical, and genetic information, thereby creating a comprehensive feature matrix. The Long Short-Term Memory model is then optimized using the Chaotic Accelerated Particle Swarm Optimization algorithm for predicting DNA methylation. CONCLUSIONS Through cross-validation experiments conducted on 17 species involving three types of DNA methylation (6 mA, 5hmC, and 4mC), the results demonstrate the robust predictive capabilities of the Multi2-Con-CAPSO-LSTM model in DNA methylation prediction across various types and species. Compared with other benchmark models, the Multi2-Con-CAPSO-LSTM model demonstrates significant advantages in sensitivity, specificity, accuracy, and correlation. The model proposed in this paper provides valuable insights and inspiration across various disciplines, including sequence alignment, genetic evolution, time series analysis, and structure-activity relationships.
Collapse
Affiliation(s)
- Wu Yan
- School of Biotechnology, Jiangsu University of Science and Technology, Zhenjiang, Jiangsu, 212018, China.
- School of Mathematics and Computer Science, Gannan Normal University, Ganzhou, Jiangxi, 341000, China.
- Sericultural Research Institute, Chinese Academy of Agricultural Sciences, Zhenjiang, Jiangsu, 212018, China.
| | - Li Tan
- College of Physics and Electronic Information, Gannan Normal University, Ganzhou, Jiangxi, 341000, China
| | - Li Mengshan
- College of Physics and Electronic Information, Gannan Normal University, Ganzhou, Jiangxi, 341000, China.
| | - Zhou Weihong
- School of Biotechnology, Jiangsu University of Science and Technology, Zhenjiang, Jiangsu, 212018, China
- Sericultural Research Institute, Chinese Academy of Agricultural Sciences, Zhenjiang, Jiangsu, 212018, China
| | - Sheng Sheng
- School of Biotechnology, Jiangsu University of Science and Technology, Zhenjiang, Jiangsu, 212018, China
- Sericultural Research Institute, Chinese Academy of Agricultural Sciences, Zhenjiang, Jiangsu, 212018, China
| | - Wang Jun
- School of Biotechnology, Jiangsu University of Science and Technology, Zhenjiang, Jiangsu, 212018, China
- Sericultural Research Institute, Chinese Academy of Agricultural Sciences, Zhenjiang, Jiangsu, 212018, China
| | - Wu Fu-An
- School of Biotechnology, Jiangsu University of Science and Technology, Zhenjiang, Jiangsu, 212018, China.
- Sericultural Research Institute, Chinese Academy of Agricultural Sciences, Zhenjiang, Jiangsu, 212018, China.
| |
Collapse
|
10
|
Huang G, Huang X, Luo W. 6mA-StackingCV: an improved stacking ensemble model for predicting DNA N6-methyladenine site. BioData Min 2023; 16:34. [PMID: 38012796 PMCID: PMC10680251 DOI: 10.1186/s13040-023-00348-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2023] [Accepted: 11/04/2023] [Indexed: 11/29/2023] Open
Abstract
DNA N6-adenine methylation (N6-methyladenine, 6mA) plays a key regulating role in the cellular processes. Precisely recognizing 6mA sites is of importance to further explore its biological functions. Although there are many developed computational methods for 6mA site prediction over the past decades, there is a large root left to improve. We presented a cross validation-based stacking ensemble model for 6mA site prediction, called 6mA-StackingCV. The 6mA-StackingCV is a type of meta-learning algorithm, which uses output of cross validation as input to the final classifier. The 6mA-StackingCV reached the state of the art performances in the Rosaceae independent test. Extensive tests demonstrated the stability and the flexibility of the 6mA-StackingCV. We implemented the 6mA-StackingCV as a user-friendly web application, which allows one to restrictively choose representations or learning algorithms. This application is freely available at http://www.biolscience.cn/6mA-stackingCV/ . The source code and experimental data is available at https://github.com/Xiaohong-source/6mA-stackingCV .
Collapse
Affiliation(s)
- Guohua Huang
- School of Information Technology and Administration, Hunan University of Finance and Economics, Changsha, China.
- College of Information Science and Engineering, Shaoyang University, Shaoyang, Hunan, 422000, China.
| | - Xiaohong Huang
- College of Information Science and Engineering, Shaoyang University, Shaoyang, Hunan, 422000, China
| | - Wei Luo
- College of Information Science and Engineering, Shaoyang University, Shaoyang, Hunan, 422000, China
| |
Collapse
|
11
|
Jia J, Cao X, Wei Z. DLC-ac4C: A Prediction Model for N4-acetylcytidine Sites in Human mRNA Based on DenseNet and Bidirectional LSTM Methods. Curr Genomics 2023; 24:171-186. [PMID: 38178985 PMCID: PMC10761336 DOI: 10.2174/0113892029270191231013111911] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2023] [Revised: 09/13/2023] [Accepted: 09/21/2023] [Indexed: 01/06/2024] Open
Abstract
Introduction N4 acetylcytidine (ac4C) is a highly conserved nucleoside modification that is essential for the regulation of immune functions in organisms. Currently, the identification of ac4C is primarily achieved using biological methods, which can be time-consuming and labor-intensive. In contrast, accurate identification of ac4C by computational methods has become a more effective method for classification and prediction. Aim To the best of our knowledge, although there are several computational methods for ac4C locus prediction, the performance of the models they constructed is poor, and the network structure they used is relatively simple and suffers from the disadvantage of network degradation. This study aims to improve these limitations by proposing a predictive model based on integrated deep learning to better help identify ac4C sites. Methods In this study, we propose a new integrated deep learning prediction framework, DLC-ac4C. First, we encode RNA sequences based on three feature encoding schemes, namely C2 encoding, nucleotide chemical property (NCP) encoding, and nucleotide density (ND) encoding. Second, one-dimensional convolutional layers and densely connected convolutional networks (DenseNet) are used to learn local features, and bi-directional long short-term memory networks (Bi-LSTM) are used to learn global features. Third, a channel attention mechanism is introduced to determine the importance of sequence characteristics. Finally, a homomorphic integration strategy is used to limit the generalization error of the model, which further improves the performance of the model. Results The DLC-ac4C model performed well in terms of sensitivity (Sn), specificity (Sp), accuracy (Acc), Mathews correlation coefficient (MCC), and area under the curve (AUC) for the independent test data with 86.23%, 79.71%, 82.97%, 66.08%, and 90.42%, respectively, which was significantly better than the prediction accuracy of the existing methods. Conclusion Our model not only combines DenseNet and Bi-LSTM, but also uses the channel attention mechanism to better capture hidden information features from a sequence perspective, and can identify ac4C sites more effectively.
Collapse
Affiliation(s)
- Jianhua Jia
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen, 333403, China
| | - Xiaojing Cao
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen, 333403, China
| | - Zhangying Wei
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen, 333403, China
| |
Collapse
|
12
|
Sinha D, Dasmandal T, Paul K, Yeasin M, Bhattacharjee S, Murmu S, Mishra DC, Pal S, Rai A, Archak S. MethSemble-6mA: an ensemble-based 6mA prediction server and its application on promoter region of LBD gene family in Poaceae. FRONTIERS IN PLANT SCIENCE 2023; 14:1256186. [PMID: 37877081 PMCID: PMC10591185 DOI: 10.3389/fpls.2023.1256186] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/10/2023] [Accepted: 09/01/2023] [Indexed: 10/26/2023]
Abstract
The Lateral Organ Boundaries Domain (LBD) containing genes are a set of plant-specific transcription factors and are crucial for controlling both organ development and defense mechanisms as well as anthocyanin synthesis and nitrogen metabolism. It is imperative to understand how methylation regulates gene expression, through predicting methylation sites of their promoters particularly in major crop species. In this study, we developed a user-friendly prediction server for accurate prediction of 6mA sites by incorporating a robust feature set, viz., Binary Encoding of Mono-nucleotide DNA. Our model,MethSemble-6mA, outperformed other state-of-the-art tools in terms of accuracy (93.12%). Furthermore, we investigated the pattern of probable 6mA sites at the upstream promoter regions of the LBD-containing genes in Triticum aestivum and its allied species using the developed tool. On average, each selected species had four 6mA sites, and it was found that with speciation and due course of evolution in wheat, the frequency of methylation have reduced, and a few sites remain conserved. This obviously cues gene birth and gene expression alteration through methylation over time in a species and reflects functional conservation throughout evolution. Since DNA methylation is a vital event in almost all plant developmental processes (e.g., genomic imprinting and gametogenesis) along with other life processes, our findings on epigenetic regulation of LBD-containing genes have dynamic implications in basic and applied research. Additionally, MethSemble-6mA (http://cabgrid.res.in:5799/) will serve as a useful resource for a plant breeders who are interested to pursue epigenetic-based crop improvement research.
Collapse
Affiliation(s)
- Dipro Sinha
- ICAR-Indian Agricultural Statistics Research Institute, Delhi, India
- Graduate School, ICAR-Indian Agricultural Research Institute, Delhi, India
| | - Tanwy Dasmandal
- ICAR-Indian Agricultural Statistics Research Institute, Delhi, India
- Graduate School, ICAR-Indian Agricultural Research Institute, Delhi, India
- ICAR-National Bureau of Fish Genetic Resources, Lucknow, India
| | - Krishnayan Paul
- Graduate School, ICAR-Indian Agricultural Research Institute, Delhi, India
- ICAR-National Institute for Plant Biotechnology, Delhi, India
| | - Md Yeasin
- ICAR-Indian Agricultural Statistics Research Institute, Delhi, India
| | - Sougata Bhattacharjee
- Graduate School, ICAR-Indian Agricultural Research Institute, Delhi, India
- ICAR-National Institute for Plant Biotechnology, Delhi, India
- ICAR-Indian Agricultural Research Institute, Hazaribagh, Jharkhand, India
| | - Sneha Murmu
- ICAR-Indian Agricultural Statistics Research Institute, Delhi, India
| | | | - Soumen Pal
- ICAR-Indian Agricultural Statistics Research Institute, Delhi, India
| | - Anil Rai
- Indian Council of Agricultural Research, Delhi, India
| | - Sunil Archak
- ICAR-National Bureau of Plant Genetic Resources, Delhi, India
| |
Collapse
|
13
|
He W, Ye Z, Li M, Yan Y, Lu W, Xing G. Extraction of soybean plant trait parameters based on SfM-MVS algorithm combined with GRNN. FRONTIERS IN PLANT SCIENCE 2023; 14:1181322. [PMID: 37560031 PMCID: PMC10407792 DOI: 10.3389/fpls.2023.1181322] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/09/2023] [Accepted: 07/06/2023] [Indexed: 08/11/2023]
Abstract
Soybean is an important grain and oil crop worldwide and is rich in nutritional value. Phenotypic morphology plays an important role in the selection and breeding of excellent soybean varieties to achieve high yield. Nowadays, the mainstream manual phenotypic measurement has some problems such as strong subjectivity, high labor intensity and slow speed. To address the problems, a three-dimensional (3D) reconstruction method for soybean plants based on structure from motion (SFM) was proposed. First, the 3D point cloud of a soybean plant was reconstructed from multi-view images obtained by a smartphone based on the SFM algorithm. Second, low-pass filtering, Gaussian filtering, Ordinary Least Square (OLS) plane fitting, and Laplacian smoothing were used in fusion to automatically segment point cloud data, such as individual plants, stems, and leaves. Finally, Eleven morphological traits, such as plant height, minimum bounding box volume per plant, leaf projection area, leaf projection length and width, and leaf tilt information, were accurately and nondestructively measured by the proposed an algorithm for leaf phenotype measurement (LPM). Moreover, Support Vector Machine (SVM), Back Propagation Neural Network (BP), and Back Propagation Neural Network (GRNN) prediction models were established to predict and identify soybean plant varieties. The results indicated that, compared with the manual measurement, the root mean square error (RMSE) of plant height, leaf length, and leaf width were 0.9997, 0.2357, and 0.2666 cm, and the mean absolute percentage error (MAPE) were 2.7013%, 1.4706%, and 1.8669%, and the coefficients of determination (R2) were 0.9775, 0.9785, and 0.9487, respectively. The accuracy of predicting plant species according to the six leaf parameters was highest when using GRNN, reaching 0.9211, and the RMSE was 18.3263. Based on the phenotypic traits of plants, the differences between C3, 47-6 and W82 soybeans were analyzed genetically, and because C3 was an insect-resistant line, the trait parametes (minimum box volume per plant, number of leaves, minimum size of single leaf box, leaf projection area).The results show that the proposed method can effectively extract the 3D phenotypic structure information of soybean plants and leaves without loss which has the potential using ability in other plants with dense leaves.
Collapse
Affiliation(s)
- Wei He
- College of Engineering, Nanjing Agricultural University, Nanjing, China
| | - Zhihao Ye
- Soybean Research Institute, Ministry of Agriculture and Rural Affairs (MARA) National Center for Soybean Improvement, Ministry of Agriculture and Rural Affairs (MARA) Key Laboratory of Biology and Genetic Improvement of Soybean, National Key Laboratory for Crop Genetics & Germplasm Enhancement and Utilization, Jiangsu Collaborative Innovation Center for Modern Crop Production, College of Agriculture, Nanjing Agricultural University, Nanjing, China
| | - Mingshuang Li
- College of Artificial Intelligence, Nanjing Agricultural University, Nanjing, China
| | - Yulu Yan
- Soybean Research Institute, Ministry of Agriculture and Rural Affairs (MARA) National Center for Soybean Improvement, Ministry of Agriculture and Rural Affairs (MARA) Key Laboratory of Biology and Genetic Improvement of Soybean, National Key Laboratory for Crop Genetics & Germplasm Enhancement and Utilization, Jiangsu Collaborative Innovation Center for Modern Crop Production, College of Agriculture, Nanjing Agricultural University, Nanjing, China
| | - Wei Lu
- College of Artificial Intelligence, Nanjing Agricultural University, Nanjing, China
| | - Guangnan Xing
- Soybean Research Institute, Ministry of Agriculture and Rural Affairs (MARA) National Center for Soybean Improvement, Ministry of Agriculture and Rural Affairs (MARA) Key Laboratory of Biology and Genetic Improvement of Soybean, National Key Laboratory for Crop Genetics & Germplasm Enhancement and Utilization, Jiangsu Collaborative Innovation Center for Modern Crop Production, College of Agriculture, Nanjing Agricultural University, Nanjing, China
| |
Collapse
|
14
|
Jia J, Wei Z, Cao X. EMDL-ac4C: identifying N4-acetylcytidine based on ensemble two-branch residual connection DenseNet and attention. Front Genet 2023; 14:1232038. [PMID: 37519885 PMCID: PMC10372626 DOI: 10.3389/fgene.2023.1232038] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2023] [Accepted: 06/29/2023] [Indexed: 08/01/2023] Open
Abstract
Introduction: N4-acetylcytidine (ac4C) is a critical acetylation modification that has an essential function in protein translation and is associated with a number of human diseases. Methods: The process of identifying ac4C sites by biological experiments is too cumbersome and costly. And the performance of several existing computational models needs to be improved. Therefore, we propose a new deep learning tool EMDL-ac4C to predict ac4C sites, which uses a simple one-hot encoding for a unbalanced dataset using a downsampled ensemble deep learning network to extract important features to identify ac4C sites. The base learner of this ensemble model consists of a modified DenseNet and Squeeze-and-Excitation Networks. In addition, we innovatively add a convolutional residual structure in parallel with the dense block to achieve the effect of two-layer feature extraction. Results: The average accuracy (Acc), mathews correlation coefficient (MCC), and area under the curve Area under curve of EMDL-ac4C on ten independent testing sets are 80.84%, 61.77%, and 87.94%, respectively. Discussion: Multiple experimental comparisons indicate that EMDL-ac4C outperforms existing predictors and it greatly improved the predictive performance of the ac4C sites. At the same time, EMDL-ac4C could provide a valuable reference for the next part of the study. The source code and experimental data are available at: https://github.com/13133989982/EMDLac4C.
Collapse
Affiliation(s)
- Jianhua Jia
- *Correspondence: Jianhua Jia, ; Zhangying Wei,
| | | | | |
Collapse
|
15
|
Arora P, Periwal N, Goyal Y, Sood V, Kaur B. iIL13Pred: improved prediction of IL-13 inducing peptides using popular machine learning classifiers. BMC Bioinformatics 2023; 24:141. [PMID: 37041520 PMCID: PMC10088697 DOI: 10.1186/s12859-023-05248-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2022] [Accepted: 03/22/2023] [Indexed: 04/13/2023] Open
Abstract
BACKGROUND Inflammatory mediators play havoc in several diseases including the novel Coronavirus disease 2019 (COVID-19) and generally correlate with the severity of the disease. Interleukin-13 (IL-13), is a pleiotropic cytokine that is known to be associated with airway inflammation in asthma and reactive airway diseases, in neoplastic and autoimmune diseases. Interestingly, the recent association of IL-13 with COVID-19 severity has sparked interest in this cytokine. Therefore characterization of new molecules which can regulate IL-13 induction might lead to novel therapeutics. RESULTS Here, we present an improved prediction of IL-13-inducing peptides. The positive and negative datasets were obtained from a recent study (IL13Pred) and the Pfeature algorithm was used to compute features for the peptides. As compared to the state-of-the-art which used the regularization based feature selection technique (linear support vector classifier with the L1 penalty), we used a multivariate feature selection technique (minimum redundancy maximum relevance) to obtain non-redundant and highly relevant features. In the proposed study (improved IL-13 prediction (iIL13Pred)), the use of the mRMR feature selection method is instrumental in choosing the most discriminatory features of IL-13-inducing peptides with improved performance. We investigated seven common machine learning classifiers including Decision Tree, Gaussian Naïve Bayes, k-Nearest Neighbour, Logistic Regression, Support Vector Machine, Random Forest, and extreme gradient boosting to efficiently classify IL-13-inducing peptides. We report improved AUC, and MCC scores of 0.83 and 0.33 on validation data as compared to the current method. CONCLUSIONS Extensive benchmarking experiments suggest that the proposed method (iIL13Pred) could provide improved performance metrics in terms of sensitivity, specificity, accuracy, the area under the curve - receiver operating characteristics (AUCROC) and Matthews correlation coefficient (MCC) than the existing state-of-the-art approach (IL13Pred) on the validation dataset and an external dataset comprising of experimentally validated IL-13-inducing peptides. Additionally, the experiments were performed with an increased number of experimentally validated training datasets to obtain a more robust model. A user-friendly web server ( www.soodlab.com/iil13pred ) is also designed to facilitate rapid screening of IL-13-inducing peptides.
Collapse
Affiliation(s)
- Pooja Arora
- Department of Zoology, Hansraj College, University of Delhi, Delhi, India.
| | - Neha Periwal
- Department of Biochemistry, Jamia Hamdard, Delhi, India
| | - Yash Goyal
- Department of Computer Science, Hansraj College, University of Delhi, Delhi, India
| | - Vikas Sood
- Department of Biochemistry, Jamia Hamdard, Delhi, India
| | - Baljeet Kaur
- Department of Computer Science, Hansraj College, University of Delhi, Delhi, India.
| |
Collapse
|
16
|
Teimouri H, Medvedeva A, Kolomeisky AB. Bacteria-Specific Feature Selection for Enhanced Antimicrobial Peptide Activity Predictions Using Machine-Learning Methods. J Chem Inf Model 2023; 63:1723-1733. [PMID: 36912047 DOI: 10.1021/acs.jcim.2c01551] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/14/2023]
Abstract
There are several classes of short peptide molecules, known as antimicrobial peptides (AMPs), which are produced during the immune responses of living organisms against various infections. In recent years, substantial progress has been achieved in applying machine-learning methods to predict the activities of AMPs against bacteria. In most investigated cases, however, the outcome is not bacterium-specific since the specific features of bacteria, such as chemical composition and structure of membranes, are not considered. To overcome this problem, we developed a new computational approach that allowed us to train several supervised machine-learning models using a specific set of data associated with peptides targeting E. coli bacteria. LASSO regression and Support Vector Machine techniques have been utilized to select, among more than 1500 physicochemical descriptors, the most important features that can be used to classify a peptide as antimicrobial or ineffective against E. coli. We then performed the classification of active versus inactive AMPs using the Support Vector classifiers, Logistic Regression, and Random Forest methods. This computational study allows us to make recommendations of how to design more efficient antibacterial drug therapies.
Collapse
Affiliation(s)
- Hamid Teimouri
- Department of Chemistry, Rice University, Houston, Texas 77005, United States.,Center for Theoretical Biological Physics, Rice University, Houston, Texas 77005, United States
| | - Angela Medvedeva
- Department of Chemistry, Rice University, Houston, Texas 77005, United States.,Center for Theoretical Biological Physics, Rice University, Houston, Texas 77005, United States
| | - Anatoly B Kolomeisky
- Department of Chemistry, Rice University, Houston, Texas 77005, United States.,Center for Theoretical Biological Physics, Rice University, Houston, Texas 77005, United States.,Department of Chemical and Biomolecular Engineering, Rice University, Houston, Texas 77005, United States.,Department of Physics and Astronomy, Rice University, Houston, Texas 77005, United States
| |
Collapse
|
17
|
Zulfiqar H, Guo Z, Grace-Mercure BK, Zhang ZY, Gao H, Lin H, Wu Y. Empirical Comparison and Recent Advances of Computational Prediction of Hormone Binding Proteins Using Machine Learning Methods. Comput Struct Biotechnol J 2023; 21:2253-2261. [PMID: 37035551 PMCID: PMC10073991 DOI: 10.1016/j.csbj.2023.03.024] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2022] [Revised: 03/15/2023] [Accepted: 03/16/2023] [Indexed: 03/19/2023] Open
Abstract
Hormone binding proteins (HBPs) belong to the group of soluble carrier proteins. These proteins selectively and non-covalently interact with hormones and promote growth hormone signaling in human and other animals. The HBPs are useful in many medical and commercial fields. Thus, the identification of HBPs is very important because it can help to discover more details about hormone binding proteins. Meanwhile, the experimental methods are time-consuming and expensive for hormone binding proteins recognition. Computational prediction methods have played significant roles in the correct recognition of hormone binding proteins with the use of sequence information and ML algorithms. In this review, we compared and assessed the implementation of ML-based tools in recognition of HBPs in a unique way. We hope that this study will give enough awareness and knowledge for research on HBPs.
Collapse
|
18
|
Fan XQ, Lin B, Hu J, Guo ZY. I-DNAN6mA: Accurate Identification of DNA N 6-Methyladenine Sites Using the Base-Pairing Map and Deep Learning. J Chem Inf Model 2023; 63:1076-1086. [PMID: 36722621 DOI: 10.1021/acs.jcim.2c01465] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023]
Abstract
The recent discovery of numerous DNA N6-methyladenine (6mA) sites has transformed our perception about the roles of 6mA in living organisms. However, our ability to understand them is hampered by our inability to identify 6mA sites rapidly and cost-efficiently by existing experimental methods. Developing a novel method to quickly and accurately identify 6mA sites is critical for speeding up the progress of its function detection and understanding. In this study, we propose a novel computational method, called I-DNAN6mA, to identify 6mA sites and complement experimental methods well, by leveraging the base-pairing rules and a well-designed three-stage deep learning model with pairwise inputs. The performance of our proposed method is benchmarked and evaluated on four species, i.e., Arabidopsis thaliana, Drosophila melanogaster, Rice, and Rosaceae. The experimental results demonstrate that I-DNAN6mA achieves area under the receiver operating characteristic curve values of 0.967, 0.963, 0.947, 0.976, and 0.990, accuracies of 91.5, 92.7, 88.2, 0.938, and 96.2%, and Mathew's correlation coefficient values of 0.855, 0.831, 0.763, 0.877, and 0.924 on five benchmark data sets, respectively, and outperforms several existing state-of-the-art methods. To our knowledge, I-DNAN6mA is the first approach to identify 6mA sites using a novel image-like representation of DNA sequences and a deep learning model with pairwise inputs. I-DNAN6mA is expected to be useful for locating functional regions of DNA.
Collapse
Affiliation(s)
- Xue-Qiang Fan
- School of Computer and Information, Hefei University of Technology, Hefei230009, China
| | - Bing Lin
- School of Computer and Information, Hefei University of Technology, Hefei230009, China
| | - Jun Hu
- College of Information Engineering, Zhejiang University of Technology, Hangzhou310023, China
| | - Zhong-Yi Guo
- School of Computer and Information, Hefei University of Technology, Hefei230009, China
| |
Collapse
|
19
|
Zhang YF, Wang YH, Gu ZF, Pan XR, Li J, Ding H, Zhang Y, Deng KJ. Bitter-RF: A random forest machine model for recognizing bitter peptides. Front Med (Lausanne) 2023; 10:1052923. [PMID: 36778738 PMCID: PMC9909039 DOI: 10.3389/fmed.2023.1052923] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2022] [Accepted: 01/05/2023] [Indexed: 01/27/2023] Open
Abstract
Introduction Bitter peptides are short peptides with potential medical applications. The huge potential behind its bitter taste remains to be tapped. To better explore the value of bitter peptides in practice, we need a more effective classification method for identifying bitter peptides. Methods In this study, we developed a Random forest (RF)-based model, called Bitter-RF, using sequence information of the bitter peptide. Bitter-RF covers more comprehensive and extensive information by integrating 10 features extracted from the bitter peptides and achieves better results than the latest generation model on independent validation set. Results The proposed model can improve the accurate classification of bitter peptides (AUROC = 0.98 on independent set test) and enrich the practical application of RF method in protein classification tasks which has not been used to build a prediction model for bitter peptides. Discussion We hope the Bitter-RF could provide more conveniences to scholars for bitter peptide research.
Collapse
Affiliation(s)
- Yu-Fei Zhang
- School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Yu-Hao Wang
- School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Zhi-Feng Gu
- School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Xian-Run Pan
- Innovative Institute of Chinese Medicine and Pharmacy, Academy for Interdiscipline, Chengdu University of Traditional Chinese Medicine, Chengdu, China
| | - Jian Li
- School of Basic Medical Sciences, Chengdu University, Chengdu, China
| | - Hui Ding
- School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China,*Correspondence: Hui Ding,
| | - Yang Zhang
- Innovative Institute of Chinese Medicine and Pharmacy, Academy for Interdiscipline, Chengdu University of Traditional Chinese Medicine, Chengdu, China,Yang Zhang,
| | - Ke-Jun Deng
- School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China,Ke-Jun Deng,
| |
Collapse
|
20
|
Nabeel Asim M, Ali Ibrahim M, Fazeel A, Dengel A, Ahmed S. DNA-MP: a generalized DNA modifications predictor for multiple species based on powerful sequence encoding method. Brief Bioinform 2023; 24:6931721. [PMID: 36528802 DOI: 10.1093/bib/bbac546] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2022] [Revised: 11/06/2022] [Accepted: 11/12/2022] [Indexed: 12/23/2022] Open
Abstract
Accurate prediction of deoxyribonucleic acid (DNA) modifications is essential to explore and discern the process of cell differentiation, gene expression and epigenetic regulation. Several computational approaches have been proposed for particular type-specific DNA modification prediction. Two recent generalized computational predictors are capable of detecting three different types of DNA modifications; however, type-specific and generalized modifications predictors produce limited performance across multiple species mainly due to the use of ineffective sequence encoding methods. The paper in hand presents a generalized computational approach "DNA-MP" that is competent to more precisely predict three different DNA modifications across multiple species. Proposed DNA-MP approach makes use of a powerful encoding method "position specific nucleotides occurrence based 117 on modification and non-modification class densities normalized difference" (POCD-ND) to generate the statistical representations of DNA sequences and a deep forest classifier for modifications prediction. POCD-ND encoder generates statistical representations by extracting position specific distributional information of nucleotides in the DNA sequences. We perform a comprehensive intrinsic and extrinsic evaluation of the proposed encoder and compare its performance with 32 most widely used encoding methods on $17$ benchmark DNA modifications prediction datasets of $12$ different species using $10$ different machine learning classifiers. Overall, with all classifiers, the proposed POCD-ND encoder outperforms existing $32$ different encoders. Furthermore, combinedly over 5-fold cross validation benchmark datasets and independent test sets, proposed DNA-MP predictor outperforms state-of-the-art type-specific and generalized modifications predictors by an average accuracy of 7% across 4mc datasets, 1.35% across 5hmc datasets and 10% for 6ma datasets. To facilitate the scientific community, the DNA-MP web application is available at https://sds_genetic_analysis.opendfki.de/DNA_Modifications/.
Collapse
Affiliation(s)
- Muhammad Nabeel Asim
- Department of Computer Science, Technical University of Kaiserslautern, Kaiserslautern 67663, Germany.,German Research Center for Artificial Intelligence GmbH, Kaiserslautern 67663, Germany
| | - Muhammad Ali Ibrahim
- Department of Computer Science, Technical University of Kaiserslautern, Kaiserslautern 67663, Germany.,German Research Center for Artificial Intelligence GmbH, Kaiserslautern 67663, Germany
| | - Ahtisham Fazeel
- Department of Computer Science, Technical University of Kaiserslautern, Kaiserslautern 67663, Germany.,German Research Center for Artificial Intelligence GmbH, Kaiserslautern 67663, Germany
| | - Andreas Dengel
- Department of Computer Science, Technical University of Kaiserslautern, Kaiserslautern 67663, Germany.,German Research Center for Artificial Intelligence GmbH, Kaiserslautern 67663, Germany
| | - Sheraz Ahmed
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern 67663, Germany
| |
Collapse
|
21
|
Ahmed B, Haque MA, Iquebal MA, Jaiswal S, Angadi UB, Kumar D, Rai A. DeepAProt: Deep learning based abiotic stress protein sequence classification and identification tool in cereals. FRONTIERS IN PLANT SCIENCE 2023; 13:1008756. [PMID: 36714750 PMCID: PMC9877618 DOI: 10.3389/fpls.2022.1008756] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/01/2022] [Accepted: 11/14/2022] [Indexed: 06/18/2023]
Abstract
The impact of climate change has been alarming for the crop growth. The extreme weather conditions can stress the crops and reduce the yield of major crops belonging to Poaceae family too, that sustains 50% of the world's food calorie and 20% of protein intake. Computational approaches, such as artificial intelligence-based techniques have become the forefront of prediction-based data interpretation and plant stress responses. In this study, we proposed a novel activation function, namely, Gaussian Error Linear Unit with Sigmoid (SIELU) which was implemented in the development of a Deep Learning (DL) model along with other hyper parameters for classification of unknown abiotic stress protein sequences from crops of Poaceae family. To develop this models, data pertaining to four different abiotic stress (namely, cold, drought, heat and salinity) responsive proteins of the crops belonging to poaceae family were retrieved from public domain. It was observed that efficiency of the DL models with our proposed novel SIELU activation function outperformed the models as compared to GeLU activation function, SVM and RF with 95.11%, 80.78%, 94.97%, and 81.69% accuracy for cold, drought, heat and salinity, respectively. Also, a web-based tool, named DeepAProt (http://login1.cabgrid.res.in:5500/) was developed using flask API, along with its mobile app. This server/App will provide researchers a convenient tool, which is rapid and economical in identification of proteins for abiotic stress management in crops Poaceae family, in endeavour of higher production for food security and combating hunger, ensuring UN SDG goal 2.0.
Collapse
Affiliation(s)
- Bulbul Ahmed
- Division of Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India
| | - Md Ashraful Haque
- Division of Computer Application, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India
| | - Mir Asif Iquebal
- Division of Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India
| | - Sarika Jaiswal
- Division of Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India
| | - U. B. Angadi
- Division of Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India
| | - Dinesh Kumar
- Division of Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India
- Department of Biotechnology, School of Interdisciplinary and Applied Sciences, Central University of Haryana, Mahendergarh, Haryana, India
| | - Anil Rai
- Division of Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India
| |
Collapse
|
22
|
Ding Y, He W, Tang J, Zou Q, Guo F. Laplacian Regularized Sparse Representation Based Classifier for Identifying DNA N4-Methylcytosine Sites via L 2,1/2-Matrix Norm. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:500-511. [PMID: 34882559 DOI: 10.1109/tcbb.2021.3133309] [Citation(s) in RCA: 11] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
N4-methylcytosine (4mC) is one of important epigenetic modifications in DNA sequences. Detecting 4mC sites is time-consuming. The computational method based on machine learning has provided effective help for identifying 4mC. To further improve the performance of prediction, we propose a Laplacian Regularized Sparse Representation based Classifier with L2,1/2-matrix norm (LapRSRC). We also utilize kernel trick to derive the kernel LapRSRC for nonlinear modeling. Matrix factorization technology is employed to solve the sparse representation coefficients of all test samples in the training set. And an efficient iterative algorithm is proposed to solve the objective function. We implement our model on six benchmark datasets of 4mC and eight UCI datasets to evaluate performance. The results show that the performance of our method is better or comparable.
Collapse
|
23
|
Zeng W, Gautam A, Huson DH. MuLan-Methyl-multiple transformer-based language models for accurate DNA methylation prediction. Gigascience 2022; 12:giad054. [PMID: 37489753 PMCID: PMC10367125 DOI: 10.1093/gigascience/giad054] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2023] [Revised: 05/09/2023] [Accepted: 07/18/2023] [Indexed: 07/26/2023] Open
Abstract
Transformer-based language models are successfully used to address massive text-related tasks. DNA methylation is an important epigenetic mechanism, and its analysis provides valuable insights into gene regulation and biomarker identification. Several deep learning-based methods have been proposed to identify DNA methylation, and each seeks to strike a balance between computational effort and accuracy. Here, we introduce MuLan-Methyl, a deep learning framework for predicting DNA methylation sites, which is based on 5 popular transformer-based language models. The framework identifies methylation sites for 3 different types of DNA methylation: N6-adenine, N4-cytosine, and 5-hydroxymethylcytosine. Each of the employed language models is adapted to the task using the "pretrain and fine-tune" paradigm. Pretraining is performed on a custom corpus of DNA fragments and taxonomy lineages using self-supervised learning. Fine-tuning aims at predicting the DNA methylation status of each type. The 5 models are used to collectively predict the DNA methylation status. We report excellent performance of MuLan-Methyl on a benchmark dataset. Moreover, we argue that the model captures characteristic differences between different species that are relevant for methylation. This work demonstrates that language models can be successfully adapted to applications in biological sequence analysis and that joint utilization of different language models improves model performance. Mulan-Methyl is open source, and we provide a web server that implements the approach.
Collapse
Affiliation(s)
- Wenhuan Zeng
- Algorithms in Bioinformatics, Institute for Bioinformatics and Medical Informatics, University of Tübingen, 72076 Tübingen, Germany
| | - Anupam Gautam
- Algorithms in Bioinformatics, Institute for Bioinformatics and Medical Informatics, University of Tübingen, 72076 Tübingen, Germany
- International Max Planck Research School “From Molecules to Organisms”, Max Planck Institute for Biology Tübingen, 72076 Tübingen, Germany
- Cluster of Excellence: EXC 2124: Controlling Microbes to Fight Infection, University of Tübingen, 72076 Tübingen, Germany
| | - Daniel H Huson
- Algorithms in Bioinformatics, Institute for Bioinformatics and Medical Informatics, University of Tübingen, 72076 Tübingen, Germany
- International Max Planck Research School “From Molecules to Organisms”, Max Planck Institute for Biology Tübingen, 72076 Tübingen, Germany
- Cluster of Excellence: EXC 2124: Controlling Microbes to Fight Infection, University of Tübingen, 72076 Tübingen, Germany
| |
Collapse
|
24
|
Pallante L, Korfiati A, Androutsos L, Stojceski F, Bompotas A, Giannikos I, Raftopoulos C, Malavolta M, Grasso G, Mavroudi S, Kalogeras A, Martos V, Amoroso D, Piga D, Theofilatos K, Deriu MA. Toward a general and interpretable umami taste predictor using a multi-objective machine learning approach. Sci Rep 2022; 12:21735. [PMID: 36526644 PMCID: PMC9758219 DOI: 10.1038/s41598-022-25935-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2022] [Accepted: 12/07/2022] [Indexed: 12/23/2022] Open
Abstract
The umami taste is one of the five basic taste modalities normally linked to the protein content in food. The implementation of fast and cost-effective tools for the prediction of the umami taste of a molecule remains extremely interesting to understand the molecular basis of this taste and to effectively rationalise the production and consumption of specific foods and ingredients. However, the only examples of umami predictors available in the literature rely on the amino acid sequence of the analysed peptides, limiting the applicability of the models. In the present study, we developed a novel ML-based algorithm, named VirtuousUmami, able to predict the umami taste of a query compound starting from its SMILES representation, thus opening up the possibility of potentially using such a model on any database through a standard and more general molecular description. Herein, we have tested our model on five databases related to foods or natural compounds. The proposed tool will pave the way toward the rationalisation of the molecular features underlying the umami taste and toward the design of specific peptide-inspired compounds with specific taste properties.
Collapse
Affiliation(s)
- Lorenzo Pallante
- grid.4800.c0000 0004 1937 0343Department of Mechanical and Aerospace Engineering, Politecnico di Torino, PolitoBIOMedLab, 10129 Torino, Italy
| | | | | | - Filip Stojceski
- Department of Innovative Technologies, Dalle Molle Institute for Artificial Intelligence, 6962 Lugano-Viganello, Switzerland
| | - Agorakis Bompotas
- grid.435019.a0000 0004 0394 1287Industrial Systems Institute, Athena Research Center, 265 04 Patras, Greece
| | - Ioannis Giannikos
- grid.435019.a0000 0004 0394 1287Industrial Systems Institute, Athena Research Center, 265 04 Patras, Greece
| | - Christos Raftopoulos
- grid.435019.a0000 0004 0394 1287Industrial Systems Institute, Athena Research Center, 265 04 Patras, Greece
| | - Marta Malavolta
- grid.8954.00000 0001 0721 6013Faculty of Computer and Information Science, University of Ljubljana, 1000 Ljubljana, Slovenia
| | - Gianvito Grasso
- Department of Innovative Technologies, Dalle Molle Institute for Artificial Intelligence, 6962 Lugano-Viganello, Switzerland
| | - Seferina Mavroudi
- InSyBio PC, 265 04 Patras, Greece ,grid.11047.330000 0004 0576 5395Department of Nursing, University of Patras, 265 04 Patras, Greece
| | - Athanasios Kalogeras
- grid.435019.a0000 0004 0394 1287Industrial Systems Institute, Athena Research Center, 265 04 Patras, Greece
| | - Vanessa Martos
- grid.4489.10000000121678994Department of Plant Physiology, Institute of Biotechnology, University of Granada, 18011 Granada, Spain
| | | | - Dario Piga
- Department of Innovative Technologies, Dalle Molle Institute for Artificial Intelligence, 6962 Lugano-Viganello, Switzerland
| | | | - Marco A. Deriu
- grid.4800.c0000 0004 1937 0343Department of Mechanical and Aerospace Engineering, Politecnico di Torino, PolitoBIOMedLab, 10129 Torino, Italy
| |
Collapse
|
25
|
Identification of adaptor proteins using the ANOVA feature selection technique. Methods 2022; 208:42-47. [DOI: 10.1016/j.ymeth.2022.10.008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2022] [Revised: 10/01/2022] [Accepted: 10/24/2022] [Indexed: 11/06/2022] Open
|
26
|
Huang Z, Lou K, Liu H. A novel prognostic signature based on N7-methylguanosine-related long non-coding RNAs in breast cancer. Front Genet 2022; 13:1030275. [PMID: 36313442 PMCID: PMC9608183 DOI: 10.3389/fgene.2022.1030275] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2022] [Accepted: 10/03/2022] [Indexed: 12/24/2022] Open
Abstract
Long non-coding RNA (lncRNA) are closely associated with the occurrence and progression of tumors. However, research on N7-methylguanosine (m7G)-related lncRNA in breast cancer is lacking. Therefore, the present study explored the prognostic value, gene expression characteristics, and effects of m7G-related lncRNA on tumor immune cell infiltration and tumor mutational burden (TMB) in breast cancer. lncRNA expression matrices and clinical follow-up data of patients with breast cancer were obtained from The Cancer Genome Atlas, revealing eight significantly differentially expressed and prognostically relevant m7G-related lncRNAs in breast cancer tissues: BAIAP2-DT, COL4A2-AS1, FARP1-AS1, RERE-AS1, NDUFA6-DT, TFAP2A-AS1, LINC00115, and MIR302CHG. A breast cancer prognostic signature was created based on these m7G-related lncRNAs according to least absolute shrinkage and selection operator Cox regression. The prognostic signature combined with potential prognostic factors showed independent prognostic value, reliability, and specificity. Meanwhile, we constructed a risk score-based nomogram to assist clinical decision-making. Gene set enrichment analysis revealed that low- and high-risk group were associated with metabolism-related pathways. Our study demonstrated the association between tumor immune cell infiltration based on analyses with the CIBERSORT algorithm and prognostic signature. We also assessed the correlation between prognostic signature and TMB. Lastly, quantitative real-time polymerase chain reaction analysis was performed to validate differentially expressed lncRNAs. The effective prognostic signature based on m7G-related lncRNAs has the potential to predict the survival prognosis of patients with breast cancer. The eight m7G-related lncRNAs identified in this study might represent potential biomarkers and therapeutic targets of breast cancer.
Collapse
|
27
|
Rehman MU, Tayara H, Zou Q, Chong KT. i6mA-Caps: a CapsuleNet-based framework for identifying DNA N6-methyladenine sites. Bioinformatics 2022; 38:3885-3891. [PMID: 35771648 DOI: 10.1093/bioinformatics/btac434] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2022] [Revised: 05/19/2022] [Accepted: 06/28/2022] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION DNA N6-methyladenine (6mA) has been demonstrated to have an essential function in epigenetic modification in eukaryotic species in recent research. 6mA has been linked to various biological processes. It's critical to create a new algorithm that can rapidly and reliably detect 6mA sites in genomes to investigate their biological roles. The identification of 6mA marks in the genome is the first and most important step in understanding the underlying molecular processes, as well as their regulatory functions. RESULTS In this article, we proposed a novel computational tool called i6mA-Caps which CapsuleNet based a framework for identifying the DNA N6-methyladenine sites. The proposed framework uses a single encoding scheme for numerical representation of the DNA sequence. The numerical data is then used by the set of convolution layers to extract low-level features. These features are then used by the capsule network to extract intermediate-level and later high-level features to classify the 6mA sites. The proposed network is evaluated on three datasets belonging to three genomes which are Rosaceae, Rice and Arabidopsis thaliana. Proposed method has attained an accuracy of 96.71%, 94% and 86.83% for independent Rosaceae dataset, Rice dataset and A.thaliana dataset respectively. The proposed framework has exhibited improved results when compared with the existing top-of-the-line methods. AVAILABILITY AND IMPLEMENTATION A user-friendly web-server is made available for the biological experts which can be accessed at: http://nsclbio.jbnu.ac.kr/tools/i6mA-Caps/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Mobeen Ur Rehman
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, South Korea
| | - Hilal Tayara
- School of International Engineering and Science, Jeonbuk National University, Jeonju 54896, South Korea
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Kil To Chong
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, South Korea.,Advances Electronics and Information Research Center, Jeonbuk National University, Jeonju 54896, South Korea
| |
Collapse
|
28
|
Al Taweraqi N, King RD. Improved prediction of gene expression through integrating cell signalling models with machine learning. BMC Bioinformatics 2022; 23:323. [PMID: 35933367 PMCID: PMC9356471 DOI: 10.1186/s12859-022-04787-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2020] [Accepted: 04/13/2022] [Indexed: 11/24/2022] Open
Abstract
Background A key problem in bioinformatics is that of predicting gene expression levels. There are two broad approaches: use of mechanistic models that aim to directly simulate the underlying biology, and use of machine learning (ML) to empirically predict expression levels from descriptors of the experiments. There are advantages and disadvantages to both approaches: mechanistic models more directly reflect the underlying biological causation, but do not directly utilize the available empirical data; while ML methods do not fully utilize existing biological knowledge. Results Here, we investigate overcoming these disadvantages by integrating mechanistic cell signalling models with ML. Our approach to integration is to augment ML with similarity features (attributes) computed from cell signalling models. Seven sets of different similarity feature were generated using graph theory. Each set of features was in turn used to learn multi-target regression models. All the features have significantly improved accuracy over the baseline model - without the similarity features. Finally, the seven multi-target regression models were stacked together to form an overall prediction model that was significantly better than the baseline on 95% of genes on an independent test set. The similarity features enable this stacking model to provide interpretable knowledge about cancer, e.g. the role of ERBB3 in the MCF7 breast cancer cell line. Conclusion Integrating mechanistic models as graphs helps to both improve the predictive results of machine learning models, and to provide biological knowledge about genes that can help in building state-of-the-art mechanistic models. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04787-8.
Collapse
Affiliation(s)
- Nada Al Taweraqi
- Department of Computer Science, University of Manchester, Manchester, UK. .,Department of Computer Science, Taif University, Taif, Saudi Arabia.
| | - Ross D King
- Department of Chemical Engineering and Biotechnology, University of Cambridge, Cambridge, UK.,Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden.,Alan Turing Institute, London, UK
| |
Collapse
|
29
|
Abbas Z, Tayara H, Chong KT. ZayyuNet - A Unified Deep Learning Model for the Identification of Epigenetic Modifications Using Raw Genomic Sequences. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:2533-2544. [PMID: 34038365 DOI: 10.1109/tcbb.2021.3083789] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Epigenetic modifications have a vital role in gene expression and are linked to cellular processes such as differentiation, development, and tumorigenesis. Thus, the availability of reliable and accurate methods for identifying and defining these changes facilitates greater insights into the regulatory mechanisms that rely on epigenetic modifications. The current experimental methods provide a genome-wide identification of epigenetic modifications; however, they are expensive and time-consuming. To date, several machine learning methods have been proposed for identifying modifications such as DNA N6-Methyladenine (6mA), RNA N6-Methyladenosine (m6A), DNA N4-methylcytosine (4mC), and RNA pseudouridine ( Ψ). However, these methods are task-specific computational tools and require different encoding representations of DNA/RNA sequences. In this study, we propose a unified deep learning model, called ZayyuNet, for the identification of various epigenetic modifications. The proposed model is based on an architecture called, SpinalNet, inspired by the human somatosensory system that can efficiently receive large inputs and achieve better performance. The proposed model has been evaluated on various epigenetic modifications such as 6mA, m6A, 4mC, and Ψ and the results achieved outperform current state-of-the-art models. A user-friendly web server has been built and made freely available at http://nsclbio.jbnu.ac.kr/tools/ZayyuNet/.
Collapse
|
30
|
Li H, Zhang N, Wang Y, Xia S, Zhu Y, Xing C, Tian X, Du Y. DNA N6-Methyladenine Modification in Eukaryotic Genome. Front Genet 2022; 13:914404. [PMID: 35812743 PMCID: PMC9263368 DOI: 10.3389/fgene.2022.914404] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2022] [Accepted: 06/08/2022] [Indexed: 11/18/2022] Open
Abstract
DNA methylation is treated as an important epigenetic mark in various biological activities. In the past, a large number of articles focused on 5 mC while lacking attention to N6-methyladenine (6 mA). The presence of 6 mA modification was previously discovered only in prokaryotes. Recently, with the development of detection technologies, 6 mA has been found in several eukaryotes, including protozoans, metazoans, plants, and fungi. The importance of 6 mA in prokaryotes and single-celled eukaryotes has been widely accepted. However, due to the incredibly low density of 6 mA and restrictions on detection technologies, the prevalence of 6 mA and its role in biological processes in eukaryotic organisms are highly debated. In this review, we first summarize the advantages and disadvantages of 6 mA detection methods. Then, we conclude existing reports on the prevalence of 6 mA in eukaryotic organisms. Next, we highlight possible methyltransferases, demethylases, and the recognition proteins of 6 mA. In addition, we summarize the functions of 6 mA in eukaryotes. Last but not least, we summarize our point of view and put forward the problems that need further research.
Collapse
Affiliation(s)
- Hao Li
- School of Basic Medical Sciences, Anhui Medical University, Hefei, China
- First School of Clinical Medicine, Anhui Medical University, Hefei, China
- First Affiliated Hospital of Anhui Medical University, Hefei, China
| | - Ning Zhang
- School of Basic Medical Sciences, Anhui Medical University, Hefei, China
- First School of Clinical Medicine, Anhui Medical University, Hefei, China
- First Affiliated Hospital of Anhui Medical University, Hefei, China
| | - Yuechen Wang
- School of Basic Medical Sciences, Anhui Medical University, Hefei, China
- Second School of Clinical Medicine, Anhui Medical University, Hefei, China
| | - Siyuan Xia
- School of Basic Medical Sciences, Anhui Medical University, Hefei, China
- Second School of Clinical Medicine, Anhui Medical University, Hefei, China
| | - Yating Zhu
- School of Basic Medical Sciences, Anhui Medical University, Hefei, China
| | - Chen Xing
- School of Basic Medical Sciences, Anhui Medical University, Hefei, China
| | - Xuefeng Tian
- School of Basic Medical Sciences, Anhui Medical University, Hefei, China
- First School of Clinical Medicine, Anhui Medical University, Hefei, China
| | - Yinan Du
- School of Basic Medical Sciences, Anhui Medical University, Hefei, China
- *Correspondence: Yinan Du,
| |
Collapse
|
31
|
Charoenkwan P, Schaduangrat N, Hasan MM, Moni MA, Lió P, Shoombuatong W. Empirical comparison and analysis of machine learning-based predictors for predicting and analyzing of thermophilic proteins. EXCLI JOURNAL 2022; 21:554-570. [PMID: 35651661 PMCID: PMC9150013 DOI: 10.17179/excli2022-4723] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/28/2022] [Accepted: 02/21/2022] [Indexed: 12/15/2022]
Abstract
Thermophilic proteins (TPPs) are critical for basic research and in the food industry due to their ability to maintain a thermodynamically stable fold at extremely high temperatures. Thus, the expeditious identification of novel TPPs through computational models from protein sequences is very desirable. Over the last few decades, a number of computational methods, especially machine learning (ML)-based methods, for in silico prediction of TPPs have been developed. Therefore, it is desirable to revisit these methods and summarize their advantages and disadvantages in order to further develop new computational approaches to achieve more accurate and improved prediction of TPPs. With this goal in mind, we comprehensively investigate a large collection of fourteen state-of-the-art TPP predictors in terms of their dataset size, feature encoding schemes, feature selection strategies, ML algorithms, evaluation strategies and web server/software usability. To the best of our knowledge, this article represents the first comprehensive review on the development of ML-based methods for in silico prediction of TPPs. Among these TPP predictors, they can be classified into two groups according to the interpretability of ML algorithms employed (i.e., computational black-box methods and computational white-box methods). In order to perform the comparative analysis, we conducted a comparative study on several currently available TPP predictors based on two benchmark datasets. Finally, we provide future perspectives for the design and development of new computational models for TPP prediction. We hope that this comprehensive review will facilitate researchers in selecting an appropriate TPP predictor that is the most suitable one to deal with their purposes and provide useful perspectives for the development of more effective and accurate TPP predictors.
Collapse
Affiliation(s)
- Phasit Charoenkwan
- Modern Management and Information Technology, College of Arts, Media and Technology, Chiang Mai University, Chiang Mai, Thailand, 50200
| | - Nalini Schaduangrat
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, Thailand, 10700
| | - Md Mehedi Hasan
- Tulane Center for Biomedical Informatics and Genomics, Division of Biomedical Informatics and Genomics, John W. Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA 70112, USA
| | - Mohammad Ali Moni
- School of Health and Rehabilitation Sciences, Faculty of Health and Behavioural Sciences, the University of Queensland, St Lucia, QLD 4072, Australia
| | - Pietro Lió
- Department of Computer Science and Technology, University of Cambridge, Cambridge, CB3 0FD, UK
| | - Watshara Shoombuatong
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, Thailand, 10700
| |
Collapse
|
32
|
Hosen MF, Mahmud SH, Ahmed K, Chen W, Moni MA, Deng HW, Shoombuatong W, Hasan MM. DeepDNAbP: A deep learning-based hybrid approach to improve the identification of deoxyribonucleic acid-binding proteins. Comput Biol Med 2022; 145:105433. [DOI: 10.1016/j.compbiomed.2022.105433] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2021] [Revised: 03/11/2022] [Accepted: 03/20/2022] [Indexed: 11/03/2022]
|
33
|
Tang X, Zheng P, Li X, Wu H, Wei DQ, Liu Y, Huang G. Deep6mAPred: A CNN and Bi-LSTM-based deep learning method for predicting DNA N6-methyladenosine sites across plant species. Methods 2022; 204:142-150. [PMID: 35477057 DOI: 10.1016/j.ymeth.2022.04.011] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2022] [Revised: 04/16/2022] [Accepted: 04/20/2022] [Indexed: 12/11/2022] Open
Abstract
DNA N6-methyladenine (6mA) is a key DNA modification, which plays versatile roles in the cellular processes, including regulation of gene expression, DNA repair, and DNA replication. DNA 6mA is closely associated with many diseases in the mammals and with growth as well as development of plants. Precisely detecting DNA 6mA sites is of great importance to exploration of 6mA functions. Although many computational methods have been presented for DNA 6mA prediction, there is still a wide gap in the practical application. We presented a convolution neural network (CNN) and bi-directional long-short term memory (Bi-LSTM)-based deep learning method (Deep6mAPred) for predicting DNA 6mA sites across plant species. The Deep6mAPred stacked the CNNs and the Bi-LSTMs in a paralleling manner instead of a series-connection manner. The Deep6mAPred also employed the attention mechanism for improving the representations of sequences. The Deep6mAPred reached an accuracy of 0.9556 over the independent rice dataset, far outperforming the state-of-the-art methods. The tests across plant species showed that the Deep6mAPred is of a remarkable advantage over the state of the art methods. We developed a user-friendly web application for DNA 6mA prediction, which is freely available at http://106.13.196.152:7001/ for all the scientific researchers. The Deep6mAPred would enrich tools to predict DNA 6mA sites and speed up the exploration of DNA modification.
Collapse
Affiliation(s)
- Xingyu Tang
- School of Electrical Engineering, Shaoyang University, Shaoyang, Hunan 422000, China
| | - Peijie Zheng
- School of Electrical Engineering, Shaoyang University, Shaoyang, Hunan 422000, China
| | - Xueyong Li
- School of Electrical Engineering, Shaoyang University, Shaoyang, Hunan 422000, China
| | - Hongyan Wu
- The Joint Engineering Research Center for Health Big Data Intelligent Analysis Technology, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China
| | - Dong-Qing Wei
- The Joint Engineering Research Center for Health Big Data Intelligent Analysis Technology, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China; State Key Laboratory of Microbial Metabolism, and School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, China.
| | - Yuewu Liu
- College of Information and Intelligence, Hunan Agricultural University, Changsha, Hunan 410081, China
| | - Guohua Huang
- School of Electrical Engineering, Shaoyang University, Shaoyang, Hunan 422000, China.
| |
Collapse
|
34
|
MLCPP 2.0: An updated cell-penetrating peptides and their uptake efficiency predictor. J Mol Biol 2022; 434:167604. [DOI: 10.1016/j.jmb.2022.167604] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2021] [Revised: 04/03/2022] [Accepted: 04/19/2022] [Indexed: 12/12/2022]
|
35
|
Liu M, Sun ZL, Zeng Z, Lam KM. MGF6mARice: prediction of DNA N6-methyladenine sites in rice by exploiting molecular graph feature and residual block. Brief Bioinform 2022; 23:6553606. [PMID: 35325050 DOI: 10.1093/bib/bbac082] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2021] [Revised: 02/13/2022] [Accepted: 02/16/2022] [Indexed: 11/12/2022] Open
Abstract
DNA N6-methyladenine (6mA) is produced by the N6 position of the adenine being methylated, which occurs at the molecular level, and is involved in numerous vital biological processes in the rice genome. Given the shortcomings of biological experiments, researchers have developed many computational methods to predict 6mA sites and achieved good performance. However, the existing methods do not consider the occurrence mechanism of 6mA to extract features from the molecular structure. In this paper, a novel deep learning method is proposed by devising DNA molecular graph feature and residual block structure for 6mA sites prediction in rice, named MGF6mARice. Firstly, the DNA sequence is changed into a simplified molecular input line entry system (SMILES) format, which reflects chemical molecular structure. Secondly, for the molecular structure data, we construct the DNA molecular graph feature based on the principle of graph convolutional network. Then, the residual block is designed to extract higher level, distinguishable features from molecular graph features. Finally, the prediction module is used to obtain the result of whether it is a 6mA site. By means of 10-fold cross-validation, MGF6mARice outperforms the state-of-the-art approaches. Multiple experiments have shown that the molecular graph feature and residual block can promote the performance of MGF6mARice in 6mA prediction. To the best of our knowledge, it is the first time to derive a feature of DNA sequence by considering the chemical molecular structure. We hope that MGF6mARice will be helpful for researchers to analyze 6mA sites in rice.
Collapse
Affiliation(s)
- Mengya Liu
- School of Computer Science and Technology, Anhui University, Hefei, 230601, China
| | - Zhan-Li Sun
- School of Artificial Intelligence, Anhui University, Hefei, 230601, China
| | - Zhigang Zeng
- School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Wuhan, 430074, China
| | - Kin-Man Lam
- Department of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hong Kong, China
| |
Collapse
|
36
|
Wang H, Zhao H, Zhang J, Han J, Liu Z. A parallel model of DenseCNN and ordered-neuron LSTM for generic and species-specific succinylation site prediction. Biotechnol Bioeng 2022; 119:1755-1767. [PMID: 35320585 DOI: 10.1002/bit.28091] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2021] [Revised: 03/12/2022] [Accepted: 03/19/2022] [Indexed: 11/07/2022]
Abstract
Lysine succinylation (Ksucc) regulates various metabolic processes, participates in vital life processes, ans is involved in the occurrence and development of numerous diseases. Accurate recognition of succinylation sites can reveal underlying functional mechanisms and pathogenesis. However, most remain undetected. Moreover, a deep learning architecture focusing on generic and species-specific predictions is still lacking. Thus, we proposed a deep learning-based framework named Deep-Ksucc, combining a dense convolutional network (DenseCNN) and ordered-neuron long short-term memory (OnLSTM) in parallel, which took the cascading characteristics of sequence information and physicochemical properties as the input. The results of the generic and species-specific predictions indicated that Deep-Ksucc can identify sequence patterns of different organisms and recognize plenty of succinylation sites. The case study showed that Deep-Ksucc can serve as a reliable tool for biology verification and computer-aided recognition of succinylation sites. This article is protected by copyright. All rights reserved.
Collapse
Affiliation(s)
- Huiqing Wang
- College of Information and Computer, Taiyuan University of Technology, Taiyuan, 030024, China
| | - Hong Zhao
- College of Information and Computer, Taiyuan University of Technology, Taiyuan, 030024, China
| | - Jing Zhang
- Engineering Training Center, Taiyuan University of Technology, Taiyuan, 030024, China
| | - Jiale Han
- College of Information and Computer, Taiyuan University of Technology, Taiyuan, 030024, China
| | - Zhihao Liu
- College of Information and Computer, Taiyuan University of Technology, Taiyuan, 030024, China
| |
Collapse
|
37
|
Ahmad S, Charoenkwan P, Quinn JMW, Moni MA, Hasan MM, Lio' P, Shoombuatong W. SCORPION is a stacking-based ensemble learning framework for accurate prediction of phage virion proteins. Sci Rep 2022; 12:4106. [PMID: 35260777 PMCID: PMC8904530 DOI: 10.1038/s41598-022-08173-5] [Citation(s) in RCA: 17] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2022] [Accepted: 03/03/2022] [Indexed: 12/30/2022] Open
Abstract
Fast and accurate identification of phage virion proteins (PVPs) would greatly aid facilitation of antibacterial drug discovery and development. Although, several research efforts based on machine learning (ML) methods have been made for in silico identification of PVPs, these methods have certain limitations. Therefore, in this study, we propose a new computational approach, termed SCORPION, (StaCking-based Predictior fOR Phage VIrion PrOteiNs), to accurately identify PVPs using only protein primary sequences. Specifically, we explored comprehensive 13 different feature descriptors from different aspects (i.e., compositional information, composition-transition-distribution information, position-specific information and physicochemical properties) with 10 popular ML algorithms to construct a pool of optimal baseline models. These optimal baseline models were then used to generate probabilistic features (PFs) and considered as a new feature vector. Finally, we utilized a two-step feature selection strategy to determine the optimal PF feature vector and used this feature vector to develop a stacked model (SCORPION). Both tenfold cross-validation and independent test results indicate that SCORPION achieves superior predictive performance than its constitute baseline models and existing methods. We anticipate SCORPION will serve as a useful tool for the cost-effective and large-scale screening of new PVPs. The source codes and datasets for this work are available for downloading in the GitHub repository (https://github.com/saeed344/SCORPION).
Collapse
Affiliation(s)
- Saeed Ahmad
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, 10700, Thailand
| | - Phasit Charoenkwan
- Modern Management and Information Technology, College of Arts, Media and Technology, Chiang Mai University, Chiang Mai, 50200, Thailand
| | - Julian M W Quinn
- Bone Biology Division, Garvan Institute of Medical Research, 384 Victoria Street, Darlinghurst, NSW, 2010, Australia
| | - Mohammad Ali Moni
- Faculty of Health and Behavioural Sciences, School of Health and Rehabilitation Sciences, The University of Queensland, St Lucia, QLD, 4072, Australia
| | - Md Mehedi Hasan
- Division of Biomedical Informatics and Genomics, John W. Deming Department of Medicine, School of Medicine, Tulane Center for Biomedical Informatics and Genomics, Tulane University, New Orleans, LA, 70112, USA
| | - Pietro Lio'
- Department of Computer Science and Technology, University of Cambridge, Cambridge, CB3 0FD, UK
| | - Watshara Shoombuatong
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, 10700, Thailand.
| |
Collapse
|
38
|
Tsukiyama S, Hasan MM, Deng HW, Kurata H. BERT6mA: prediction of DNA N6-methyladenine site using deep learning-based approaches. Brief Bioinform 2022; 23:6539171. [PMID: 35225328 PMCID: PMC8921755 DOI: 10.1093/bib/bbac053] [Citation(s) in RCA: 21] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2021] [Revised: 01/28/2022] [Accepted: 01/31/2022] [Indexed: 01/29/2023] Open
Abstract
N6-methyladenine (6mA) is associated with important roles in DNA replication, DNA repair, transcription, regulation of gene expression. Several experimental methods were used to identify DNA modifications. However, these experimental methods are costly and time-consuming. To detect the 6mA and complement these shortcomings of experimental methods, we proposed a novel, deep leaning approach called BERT6mA. To compare the BERT6mA with other deep learning approaches, we used the benchmark datasets including 11 species. The BERT6mA presented the highest AUCs in eight species in independent tests. Furthermore, BERT6mA showed higher and comparable performance with the state-of-the-art models while the BERT6mA showed poor performances in a few species with a small sample size. To overcome this issue, pretraining and fine-tuning between two species were applied to the BERT6mA. The pretrained and fine-tuned models on specific species presented higher performances than other models even for the species with a small sample size. In addition to the prediction, we analyzed the attention weights generated by BERT6mA to reveal how the BERT6mA model extracts critical features responsible for the 6mA prediction. To facilitate biological sciences, the BERT6mA online web server and its source codes are freely accessible at https://github.com/kuratahiroyuki/BERT6mA.git, respectively.
Collapse
Affiliation(s)
- Sho Tsukiyama
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan
| | - Md Mehedi Hasan
- Tulane Center for Biomedical Informatics and Genomics, Division of Biomedical Informatics and Genomics, John W. Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA 70112, USA
| | - Hong-Wen Deng
- Tulane Center for Biomedical Informatics and Genomics, Division of Biomedical Informatics and Genomics, John W. Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA 70112, USA
| | - Hiroyuki Kurata
- Corresponding author: Hiroyuki Kurata, Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan. Tel: 81-948-29-7828; E-mail:
| |
Collapse
|
39
|
Banna HU, Zanabli A, McMillan B, Lehmann M, Gupta S, Gerbo M, Palko J. Evaluation of machine learning algorithms for trabeculectomy outcome prediction in patients with glaucoma. Sci Rep 2022; 12:2473. [PMID: 35169235 PMCID: PMC8847459 DOI: 10.1038/s41598-022-06438-7] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2021] [Accepted: 01/18/2022] [Indexed: 02/04/2023] Open
Abstract
The purpose of this study was to evaluate the performance of machine learning algorithms to predict trabeculectomy surgical outcomes. Preoperative systemic, demographic and ocular data from consecutive trabeculectomy surgeries from a single academic institution between January 2014 and December 2018 were incorporated into models using random forest, support vector machine, artificial neural networks and multivariable logistic regression. Mean area under the receiver operating characteristic curve (AUC) and accuracy were used to evaluate the discrimination of each model to predict complete success of trabeculectomy surgery at 1 year. The top performing model was optimized using recursive feature selection and hyperparameter tuning. Calibration and net benefit of the final models were assessed. Among the 230 trabeculectomy surgeries performed on 184 patients, 104 (45.2%) were classified as complete success. Random forest was found to be the top performing model with an accuracy of 0.68 and AUC of 0.74 using 5-fold cross-validation to evaluate the final optimized model. These results provide evidence that machine learning models offer value in predicting trabeculectomy outcomes in patients with refractory glaucoma.
Collapse
Affiliation(s)
- Hasan Ul Banna
- Department of Ophthalmology and Visual Sciences, West Virginia University School of Medicine, Morgantown, WV, 26506, USA
| | - Ahmed Zanabli
- Department of Ophthalmology and Visual Sciences, West Virginia University School of Medicine, Morgantown, WV, 26506, USA
| | - Brian McMillan
- Department of Ophthalmology and Visual Sciences, West Virginia University School of Medicine, Morgantown, WV, 26506, USA
| | - Maria Lehmann
- Department of Ophthalmology and Visual Sciences, West Virginia University School of Medicine, Morgantown, WV, 26506, USA
| | - Sumeet Gupta
- Department of Ophthalmology and Visual Sciences, West Virginia University School of Medicine, Morgantown, WV, 26506, USA
| | - Michael Gerbo
- Department of Ophthalmology and Visual Sciences, West Virginia University School of Medicine, Morgantown, WV, 26506, USA
| | - Joel Palko
- Department of Ophthalmology and Visual Sciences, West Virginia University School of Medicine, Morgantown, WV, 26506, USA.
| |
Collapse
|
40
|
Kabir M, Nantasenamat C, Kanthawong S, Charoenkwan P, Shoombuatong W. Large-scale comparative review and assessment of computational methods for phage virion proteins identification. EXCLI JOURNAL 2022; 21:11-29. [PMID: 35145365 PMCID: PMC8822302 DOI: 10.17179/excli2021-4411] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/12/2021] [Accepted: 11/29/2021] [Indexed: 12/11/2022]
Abstract
Phage virion proteins (PVPs) are effective at recognizing and binding to host cell receptors while having no deleterious effects on human or animal cells. Understanding their functional mechanisms is regarded as a critical goal that will aid in rational antibacterial drug discovery and development. Although high-throughput experimental methods for identifying PVPs are considered the gold standard for exploring crucial PVP features, these procedures are frequently time-consuming and labor-intensive. Thusfar, more than ten sequence-based predictors have been established for the in silico identification of PVPs in conjunction with traditional experimental approaches. As a result, a revised and more thorough assessment is extremely desirable. With this purpose in mind, we first conduct a thorough survey and evaluation of a vast array of 13 state-of-the-art PVP predictors. Among these PVP predictors, they can be classified into three groups according to the types of machine learning (ML) algorithms employed (i.e. traditional ML-based methods, ensemble-based methods and deep learning-based methods). Subsequently, we explored which factors are important for building more accurate and stable predictors and this included training/independent datasets, feature encoding algorithms, feature selection methods, core algorithms, performance evaluation metrics/strategies and web servers. Finally, we provide insights and future perspectives for the design and development of new and more effective computational approaches for the detection and characterization of PVPs.
Collapse
Affiliation(s)
- Muhammad Kabir
- School of Systems and Technology, Department of Computer Science, University of Management and Technology, Lahore, Pakistan, 54770
| | - Chanin Nantasenamat
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, Thailand, 10700
| | - Sakawrat Kanthawong
- Department of Microbiology, Faculty of Medicine, Khon Kaen University, Khon Kaen, Thailand, 40002
| | - Phasit Charoenkwan
- Modern Management and Information Technology, College of Arts, Media and Technology, Chiang Mai University, Chiang Mai, Thailand, 50200
| | - Watshara Shoombuatong
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, Thailand, 10700
| |
Collapse
|
41
|
Wang H, Joshi P, Hong SH, Maye PF, Rowe DW, Shin DG. Predicting the targets of IRF8 and NFATc1 during osteoclast differentiation using the machine learning method framework cTAP. BMC Genomics 2022; 23:14. [PMID: 34991467 PMCID: PMC8740472 DOI: 10.1186/s12864-021-08159-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2021] [Accepted: 10/26/2021] [Indexed: 01/14/2023] Open
Abstract
BACKGROUND Interferon regulatory factor-8 (IRF8) and nuclear factor-activated T cells c1 (NFATc1) are two transcription factors that have an important role in osteoclast differentiation. Thanks to ChIP-seq technology, scientists can now estimate potential genome-wide target genes of IRF8 and NFATc1. However, finding target genes that are consistently up-regulated or down-regulated across different studies is hard because it requires analysis of a large number of high-throughput expression studies from a comparable context. METHOD We have developed a machine learning based method, called, Cohort-based TF target prediction system (cTAP) to overcome this problem. This method assumes that the pathway involving the transcription factors of interest is featured with multiple "functional groups" of marker genes pertaining to the concerned biological process. It uses two notions, Gene-Present Sufficiently (GP) and Gene-Absent Insufficiently (GA), in addition to log2 fold changes of differentially expressed genes for the prediction. Target prediction is made by applying multiple machine-learning models, which learn the patterns of GP and GA from log2 fold changes and four types of Z scores from the normalized cohort's gene expression data. The learned patterns are then associated with the putative transcription factor targets to identify genes that consistently exhibit Up/Down gene regulation patterns within the cohort. We applied this method to 11 publicly available GEO data sets related to osteoclastgenesis. RESULT Our experiment identified a small number of Up/Down IRF8 and NFATc1 target genes as relevant to osteoclast differentiation. The machine learning models using GP and GA produced NFATc1 and IRF8 target genes different than simply using a log2 fold change alone. Our literature survey revealed that all predicted target genes have known roles in bone remodeling, specifically related to the immune system and osteoclast formation and functions, suggesting confidence and validity in our method. CONCLUSION cTAP was motivated by recognizing that biologists tend to use Z score values present in data sets for the analysis. However, using cTAP effectively presupposes assembling a sizable cohort of gene expression data sets within a comparable context. As public gene expression data repositories grow, the need to use cohort-based analysis method like cTAP will become increasingly important.
Collapse
Affiliation(s)
- Honglin Wang
- Computer Science and Engineering Department, University of Connecticut, Storrs, USA
| | - Pujan Joshi
- Computer Science and Engineering Department, University of Connecticut, Storrs, USA
| | - Seung-Hyun Hong
- Computer Science and Engineering Department, University of Connecticut, Storrs, USA
| | - Peter F. Maye
- Department of Reconstructive Sciences, University of Connecticut Health Center, Farmington, USA
| | - David W. Rowe
- Center for Regenerative Medicine and Skeletal Development, University of Connecticut Health Center, Farmington, USA
| | - Dong-Guk Shin
- Computer Science and Engineering Department, University of Connecticut, Storrs, USA
| |
Collapse
|
42
|
Teng Z, Zhao Z, Li Y, Tian Z, Guo M, Lu Q, Wang G. i6mA-Vote: Cross-Species Identification of DNA N6-Methyladenine Sites in Plant Genomes Based on Ensemble Learning With Voting. FRONTIERS IN PLANT SCIENCE 2022; 13:845835. [PMID: 35237293 PMCID: PMC8882731 DOI: 10.3389/fpls.2022.845835] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/30/2021] [Accepted: 01/24/2022] [Indexed: 05/17/2023]
Abstract
DNA N6-Methyladenine (6mA) is a common epigenetic modification, which plays some significant roles in the growth and development of plants. It is crucial to identify 6mA sites for elucidating the functions of 6mA. In this article, a novel model named i6mA-vote is developed to predict 6mA sites of plants. Firstly, DNA sequences were coded into six feature vectors with diverse strategies based on density, physicochemical properties, and position of nucleotides, respectively. To find the best coding strategy, the feature vectors were compared on several machine learning classifiers. The results suggested that the position of nucleotides has a significant positive effect on 6mA sites identification. Thus, the dinucleotide one-hot strategy which can describe position characteristics of nucleotides well was employed to extract DNA features in our method. Secondly, DNA sequences of Rosaceae were divided into a training dataset and a test dataset randomly. Finally, i6mA-vote was constructed by combining five different base-classifiers under a majority voting strategy and trained on the Rosaceae training dataset. The i6mA-vote was evaluated on the task of predicting 6mA sites from the genome of the Rosaceae, Rice, and Arabidopsis separately. In Rosaceae, the performances of i6mA-vote were 0.955 on accuracy (ACC), 0.909 on Matthew correlation coefficients (MCC), 0.955 on sensitivity (SN), and 0.954 on specificity (SP). Those indicators, in the order of ACC, MCC, SN, SP, were 0.882, 0.774, 0.961, and 0.803 on Rice while they were 0.798, 0.617, 0.666, and 0.929 on Arabidopsis. According to the indicators, our method was effectiveness and better than other concerned methods. The results also illustrated that i6mA-vote does not only well in 6mA sites prediction of intraspecies but also interspecies plants. Moreover, it can be seen that the specificity is distinctly lower than the sensitivity in Rice while it is just the opposite in Arabidopsis. It may be resulted from sequence similarity among Rosaceae, Rice and Arabidopsis.
Collapse
Affiliation(s)
- Zhixia Teng
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Zhengnan Zhao
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Yanjuan Li
- College of Electrical and Information Engineering, Quzhou University, Quzhou, China
| | - Zhen Tian
- College of Information Engineering, Zhengzhou University, Zhengzhou, China
| | - Maozu Guo
- College of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing, China
| | - Qianzi Lu
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
- *Correspondence: Qianzi Lu,
| | - Guohua Wang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
- Guohua Wang,
| |
Collapse
|
43
|
Le NQK, Ho QT. Deep transformers and convolutional neural network in identifying DNA N6-methyladenine sites in cross-species genomes. Methods 2021; 204:199-206. [PMID: 34915158 DOI: 10.1016/j.ymeth.2021.12.004] [Citation(s) in RCA: 28] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2021] [Revised: 11/30/2021] [Accepted: 12/09/2021] [Indexed: 12/19/2022] Open
Abstract
As one of the most common post-transcriptional epigenetic modifications, N6-methyladenine (6 mA), plays an essential role in various cellular processes and disease pathogenesis. Therefore, accurately identifying 6 mA modifications is necessary for a deep understanding of cellular processes and other possible functional mechanisms. Although a few computational methods have been proposed, their respective models were developed with small training datasets. Hence, their practical application is quite limited in genome-wide detection. To overcome the existing limitations, we present a novel model based on transformer architecture and deep learning to identify DNA 6 mA sites from the cross-species genome. The model is constructed on a benchmark dataset and explored a feature derived from pre-trained transformer word embedding approaches. Subsequently, a convolutional neural network was employed to learn the generated features and generate the prediction outcomes. As a result, our predictor achieved excellent performance during independent test with the accuracy and Matthews correlation coefficient (MCC) of 79.3% and 0.58, respectively. Overall, its performance achieved better accuracy than the baseline models and significantly outperformed the existing predictors, demonstrating the effectiveness of our proposed hybrid framework. Furthermore, our model is expected to assist biologists in accurately identifying 6mAs and formulate the novel testable biological hypothesis. We also release source codes and datasets freely at https://github.com/khanhlee/bert-dna for front-end users.
Collapse
Affiliation(s)
- Nguyen Quoc Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, Taipei 106, Taiwan; Research Center for Artificial Intelligence in Medicine, Taipei Medical University, Taipei 106, Taiwan; Translational Imaging Research Center, Taipei Medical University Hospital, Taipei 110, Taiwan.
| | - Quang-Thai Ho
- College of Information & Communication Technology, Can Tho University, Viet Nam; Department of Computer Science and Engineering, Yuan Ze University, Chung-Li, 32003, Taiwan
| |
Collapse
|
44
|
Nguyen TTD, Trinh VN, Le NQK, Ou YY. Using k-mer embeddings learned from a Skip-gram based neural network for building a cross-species DNA N6-methyladenine site prediction model. PLANT MOLECULAR BIOLOGY 2021; 107:533-542. [PMID: 34843033 DOI: 10.1007/s11103-021-01204-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/23/2021] [Accepted: 09/25/2021] [Indexed: 06/13/2023]
Abstract
This study used k-mer embeddings as effective feature to identify DNA N6-Methyladenine sites in plant genomes and obtained improved performance without substantial effort in feature extraction, combination and selection. Identification of DNA N6-methyladenine sites has been a very active topic of computational biology due to the unavailability of suitable methods to identify them accurately, especially in plants. Substantial results were obtained with a great effort put in extracting, heuristic searching, or fusing a diverse types of features, not to mention a feature selection step. In this study, we regarded DNA sequences as textual information and employed natural language processing techniques to decipher hidden biological meanings from those sequences. In other words, we considered DNA, the human life book, as a book corpus for training DNA language models. K-mer embeddings then were generated from these language models to be used in machine learning prediction models. Skip-gram neural networks were the base of the language models and ensemble tree-based algorithms were the machine learning algorithms for prediction models. We trained the prediction model on Rosaceae genome dataset and performed a comprehensive test on 3 plant genome datasets. Our proposed method shows promising performance with AUC performance approaching an ideal value on Rosaceae dataset (0.99), a high score on Rice dataset (0.95) and improved performance on Rice dataset while enjoying an elegant, yet efficient feature extraction process.
Collapse
Affiliation(s)
| | - Van Ngu Trinh
- Soonchunhyang Institute of Medi-Bio Science, Soonchunhyang University, Cheonan, 31151, South Korea
| | - Nguyen Quoc Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, Taipei City, 106, Taiwan
- Research Center for Artificial Intelligence in Medicine, Taipei Medical University, Taipei City, 106, Taiwan
| | - Yu-Yen Ou
- Department of Computer Science and Engineering, Yuan Ze University, Taoyuan, 32003, Taiwan.
| |
Collapse
|
45
|
Combining machine learning and conventional statistical approaches for risk factor discovery in a large cohort study. Sci Rep 2021; 11:22997. [PMID: 34837000 PMCID: PMC8626442 DOI: 10.1038/s41598-021-02476-9] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2021] [Accepted: 11/16/2021] [Indexed: 11/08/2022] Open
Abstract
We present a simple and efficient hypothesis-free machine learning pipeline for risk factor discovery that accounts for non-linearity and interaction in large biomedical databases with minimal variable pre-processing. In this study, mortality models were built using gradient boosting decision trees (GBDT) and important predictors were identified using a Shapley values-based feature attribution method, SHAP values. Cox models controlled for false discovery rate were used for confounder adjustment, interpretability, and further validation. The pipeline was tested using information from 502,506 UK Biobank participants, aged 37–73 years at recruitment and followed over seven years for mortality registrations. From the 11,639 predictors included in GBDT, 193 potential risk factors had SHAP values ≥ 0.05, passed the correlation test, and were selected for further modelling. Of the total variable importance summed up, 60% was directly health related, and baseline characteristics, sociodemographics, and lifestyle factors each contributed about 10%. Cox models adjusted for baseline characteristics, showed evidence for an association with mortality for 166 out of the 193 predictors. These included mostly well-known risk factors (e.g., age, sex, ethnicity, education, material deprivation, smoking, physical activity, self-rated health, BMI, and many disease outcomes). For 19 predictors we saw evidence for an association in the unadjusted but not adjusted analyses, suggesting bias by confounding. Our GBDT-SHAP pipeline was able to identify relevant predictors ‘hidden’ within thousands of variables, providing an efficient and pragmatic solution for the first stage of hypothesis free risk factor identification.
Collapse
|
46
|
Huang Z, Li J, Chen J, Chen D. Construction of Prognostic Risk Model of 5-Methylcytosine-Related Long Non-Coding RNAs and Evaluation of the Characteristics of Tumor-Infiltrating Immune Cells in Breast Cancer. Front Genet 2021; 12:748279. [PMID: 34777473 PMCID: PMC8585929 DOI: 10.3389/fgene.2021.748279] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2021] [Accepted: 10/12/2021] [Indexed: 11/28/2022] Open
Abstract
Purpose: The role of 5-methylcytosine-related long non-coding RNAs (m5C-lncRNAs) in breast cancer (BC) remains unclear. Here, we aimed to investigate the prognostic value, gene expression characteristics, and correlation between m5C-lncRNA risk model and tumor immune cell infiltration in BC. Methods: The expression matrix of m5C-lncRNAs in BC was obtained from The Cancer Genome Atlas database, and the lncRNAs were analyzed using differential expression analysis as well as univariate and multivariate Cox regression analysis to eventually obtain BC-specific m5C-lncRNAs. A risk model was developed based on three lncRNAs using multivariate Cox regression and the prognostic value, accuracy, as well as reliability were verified. Gene set enrichment analysis (GSEA) was used to analyze the Kyoto Encyclopedia of Genes and Genomes signaling pathway enrichment of the risk model. CIBERSORT algorithm and correlation analysis were used to explore the characteristics of the BC tumor-infiltrating immune cells. Finally, reverse transcription-quantitative polymerase chain reaction was performed to detect the expression level of three lncRNA in clinical samples. Results: A total of 334 differential m5C-lncRNAs were identified, and three BC-specific m5C-lncRNAs were selected, namely AP005131.2, AL121832.2, and LINC01152. Based on these three lncRNAs, a highly reliable and specific risk model was constructed, which was proven to be closely related to the prognosis of patients with BC. Therefore, a nomogram based on the risk score was built to assist clinical decisions. GSEA revealed that the risk model was significantly enriched in metabolism-related pathways and was associated with tumor immune cell infiltration based on the analysis with the CIBERSORT algorithm. Conclusion: The efficient risk model based on m5C-lncRNAs associated with cancer metabolism and tumor immune cell infiltration could predict the survival prognosis of patients, and AP005131.2, AL121832.2, and LINC01152 could be novel biomarkers and therapeutic targets for BC.
Collapse
Affiliation(s)
| | | | | | - Debo Chen
- Department of Breast Surgery, Quanzhou First Hospital of Fujian Medical University, Quanzhou, China
| |
Collapse
|
47
|
Yu K, Xie W, Wang L, Li W. ILRC: a hybrid biomarker discovery algorithm based on improved L1 regularization and clustering in microarray data. BMC Bioinformatics 2021; 22:514. [PMID: 34686127 PMCID: PMC8532312 DOI: 10.1186/s12859-021-04443-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2021] [Accepted: 10/13/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Finding significant genes or proteins from gene chip data for disease diagnosis and drug development is an important task. However, the challenge comes from the curse of the data dimension. It is of great significance to use machine learning methods to find important features from the data and build an accurate classification model. RESULTS The proposed method has proved superior to the published advanced hybrid feature selection method and traditional feature selection method on different public microarray data sets. In addition, the biomarkers selected using our method show a match to those provided by the cooperative hospital in a set of clinical cleft lip and palate data. METHOD In this paper, a feature selection algorithm ILRC based on clustering and improved L1 regularization is proposed. The features are firstly clustered, and the redundant features in the sub-clusters are deleted. Then all the remaining features are iteratively evaluated using ILR. The final result is given according to the cumulative weight reordering. CONCLUSION The proposed method can effectively remove redundant features. The algorithm's output has high stability and classification accuracy, which can potentially select potential biomarkers.
Collapse
Affiliation(s)
- Kun Yu
- College of Medicine and Biological Information Engineering, Northeastern University, Shenyang, China
| | - Weidong Xie
- School of Computer Science and Engineering, Northeastern University, Shenyang, China
| | - Linjie Wang
- School of Computer Science and Engineering, Northeastern University, Shenyang, China
| | - Wei Li
- Key Laboratory of Intelligent Computing in Medical Image (MIIC), Northeastern University, Ministry of Education, Shenyang, China
| |
Collapse
|
48
|
Liu T, Chen J, Zhang Q, Hippe K, Hunt C, Le T, Cao R, Tang H. The Development of Machine Learning Methods in discriminating Secretory Proteins of Malaria Parasite. Curr Med Chem 2021; 29:807-821. [PMID: 34636289 DOI: 10.2174/0929867328666211005140625] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2021] [Revised: 07/28/2021] [Accepted: 08/15/2021] [Indexed: 11/22/2022]
Abstract
Malaria caused by Plasmodium falciparum is one of the major infectious diseases in the world. It is essential to exploit an effective method to predict secretory proteins of malaria parasites to develop effective cures and treatment. Biochemical assays can provide details for accurate identification of the secretory proteins, but these methods are expensive and time-consuming. In this paper, we summarized the machine learning-based identification algorithms and compared the construction strategies between different computational methods. Also, we discussed the use of machine learning to improve the ability of algorithms to identify proteins secreted by malaria parasites.
Collapse
Affiliation(s)
- Ting Liu
- School of Basic Medical Sciences, Southwest Medical University, Luzhou. China
| | - Jiamao Chen
- School of Basic Medical Sciences, Southwest Medical University, Luzhou. China
| | - Qian Zhang
- School of Basic Medical Sciences, Southwest Medical University, Luzhou. China
| | - Kyle Hippe
- Department of Computer Science, Pacific Lutheran University. United States
| | - Cassandra Hunt
- Department of Computer Science, Pacific Lutheran University. United States
| | - Thu Le
- Department of Computer Science, Pacific Lutheran University. United States
| | - Renzhi Cao
- Department of Computer Science, Pacific Lutheran University. United States
| | - Hua Tang
- School of Basic Medical Sciences, Southwest Medical University, Luzhou. China
| |
Collapse
|
49
|
Auwul MR, Zhang C, Rahman MR, Shahjaman M, Alyami SA, Moni MA. Network-based transcriptomic analysis identifies the genetic effect of COVID-19 to chronic kidney disease patients: A bioinformatics approach. Saudi J Biol Sci 2021; 28:5647-5656. [PMID: 34127904 PMCID: PMC8190333 DOI: 10.1016/j.sjbs.2021.06.015] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2021] [Revised: 06/01/2021] [Accepted: 06/02/2021] [Indexed: 12/15/2022] Open
Abstract
COVID-19 has emerged as global health threats. Chronic kidney disease (CKD) patients are immune-compromised and may have a high risk of infection by the SARS-CoV-2. We aimed to detect common transcriptomic signatures and pathways between COVID-19 and CKD by systems biology analysis. We analyzed transcriptomic data obtained from peripheral blood mononuclear cells (PBMC) infected with SARS-CoV-2 and PBMC of CKD patients. We identified 49 differentially expressed genes (DEGs) which were common between COVID-19 and CKD. The gene ontology and pathways analysis showed the DEGs were associated with "platelet degranulation", "regulation of wound healing", "platelet activation", "focal adhesion", "regulation of actin cytoskeleton" and "PI3K-Akt signalling pathway". The protein-protein interaction (PPI) network encoded by the common DEGs showed ten hub proteins (EPHB2, PRKAR2B, CAV1, ARHGEF12, HSP90B1, ITGA2B, BCL2L1, E2F1, TUBB1, and C3). Besides, we identified significant transcription factors and microRNAs that may regulate the common DEGs. We investigated protein-drug interaction analysis and identified potential drugs namely, aspirin, estradiol, rapamycin, and nebivolol. The identified common gene signature and pathways between COVID-19 and CKD may be therapeutic targets in COVID-19 patients with CKD comorbidity.
Collapse
Affiliation(s)
- Md. Rabiul Auwul
- School of Economics and Statistics, Guangzhou University, Guangzhou 510006, China
| | - Chongqi Zhang
- School of Economics and Statistics, Guangzhou University, Guangzhou 510006, China
| | - Md Rezanur Rahman
- Department of Biotechnology and Genetic Engineering, Faculty of Biological Sciences, Islamic University, Kushtia 7003, Bangladesh
- Department of Biochemistry and Biotechnology, School of Biomedical Science, Khwaja Yunus Ali University, Enayetpur, Sirajganj 6751, Bangladesh
| | - Md. Shahjaman
- Department of Statistics, Begum Rokeya University, Rangpur 5400, Bangladesh
| | - Salem A. Alyami
- Department of Mathematics and Statistics, Imam Mohammad Ibn Saud Islamic University, Saudi Arabia
| | - Mohammad Ali Moni
- WHO Collaborating Centre on eHealth, UNSW Digital Health, School of Public Health and Community Medicine, Faculty of Medicine, UNSW Sydney, Australia
- The Garvan Institute of Medical Research, Healthy Ageing Theme, Darlinghurst, NSW 2010, Australia
| |
Collapse
|
50
|
Malik AA, Chotpatiwetchkul W, Phanus-Umporn C, Nantasenamat C, Charoenkwan P, Shoombuatong W. StackHCV: a web-based integrative machine-learning framework for large-scale identification of hepatitis C virus NS5B inhibitors. J Comput Aided Mol Des 2021; 35:1037-1053. [PMID: 34622387 DOI: 10.1007/s10822-021-00418-1] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2021] [Accepted: 09/17/2021] [Indexed: 01/07/2023]
Abstract
Fast and accurate identification of inhibitors with potency against HCV NS5B polymerase is currently a challenging task. As conventional experimental methods is the gold standard method for the design and development of new HCV inhibitors, they often require costly investment of time and resources. In this study, we develop a novel machine learning-based meta-predictor (termed StackHCV) for accurate and large-scale identification of HCV inhibitors. Unlike the existing method, which is based on single-feature-based approach, we first constructed a pool of various baseline models by employing a wide range of heterogeneous molecular fingerprints with five popular machine learning algorithms (k-nearest neighbor, multi-layer perceptron, partial least squares, random forest and support vectors machine). Secondly, we integrated these baseline models in order to develop the final meta-based model by means of the stacking strategy. Extensive benchmarking experiments showed that StackHCV achieved a more accurate and stable performance as compared to its constituent baseline models on the training dataset and also outperformed the existing predictor on the independent test dataset. To facilitate the high-throughput identification of HCV inhibitors, we built a web server that can be freely accessed at http://camt.pythonanywhere.com/StackHCV . It is expected that StackHCV could be a useful tool for fast and precise identification of potential drugs against HCV NS5B particularly for liver cancer therapy and other clinical applications.
Collapse
Affiliation(s)
- Aijaz Ahmad Malik
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, 10700, Thailand
| | - Warot Chotpatiwetchkul
- Applied Computational Chemistry Research Unit, Department of Chemistry, School of Science, King Mongkut's Institute of Technology Ladkrabang, Bangkok, 10520, Thailand
| | - Chuleeporn Phanus-Umporn
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, 10700, Thailand
| | - Chanin Nantasenamat
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, 10700, Thailand
| | - Phasit Charoenkwan
- Modern Management and Information Technology, College of Arts, Media and Technology, Chiang Mai University, Chiang Mai, 50200, Thailand.
| | - Watshara Shoombuatong
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, 10700, Thailand.
| |
Collapse
|