1
|
Sha M, Parveen Rahamathulla M. Splice site recognition - deciphering Exon-Intron transitions for genetic insights using Enhanced integrated Block-Level gated LSTM model. Gene 2024; 915:148429. [PMID: 38575098 DOI: 10.1016/j.gene.2024.148429] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2023] [Revised: 03/26/2024] [Accepted: 04/01/2024] [Indexed: 04/06/2024]
Abstract
Bioinformatics is a contemporary interdisciplinary area focused on analyzing the growing number of genome sequences. Gene variants are differences in DNA sequences among individuals within a population. Splice site recognition is a crucial step in the process of gene expression, where the coding sequences of genes are joined together to form mature messenger RNA (mRNA). These genetic variants that disrupt genes are believed to be the primary reason for neuro-developmental disorders like ASD (Autism Spectrum Disorder) is a neuro-developmental disorder that is diagnosed in individuals, families, and society and occurs as the developmental delay in one among the hundred genes that are associated with these disorders. Missense variants, premature stop codons, or deletions alter both the quality and quantity of encoded proteins. Predicting genes within exons and introns presents main challenges, such as dealing with sequencing errors, short reads, incomplete genes, overlapping, and more. Although many traditional techniques have been utilized in creating an exon prediction system, the primary challenge lies in accurately identifying the length and spliced strand location classification of exons in conjunction with introns. From now on, the suggested approach utilizes a Deep Learning algorithm to analyze intricate and extensive genomic datasets. M-LSTM is utilized to categorize three binary combinations (EI as 1, IE as 2, and none as 3) using spliced DNA strands. The M-LSTM system is able to sequence extensive datasets, ensuring that long information can be stored without any impact on the current input or output. This enables it to recognize and address long-term connections and problems with rapidly increasing gradients. The proposed model is compared internally with Naïve Bayes and Random Forest to assess its efficacy. Additionally, the proposed model's performance is forecasted by utilizing probabilistic parameters like recall, F1-score, precision, and accuracy to assess the effectiveness of the proposed system.
Collapse
Affiliation(s)
- Mohemmed Sha
- Department of Software Engineering, College of Computer Engineering and Sciences, Prince Sattam bin Abdulaziz University, Al Kharj 11942, Kingdom of Saudi Arabia.
| | - Mohamudha Parveen Rahamathulla
- Department of Basic Medical Sciences, College of Medicine, Prince Sattam bin Abdulaziz University, Al Kharj 11942, Kingdom of Saudi Arabia.
| |
Collapse
|
2
|
Liu X, Zhang H, Zeng Y, Zhu X, Zhu L, Fu J. DRANetSplicer: A Splice Site Prediction Model Based on Deep Residual Attention Networks. Genes (Basel) 2024; 15:404. [PMID: 38674339 PMCID: PMC11048956 DOI: 10.3390/genes15040404] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2024] [Revised: 03/20/2024] [Accepted: 03/23/2024] [Indexed: 04/28/2024] Open
Abstract
The precise identification of splice sites is essential for unraveling the structure and function of genes, constituting a pivotal step in the gene annotation process. In this study, we developed a novel deep learning model, DRANetSplicer, that integrates residual learning and attention mechanisms for enhanced accuracy in capturing the intricate features of splice sites. We constructed multiple datasets using the most recent versions of genomic data from three different organisms, Oryza sativa japonica, Arabidopsis thaliana and Homo sapiens. This approach allows us to train models with a richer set of high-quality data. DRANetSplicer outperformed benchmark methods on donor and acceptor splice site datasets, achieving an average accuracy of (96.57%, 95.82%) across the three organisms. Comparative analyses with benchmark methods, including SpliceFinder, Splice2Deep, Deep Splicer, EnsembleSplice, and DNABERT, revealed DRANetSplicer's superior predictive performance, resulting in at least a (4.2%, 11.6%) relative reduction in average error rate. We utilized the DRANetSplicer model trained on O. sativa japonica data to predict splice sites in A. thaliana, achieving accuracies for donor and acceptor sites of (94.89%, 94.25%). These results indicate that DRANetSplicer possesses excellent cross-organism predictive capabilities, with its performance in cross-organism predictions even surpassing that of benchmark methods in non-cross-organism predictions. Cross-organism validation showcased DRANetSplicer's excellence in predicting splice sites across similar organisms, supporting its applicability in gene annotation for understudied organisms. We employed multiple methods to visualize the decision-making process of the model. The visualization results indicate that DRANetSplicer can learn and interpret well-known biological features, further validating its overall performance. Our study systematically examined and confirmed the predictive ability of DRANetSplicer from various levels and perspectives, indicating that its practical application in gene annotation is justified.
Collapse
Affiliation(s)
- Xueyan Liu
- College of Information and Intelligence, Hunan Agricultural University, Changsha 410128, China; (X.L.); (X.Z.); (L.Z.); (J.F.)
| | - Hongyan Zhang
- College of Information and Intelligence, Hunan Agricultural University, Changsha 410128, China; (X.L.); (X.Z.); (L.Z.); (J.F.)
| | - Ying Zeng
- School of Computer and Communication, Hunan Institute of Engineering, Xiangtan 411104, China;
| | - Xinghui Zhu
- College of Information and Intelligence, Hunan Agricultural University, Changsha 410128, China; (X.L.); (X.Z.); (L.Z.); (J.F.)
| | - Lei Zhu
- College of Information and Intelligence, Hunan Agricultural University, Changsha 410128, China; (X.L.); (X.Z.); (L.Z.); (J.F.)
| | - Jiahui Fu
- College of Information and Intelligence, Hunan Agricultural University, Changsha 410128, China; (X.L.); (X.Z.); (L.Z.); (J.F.)
| |
Collapse
|
3
|
Alsenan S, Al-Turaiki I, Aldayel M, Tounsi M. Role of Optimization in RNA-Protein-Binding Prediction. Curr Issues Mol Biol 2024; 46:1360-1373. [PMID: 38392205 PMCID: PMC11154364 DOI: 10.3390/cimb46020087] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2023] [Revised: 01/25/2024] [Accepted: 01/31/2024] [Indexed: 02/24/2024] Open
Abstract
RNA-binding proteins (RBPs) play an important role in regulating biological processes, such as gene regulation. Understanding their behaviors, for example, their binding site, can be helpful in understanding RBP-related diseases. Studies have focused on predicting RNA binding by means of machine learning algorithms including deep convolutional neural network models. One of the integral parts of modeling deep learning is achieving optimal hyperparameter tuning and minimizing a loss function using optimization algorithms. In this paper, we investigate the role of optimization in the RBP classification problem using the CLIP-Seq 21 dataset. Three optimization methods are employed on the RNA-protein binding CNN prediction model; namely, grid search, random search, and Bayesian optimizer. The empirical results show an AUC of 94.42%, 93.78%, 93.23% and 92.68% on the ELAVL1C, ELAVL1B, ELAVL1A, and HNRNPC datasets, respectively, and a mean AUC of 85.30 on 24 datasets. This paper's findings provide evidence on the role of optimizers in improving the performance of RNA-protein binding prediction.
Collapse
Affiliation(s)
- Shrooq Alsenan
- Information Systems Department, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University, P.O. Box 84428, Riyadh 11671, Saudi Arabia
| | - Isra Al-Turaiki
- Department of Computer Science, College of Computer and Information Sciences, King Saud University, Riyadh 11653, Saudi Arabia;
| | - Mashael Aldayel
- Information Technology Department, College of Computer and Information Sciences, King Saud University, Riyadh 11451, Saudi Arabia;
| | - Mohamed Tounsi
- Department of Computer Science, College of Computer and information Sciences, Prince Sultan University, P.O. Box 66833, Riyadh 12435, Saudi Arabia;
| |
Collapse
|
4
|
Shen F, Hu C, Huang X, He H, Yang D, Zhao J, Yang X. Advances in alternative splicing identification: deep learning and pantranscriptome. FRONTIERS IN PLANT SCIENCE 2023; 14:1232466. [PMID: 37790793 PMCID: PMC10544900 DOI: 10.3389/fpls.2023.1232466] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/31/2023] [Accepted: 08/28/2023] [Indexed: 10/05/2023]
Abstract
In plants, alternative splicing is a crucial mechanism for regulating gene expression at the post-transcriptional level, which leads to diverse proteins by generating multiple mature mRNA isoforms and diversify the gene regulation. Due to the complexity and variability of this process, accurate identification of splicing events is a vital step in studying alternative splicing. This article presents the application of alternative splicing algorithms with or without reference genomes in plants, as well as the integration of advanced deep learning techniques for improved detection accuracy. In addition, we also discuss alternative splicing studies in the pan-genomic background and the usefulness of integrated strategies for fully profiling alternative splicing.
Collapse
Affiliation(s)
- Fei Shen
- Institute of Biotechnology, Beijing Academy of Agriculture and Forestry Sciences, Beijing, China
| | - Chenyang Hu
- Institute of Biotechnology, Beijing Academy of Agriculture and Forestry Sciences, Beijing, China
- Shanxi Key Lab of Chinese Jujube, College of Life Science, Yan’an University, Yan’an, Shanxi, China
| | - Xin Huang
- Institute of Biotechnology, Beijing Academy of Agriculture and Forestry Sciences, Beijing, China
| | - Hao He
- Institute of Biotechnology, Beijing Academy of Agriculture and Forestry Sciences, Beijing, China
| | - Deng Yang
- Institute of Biotechnology, Beijing Academy of Agriculture and Forestry Sciences, Beijing, China
| | - Jirong Zhao
- Shanxi Key Lab of Chinese Jujube, College of Life Science, Yan’an University, Yan’an, Shanxi, China
| | - Xiaozeng Yang
- Institute of Biotechnology, Beijing Academy of Agriculture and Forestry Sciences, Beijing, China
| |
Collapse
|
5
|
Bhandari N, Walambe R, Kotecha K, Khare SP. A comprehensive survey on computational learning methods for analysis of gene expression data. Front Mol Biosci 2022; 9:907150. [PMID: 36458095 PMCID: PMC9706412 DOI: 10.3389/fmolb.2022.907150] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2022] [Accepted: 09/28/2022] [Indexed: 09/19/2023] Open
Abstract
Computational analysis methods including machine learning have a significant impact in the fields of genomics and medicine. High-throughput gene expression analysis methods such as microarray technology and RNA sequencing produce enormous amounts of data. Traditionally, statistical methods are used for comparative analysis of gene expression data. However, more complex analysis for classification of sample observations, or discovery of feature genes requires sophisticated computational approaches. In this review, we compile various statistical and computational tools used in analysis of expression microarray data. Even though the methods are discussed in the context of expression microarrays, they can also be applied for the analysis of RNA sequencing and quantitative proteomics datasets. We discuss the types of missing values, and the methods and approaches usually employed in their imputation. We also discuss methods of data normalization, feature selection, and feature extraction. Lastly, methods of classification and class discovery along with their evaluation parameters are described in detail. We believe that this detailed review will help the users to select appropriate methods for preprocessing and analysis of their data based on the expected outcome.
Collapse
Affiliation(s)
- Nikita Bhandari
- Computer Science Department, Symbiosis Institute of Technology, Symbiosis International (Deemed University), Pune, India
| | - Rahee Walambe
- Electronics and Telecommunication Department, Symbiosis Institute of Technology, Symbiosis International (Deemed University), Pune, India
- Symbiosis Center for Applied AI (SCAAI), Symbiosis International (Deemed University), Pune, India
| | - Ketan Kotecha
- Computer Science Department, Symbiosis Institute of Technology, Symbiosis International (Deemed University), Pune, India
- Symbiosis Center for Applied AI (SCAAI), Symbiosis International (Deemed University), Pune, India
| | - Satyajeet P. Khare
- Symbiosis School of Biological Sciences, Symbiosis International (Deemed University), Pune, India
| |
Collapse
|