1
|
Hamouda E, Tarek M. A hybrid approach of ensemble learning and grey wolf optimizer for DNA splice junction prediction. PLoS One 2024; 19:e0310698. [PMID: 39312561 PMCID: PMC11419377 DOI: 10.1371/journal.pone.0310698] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2024] [Accepted: 09/05/2024] [Indexed: 09/25/2024] Open
Abstract
DNA splice junction classification is a crucial job in computational biology. The challenge is to predict the junction type (IE, EI, or N) from a given DNA sequence. Predicting junction type is crucial for understanding gene expression patterns, disease causes, splicing regulation, and gene structure. The location of the regions where exons are joined, and introns are removed during RNA splicing is very difficult to determine because no universal rule guides this process. This study presents a two-layer hybrid approach inspired by ensemble learning to overcome this challenge. The first layer applies the grey wolf optimizer (GWO) for feature selection. GWO's exploration ability allows it to efficiently search a vast feature space, while its exploitation ability refines promising areas, thus leading to a more reliable feature selection. The selected features are then fed into the second layer, which employs a classification model trained on the retrieved features. Using cross-validation, the proposed method divides the DNA splice junction dataset into training and test sets, allowing for a thorough examination of the classifier's generalization ability. The ensemble model is trained on various partitions of the training set and tested on the remaining held-out fold. This process is performed for each fold, comprehensively evaluating the classifier's performance. We tested our method using the StatLog DNA dataset. Compared to various machine learning models for DNA splice junction prediction, the proposed GWO+SVM ensemble method achieved an accuracy of 96%. This finding suggests that the proposed ensemble hybrid approach is promising for DNA splice junction classification. The implementation code for the proposed approach is available at https://github.com/EFHamouda/DNA-splice-junction-prediction.
Collapse
Affiliation(s)
- Eslam Hamouda
- Computer Science Department, Faculty of Computers & Information, Mansoura University, Mansoura, Egypt
- Computer Science Department, Faculty of Computers & Information, Jouf University, Jouf, Saudi Arabi
| | - Mayada Tarek
- Computer Science Department, Faculty of Computers & Information, Mansoura University, Mansoura, Egypt
- Computer Science Department, Faculty of Computers & Information, Jouf University, Jouf, Saudi Arabi
| |
Collapse
|
2
|
Zabardast A, Tamer EG, Son YA, Yılmaz A. An automated framework for evaluation of deep learning models for splice site predictions. Sci Rep 2023; 13:10221. [PMID: 37353532 PMCID: PMC10290104 DOI: 10.1038/s41598-023-34795-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2022] [Accepted: 05/08/2023] [Indexed: 06/25/2023] Open
Abstract
A novel framework for the automated evaluation of various deep learning-based splice site detectors is presented. The framework eliminates time-consuming development and experimenting activities for different codebases, architectures, and configurations to obtain the best models for a given RNA splice site dataset. RNA splicing is a cellular process in which pre-mRNAs are processed into mature mRNAs and used to produce multiple mRNA transcripts from a single gene sequence. Since the advancement of sequencing technologies, many splice site variants have been identified and associated with the diseases. So, RNA splice site prediction is essential for gene finding, genome annotation, disease-causing variants, and identification of potential biomarkers. Recently, deep learning models performed highly accurately for classifying genomic signals. Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM) and its bidirectional version (BLSTM), Gated Recurrent Unit (GRU), and its bidirectional version (BGRU) are promising models. During genomic data analysis, CNN's locality feature helps where each nucleotide correlates with other bases in its vicinity. In contrast, BLSTM can be trained bidirectionally, allowing sequential data to be processed from forward and reverse directions. Therefore, it can process 1-D encoded genomic data effectively. Even though both methods have been used in the literature, a performance comparison was missing. To compare selected models under similar conditions, we have created a blueprint for a series of networks with five different levels. As a case study, we compared CNN and BLSTM models' learning capabilities as building blocks for RNA splice site prediction in two different datasets. Overall, CNN performed better with [Formula: see text] accuracy ([Formula: see text] improvement), [Formula: see text] F1 score ([Formula: see text] improvement), and [Formula: see text] AUC-PR ([Formula: see text] improvement) in human splice site prediction. Likewise, an outperforming performance with [Formula: see text] accuracy ([Formula: see text] improvement), [Formula: see text] F1 score ([Formula: see text] improvement), and [Formula: see text] AUC-PR ([Formula: see text] improvement) is achieved in C. elegans splice site prediction. Overall, our results showed that CNN learns faster than BLSTM and BGRU. Moreover, CNN performs better at extracting sequence patterns than BLSTM and BGRU. To our knowledge, no other framework is developed explicitly for evaluating splice detection models to decide the best possible model in an automated manner. So, the proposed framework and the blueprint would help selecting different deep learning models, such as CNN vs. BLSTM and BGRU, for splice site analysis or similar classification tasks and in different problems.
Collapse
Affiliation(s)
- Amin Zabardast
- Department of Health Informatics, Graduate School of Informatics, Middle East Technical University, Ankara, Turkey
| | - Elif Güney Tamer
- Department of Health Informatics, Graduate School of Informatics, Middle East Technical University, Ankara, Turkey
| | - Yeşim Aydın Son
- Department of Health Informatics, Graduate School of Informatics, Middle East Technical University, Ankara, Turkey
| | - Arif Yılmaz
- Institute of Data Science, Maastricht University, Maastricht, The Netherlands.
| |
Collapse
|
3
|
Das L, Das JK, Mohapatra S, Nanda S. DNA numerical encoding schemes for exon prediction: a recent history. NUCLEOSIDES NUCLEOTIDES & NUCLEIC ACIDS 2021; 40:985-1017. [PMID: 34455915 DOI: 10.1080/15257770.2021.1966797] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Abstract
Bioinformatics in the present day has been firmly established as a regulator in genomics. In recent times, applications of Signal processing in exon prediction have gained a lot of attention. The exons carry protein information. Proteins are composed of connected constituents known as amino acids that characterize the specific function. Conversion of the nucleotide character string into a numerical sequence is the gateway before analyzing it through signal processing methods. This numeric encoding is the mathematical descriptor of nucleotides and is based on some statistical properties of the structure of nucleic acids. Since the type of encoding extremely affects the exon detection accuracy, this paper is devised for the review of existing encoding (mapping) schemes. The comparative analysis is formulated to emphasize the importance of the genetic code setting of amino acids considered for application related to computational elucidation for exon detection. This work covers much helpful information for future applications.
Collapse
Affiliation(s)
- Lopamudra Das
- School of Electronics Engineering, KIIT, Bhubaneswar, India
| | - J K Das
- School of Electronics Engineering, KIIT, Bhubaneswar, India
| | - S Mohapatra
- School of Electronics Engineering, KIIT, Bhubaneswar, India
| | - Sarita Nanda
- School of Electronics Engineering, KIIT, Bhubaneswar, India
| |
Collapse
|
4
|
Moosa S, Amira PA, Boughorbel DS. DASSI: differential architecture search for splice identification from DNA sequences. BioData Min 2021; 14:15. [PMID: 33588916 PMCID: PMC7885202 DOI: 10.1186/s13040-021-00237-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2020] [Accepted: 01/05/2021] [Indexed: 11/28/2022] Open
Abstract
Background The data explosion caused by unprecedented advancements in the field of genomics is constantly challenging the conventional methods used in the interpretation of the human genome. The demand for robust algorithms over the recent years has brought huge success in the field of Deep Learning (DL) in solving many difficult tasks in image, speech and natural language processing by automating the manual process of architecture design. This has been fueled through the development of new DL architectures. Yet genomics possesses unique challenges that requires customization and development of new DL models. Methods We proposed a new model, DASSI, by adapting a differential architecture search method and applying it to the Splice Site (SS) recognition task on DNA sequences to discover new high-performance convolutional architectures in an automated manner. We evaluated the discovered model against state-of-the-art tools to classify true and false SS in Homo sapiens (Human), Arabidopsis thaliana (Plant), Caenorhabditis elegans (Worm) and Drosophila melanogaster (Fly). Results Our experimental evaluation demonstrated that the discovered architecture outperformed baseline models and fixed architectures and showed competitive results against state-of-the-art models used in classification of splice sites. The proposed model - DASSI has a compact architecture and showed very good results on a transfer learning task. The benchmarking experiments of execution time and precision on architecture search and evaluation process showed better performance on recently available GPUs making it feasible to adopt architecture search based methods on large datasets. Conclusions We proposed the use of differential architecture search method (DASSI) to perform SS classification on raw DNA sequences, and discovered new neural network models with low number of tunable parameters and competitive performance compared with manually engineered architectures. We have extensively benchmarked DASSI model with other state-of-the-art models and assessed its computational efficiency. The results have shown a high potential of using automated architecture search mechanism for solving various problems in the field of genomics.
Collapse
Affiliation(s)
- Shabir Moosa
- Department of Systems Biology, SIDRA Medicine, Doha, 26999, Qatar. .,Dept. of Computer Science and Engineering, Qatar University, Doha, 2713, Qatar.
| | - Prof Abbes Amira
- Dept. of Computer Science and Engineering, Qatar University, Doha, 2713, Qatar
| | | |
Collapse
|
5
|
Amilpur S, Bhukya R. EDeepSSP: Explainable deep neural networks for exact splice sites prediction. J Bioinform Comput Biol 2020; 18:2050024. [PMID: 32696716 DOI: 10.1142/s0219720020500249] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Splice site prediction is crucial for understanding underlying gene regulation, gene function for better genome annotation. Many computational methods exist for recognizing the splice sites. Although most of the methods achieve a competent performance, their interpretability remains challenging. Moreover, all traditional machine learning methods manually extract features, which is tedious job. To address these challenges, we propose a deep learning-based approach (EDeepSSP) that employs convolutional neural networks (CNNs) architecture for automatic feature extraction and effectively predicts splice sites. Our model, EDeepSSP, divulges the opaque nature of CNN by extracting significant motifs and explains why these motifs are vital for predicting splice sites. In this study, experiments have been conducted on six benchmark acceptors and donor datasets of humans, cress, and fly. The results show that EDeepSSP has outperformed many state-of-the-art approaches. EDeepSSP achieves the highest area under the receiver operating characteristic curve (AUC_ROC) and area under the precision-recall curve (AUC_PR) of 99.32% and 99.26% on human donor datasets, respectively. We also analyze various filter activities, feature activations, and extracted significant motifs responsible for the splice site prediction. Further, we validate the learned motifs of our model against known motifs of JASPAR splice site database.
Collapse
Affiliation(s)
- Santhosh Amilpur
- Computer Science and Engineering, National Institute of Technology Warangal, Warangal, Telangana 506004, India
| | - Raju Bhukya
- Computer Science and Engineering, National Institute of Technology Warangal, Warangal, Telangana 506004, India
| |
Collapse
|
6
|
Thanapattheerakul T, Engchuan W, Chan JH. Predicting the effect of variants on splicing using Convolutional Neural Networks. PeerJ 2020; 8:e9470. [PMID: 32704450 PMCID: PMC7346860 DOI: 10.7717/peerj.9470] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2019] [Accepted: 06/11/2020] [Indexed: 11/23/2022] Open
Abstract
Mutations that cause an error in the splicing of a messenger RNA (mRNA) can lead to diseases in humans. Various computational models have been developed to recognize the sequence pattern of the splice sites. In recent studies, Convolutional Neural Network (CNN) architectures were shown to outperform other existing models in predicting the splice sites. However, an insufficient effort has been put into extending the CNN model to predict the effect of the genomic variants on the splicing of mRNAs. This study proposes a framework to elaborate on the utility of CNNs to assess the effect of splice variants on the identification of potential disease-causing variants that disrupt the RNA splicing process. Five models, including three CNN-based and two non-CNN machine learning based, were trained and compared using two existing splice site datasets, Genome Wide Human splice sites (GWH) and a dataset provided at the Deep Learning and Artificial Intelligence winter school 2018 (DLAI). The donor sites were also used to test on the HSplice tool to evaluate the predictive models. To improve the effectiveness of predictive models, two datasets were combined. The CNN model with four convolutional layers showed the best splice site prediction performance with an AUPRC of 93.4% and 88.8% for donor and acceptor sites, respectively. The effects of variants on splicing were estimated by applying the best model on variant data from the ClinVar database. Based on the estimation, the framework could effectively differentiate pathogenic variants from the benign variants (p = 5.9 × 10−7). These promising results support that the proposed framework could be applied in future genetic studies to identify disease causing loci involving the splicing mechanism. The datasets and Python scripts used in this study are available on the GitHub repository at https://github.com/smiile8888/rna-splice-sites-recognition.
Collapse
Affiliation(s)
| | - Worrawat Engchuan
- Genetics and Genome Biology, The Hospital for Sick Children, Toronto, Ontario, Canada.,The Centre for Applied Genomics, The Hospital of Sick Children, Toronto, Ontario, Canada
| | - Jonathan H Chan
- School of Information Technology, King Mongkut's University of Technology Thonburi, Bangkok, Thailand.,IC2-DLab, School of Information Technology, King Mongkut's University of Technology Thonburi, Bangkok, Thailand
| |
Collapse
|
7
|
Albaradei S, Magana-Mora A, Thafar M, Uludag M, Bajic VB, Gojobori T, Essack M, Jankovic BR. Splice2Deep: An ensemble of deep convolutional neural networks for improved splice site prediction in genomic DNA. Gene 2020; 763S:100035. [PMID: 32550561 PMCID: PMC7285987 DOI: 10.1016/j.gene.2020.100035] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2020] [Accepted: 05/06/2020] [Indexed: 12/21/2022]
Abstract
Background The accurate identification of the exon/intron boundaries is critical for the correct annotation of genes with multiple exons. Donor and acceptor splice sites (SS) demarcate these boundaries. Therefore, deriving accurate computational models to predict the SS are useful for functional annotation of genes and genomes, and for finding alternative SS associated with different diseases. Although various models have been proposed for the in silico prediction of SS, improving their accuracy is required for reliable annotation. Moreover, models are often derived and tested using the same genome, providing no evidence of broad application, i.e. to other poorly studied genomes. Results With this in mind, we developed the Splice2Deep models for SS detection. Each model is an ensemble of deep convolutional neural networks. We evaluated the performance of the models based on the ability to detect SS in Homo sapiens, Oryza sativa japonica, Arabidopsis thaliana, Drosophila melanogaster, and Caenorhabditis elegans. Results demonstrate that the models efficiently detect SS in other organisms not considered during the training of the models. Compared to the state-of-the-art tools, Splice2Deep models achieved significantly reduced average error rates of 41.97% and 28.51% for acceptor and donor SS, respectively. Moreover, the Splice2Deep cross-organism validation demonstrates that models correctly identify conserved genomic elements enabling annotation of SS in new genomes by choosing the taxonomically closest model. Conclusions The results of our study demonstrated that Splice2Deep both achieved a considerably reduced error rate compared to other state-of-the-art models and the ability to accurately recognize SS in other organisms for which the model was not trained, enabling annotation of poorly studied or newly sequenced genomes. Splice2Deep models are implemented in Python using Keras API; the models and the data are available at https://github.com/SomayahAlbaradei/Splice_Deep.git.
Collapse
Key Words
- AUC, area under curve
- AcSS, acceptor splice site
- Acc, accuracy
- Bioinformatics
- CNN, convolutional neural network
- CONV, convolutional layers
- DL, deep learning
- DNA, deoxyribonucleic acid
- DT, decision trees
- Deep-learning
- DoSS, donor splice site
- FC, fully connected layer
- ML, machine learning
- NB, naive Bayes
- NN, neural network
- POOL, pooling layer
- Prediction
- RF, random forest
- RNA, ribonucleic acid
- ReLU, rectified linear unit layer
- SS, splice site
- SVM, support vector machine
- Sn, sensitivity
- Sp, specificity
- Splice sites
- Splicing
Collapse
Affiliation(s)
- Somayah Albaradei
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center, Computer (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia.,Faculty of Computing and Information Technology, King Abdulaziz University, Saudi Arabia
| | - Arturo Magana-Mora
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center, Computer (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia.,Saudi Aramco, EXPEC-ARC, Drilling Technology Team, Dhahran 31311, Saudi Arabia
| | - Maha Thafar
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center, Computer (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia.,Faculty of Computers and Information Systems, Taif University, Saudi Arabia
| | - Mahmut Uludag
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center, Computer (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
| | - Vladimir B Bajic
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center, Computer (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
| | - Takashi Gojobori
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center, Computer (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia.,Biological and Environmental Sciences and Engineering Division (BESE), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
| | - Magbubah Essack
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center, Computer (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
| | - Boris R Jankovic
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center, Computer (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
| |
Collapse
|
8
|
Splice sites detection using chaos game representation and neural network. Genomics 2019; 112:1847-1852. [PMID: 31704313 DOI: 10.1016/j.ygeno.2019.10.018] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2018] [Revised: 03/18/2019] [Accepted: 10/29/2019] [Indexed: 11/23/2022]
Abstract
A novel method is proposed to detect the acceptor and donor splice sites using chaos game representation and artificial neural network. In order to achieve high accuracy, inputs to the neural network, or feature vector, shall reflect the true nature of the DNA segments. Therefore it is important to have one-to-one numerical representation, i.e. a feature vector should be able to represent the original data. Chaos game representation (CGR) is an iterative mapping technique that assigns each nucleotide in a DNA sequence to a respective position on the plane in a one-to-one manner. Using CGR, a DNA sequence can be mapped to a numerical sequence that reflects the true nature of the original sequence. In this research, we propose to use CGR as feature input to a neural network to detect splice sites on the NN269 dataset. Computational experiments indicate that this approach gives good accuracy while being simpler than other methods in the literature, with only one neural network component. The code and data for our method can be accessed from this link: https://github.com/thoang3/portfolio/tree/SpliceSites_ANN_CGR.
Collapse
|
9
|
Meher PK, Sahu TK, Gahoi S, Satpathy S, Rao AR. Evaluating the performance of sequence encoding schemes and machine learning methods for splice sites recognition. Gene 2019; 705:113-126. [PMID: 31009682 DOI: 10.1016/j.gene.2019.04.047] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2018] [Revised: 03/27/2019] [Accepted: 04/17/2019] [Indexed: 02/02/2023]
Abstract
Identification of splice sites is imperative for prediction of gene structure. Machine learning-based approaches (MLAs) have been reported to be more successful than the rule-based methods for identification of splice sites. However, the strings of alphabets should be transformed into numeric features through sequence encoding before using them as input in MLAs. In this study, we evaluated the performances of 8 different sequence encoding schemes i.e., Bayes kernel, density and sparse (DS), distribution of tri-nucleotide and 1st order Markov model (DM), frequency difference distance measure (FDDM), paired-nucleotide frequency difference between true and false sites (FDTF), 1st order Markov model (MM1), combination of both 1st and 2nd order Markov model (MM1 + MM2) and 2nd order Markov model (MM2) in respect of predicting donor and acceptor splice sites using 5 supervised learning methods (ANN, Bagging, Boosting, RF and SVM). The encoding schemes and machine learning methods were first evaluated in 4 species i.e., A. thaliana, C. elegans, D. melanogaster and H. sapiens, and then performances were validated with another four species i.e., Ciona intestinalis, Dictyostelium discoideum, Phaeodactylum tricornutum and Trypanosoma brucei. In terms of ROC (receiver-operating-characteristics) and PR (precision-recall) curves, FDTF encoding approach achieved higher accuracy followed by either MM2 or FDDM. Further, SVM was found to achieve higher accuracy (in terms of ROC and PR curves) followed by RF across encoding schemes and species. In terms of prediction accuracy across species, the SVM-FDTF combination was optimum than other combinations of classifiers and encoding schemes. Further, splice site prediction accuracies were observed higher for the species with low intron density. To our limited knowledge, this is the first attempt as far as comprehensive evaluation of sequence encoding schemes for prediction of splice sites is concerned. We have also developed an R-package EncDNA (https://cran.r-project.org/web/packages/EncDNA/index.html) for encoding of splice site motifs with different encoding schemes, which is expected to supplement the existing nucleotide sequence encoding approaches. This study is believed to be useful for the computational biologists for predicting different functional elements on the genomic DNA.
Collapse
Affiliation(s)
- Prabina Kumar Meher
- ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India.
| | - Tanmaya Kumar Sahu
- ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India
| | - Shachi Gahoi
- ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India
| | - Subhrajit Satpathy
- ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India
| | | |
Collapse
|
10
|
Abstract
Accurate splice-site prediction is essential to delineate gene structures from sequence data. Several computational techniques have been applied to create a system to predict canonical splice sites. For classification tasks, deep neural networks (DNNs) have achieved record-breaking results and often outperformed other supervised learning techniques. In this study, a new method of splice-site prediction using DNNs was proposed. The proposed system receives an input sequence data and returns an answer as to whether it is splice site. The length of input is 140 nucleotides, with the consensus sequence (i.e., "GT" and "AG" for the donor and acceptor sites, respectively) in the middle. Each input sequence model is applied to the pretrained DNN model that determines the probability that an input is a splice site. The model consists of convolutional layers and bidirectional long short-term memory network layers. The pretraining and validation were conducted using the data set tested in previously reported methods. The performance evaluation results showed that the proposed method can outperform the previous methods. In addition, the pattern learned by the DNNs was visualized as position frequency matrices (PFMs). Some of PFMs were very similar to the consensus sequence. The trained DNN model and the brief source code for the prediction system are uploaded. Further improvement will be achieved following the further development of DNNs.
Collapse
Affiliation(s)
- Tatsuhiko Naito
- Department of Neurology, Graduate School of Medicine, The University of Tokyo , Tokyo, Japan
| |
Collapse
|
11
|
Xu ZC, Wang P, Qiu WR, Xiao X. iSS-PC: Identifying Splicing Sites via Physical-Chemical Properties Using Deep Sparse Auto-Encoder. Sci Rep 2017; 7:8222. [PMID: 28811565 PMCID: PMC5557945 DOI: 10.1038/s41598-017-08523-8] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2017] [Accepted: 07/10/2017] [Indexed: 12/13/2022] Open
Abstract
Gene splicing is one of the most significant biological processes in eukaryotic gene expression, such as RNA splicing, which can cause a pre-mRNA to produce one or more mature messenger RNAs containing the coded information with multiple biological functions. Thus, identifying splicing sites in DNA/RNA sequences is significant for both the bio-medical research and the discovery of new drugs. However, it is expensive and time consuming based only on experimental technique, so new computational methods are needed. To identify the splice donor sites and splice acceptor sites accurately and quickly, a deep sparse auto-encoder model with two hidden layers, called iSS-PC, was constructed based on minimum error law, in which we incorporated twelve physical-chemical properties of the dinucleotides within DNA into PseDNC to formulate given sequence samples via a battery of cross-covariance and auto-covariance transformations. In this paper, five-fold cross-validation test results based on the same benchmark data-sets indicated that the new predictor remarkably outperformed the existing prediction methods in this field. Furthermore, it is expected that many other related problems can be also studied by this approach. To implement classification accurately and quickly, an easy-to-use web-server for identifying slicing sites has been established for free access at: http://www.jci-bioinfo.cn/iSS-PC.
Collapse
Affiliation(s)
- Zhao-Chun Xu
- Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen, 333403, China.
| | - Peng Wang
- Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen, 333403, China
| | - Wang-Ren Qiu
- Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen, 333403, China.
- Department of Computer Science and Bond Life Science Center, University of Missouri, Columbia, MO, USA.
| | - Xuan Xiao
- Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen, 333403, China.
- Gordon Life Science Institute, Boston, Massachusetts, 02478, United States of America.
| |
Collapse
|