1
|
Ali M, Shah D, Qazi S, Khan IA, Abrar M, Zahir S. An effective deep learning-based approach for splice site identification in gene expression. Sci Prog 2024; 107:368504241266588. [PMID: 39051530 PMCID: PMC11273556 DOI: 10.1177/00368504241266588] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/27/2024]
Abstract
A crucial stage in eukaryote gene expression involves mRNA splicing by a protein assembly known as the spliceosome. This step significantly contributes to generating and properly operating the ultimate gene product. Since non-coding introns disrupt eukaryotic genes, splicing entails the elimination of introns and joining exons to create a functional mRNA molecule. Nevertheless, accurately finding splice sequence sites using various molecular biology techniques and other biological approaches is complex and time-consuming. This paper presents a precise and reliable computer-aided diagnosis (CAD) technique for the rapid and correct identification of splice site sequences. The proposed deep learning-based framework uses long short-term memory (LSTM) to extract distinct patterns from RNA sequences, enabling rapid and accurate point mutation sequence mapping. The proposed network employs one-hot encodings to find sequential patterns that effectively identify splicing sites. A thorough ablation study of traditional machine learning, one-dimensional convolutional neural networks (1D-CNNs), and recurrent neural networks (RNNs) models was conducted. The proposed LSTM network outperformed existing state-of-the-art approaches, improving accuracy by 3% and 2% for the acceptor and donor sites datasets.
Collapse
Affiliation(s)
- Mohsin Ali
- Department of Computer Science, Bacha Khan University, Charsadda, KP, Pakistan
| | - Dilawar Shah
- Department of Computer Science, Bacha Khan University, Charsadda, KP, Pakistan
| | - Shahid Qazi
- Department of Computer Science, Bacha Khan University, Charsadda, KP, Pakistan
| | - Izaz Ahmad Khan
- Department of Computer Science, Bacha Khan University, Charsadda, KP, Pakistan
| | - Mohammad Abrar
- Faculty of Computer Science, Arab Open University, Muscat, Oman, Sultanate of Oman
| | - Sana Zahir
- Institute of Computer Sciences and Information Technology, The University of Agriculture Peshawar, Peshawar, KP, Pakistan
| |
Collapse
|
2
|
Luo Z, Lou L, Qiu W, Xu Z, Xiao X. Predicting N6-Methyladenosine Sites in Multiple Tissues of Mammals through Ensemble Deep Learning. Int J Mol Sci 2022; 23:ijms232415490. [PMID: 36555143 PMCID: PMC9778682 DOI: 10.3390/ijms232415490] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2022] [Revised: 12/03/2022] [Accepted: 12/05/2022] [Indexed: 12/13/2022] Open
Abstract
N6-methyladenosine (m6A) is the most abundant within eukaryotic messenger RNA modification, which plays an essential regulatory role in the control of cellular functions and gene expression. However, it remains an outstanding challenge to detect mRNA m6A transcriptome-wide at base resolution via experimental approaches, which are generally time-consuming and expensive. Developing computational methods is a good strategy for accurate in silico detection of m6A modification sites from the large amount of RNA sequence data. Unfortunately, the existing computational models are usually only for m6A site prediction in a single species, without considering the tissue level of species, while most of them are constructed based on low-confidence level data generated by an m6A antibody immunoprecipitation (IP)-based sequencing method, thereby restricting reliability and generalizability of proposed models. Here, we review recent advances in computational prediction of m6A sites and construct a new computational approach named im6APred using ensemble deep learning to accurately identify m6A sites based on high-confidence level data in multiple tissues of mammals. Our model im6APred builds upon a comprehensive evaluation of multiple classification methods, including four traditional classification algorithms and three deep learning methods and their ensembles. The optimal base-classifier combinations are then chosen by five-fold cross-validation test to achieve an effective stacked model. Our model im6APred can produce the area under the receiver operating characteristic curve (AUROC) in the range of 0.82-0.91 on independent tests, indicating that our model has the ability to learn general methylation rules on RNA bases and generalize to m6A transcriptome-wide identification. Moreover, AUROCs in the range of 0.77-0.96 were achieved using cross-species/tissues validation on the benchmark dataset, demonstrating differences in predictive performance at the tissue level and the need for constructing tissue-specific models for m6A site prediction.
Collapse
|
3
|
Zhu L, Li W. Roles of Physicochemical and Structural Properties of RNA-Binding Proteins in Predicting the Activities of Trans-Acting Splicing Factors with Machine Learning. Int J Mol Sci 2022; 23:ijms23084426. [PMID: 35457243 PMCID: PMC9030803 DOI: 10.3390/ijms23084426] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2022] [Revised: 04/13/2022] [Accepted: 04/14/2022] [Indexed: 02/06/2023] Open
Abstract
Trans-acting splicing factors play a pivotal role in modulating alternative splicing by specifically binding to cis-elements in pre-mRNAs. There are approximately 1500 RNA-binding proteins (RBPs) in the human genome, but the activities of these RBPs in alternative splicing are unknown. Since determining RBP activities through experimental methods is expensive and time consuming, the development of an efficient computational method for predicting the activities of RBPs in alternative splicing from their sequences is of great practical importance. Recently, a machine learning model for predicting the activities of splicing factors was built based on features of single and dual amino acid compositions. Here, we explored the role of physicochemical and structural properties in predicting their activities in alternative splicing using machine learning approaches and found that the prediction performance is significantly improved by including these properties. By combining the minimum redundancy–maximum relevance (mRMR) method and forward feature searching strategy, a promising feature subset with 24 features was obtained to predict the activities of RBPs. The feature subset consists of 16 dual amino acid compositions, 5 physicochemical features, and 3 structural features. The physicochemical and structural properties were as important as the sequence composition features for an accurate prediction of the activities of splicing factors. The hydrophobicity and distribution of coil are suggested to be the key physicochemical and structural features, respectively.
Collapse
Affiliation(s)
| | - Wenjin Li
- Correspondence: ; Tel.: +86-0755-26942336
| |
Collapse
|
4
|
Identification of D Modification Sites Using a Random Forest Model Based on Nucleotide Chemical Properties. Int J Mol Sci 2022; 23:ijms23063044. [PMID: 35328461 PMCID: PMC8950657 DOI: 10.3390/ijms23063044] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2022] [Revised: 02/25/2022] [Accepted: 03/09/2022] [Indexed: 12/03/2022] Open
Abstract
Dihydrouridine (D) is an abundant post-transcriptional modification present in transfer RNA from eukaryotes, bacteria, and archaea. D has contributed to treatments for cancerous diseases. Therefore, the precise detection of D modification sites can enable further understanding of its functional roles. Traditional experimental techniques to identify D are laborious and time-consuming. In addition, there are few computational tools for such analysis. In this study, we utilized eleven sequence-derived feature extraction methods and implemented five popular machine algorithms to identify an optimal model. During data preprocessing, data were partitioned for training and testing. Oversampling was also adopted to reduce the effect of the imbalance between positive and negative samples. The best-performing model was obtained through a combination of random forest and nucleotide chemical property modeling. The optimized model presented high sensitivity and specificity values of 0.9688 and 0.9706 in independent tests, respectively. Our proposed model surpassed published tools in independent tests. Furthermore, a series of validations across several aspects was conducted in order to demonstrate the robustness and reliability of our model.
Collapse
|
5
|
Wang H, Wang S, Zhang Y, Bi S, Zhu X. A brief review of machine learning methods for RNA methylation sites prediction. Methods 2022; 203:399-421. [DOI: 10.1016/j.ymeth.2022.03.001] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2021] [Revised: 02/15/2022] [Accepted: 03/01/2022] [Indexed: 02/07/2023] Open
|
6
|
BERT-m7G: A Transformer Architecture Based on BERT and Stacking Ensemble to Identify RNA N7-Methylguanosine Sites from Sequence Information. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2021; 2021:7764764. [PMID: 34484416 PMCID: PMC8413034 DOI: 10.1155/2021/7764764] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/16/2021] [Accepted: 08/13/2021] [Indexed: 01/19/2023]
Abstract
As one of the most prevalent posttranscriptional modifications of RNA, N7-methylguanosine (m7G) plays an essential role in the regulation of gene expression. Accurate identification of m7G sites in the transcriptome is invaluable for better revealing their potential functional mechanisms. Although high-throughput experimental methods can locate m7G sites precisely, they are overpriced and time-consuming. Hence, it is imperative to design an efficient computational method that can accurately identify the m7G sites. In this study, we propose a novel method via incorporating BERT-based multilingual model in bioinformatics to represent the information of RNA sequences. Firstly, we treat RNA sequences as natural sentences and then employ bidirectional encoder representations from transformers (BERT) model to transform them into fixed-length numerical matrices. Secondly, a feature selection scheme based on the elastic net method is constructed to eliminate redundant features and retain important features. Finally, the selected feature subset is input into a stacking ensemble classifier to predict m7G sites, and the hyperparameters of the classifier are tuned with tree-structured Parzen estimator (TPE) approach. By 10-fold cross-validation, the performance of BERT-m7G is measured with an ACC of 95.48% and an MCC of 0.9100. The experimental results indicate that the proposed method significantly outperforms state-of-the-art prediction methods in the identification of m7G modifications.
Collapse
|
7
|
Zhang L, Qin X, Liu M, Xu Z, Liu G. DNN-m6A: A Cross-Species Method for Identifying RNA N6-Methyladenosine Sites Based on Deep Neural Network with Multi-Information Fusion. Genes (Basel) 2021; 12:354. [PMID: 33670877 PMCID: PMC7997228 DOI: 10.3390/genes12030354] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2021] [Revised: 02/22/2021] [Accepted: 02/25/2021] [Indexed: 12/16/2022] Open
Abstract
As a prevalent existing post-transcriptional modification of RNA, N6-methyladenosine (m6A) plays a crucial role in various biological processes. To better radically reveal its regulatory mechanism and provide new insights for drug design, the accurate identification of m6A sites in genome-wide is vital. As the traditional experimental methods are time-consuming and cost-prohibitive, it is necessary to design a more efficient computational method to detect the m6A sites. In this study, we propose a novel cross-species computational method DNN-m6A based on the deep neural network (DNN) to identify m6A sites in multiple tissues of human, mouse and rat. Firstly, binary encoding (BE), tri-nucleotide composition (TNC), enhanced nucleic acid composition (ENAC), K-spaced nucleotide pair frequencies (KSNPFs), nucleotide chemical property (NCP), pseudo dinucleotide composition (PseDNC), position-specific nucleotide propensity (PSNP) and position-specific dinucleotide propensity (PSDP) are employed to extract RNA sequence features which are subsequently fused to construct the initial feature vector set. Secondly, we use elastic net to eliminate redundant features while building the optimal feature subset. Finally, the hyper-parameters of DNN are tuned with Bayesian hyper-parameter optimization based on the selected feature subset. The five-fold cross-validation test on training datasets show that the proposed DNN-m6A method outperformed the state-of-the-art method for predicting m6A sites, with an accuracy (ACC) of 73.58%-83.38% and an area under the curve (AUC) of 81.39%-91.04%. Furthermore, the independent datasets achieved an ACC of 72.95%-83.04% and an AUC of 80.79%-91.09%, which shows an excellent generalization ability of our proposed method.
Collapse
Affiliation(s)
- Lu Zhang
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China; (L.Z.); (X.Q.); (M.L.)
| | - Xinyi Qin
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China; (L.Z.); (X.Q.); (M.L.)
| | - Min Liu
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China; (L.Z.); (X.Q.); (M.L.)
| | - Ziwei Xu
- Polytech Nantes, Bâtiment Ireste, 44300 Nantes, France;
| | - Guangzhong Liu
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China; (L.Z.); (X.Q.); (M.L.)
| |
Collapse
|
8
|
Moosa S, Amira PA, Boughorbel DS. DASSI: differential architecture search for splice identification from DNA sequences. BioData Min 2021; 14:15. [PMID: 33588916 PMCID: PMC7885202 DOI: 10.1186/s13040-021-00237-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2020] [Accepted: 01/05/2021] [Indexed: 11/28/2022] Open
Abstract
Background The data explosion caused by unprecedented advancements in the field of genomics is constantly challenging the conventional methods used in the interpretation of the human genome. The demand for robust algorithms over the recent years has brought huge success in the field of Deep Learning (DL) in solving many difficult tasks in image, speech and natural language processing by automating the manual process of architecture design. This has been fueled through the development of new DL architectures. Yet genomics possesses unique challenges that requires customization and development of new DL models. Methods We proposed a new model, DASSI, by adapting a differential architecture search method and applying it to the Splice Site (SS) recognition task on DNA sequences to discover new high-performance convolutional architectures in an automated manner. We evaluated the discovered model against state-of-the-art tools to classify true and false SS in Homo sapiens (Human), Arabidopsis thaliana (Plant), Caenorhabditis elegans (Worm) and Drosophila melanogaster (Fly). Results Our experimental evaluation demonstrated that the discovered architecture outperformed baseline models and fixed architectures and showed competitive results against state-of-the-art models used in classification of splice sites. The proposed model - DASSI has a compact architecture and showed very good results on a transfer learning task. The benchmarking experiments of execution time and precision on architecture search and evaluation process showed better performance on recently available GPUs making it feasible to adopt architecture search based methods on large datasets. Conclusions We proposed the use of differential architecture search method (DASSI) to perform SS classification on raw DNA sequences, and discovered new neural network models with low number of tunable parameters and competitive performance compared with manually engineered architectures. We have extensively benchmarked DASSI model with other state-of-the-art models and assessed its computational efficiency. The results have shown a high potential of using automated architecture search mechanism for solving various problems in the field of genomics.
Collapse
Affiliation(s)
- Shabir Moosa
- Department of Systems Biology, SIDRA Medicine, Doha, 26999, Qatar. .,Dept. of Computer Science and Engineering, Qatar University, Doha, 2713, Qatar.
| | - Prof Abbes Amira
- Dept. of Computer Science and Engineering, Qatar University, Doha, 2713, Qatar
| | | |
Collapse
|
9
|
Ahmed S, Kabir M, Arif M, Khan ZU, Yu DJ. DeepPPSite: A deep learning-based model for analysis and prediction of phosphorylation sites using efficient sequence information. Anal Biochem 2020; 612:113955. [PMID: 32949607 DOI: 10.1016/j.ab.2020.113955] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2020] [Revised: 08/30/2020] [Accepted: 09/11/2020] [Indexed: 12/29/2022]
Abstract
Phosphorylation is a ubiquitous type of post-translational modification (PTM) that occurs in both eukaryotic and prokaryotic cells where in a phosphate group binds with amino acid residues. These specific residues, i.e., serine (S), threonine (T), and tyrosine (Y), exhibit diverse functions at the molecular level. Recent studies have determined that some diseases such as cancer, diabetes, and neurodegenerative diseases are caused by abnormal phosphorylation. Based on its potential applications in biological research and drug development, the large-scale identification of phosphorylation sites has attracted interest. Existing wet-lab technologies for targeting phosphorylation sites are overpriced and time consuming. Thus, computational algorithms that can efficiently accelerate the annotation of phosphorylation sites from massive protein sequences are needed. Numerous machine learning-based methods have been implemented for phosphorylation sites prediction. However, despite extensive efforts, existing computational approaches continue to have inadequate performance, particularly in terms of overall ACC, MCC, and AUC. In this paper, we report a novel deep learning-based predictor to overcome these performance hurdles, DeepPPSite, which was constructed using a stacked long short-term memory recurrent network for predicting phosphorylation sites. The proposed technique expediently learns the protein representations from conjoint protein descriptors. The experimental results indicated that our model achieved superior performance on the training dataset for S, T and Y, with MCC values of 0.608, 0.602, and 0.558, respectively, using a 10-fold cross-validation test. We further determined the generalization efficacy of the proposed predictor DeepPPSite by conducting a rigorous independent test. The predictive MCC values were 0.358, 0.356, and 0.350 for the S, T, and Y phosphorylation sites, respectively. Rigorous cross-validation and independent validation tests for the three types of phosphorylation sites demonstrated that the designed DeepPPSite tool significantly outperforms state-of-the-art methods.
Collapse
Affiliation(s)
- Saeed Ahmed
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, 210094, China.
| | - Muhammad Kabir
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, 210094, China.
| | - Muhammad Arif
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, 210094, China.
| | - Zaheer Ullah Khan
- School of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, China.
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, 210094, China.
| |
Collapse
|
10
|
Xu ZC, Feng PM, Yang H, Qiu WR, Chen W, Lin H. iRNAD: a computational tool for identifying D modification sites in RNA sequence. Bioinformatics 2020; 35:4922-4929. [PMID: 31077296 DOI: 10.1093/bioinformatics/btz358] [Citation(s) in RCA: 71] [Impact Index Per Article: 17.8] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2018] [Revised: 03/01/2019] [Accepted: 04/27/2019] [Indexed: 12/19/2022] Open
Abstract
MOTIVATION Dihydrouridine (D) is a common RNA post-transcriptional modification found in eukaryotes, bacteria and a few archaea. The modification can promote the conformational flexibility of individual nucleotide bases. And its levels are increased in cancerous tissues. Therefore, it is necessary to detect D in RNA for further understanding its functional roles. Since wet-experimental techniques for the aim are time-consuming and laborious, it is urgent to develop computational models to identify D modification sites in RNA. RESULTS We constructed a predictor, called iRNAD, for identifying D modification sites in RNA sequence. In this predictor, the RNA samples derived from five species were encoded by nucleotide chemical property and nucleotide density. Support vector machine was utilized to perform the classification. The final model could produce the overall accuracy of 96.18% with the area under the receiver operating characteristic curve of 0.9839 in jackknife cross-validation test. Furthermore, we performed a series of validations from several aspects and demonstrated the robustness and reliability of the proposed model. AVAILABILITY AND IMPLEMENTATION A user-friendly web-server called iRNAD can be freely accessible at http://lin-group.cn/server/iRNAD, which will provide convenience and guide to users for further studying D modification.
Collapse
Affiliation(s)
- Zhao-Chun Xu
- Computer Department, Jingdezhen Ceramic Institute, Jingdezhen, China.,Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Peng-Mian Feng
- Innovative Institute of Chinese Medicine and Pharmacy, Chengdu University of Traditional Chinese Medicine, Chengdu, China
| | - Hui Yang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Wang-Ren Qiu
- Computer Department, Jingdezhen Ceramic Institute, Jingdezhen, China
| | - Wei Chen
- Innovative Institute of Chinese Medicine and Pharmacy, Chengdu University of Traditional Chinese Medicine, Chengdu, China
| | - Hao Lin
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
11
|
|
12
|
Sequence and Evolutionary Features for the Alternatively Spliced Exons of Eukaryotic Genes. Int J Mol Sci 2019; 20:ijms20153834. [PMID: 31390737 PMCID: PMC6695735 DOI: 10.3390/ijms20153834] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2019] [Revised: 07/25/2019] [Accepted: 07/31/2019] [Indexed: 12/22/2022] Open
Abstract
Alternative splicing of pre-mRNAs is a crucial mechanism for maintaining protein diversity in eukaryotes without requiring a considerable increase of genes in the number. Due to rapid advances in high-throughput sequencing technologies and computational algorithms, it is anticipated that alternative splicing events will be more intensively studied to address different kinds of biological questions. The occurrences of alternative splicing mean that all exons could be classified to be either constitutively or alternatively spliced depending on whether they are virtually included into all mature mRNAs. From an evolutionary point of view, therefore, the alternatively spliced exons would have been associated with distinctive biological characteristics in comparison with constitutively spliced exons. In this paper, we first outline the representative types of alternative splicing events and exon classification, and then review sequence and evolutionary features for the alternatively spliced exons. The main purpose is to facilitate understanding of the biological implications of alternative splicing in eukaryotes. This knowledge is also helpful to establish computational approaches for predicting the splicing pattern of exons.
Collapse
|
13
|
Khan S, Khan M, Iqbal N, Hussain T, Khan SA, Chou KC. A Two-Level Computation Model Based on Deep Learning Algorithm for Identification of piRNA and Their Functions via Chou’s 5-Steps Rule. Int J Pept Res Ther 2019. [DOI: 10.1007/s10989-019-09887-3] [Citation(s) in RCA: 45] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
|
14
|
Albalawi F, Chahid A, Guo X, Albaradei S, Magana-Mora A, Jankovic BR, Uludag M, Van Neste C, Essack M, Laleg-Kirati TM, Bajic VB. Hybrid model for efficient prediction of poly(A) signals in human genomic DNA. Methods 2019; 166:31-39. [PMID: 30991099 DOI: 10.1016/j.ymeth.2019.04.001] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2018] [Revised: 03/12/2019] [Accepted: 04/01/2019] [Indexed: 12/15/2022] Open
Abstract
Polyadenylation signals (PAS) are found in most protein-coding and some non-coding genes in eukaryotes. Their accurate recognition improves understanding gene regulation mechanisms and recognition of the 3'-end of transcribed gene regions where premature or alternate transcription ends may lead to various diseases. Although different methods and tools for in-silico prediction of genomic signals have been proposed, the correct identification of PAS in genomic DNA remains challenging due to a vast number of non-relevant hexamers identical to PAS hexamers. In this study, we developed a novel method for PAS recognition. The method is implemented in a hybrid PAS recognition model (HybPAS), which is based on deep neural networks (DNNs) and logistic regression models (LRMs). One of such models is developed for each of the 12 most frequent human PAS hexamers. DNN models appeared the best for eight PAS types (including the two most frequent PAS hexamers), while LRM appeared best for the remaining four PAS types. The new models use different combinations of signal processing-based, statistical, and sequence-based features as input. The results obtained on human genomic data show that HybPAS outperforms the well-tuned state-of-the-art Omni-PolyA models, reducing the classification error for different PAS hexamers by up to 57.35% for 10 out of 12 PAS types, with Omni-PolyA models being better for two PAS types. For the most frequent PAS types, 'AATAAA' and 'ATTAAA', HybPAS reduced the error rate by 35.14% and 34.48%, respectively. On average, HybPAS reduces the error by 30.29%. HybPAS is implemented partly in Python and in MATLAB available at https://github.com/EMANG-KAUST/PolyA_Prediction_LRM_DNN.
Collapse
Affiliation(s)
- Fahad Albalawi
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Thuwal 23955-6900, Saudi Arabia; Taif University, Electrical Engineering, Taif 21944, Saudi Arabia
| | - Abderrazak Chahid
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Thuwal 23955-6900, Saudi Arabia
| | - Xingang Guo
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Thuwal 23955-6900, Saudi Arabia
| | - Somayah Albaradei
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Thuwal 23955-6900, Saudi Arabia
| | - Arturo Magana-Mora
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Thuwal 23955-6900, Saudi Arabia; Saudi Aramco, EXPEC-ARC, Drilling Technology Team, Dhahran 31311, Saudi Arabia
| | - Boris R Jankovic
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Thuwal 23955-6900, Saudi Arabia
| | - Mahmut Uludag
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Thuwal 23955-6900, Saudi Arabia
| | - Christophe Van Neste
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Thuwal 23955-6900, Saudi Arabia; Ghent University, Center for Medical Genetics Ghent (CMGG), B-9000 Ghent, Belgium
| | - Magbubah Essack
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Thuwal 23955-6900, Saudi Arabia
| | - Taous-Meriem Laleg-Kirati
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Thuwal 23955-6900, Saudi Arabia.
| | - Vladimir B Bajic
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Thuwal 23955-6900, Saudi Arabia.
| |
Collapse
|
15
|
Xu ZC, Xiao X, Qiu WR, Wang P, Fang XZ. iAI-DSAE: A Computational Method for Adenosine to Inosine Editing Site Prediction. LETT ORG CHEM 2019. [DOI: 10.2174/1570178615666181016112546] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
As an important post-transcriptional modification, adenosine-to-inosine RNA editing generally occurs in both coding and noncoding RNA transcripts in which adenosines are converted to inosines. Accordingly, the diversification of the transcriptome can be resulted in by this modification. It is significant to accurately identify adenosine-to-inosine editing sites for further understanding their biological functions. Currently, the adenosine-to-inosine editing sites would be determined by experimental methods, unfortunately, it may be costly and time consuming. Furthermore, there are only a few existing computational prediction models in this field. Therefore, the work in this study is starting to develop other computational methods to address these problems. Given an uncharacterized RNA sequence that contains many adenosine resides, can we identify which one of them can be converted to inosine, and which one cannot? To deal with this problem, a novel predictor called iAI-DSAE is proposed in the current study. In fact, there are two key issues to address: one is ‘what feature extraction methods should be adopted to formulate the given sample sequence?’ The other is ‘what classification algorithms should be used to construct the classification model?’ For the former, a 540-dimensional feature vector is extracted to formulate the sample sequence by dinucleotide-based auto-cross covariance, pseudo dinucleotide composition, and nucleotide density methods. For the latter, we use the present more popular method i.e. deep spare autoencoder to construct the classification model. Generally, ACC and MCC are considered as the two of the most important performance indicators of a predictor. In this study, in comparison with those of predictor PAI, they are up 2.46% and 4.14%, respectively. The two other indicators, Sn and Sp, rise at certain degree also. This indicates that our predictor can be as an important complementary tool to identify adenosine-toinosine RNA editing sites. For the convenience of most experimental scientists, an easy-to-use web-server for identifying adenosine-to-inosine editing sites has been established at: http://www.jci-bioinfo.cn/iAI-DSAE, by which users can easily obtain their desired results without the need to go through the complicated mathematical equations involved. It is important to identify adenosine-to-inosine editing sites in RNA sequences for the intensive study on RNA function and the development of new medicine. In current study, a novel predictor, called iAI-DSAE, was proposed by using three feature extraction methods including dinucleotidebased auto-cross covariance, pseudo dinucleotide composition and nucleotide density. The jackknife test results of the iAI-DSAE predictor based on deep spare auto-encoder model show that our predictor is more stable and reliable. It has not escaped our notice that the methods proposed in the current paper can be used to solve many other problems in genome analysis.
Collapse
Affiliation(s)
- Zhao-Chun Xu
- Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen 333403, China
| | - Xuan Xiao
- Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen 333403, China
| | - Wang-Ren Qiu
- Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen 333403, China
| | - Peng Wang
- Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen 333403, China
| | - Xin-Zhu Fang
- Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen 333403, China
| |
Collapse
|
16
|
Xiao X, Wang P, Xu Z, Qiu W, Fang X. PAI-SAE: Predicting Adenosine To Inosine Editing Sites Based On Hybrid Features By Using Spare Auto-Encoder. ACTA ACUST UNITED AC 2018. [DOI: 10.1088/1755-1315/170/5/052018] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
|
17
|
Du PF, Zhao W, Miao YY, Wei LY, Wang L. UltraPse: A Universal and Extensible Software Platform for Representing Biological Sequences. Int J Mol Sci 2017; 18:ijms18112400. [PMID: 29135934 PMCID: PMC5713368 DOI: 10.3390/ijms18112400] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2017] [Revised: 11/01/2017] [Accepted: 11/03/2017] [Indexed: 01/12/2023] Open
Abstract
With the avalanche of biological sequences in public databases, one of the most challenging problems in computational biology is to predict their biological functions and cellular attributes. Most of the existing prediction algorithms can only handle fixed-length numerical vectors. Therefore, it is important to be able to represent biological sequences with various lengths using fixed-length numerical vectors. Although several algorithms, as well as software implementations, have been developed to address this problem, these existing programs can only provide a fixed number of representation modes. Every time a new sequence representation mode is developed, a new program will be needed. In this paper, we propose the UltraPse as a universal software platform for this problem. The function of the UltraPse is not only to generate various existing sequence representation modes, but also to simplify all future programming works in developing novel representation modes. The extensibility of UltraPse is particularly enhanced. It allows the users to define their own representation mode, their own physicochemical properties, or even their own types of biological sequences. Moreover, UltraPse is also the fastest software of its kind. The source code package, as well as the executables for both Linux and Windows platforms, can be downloaded from the GitHub repository.
Collapse
Affiliation(s)
- Pu-Feng Du
- School of Computer Science and Technology, Tianjin University, Tianjin 300350, China.
| | - Wei Zhao
- School of Computer Science and Technology, Tianjin University, Tianjin 300350, China.
| | - Yang-Yang Miao
- School of Computer Science and Technology, Tianjin University, Tianjin 300350, China.
- School of Chemical Engineering, Tianjin University, Tianjin 300350, China.
| | - Le-Yi Wei
- School of Computer Science and Technology, Tianjin University, Tianjin 300350, China.
| | - Likun Wang
- Institute of Systems Biomedicine, Beijing Key Laboratory of Tumor Systems Biology, Department of Pathology, School of Basic Medical Sciences, Peking University Health Science Center, Beijing 100191, China.
| |
Collapse
|