51
|
Wang R, Wang Z, Wang J, Li S. SpliceFinder: ab initio prediction of splice sites using convolutional neural network. BMC Bioinformatics 2019; 20:652. [PMID: 31881982 PMCID: PMC6933889 DOI: 10.1186/s12859-019-3306-3] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022] Open
Abstract
Background Identifying splice sites is a necessary step to analyze the location and structure of genes. Two dinucleotides, GT and AG, are highly frequent on splice sites, and many other patterns are also on splice sites with important biological functions. Meanwhile, the dinucleotides occur frequently at the sequences without splice sites, which makes the prediction prone to generate false positives. Most existing tools select all the sequences with the two dimers and then focus on distinguishing the true splice sites from those pseudo ones. Such an approach will lead to a decrease in false positives; however, it will result in non-canonical splice sites missing. Result We have designed SpliceFinder based on convolutional neural network (CNN) to predict splice sites. To achieve the ab initio prediction, we used human genomic data to train our neural network. An iterative approach is adopted to reconstruct the dataset, which tackles the data unbalance problem and forces the model to learn more features of splice sites. The proposed CNN obtains the classification accuracy of 90.25%, which is 10% higher than the existing algorithms. The method outperforms other existing methods in terms of area under receiver operating characteristics (AUC), recall, precision, and F1 score. Furthermore, SpliceFinder can find the exact position of splice sites on long genomic sequences with a sliding window. Compared with other state-of-the-art splice site prediction tools, SpliceFinder generates results in about half lower false positive while keeping recall higher than 0.8. Also, SpliceFinder captures the non-canonical splice sites. In addition, SpliceFinder performs well on the genomic sequences of Drosophila melanogaster, Mus musculus, Rattus, and Danio rerio without retraining. Conclusion Based on CNN, we have proposed a new ab initio splice site prediction tool, SpliceFinder, which generates less false positives and can detect non-canonical splice sites. Additionally, SpliceFinder is transferable to other species without retraining. The source code and additional materials are available at https://gitlab.deepomics.org/wangruohan/SpliceFinder.
Collapse
Affiliation(s)
- Ruohan Wang
- Department of Computer Science, City University of Hong Kong, 83 Tat Chee Ave, Kowloon Tong, Hong Kong, China
| | - Zishuai Wang
- Department of Computer Science, City University of Hong Kong, 83 Tat Chee Ave, Kowloon Tong, Hong Kong, China
| | - Jianping Wang
- Department of Computer Science, City University of Hong Kong, 83 Tat Chee Ave, Kowloon Tong, Hong Kong, China.
| | - Shuaicheng Li
- Department of Computer Science, City University of Hong Kong, 83 Tat Chee Ave, Kowloon Tong, Hong Kong, China.
| |
Collapse
|
52
|
Wang L, Zhang J. Prediction of sgRNA on-target activity in bacteria by deep learning. BMC Bioinformatics 2019; 20:517. [PMID: 31651233 PMCID: PMC6814057 DOI: 10.1186/s12859-019-3151-4] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2019] [Accepted: 10/04/2019] [Indexed: 12/26/2022] Open
Abstract
BACKGROUND One of the main challenges for the CRISPR-Cas9 system is selecting optimal single-guide RNAs (sgRNAs). Recently, deep learning has enhanced sgRNA prediction in eukaryotes. However, the prokaryotic chromatin structure is different from eukaryotes, so models trained on eukaryotes may not apply to prokaryotes. RESULTS We designed and implemented a convolutional neural network to predict sgRNA activity in Escherichia coli. The network was trained and tested on the recently-released sgRNA activity dataset. Our convolutional neural network achieved excellent performance, yielding average Spearman correlation coefficients of 0.5817, 0.7105, and 0.3602, respectively for Cas9, eSpCas9 and Cas9 with a recA coding region deletion. We confirmed that the sgRNA prediction models trained on prokaryotes do not apply to eukaryotes and vice versa. We adopted perturbation-based approaches to analyze distinct biological patterns between prokaryotic and eukaryotic editing. Then, we improved the predictive performance of the prokaryotic Cas9 system by transfer learning. Finally, we determined that potential off-target scores accumulated on a genome-wide scale affect on-target activity, which could slightly improve on-target predictive performance. CONCLUSIONS We developed convolutional neural networks to predict sgRNA activity for wild type and mutant Cas9 in prokaryotes. Our results show that the prediction accuracy of our method is improved over state-of-the-art models.
Collapse
Affiliation(s)
- Lei Wang
- School of Life Science, Beijing Institute of Technology, South Zhongguancun Street, Beijing, 100081 China
| | - Juhua Zhang
- School of Life Science, Beijing Institute of Technology, South Zhongguancun Street, Beijing, 100081 China
- Key Laboratory of Convergence Medical Engineering System and Healthcare Technology, The Ministry of Industry and Information Technology, Beijing Institute of Technology, Beijing, China
| |
Collapse
|
53
|
Li F, Chen J, Leier A, Marquez-Lago T, Liu Q, Wang Y, Revote J, Smith AI, Akutsu T, Webb GI, Kurgan L, Song J. DeepCleave: a deep learning predictor for caspase and matrix metalloprotease substrates and cleavage sites. Bioinformatics 2019; 36:1057-1065. [PMID: 31566664 PMCID: PMC8215920 DOI: 10.1093/bioinformatics/btz721] [Citation(s) in RCA: 78] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2019] [Revised: 08/13/2019] [Accepted: 09/25/2019] [Indexed: 01/31/2023] Open
Abstract
MOTIVATION Proteases are enzymes that cleave target substrate proteins by catalyzing the hydrolysis of peptide bonds between specific amino acids. While the functional proteolysis regulated by proteases plays a central role in the 'life and death' cellular processes, many of the corresponding substrates and their cleavage sites were not found yet. Availability of accurate predictors of the substrates and cleavage sites would facilitate understanding of proteases' functions and physiological roles. Deep learning is a promising approach for the development of accurate predictors of substrate cleavage events. RESULTS We propose DeepCleave, the first deep learning-based predictor of protease-specific substrates and cleavage sites. DeepCleave uses protein substrate sequence data as input and employs convolutional neural networks with transfer learning to train accurate predictive models. High predictive performance of our models stems from the use of high-quality cleavage site features extracted from the substrate sequences through the deep learning process, and the application of transfer learning, multiple kernels and attention layer in the design of the deep network. Empirical tests against several related state-of-the-art methods demonstrate that DeepCleave outperforms these methods in predicting caspase and matrix metalloprotease substrate-cleavage sites. AVAILABILITY AND IMPLEMENTATION The DeepCleave webserver and source code are freely available at http://deepcleave.erc.monash.edu/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | | | - André Leier
- Department of Genetics, USA,Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, USA
| | - Tatiana Marquez-Lago
- Department of Genetics, USA,Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, USA
| | - Quanzhong Liu
- College of Information Engineering, Northwest A&F University, Yangling 712100, China
| | - Yanze Wang
- College of Information Engineering, Northwest A&F University, Yangling 712100, China
| | - Jerico Revote
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
| | - A Ian Smith
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
| | - Tatsuya Akutsu
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Kyoto 611-0011, Japan
| | - Geoffrey I Webb
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
| | | | | |
Collapse
|
54
|
Akter S, Xu D, Nagel SC, Bromfield JJ, Pelch K, Wilshire GB, Joshi T. Machine Learning Classifiers for Endometriosis Using Transcriptomics and Methylomics Data. Front Genet 2019; 10:766. [PMID: 31552087 PMCID: PMC6737999 DOI: 10.3389/fgene.2019.00766] [Citation(s) in RCA: 26] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2019] [Accepted: 07/19/2019] [Indexed: 12/29/2022] Open
Abstract
Endometriosis is a complex and common gynecological disorder yet a poorly understood disease affecting about 176 million women worldwide and causing significant impact on their quality of life and economic burden. Neither a definitive clinical symptom nor a minimally invasive diagnostic method is available, thus leading to an average of 4 to 11 years of diagnostic latency. Discovery of relevant biological patterns from microarray expression or next generation sequencing (NGS) data has been advanced over the last several decades by applying various machine learning tools. We performed machine learning analysis using 38 RNA-seq and 80 enrichment-based DNA methylation (MBD-seq) datasets. We experimented how well various supervised machine learning methods such as decision tree, partial least squares discriminant analysis (PLSDA), support vector machine, and random forest perform in classifying endometriosis from the control samples trained on both transcriptomics and methylomics data. The assessment was done from two different perspectives for improving classification performances: a) implication of three different normalization techniques and b) implication of differential analysis using the generalized linear model (GLM). Several candidate biomarker genes were identified by multiple machine learning experiments including NOTCH3, SNAPC2, B4GALNT1, SMAP2, DDB2, GTF3C5, and PTOV1 from the transcriptomics data analysis and TRPM6, RASSF2, TNIP2, RP3-522J7.6, FGD3, and MFSD14B from the methylomics data analysis. We concluded that an appropriate machine learning diagnostic pipeline for endometriosis should use TMM normalization for transcriptomics data, and quantile or voom normalization for methylomics data, GLM for feature space reduction and classification performance maximization.
Collapse
Affiliation(s)
- Sadia Akter
- Informatics Institute, University of Missouri, Columbia, MO, United States
| | - Dong Xu
- Informatics Institute, University of Missouri, Columbia, MO, United States
- Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, United States
- Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
| | - Susan C. Nagel
- OB/GYN and Women’s Health, University of Missouri School of Medicine, Columbia, MO, United States
| | - John J. Bromfield
- OB/GYN and Women’s Health, University of Missouri School of Medicine, Columbia, MO, United States
| | - Katherine Pelch
- OB/GYN and Women’s Health, University of Missouri School of Medicine, Columbia, MO, United States
| | | | - Trupti Joshi
- Informatics Institute, University of Missouri, Columbia, MO, United States
- Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
- Health Management and Informatics, University of Missouri, Columbia, MO, United States
| |
Collapse
|
55
|
Liu ZX, Yu K, Dong J, Zhao L, Liu Z, Zhang Q, Li S, Du Y, Cheng H. Precise Prediction of Calpain Cleavage Sites and Their Aberrance Caused by Mutations in Cancer. Front Genet 2019; 10:715. [PMID: 31440276 PMCID: PMC6694742 DOI: 10.3389/fgene.2019.00715] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2019] [Accepted: 07/05/2019] [Indexed: 02/05/2023] Open
Abstract
As a widespread post-translational modification of proteins, calpain-mediated cleavage regulates a broad range of cellular processes, including proliferation, differentiation, cytoskeletal reorganization, and apoptosis. The identification of proteins that undergo calpain cleavage in a site-specific manner is the necessary foundation for understanding the exact molecular mechanisms and regulatory roles of calpain-mediated cleavage. In contrast with time-consuming and labor-intensive experimental methods, computational approaches for detecting calpain cleavage sites have attracted wide attention due to their efficiency and convenience. In this study, we established a novel computational tool named DeepCalpain (http://deepcalpain.cancerbio.info/) for predicting the potential calpain cleavage sites by adopting deep neural network and the particle swarm optimization algorithm. Through critical evaluation and comparison, DeepCalpain exhibited superior performance against other existing tools. Meanwhile, we found that protein interactions could enrich the calpain-substrate regulatory relationship. Since calpain-mediated cleavage was critical for cancer development and progression, we comprehensively analyzed the calpain cleavage associated mutations across 11 cancers with the help of DeepCalpain, which demonstrated that the calpain-mediated cleavage events were affected by mutations and heavily implicated in the regulation of cancer cells. These prediction and analysis results might provide helpful information to reveal the regulatory mechanism of calpain cleavage in biological pathways and different cancer types, which might open new avenues for the diagnosis and treatment of cancers.
Collapse
Affiliation(s)
- Ze-Xian Liu
- School of Life Sciences, Zhengzhou University, Zhengzhou, China.,State Key Laboratory of Oncology in South China, Collaborative Innovation Center for Cancer Medicine, Sun Yat-sen University Cancer Center, Guangzhou, China
| | - Kai Yu
- School of Life Sciences, Zhengzhou University, Zhengzhou, China.,State Key Laboratory of Oncology in South China, Collaborative Innovation Center for Cancer Medicine, Sun Yat-sen University Cancer Center, Guangzhou, China
| | - Jingsi Dong
- Lung Cancer Center, West China Hospital, Sichuan University, Chengdu, China
| | - Linhong Zhao
- Institute of Life Sciences, Southeast University, Nanjing, China
| | - Zekun Liu
- State Key Laboratory of Oncology in South China, Collaborative Innovation Center for Cancer Medicine, Sun Yat-sen University Cancer Center, Guangzhou, China
| | - Qingfeng Zhang
- State Key Laboratory of Oncology in South China, Collaborative Innovation Center for Cancer Medicine, Sun Yat-sen University Cancer Center, Guangzhou, China
| | - Shihua Li
- School of Life Sciences, Zhengzhou University, Zhengzhou, China
| | - Yimeng Du
- School of Life Sciences, Zhengzhou University, Zhengzhou, China
| | - Han Cheng
- School of Life Sciences, Zhengzhou University, Zhengzhou, China
| |
Collapse
|
56
|
Fang C, Shang Y, Xu D. A deep dense inception network for protein beta-turn prediction. Proteins 2019; 88:143-151. [PMID: 31294886 DOI: 10.1002/prot.25780] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2018] [Revised: 06/17/2019] [Accepted: 07/06/2019] [Indexed: 12/13/2022]
Abstract
Beta-turn prediction is useful in protein function studies and experimental design. Although recent approaches using machine-learning techniques such as support vector machine (SVM), neural networks, and K nearest neighbor have achieved good results for beta-turn prediction, there is still significant room for improvement. As previous predictors utilized features in a sliding window of 4-20 residues to capture interactions among sequentially neighboring residues, such feature engineering may result in incomplete or biased features and neglect interactions among long-range residues. Deep neural networks provide a new opportunity to address these issues. Here, we proposed a deep dense inception network (DeepDIN) for beta-turn prediction, which takes advantage of the state-of-the-art deep neural network design of dense networks and inception networks. A test on a recent BT6376 benchmark data set shows that DeepDIN outperformed the previous best tool BetaTPred3 significantly in both the overall prediction accuracy and the nine-type beta-turn classification accuracy. A tool, called MUFold-BetaTurn, was developed, which is the first beta-turn prediction tool utilizing deep neural networks. The tool can be downloaded at http://dslsrv8.cs.missouri.edu/~cf797/MUFoldBetaTurn/download.html.
Collapse
Affiliation(s)
- Chao Fang
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, Missouri
| | - Yi Shang
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, Missouri
| | - Dong Xu
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, Missouri.,Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, Missouri
| |
Collapse
|
57
|
Deep learning in bioinformatics: Introduction, application, and perspective in the big data era. Methods 2019; 166:4-21. [PMID: 31022451 DOI: 10.1016/j.ymeth.2019.04.008] [Citation(s) in RCA: 137] [Impact Index Per Article: 22.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2018] [Revised: 03/23/2019] [Accepted: 04/15/2019] [Indexed: 12/13/2022] Open
Abstract
Deep learning, which is especially formidable in handling big data, has achieved great success in various fields, including bioinformatics. With the advances of the big data era in biology, it is foreseeable that deep learning will become increasingly important in the field and will be incorporated in vast majorities of analysis pipelines. In this review, we provide both the exoteric introduction of deep learning, and concrete examples and implementations of its representative applications in bioinformatics. We start from the recent achievements of deep learning in the bioinformatics field, pointing out the problems which are suitable to use deep learning. After that, we introduce deep learning in an easy-to-understand fashion, from shallow neural networks to legendary convolutional neural networks, legendary recurrent neural networks, graph neural networks, generative adversarial networks, variational autoencoder, and the most recent state-of-the-art architectures. After that, we provide eight examples, covering five bioinformatics research directions and all the four kinds of data type, with the implementation written in Tensorflow and Keras. Finally, we discuss the common issues, such as overfitting and interpretability, that users will encounter when adopting deep learning methods and provide corresponding suggestions. The implementations are freely available at https://github.com/lykaust15/Deep_learning_examples.
Collapse
|
58
|
Oubounyt M, Louadi Z, Tayara H, Chong KT. DeePromoter: Robust Promoter Predictor Using Deep Learning. Front Genet 2019; 10:286. [PMID: 31024615 PMCID: PMC6460014 DOI: 10.3389/fgene.2019.00286] [Citation(s) in RCA: 80] [Impact Index Per Article: 13.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2019] [Accepted: 03/15/2019] [Indexed: 12/11/2022] Open
Abstract
The promoter region is located near the transcription start sites and regulates transcription initiation of the gene by controlling the binding of RNA polymerase. Thus, promoter region recognition is an important area of interest in the field of bioinformatics. Numerous tools for promoter prediction were proposed. However, the reliability of these tools still needs to be improved. In this work, we propose a robust deep learning model, called DeePromoter, to analyze the characteristics of the short eukaryotic promoter sequences, and accurately recognize the human and mouse promoter sequences. DeePromoter combines a convolutional neural network (CNN) and a long short-term memory (LSTM). Additionally, instead of using non-promoter regions of the genome as a negative set, we derive a more challenging negative set from the promoter sequences. The proposed negative set reconstruction method improves the discrimination ability and significantly reduces the number of false positive predictions. Consequently, DeePromoter outperforms the previously proposed promoter prediction tools. In addition, a web-server for promoter prediction is developed based on the proposed methods and made available at https://home.jbnu.ac.kr/NSCL/deepromoter.htm.
Collapse
Affiliation(s)
- Mhaned Oubounyt
- Department of Information and Electronics Engineering, Chonbuk National University, Jeonju, South Korea
| | - Zakaria Louadi
- Department of Information and Electronics Engineering, Chonbuk National University, Jeonju, South Korea
| | - Hilal Tayara
- Department of Information and Electronics Engineering, Chonbuk National University, Jeonju, South Korea
| | - Kil To Chong
- Advanced Research Center of Information and Electronics Engineering, Chonbuk National University, Jeonju, South Korea
| |
Collapse
|