1
|
Kuśmirek W. Estimated Nucleotide Reconstruction Quality Symbols of Basecalling Tools for Oxford Nanopore Sequencing. SENSORS (BASEL, SWITZERLAND) 2023; 23:6787. [PMID: 37571570 PMCID: PMC10422362 DOI: 10.3390/s23156787] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/29/2023] [Revised: 07/21/2023] [Accepted: 07/27/2023] [Indexed: 08/13/2023]
Abstract
Currently, one of the fastest-growing DNA sequencing technologies is nanopore sequencing. One of the key stages involved in processing sequencer data is the basecalling process, where the input sequence of currents measured on the nanopores of the sequencer reproduces the DNA sequences, called DNA reads. Many of the applications dedicated to basecalling, together with the DNA sequence, provide the estimated quality of the reconstruction of a given nucleotide (quality symbols are contained on every fourth line of the FASTQ file; each nucleotide in the FASTQ file corresponds to exactly one estimated nucleotide reconstruction quality symbol). Herein, we compare the estimated nucleotide reconstruction quality symbols (signs from every fourth line of the FASTQ file) reported by other basecallers. The conducted experiments consisted of basecalling the same raw datasets from the nanopore device by other basecallers and comparing the provided quality symbols, denoting the estimated quality of the nucleotide reconstruction. The results show that the estimated quality reported by different basecallers may vary, depending on the tool used, particularly in terms of range and distribution. Moreover, we mapped basecalled DNA reads to reference genomes and calculated matched and mismatched rates for groups of nucleotides with the same quality symbol. Finally, the presented paper shows that the estimated nucleotide reconstruction quality reported in the basecalling process is not used in any investigated tool for processing nanopore DNA reads.
Collapse
Affiliation(s)
- Wiktor Kuśmirek
- Institute of Computer Science, Warsaw University of Technology, 00-661 Warsaw, Poland
| |
Collapse
|
2
|
Boza V, Peresini P, Brejova B, Vinar T. Dynamic Pooling Improves Nanopore Base Calling Accuracy. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:3416-3424. [PMID: 34784283 DOI: 10.1109/tcbb.2021.3128366] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
In nanopore sequencing, electrical signal is measured as DNA molecules pass through the sequencing pores. Translating these signals into DNA bases (base calling) is a highly non-trivial task, and its quality has a large impact on the sequencing accuracy. The most successful nanopore base callers to date use convolutional neural networks (CNN) to accomplish the task. Convolutional layers in CNNs are typically composed of filters with constant window size, performing best in analysis of signals with uniform speed. However, the speed of nanopore sequencing varies greatly both within reads and between sequencing runs. Here, we present dynamic pooling, a novel neural network component, which addresses this problem by adaptively adjusting the pooling ratio. To demonstrate the usefulness of dynamic pooling, we developed two base callers: Heron and Osprey. Heron improves the accuracy beyond the experimental high-accuracy base caller Bonito developed by Oxford Nanopore. Osprey is a fast base caller that can compete in accuracy with Guppy high-accuracy mode, but does not require GPU acceleration and achieves a near real-time speed on common desktop CPUs. Availability: https://github.com/fmfi-compbio/osprey, https://github.com/fmfi-compbio/heron.
Collapse
|
3
|
Li Z, Li Y, Zhang B, Li Y, Long Y, Zhou J, Zou X, Zhang M, Hu Y, Chen W, Gao X. DeeReCT-APA: Prediction of Alternative Polyadenylation Site Usage Through Deep Learning. GENOMICS, PROTEOMICS & BIOINFORMATICS 2022; 20:483-495. [PMID: 33662629 PMCID: PMC9801043 DOI: 10.1016/j.gpb.2020.05.004] [Citation(s) in RCA: 15] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/13/2020] [Revised: 03/28/2020] [Accepted: 06/12/2020] [Indexed: 01/26/2023]
Abstract
Alternative polyadenylation (APA) is a crucial step in post-transcriptional regulation. Previous bioinformatic studies have mainly focused on the recognition of polyadenylation sites (PASs) in a given genomic sequence, which is a binary classification problem. Recently, computational methods for predicting the usage level of alternative PASs in the same gene have been proposed. However, all of them cast the problem as a non-quantitative pairwise comparison task and do not take the competition among multiple PASs into account. To address this, here we propose a deep learning architecture, Deep Regulatory Code and Tools for Alternative Polyadenylation (DeeReCT-APA), to quantitatively predict the usage of all alternative PASs of a given gene. To accommodate different genes with potentially different numbers of PASs, DeeReCT-APA treats the problem as a regression task with a variable-length target. Based on a convolutional neural network-long short-term memory (CNN-LSTM) architecture, DeeReCT-APA extracts sequence features with CNN layers, uses bidirectional LSTM to explicitly model the interactions among competing PASs, and outputs percentage scores representing the usage levels of all PASs of a gene. In addition to the fact that only our method can quantitatively predict the usage of all the PASs within a gene, we show that our method consistently outperforms other existing methods on three different tasks for which they are trained: pairwise comparison task, highest usage prediction task, and ranking task. Finally, we demonstrate that our method can be used to predict the effect of genetic variations on APA patterns and sheds light on future mechanistic understanding in APA regulation. Our code and data are available at https://github.com/lzx325/DeeReCT-APA-repo.
Collapse
Affiliation(s)
- Zhongxiao Li
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Thuwal 23955-6900, Saudi Arabia
| | - Yisheng Li
- Department of Biology, Southern University of Science and Technology (SUSTech), Shenzhen 518055, China
| | - Bin Zhang
- Cancer Science Institute of Singapore, Singapore 117599, Singapore
| | - Yu Li
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Thuwal 23955-6900, Saudi Arabia
| | - Yongkang Long
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Thuwal 23955-6900, Saudi Arabia,Department of Biology, Southern University of Science and Technology (SUSTech), Shenzhen 518055, China
| | - Juexiao Zhou
- Department of Biology, Southern University of Science and Technology (SUSTech), Shenzhen 518055, China
| | - Xudong Zou
- Department of Biology, Southern University of Science and Technology (SUSTech), Shenzhen 518055, China
| | - Min Zhang
- Department of Biology, Southern University of Science and Technology (SUSTech), Shenzhen 518055, China
| | - Yuhui Hu
- Department of Biology, Southern University of Science and Technology (SUSTech), Shenzhen 518055, China,Corresponding authors.
| | - Wei Chen
- Department of Biology, Southern University of Science and Technology (SUSTech), Shenzhen 518055, China,Corresponding authors.
| | - Xin Gao
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Thuwal 23955-6900, Saudi Arabia,Corresponding authors.
| |
Collapse
|
4
|
Wang Y, Zhao Y, Bollas A, Wang Y, Au KF. Nanopore sequencing technology, bioinformatics and applications. Nat Biotechnol 2021; 39:1348-1365. [PMID: 34750572 PMCID: PMC8988251 DOI: 10.1038/s41587-021-01108-x] [Citation(s) in RCA: 537] [Impact Index Per Article: 179.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2019] [Accepted: 09/22/2021] [Indexed: 12/13/2022]
Abstract
Rapid advances in nanopore technologies for sequencing single long DNA and RNA molecules have led to substantial improvements in accuracy, read length and throughput. These breakthroughs have required extensive development of experimental and bioinformatics methods to fully exploit nanopore long reads for investigations of genomes, transcriptomes, epigenomes and epitranscriptomes. Nanopore sequencing is being applied in genome assembly, full-length transcript detection and base modification detection and in more specialized areas, such as rapid clinical diagnoses and outbreak surveillance. Many opportunities remain for improving data quality and analytical approaches through the development of new nanopores, base-calling methods and experimental protocols tailored to particular applications.
Collapse
Affiliation(s)
- Yunhao Wang
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH, USA
| | - Yue Zhao
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH, USA
- Biomedical Informatics Shared Resources, The Ohio State University, Columbus, OH, USA
| | - Audrey Bollas
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH, USA
| | - Yuru Wang
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH, USA
| | - Kin Fai Au
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH, USA.
- Biomedical Informatics Shared Resources, The Ohio State University, Columbus, OH, USA.
| |
Collapse
|
5
|
Konishi H, Yamaguchi R, Yamaguchi K, Furukawa Y, Imoto S. Halcyon: an accurate basecaller exploiting an encoder-decoder model with monotonic attention. Bioinformatics 2021; 37:1211-1217. [PMID: 33165508 PMCID: PMC8189681 DOI: 10.1093/bioinformatics/btaa953] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2020] [Revised: 10/14/2020] [Accepted: 10/30/2020] [Indexed: 11/17/2022] Open
Abstract
Motivation In recent years, nanopore sequencing technology has enabled inexpensive long-read sequencing, which promises reads longer than a few thousand bases. Such long-read sequences contribute to the precise detection of structural variations and accurate haplotype phasing. However, deciphering precise DNA sequences from noisy and complicated nanopore raw signals remains a crucial demand for downstream analyses based on higher-quality nanopore sequencing, although various basecallers have been introduced to date. Results To address this need, we developed a novel basecaller, Halcyon, that incorporates neural-network techniques frequently used in the field of machine translation. Our model employs monotonic-attention mechanisms to learn semantic correspondences between nucleotides and signal levels without any pre-segmentation against input signals. We evaluated performance with a human whole-genome sequencing dataset and demonstrated that Halcyon outperformed existing third-party basecallers and achieved competitive performance against the latest Oxford Nanopore Technologies’ basecallers. Availabilityand implementation The source code (halcyon) can be found at https://github.com/relastle/halcyon.
Collapse
Affiliation(s)
| | | | - Kiyoshi Yamaguchi
- Advanced Clinical Research Center, Institute of Medical Science, The University of Tokyo, Tokyo, Japan
| | - Yoichi Furukawa
- Advanced Clinical Research Center, Institute of Medical Science, The University of Tokyo, Tokyo, Japan
| | - Seiya Imoto
- Health Intelligence Center.,Human Genome Center
| |
Collapse
|
6
|
Li Y, Xu Z, Han W, Cao H, Umarov R, Yan A, Fan M, Chen H, Duarte CM, Li L, Ho PL, Gao X. HMD-ARG: hierarchical multi-task deep learning for annotating antibiotic resistance genes. MICROBIOME 2021; 9:40. [PMID: 33557954 PMCID: PMC7871585 DOI: 10.1186/s40168-021-01002-3] [Citation(s) in RCA: 36] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/11/2020] [Accepted: 01/08/2021] [Indexed: 05/07/2023]
Abstract
BACKGROUND The spread of antibiotic resistance has become one of the most urgent threats to global health, which is estimated to cause 700,000 deaths each year globally. Its surrogates, antibiotic resistance genes (ARGs), are highly transmittable between food, water, animal, and human to mitigate the efficacy of antibiotics. Accurately identifying ARGs is thus an indispensable step to understanding the ecology, and transmission of ARGs between environmental and human-associated reservoirs. Unfortunately, the previous computational methods for identifying ARGs are mostly based on sequence alignment, which cannot identify novel ARGs, and their applications are limited by currently incomplete knowledge about ARGs. RESULTS Here, we propose an end-to-end Hierarchical Multi-task Deep learning framework for ARG annotation (HMD-ARG). Taking raw sequence encoding as input, HMD-ARG can identify, without querying against existing sequence databases, multiple ARG properties simultaneously, including if the input protein sequence is an ARG, and if so, what antibiotic family it is resistant to, what resistant mechanism the ARG takes, and if the ARG is an intrinsic one or acquired one. In addition, if the predicted antibiotic family is beta-lactamase, HMD-ARG further predicts the subclass of beta-lactamase that the ARG is resistant to. Comprehensive experiments, including cross-fold validation, third-party dataset validation in human gut microbiota, wet-experimental functional validation, and structural investigation of predicted conserved sites, demonstrate not only the superior performance of our method over the state-of-art methods, but also the effectiveness and robustness of the proposed method. CONCLUSIONS We propose a hierarchical multi-task method, HMD-ARG, which is based on deep learning and can provide detailed annotations of ARGs from three important aspects: resistant antibiotic class, resistant mechanism, and gene mobility. We believe that HMD-ARG can serve as a powerful tool to identify antibiotic resistance genes and, therefore mitigate their global threat. Our method and the constructed database are available at http://www.cbrc.kaust.edu.sa/HMDARG/ . Video abstract (MP4 50984 kb).
Collapse
Affiliation(s)
- Yu Li
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955, Saudi Arabia
- Department of Computer Science and Engineering (CSE), The Chinese University of Hong Kong (CUHK), Hong Kong, People's Republic of China
| | - Zeling Xu
- School of Biological Sciences, The University of Hong Kong, Hong Kong, People's Republic of China
| | - Wenkai Han
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955, Saudi Arabia
| | - Huiluo Cao
- Carol Yu Center for Infection and Department of Microbiology, The University of Hong Kong, Hong Kong, People's Republic of China
| | - Ramzan Umarov
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955, Saudi Arabia
| | - Aixin Yan
- School of Biological Sciences, The University of Hong Kong, Hong Kong, People's Republic of China
| | - Ming Fan
- Institute of Biomedical Engineering and Instrumentation, Hangzhou Dianzi University, Hangzhou, People's Republic of China
| | - Huan Chen
- Key Laboratory of Microbial Technology and Bioinformatics of Zhejiang Province, Zhejiang Institute of Microbiology, Hangzhou, People's Republic of China
| | - Carlos M Duarte
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955, Saudi Arabia
- Biological and Environmental Sciences and Engineering (BESE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955, Saudi Arabia
| | - Lihua Li
- Institute of Biomedical Engineering and Instrumentation, Hangzhou Dianzi University, Hangzhou, People's Republic of China
| | - Pak-Leung Ho
- Carol Yu Center for Infection and Department of Microbiology, The University of Hong Kong, Hong Kong, People's Republic of China
| | - Xin Gao
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955, Saudi Arabia.
| |
Collapse
|
7
|
Li Y, Wang S, Bi C, Qiu Z, Li M, Gao X. DeepSimulator1.5: a more powerful, quicker and lighter simulator for Nanopore sequencing. Bioinformatics 2020; 36:2578-2580. [PMID: 31913436 PMCID: PMC7178411 DOI: 10.1093/bioinformatics/btz963] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2019] [Revised: 11/17/2019] [Accepted: 01/03/2020] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Nanopore sequencing is one of the leading third-generation sequencing technologies. A number of computational tools have been developed to facilitate the processing and analysis of the Nanopore data. Previously, we have developed DeepSimulator1.0 (DS1.0), which is the first simulator for Nanopore sequencing to produce both the raw electrical signals and the reads. However, although DS1.0 can produce high-quality reads, for some sequences, the divergence between the simulated raw signals and the real signals can be large. Furthermore, the Nanopore sequencing technology has evolved greatly since DS1.0 was released. It is thus necessary to update DS1.0 to accommodate those changes. RESULTS We propose DeepSimulator1.5 (DS1.5), all three modules of which have been updated substantially from DS1.0. As for the sequence generator, we updated the sample read length distribution to reflect the newest real reads' features. In terms of the signal generator, which is the core of DeepSimulator, we added one more pore model, the context-independent pore model, which is much faster than the previous context-dependent one. Furthermore, to make the generated signals more similar to the real ones, we added a low-pass filter to post-process the pore model signals. Regarding the basecaller, we added the support for the newest official basecaller, Guppy, which can support both GPU and CPU. In addition, multiple optimizations, related to multiprocessing control, memory and storage management, have been implemented to make DS1.5 a much more amenable and lighter simulator than DS1.0. AVAILABILITY AND IMPLEMENTATION The main program and the data are available at https://github.com/lykaust15/DeepSimulator. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yu Li
- Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
| | - Sheng Wang
- Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia.,Tencent AI lab, Shenzhen 518000, China
| | - Chongwei Bi
- Biological and Environmental Sciences and Engineering (BESE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
| | - Zhaowen Qiu
- Institute of Information and Computer Engineering, Northeast Forestry University, Harbin 150040, China
| | - Mo Li
- Biological and Environmental Sciences and Engineering (BESE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
| | - Xin Gao
- Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
| |
Collapse
|
8
|
Boža V, Perešíni P, Brejová B, Vinař T. DeepNano-blitz: a fast base caller for MinION nanopore sequencers. Bioinformatics 2020; 36:4191-4192. [DOI: 10.1093/bioinformatics/btaa297] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2020] [Revised: 04/10/2020] [Accepted: 04/28/2020] [Indexed: 11/14/2022] Open
Abstract
Abstract
Motivation
Oxford Nanopore MinION is a portable DNA sequencer that is marketed as a device that can be deployed anywhere. Current base callers, however, require a powerful GPU to analyze data produced by MinION in real time, which hampers field applications.
Results
We have developed a fast base caller DeepNano-blitz that can analyze stream from up to two MinION runs in real time using a common laptop CPU (i7-7700HQ), with no GPU requirements. The base caller settings allow trading accuracy for speed and the results can be used for real time run monitoring (i.e. sample composition, barcode balance, species identification, etc.) or prefiltering of results for more detailed analysis (i.e. filtering out human DNA from human–pathogen runs).
Availability and implementation
DeepNano-blitz has been developed and tested on Linux and Intel processors and is available under MIT license at https://github.com/fmfi-compbio/deepnano-blitz.
Contact
vladimir.boza@fmph.uniba.sk
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Vladimír Boža
- Faculty of Mathematics, Physics and Informatics, Comenius University in Bratislava, Mlynská dolina, Bratislava, Slovakia
| | - Peter Perešíni
- Faculty of Mathematics, Physics and Informatics, Comenius University in Bratislava, Mlynská dolina, Bratislava, Slovakia
| | - Broňa Brejová
- Faculty of Mathematics, Physics and Informatics, Comenius University in Bratislava, Mlynská dolina, Bratislava, Slovakia
| | - Tomáš Vinař
- Faculty of Mathematics, Physics and Informatics, Comenius University in Bratislava, Mlynská dolina, Bratislava, Slovakia
| |
Collapse
|
9
|
Zeng J, Cai H, Peng H, Wang H, Zhang Y, Akutsu T. Causalcall: Nanopore Basecalling Using a Temporal Convolutional Network. Front Genet 2020; 10:1332. [PMID: 32038706 PMCID: PMC6984161 DOI: 10.3389/fgene.2019.01332] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2019] [Accepted: 12/05/2019] [Indexed: 11/13/2022] Open
Abstract
Nanopore sequencing is promising because of its long read length and high speed. During sequencing, a strand of DNA/RNA passes through a biological nanopore, which causes the current in the pore to fluctuate. During basecalling, context-dependent current measurements are translated into the base sequence of the DNA/RNA strand. Accurate and fast basecalling is vital for downstream analyses such as genome assembly and detecting single-nucleotide polymorphisms and genomic structural variants. However, owing to the various changes in DNA/RNA molecules, noise during sequencing, and limitations of basecalling methods, accurate basecalling remains a challenge. In this paper, we propose Causalcall, which uses an end-to-end temporal convolution-based deep learning model for accurate and fast nanopore basecalling. Developed on a temporal convolutional network (TCN) and a connectionist temporal classification decoder, Causalcall directly identifies base sequences of varying lengths from current measurements in long time series. In contrast to the basecalling models using recurrent neural networks (RNNs), the convolution-based model of Causalcall can speed up basecalling by matrix computation. Experiments on multiple species have demonstrated the great potential of the TCN-based model to improve basecalling accuracy and speed when compared to an RNN-based model. Besides, experiments on genome assembly indicate the utility of Causalcall in reference-based genome assembly.
Collapse
Affiliation(s)
- Jingwen Zeng
- School of Computer Science and Engineering, South China University of Technology, Guangzhou, China
| | - Hongmin Cai
- School of Computer Science and Engineering, South China University of Technology, Guangzhou, China
| | - Hong Peng
- School of Computer Science and Engineering, South China University of Technology, Guangzhou, China
| | - Haiyan Wang
- School of Computer Science and Engineering, South China University of Technology, Guangzhou, China
| | - Yue Zhang
- School of Computer Science, Guangdong Plytechnic Normal University, Guangzhou, China
| | - Tatsuya Akutsu
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Kyoto, Japan
| |
Collapse
|
10
|
Deep learning in bioinformatics: Introduction, application, and perspective in the big data era. Methods 2019; 166:4-21. [PMID: 31022451 DOI: 10.1016/j.ymeth.2019.04.008] [Citation(s) in RCA: 134] [Impact Index Per Article: 26.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2018] [Revised: 03/23/2019] [Accepted: 04/15/2019] [Indexed: 12/13/2022] Open
Abstract
Deep learning, which is especially formidable in handling big data, has achieved great success in various fields, including bioinformatics. With the advances of the big data era in biology, it is foreseeable that deep learning will become increasingly important in the field and will be incorporated in vast majorities of analysis pipelines. In this review, we provide both the exoteric introduction of deep learning, and concrete examples and implementations of its representative applications in bioinformatics. We start from the recent achievements of deep learning in the bioinformatics field, pointing out the problems which are suitable to use deep learning. After that, we introduce deep learning in an easy-to-understand fashion, from shallow neural networks to legendary convolutional neural networks, legendary recurrent neural networks, graph neural networks, generative adversarial networks, variational autoencoder, and the most recent state-of-the-art architectures. After that, we provide eight examples, covering five bioinformatics research directions and all the four kinds of data type, with the implementation written in Tensorflow and Keras. Finally, we discuss the common issues, such as overfitting and interpretability, that users will encounter when adopting deep learning methods and provide corresponding suggestions. The implementations are freely available at https://github.com/lykaust15/Deep_learning_examples.
Collapse
|