1
|
Liu Y, Yuan H, Zhang Q, Wang Z, Xiong S, Wen N, Zhang Y. Multiple sequence alignment based on deep reinforcement learning with self-attention and positional encoding. Bioinformatics 2023; 39:btad636. [PMID: 37856335 PMCID: PMC10628385 DOI: 10.1093/bioinformatics/btad636] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2022] [Revised: 07/24/2023] [Accepted: 10/17/2023] [Indexed: 10/21/2023] Open
Abstract
MOTIVATION Multiple sequence alignment (MSA) is one of the hotspots of current research and is commonly used in sequence analysis scenarios. However, there is no lasting solution for MSA because it is a Nondeterministic Polynomially complete problem, and the existing methods still have room to improve the accuracy. RESULTS We propose Deep reinforcement learning with Positional encoding and self-Attention for MSA, based on deep reinforcement learning, to enhance the accuracy of the alignment Specifically, inspired by the translation technique in natural language processing, we introduce self-attention and positional encoding to improve accuracy and reliability. Firstly, positional encoding encodes the position of the sequence to prevent the loss of nucleotide position information. Secondly, the self-attention model is used to extract the key features of the sequence. Then input the features into a multi-layer perceptron, which can calculate the insertion position of the gap according to the features. In addition, a novel reinforcement learning environment is designed to convert the classic progressive alignment into progressive column alignment, gradually generating each column's sub-alignment. Finally, merge the sub-alignment into the complete alignment. Extensive experiments based on several datasets validate our method's effectiveness for MSA, outperforming some state-of-the-art methods in terms of the Sum-of-pairs and Column scores. AVAILABILITY AND IMPLEMENTATION The process is implemented in Python and available as open-source software from https://github.com/ZhangLab312/DPAMSA.
Collapse
Affiliation(s)
- Yuhang Liu
- School of Computer Science, Chengdu University of Information Technology, Chengdu 610225, China
| | - Hao Yuan
- School of Computer Science, Chengdu University of Information Technology, Chengdu 610225, China
| | - Qiang Zhang
- School of Computer Science, Chengdu University of Information Technology, Chengdu 610225, China
| | - Zixuan Wang
- College of Electronics and Information Engineering, Sichuan University, Chengdu 610065, China
| | - Shuwen Xiong
- School of Computer Science, Chengdu University of Information Technology, Chengdu 610225, China
| | - Naifeng Wen
- School of Mechanical and Electrical Engineering, Dalian Minzu University, Dalian 116600, China
| | - Yongqing Zhang
- School of Computer Science, Chengdu University of Information Technology, Chengdu 610225, China
| |
Collapse
|
2
|
Raslan MA, Raslan SA, Shehata EM, Mahmoud AS, Sabri NA. Advances in the Applications of Bioinformatics and Chemoinformatics. Pharmaceuticals (Basel) 2023; 16:1050. [PMID: 37513961 PMCID: PMC10384252 DOI: 10.3390/ph16071050] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2023] [Revised: 07/19/2023] [Accepted: 07/20/2023] [Indexed: 07/30/2023] Open
Abstract
Chemoinformatics involves integrating the principles of physical chemistry with computer-based and information science methodologies, commonly referred to as "in silico techniques", in order to address a wide range of descriptive and prescriptive chemistry issues, including applications to biology, drug discovery, and related molecular areas. On the other hand, the incorporation of machine learning has been considered of high importance in the field of drug design, enabling the extraction of chemical data from enormous compound databases to develop drugs endowed with significant biological features. The present review discusses the field of cheminformatics and proposes the use of virtual chemical libraries in virtual screening methods to increase the probability of discovering novel hit chemicals. The virtual libraries address the need to increase the quality of the compounds as well as discover promising ones. On the other hand, various applications of bioinformatics in disease classification, diagnosis, and identification of multidrug-resistant organisms were discussed. The use of ensemble models and brute-force feature selection methodology has resulted in high accuracy rates for heart disease and COVID-19 diagnosis, along with the role of special formulations for targeting meningitis and Alzheimer's disease. Additionally, the correlation between genomic variations and disease states such as obesity and chronic progressive external ophthalmoplegia, the investigation of the antibacterial activity of pyrazole and benzimidazole-based compounds against resistant microorganisms, and its applications in chemoinformatics for the prediction of drug properties and toxicity-all the previously mentioned-were presented in the current review.
Collapse
Affiliation(s)
| | | | | | - Amr S Mahmoud
- Department of Obstetrics and Gynecology, Faculty of Medicine, Ain Shams University, Cairo P.O. Box 11566, Egypt
| | - Nagwa A Sabri
- Department of Clinical Pharmacy, Faculty of Pharmacy, Ain Shams University, Cairo P.O. Box 11566, Egypt
| |
Collapse
|
3
|
Deep reinforcement learning-based pairwise DNA sequence alignment method compatible with embedded edge devices. Sci Rep 2023; 13:2773. [PMID: 36797269 PMCID: PMC9935504 DOI: 10.1038/s41598-023-29277-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2022] [Accepted: 02/01/2023] [Indexed: 02/18/2023] Open
Abstract
Sequence alignment is an essential component of bioinformatics, for identifying regions of similarity that may indicate functional, structural, or evolutionary relationships between the sequences. Genome-based diagnostics relying on DNA sequencing have benefited hugely from the boom in computing power in recent decades, particularly due to cloud-computing and the rise of graphics processing units (GPUs) and other advanced computing platforms for running advanced algorithms. Translating the success of such breakthroughs in diagnostics to affordable solutions for low-cost healthcare requires development of algorithms that can operate on the edge instead of in the cloud, using low-cost and low-power electronic systems such as microcontrollers and field programmable gate arrays (FPGAs). In this work, we present EdgeAlign, a deep reinforcement learning based method for performing pairwise DNA sequence alignment on stand-alone edge devices. EdgeAlign uses deep reinforcement learning to train a deep Q-network (DQN) agent for performing sequence alignment on fixed length sub-sequences, using a sliding window that is scanned over the length of the entire sequence. The hardware resource-consumption for implementing this scheme is thus independent of the lengths of the sequences to be aligned, and is further optimized using a novel AutoML based method for neural network model size reduction. Unlike other algorithms for sequence alignment reported in literature, the model demonstrated in this work is highly compact and deployed on two edge devices (NVIDIA Jetson Nano Developer Kit and Digilent Arty A7-100T, containing Xilinx XC7A35T Artix-7 FPGA) for demonstration of alignment for sequences from the publicly available Influenza sequences at the National Center for Biotechnology Information (NCBI) Virus Data Hub.
Collapse
|
4
|
Khodji H, Collet P, Thompson JD, Jeannin-Girardon A. De-MISTED: Image-based classification of erroneous multiple sequence alignments using convolutional neural networks. APPL INTELL 2023. [DOI: 10.1007/s10489-022-04390-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/11/2023]
|
5
|
Becker F, Stanke M. learnMSA: learning and aligning large protein families. Gigascience 2022; 11:giac104. [PMID: 36399060 PMCID: PMC9673500 DOI: 10.1093/gigascience/giac104] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2022] [Revised: 09/01/2022] [Accepted: 10/06/2022] [Indexed: 11/19/2022] Open
Abstract
BACKGROUND The alignment of large numbers of protein sequences is a challenging task and its importance grows rapidly along with the size of biological datasets. State-of-the-art algorithms have a tendency to produce less accurate alignments with an increasing number of sequences. This is a fundamental problem since many downstream tasks rely on accurate alignments. RESULTS We present learnMSA, a novel statistical learning approach of profile hidden Markov models (pHMMs) based on batch gradient descent. Fundamentally different from popular aligners, we fit a custom recurrent neural network architecture for (p)HMMs to potentially millions of sequences with respect to a maximum a posteriori objective and decode an alignment. We rely on automatic differentiation of the log-likelihood, and thus, our approach is different from existing HMM training algorithms like Baum-Welch. Our method does not involve progressive, regressive, or divide-and-conquer heuristics. We use uniform batch sampling to adapt to large datasets in linear time without the requirement of a tree. When tested on ultra-large protein families with up to 3.5 million sequences, learnMSA is both more accurate and faster than state-of-the-art tools. On the established benchmarks HomFam and BaliFam with smaller sequence sets, it matches state-of-the-art performance. All experiments were done on a standard workstation with a GPU. CONCLUSIONS Our results show that learnMSA does not share the counterintuitive drawback of many popular heuristic aligners, which can substantially lose accuracy when many additional homologs are input. LearnMSA is a future-proof framework for large alignments with many opportunities for further improvements.
Collapse
Affiliation(s)
- Felix Becker
- Institute of Mathematics and Computer Science, University of Greifswald, Walther-Rathenau-Straße 47, 17489 Greifswald, Germany
| | - Mario Stanke
- Institute of Mathematics and Computer Science, University of Greifswald, Walther-Rathenau-Straße 47, 17489 Greifswald, Germany
| |
Collapse
|
6
|
Zhang Y, Zhang Q, Zhou J, Zou Q. A survey on the algorithm and development of multiple sequence alignment. Brief Bioinform 2022; 23:6546258. [PMID: 35272347 DOI: 10.1093/bib/bbac069] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2021] [Revised: 01/30/2022] [Accepted: 02/09/2022] [Indexed: 12/21/2022] Open
Abstract
Multiple sequence alignment (MSA) is an essential cornerstone in bioinformatics, which can reveal the potential information in biological sequences, such as function, evolution and structure. MSA is widely used in many bioinformatics scenarios, such as phylogenetic analysis, protein analysis and genomic analysis. However, MSA faces new challenges with the gradual increase in sequence scale and the increasing demand for alignment accuracy. Therefore, developing an efficient and accurate strategy for MSA has become one of the research hotspots in bioinformatics. In this work, we mainly summarize the algorithms for MSA and its applications in bioinformatics. To provide a structured and clear perspective, we systematically introduce MSA's knowledge, including background, database, metric and benchmark. Besides, we list the most common applications of MSA in the field of bioinformatics, including database searching, phylogenetic analysis, genomic analysis, metagenomic analysis and protein analysis. Furthermore, we categorize and analyze classical and state-of-the-art algorithms, divided into progressive alignment, iterative algorithm, heuristics, machine learning and divide-and-conquer. Moreover, we also discuss the challenges and opportunities of MSA in bioinformatics. Our work provides a comprehensive survey of MSA applications and their relevant algorithms. It could bring valuable insights for researchers to contribute their knowledge to MSA and relevant studies.
Collapse
Affiliation(s)
- Yongqing Zhang
- School of Computer Science, Chengdu University of Information Technology, 610225, Chengdu, China.,School of Computer Science and Engineering, University of Electronic Science and Technology of China, 611731, Chengdu, China
| | - Qiang Zhang
- School of Computer Science, Chengdu University of Information Technology, 610225, Chengdu, China
| | - Jiliu Zhou
- School of Computer Science, Chengdu University of Information Technology, 610225, Chengdu, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, 610054, Chengdu, China
| |
Collapse
|
7
|
Song YJ, Cho DH. Local Alignment of DNA Sequence Based on Deep Reinforcement Learning. IEEE OPEN JOURNAL OF ENGINEERING IN MEDICINE AND BIOLOGY 2021; 2:170-178. [PMID: 35402982 PMCID: PMC8975175 DOI: 10.1109/ojemb.2021.3076156] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2021] [Revised: 04/20/2021] [Accepted: 04/23/2021] [Indexed: 11/06/2022] Open
Abstract
Goal: Over the decades, there have been improvements in the sequence alignment algorithm, with significant advances in various aspects such as complexity and accuracy. However, human-defined algorithms have an explicit limitation in view of developmental completeness. This paper introduces a novel local alignment method to obtain optimal sequence alignment based on reinforcement learning. Methods: There is a DQNalign algorithm that learns and performs sequence alignment through deep reinforcement learning. This paper proposes a DQN x-drop algorithm that performs local alignment without human intervention by combining the x-drop algorithm with this DQNalign algorithm. The proposed algorithm performs local alignment by repeatedly observing the subsequences and selecting the next alignment direction until the x-drop algorithm terminates the DQNalign algorithm. This proposed algorithm has an advantage in view of linear computational complexity compared to conventional local alignment algorithms. Results: This paper compares alignment performance (coverage and identity) and complexity for a fair comparison between the proposed DQN x-drop algorithm and the conventional greedy x-drop algorithm. Firstly, we prove the proposed algorithm's superiority by comparing the two algorithms' computational complexity through numerical analysis. After that, we tested the alignment performance actual HEV and E.coli sequence datasets. The proposed method shows the comparable identity and coverage performance to the conventional alignment method while having linear complexity for the [Formula: see text] parameter. Conclusions: Through this study, it was possible to confirm the possibility of a new local alignment algorithm that minimizes computational complexity without human intervention.
Collapse
Affiliation(s)
- Yong-Joon Song
- School of Electrical EngineeringKorea Advanced Institute of Science and TechnologyDaejeon305-701South Korea
| | - Dong-Ho Cho
- School of Electrical EngineeringKorea Advanced Institute of Science and TechnologyDaejeon305-701South Korea
| |
Collapse
|
8
|
Auslander N, Gussow AB, Koonin EV. Incorporating Machine Learning into Established Bioinformatics Frameworks. Int J Mol Sci 2021; 22:2903. [PMID: 33809353 PMCID: PMC8000113 DOI: 10.3390/ijms22062903] [Citation(s) in RCA: 35] [Impact Index Per Article: 11.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2021] [Revised: 03/08/2021] [Accepted: 03/10/2021] [Indexed: 12/23/2022] Open
Abstract
The exponential growth of biomedical data in recent years has urged the application of numerous machine learning techniques to address emerging problems in biology and clinical research. By enabling the automatic feature extraction, selection, and generation of predictive models, these methods can be used to efficiently study complex biological systems. Machine learning techniques are frequently integrated with bioinformatic methods, as well as curated databases and biological networks, to enhance training and validation, identify the best interpretable features, and enable feature and model investigation. Here, we review recently developed methods that incorporate machine learning within the same framework with techniques from molecular evolution, protein structure analysis, systems biology, and disease genomics. We outline the challenges posed for machine learning, and, in particular, deep learning in biomedicine, and suggest unique opportunities for machine learning techniques integrated with established bioinformatics approaches to overcome some of these challenges.
Collapse
Affiliation(s)
| | | | - Eugene V. Koonin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA;
| |
Collapse
|
9
|
Lim D, Blanchette M. EvoLSTM: context-dependent models of sequence evolution using a sequence-to-sequence LSTM. Bioinformatics 2021; 36:i353-i361. [PMID: 32657367 PMCID: PMC7355264 DOI: 10.1093/bioinformatics/btaa447] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Motivation Accurate probabilistic models of sequence evolution are essential for a wide variety of bioinformatics tasks, including sequence alignment and phylogenetic inference. The ability to realistically simulate sequence evolution is also at the core of many benchmarking strategies. Yet, mutational processes have complex context dependencies that remain poorly modeled and understood. Results We introduce EvoLSTM, a recurrent neural network-based evolution simulator that captures mutational context dependencies. EvoLSTM uses a sequence-to-sequence long short-term memory model trained to predict mutation probabilities at each position of a given sequence, taking into consideration the 14 flanking nucleotides. EvoLSTM can realistically simulate mammalian and plant DNA sequence evolution and reveals unexpectedly strong long-range context dependencies in mutation probabilities. EvoLSTM brings modern machine-learning approaches to bear on sequence evolution. It will serve as a useful tool to study and simulate complex mutational processes. Availability and implementation Code and dataset are available at https://github.com/DongjoonLim/EvoLSTM. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Dongjoon Lim
- School of Computer Science, McGill University, Montreal, Quebec H3A 0G4, Canada
| | - Mathieu Blanchette
- School of Computer Science, McGill University, Montreal, Quebec H3A 0G4, Canada
| |
Collapse
|
10
|
Prihoda D, Maritz JM, Klempir O, Dzamba D, Woelk CH, Hazuda DJ, Bitton DA, Hannigan GD. The application potential of machine learning and genomics for understanding natural product diversity, chemistry, and therapeutic translatability. Nat Prod Rep 2021; 38:1100-1108. [PMID: 33245088 DOI: 10.1039/d0np00055h] [Citation(s) in RCA: 24] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
Abstract
Covering: up to the end of 2020. The machine learning field can be defined as the study and application of algorithms that perform classification and prediction tasks through pattern recognition instead of explicitly defined rules. Among other areas, machine learning has excelled in natural language processing. As such methods have excelled at understanding written languages (e.g. English), they are also being applied to biological problems to better understand the "genomic language". In this review we focus on recent advances in applying machine learning to natural products and genomics, and how those advances are improving our understanding of natural product biology, chemistry, and drug discovery. We discuss machine learning applications in genome mining (identifying biosynthetic signatures in genomic data), predictions of what structures will be created from those genomic signatures, and the types of activity we might expect from those molecules. We further explore the application of these approaches to data derived from complex microbiomes, with a focus on the human microbiome. We also review challenges in leveraging machine learning approaches in the field, and how the availability of other "omics" data layers provides value. Finally, we provide insights into the challenges associated with interpreting machine learning models and the underlying biology and promises of applying machine learning to natural product drug discovery. We believe that the application of machine learning methods to natural product research is poised to accelerate the identification of new molecular entities that may be used to treat a variety of disease indications.
Collapse
Affiliation(s)
- David Prihoda
- R&D Informatics Solutions, MSD Czech Republic s.r.o., Prague, Czech Republic and Department of Informatics and Chemistry, Faculty of Chemical Technology, University of Chemistry and Technology, Prague, Czech Republic
| | - Julia M Maritz
- Exploratory Science Center, Merck & Co., Inc., Cambridge, MA, USA.
| | - Ondrej Klempir
- R&D Informatics Solutions, MSD Czech Republic s.r.o., Prague, Czech Republic
| | - David Dzamba
- R&D Informatics Solutions, MSD Czech Republic s.r.o., Prague, Czech Republic
| | | | - Daria J Hazuda
- Exploratory Science Center, Merck & Co., Inc., Cambridge, MA, USA.
| | - Danny A Bitton
- R&D Informatics Solutions, MSD Czech Republic s.r.o., Prague, Czech Republic
| | | |
Collapse
|