1
|
Liu Y, Yuan H, Zhang Q, Wang Z, Xiong S, Wen N, Zhang Y. Multiple sequence alignment based on deep reinforcement learning with self-attention and positional encoding. Bioinformatics 2023; 39:btad636. [PMID: 37856335 PMCID: PMC10628385 DOI: 10.1093/bioinformatics/btad636] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2022] [Revised: 07/24/2023] [Accepted: 10/17/2023] [Indexed: 10/21/2023] Open
Abstract
MOTIVATION Multiple sequence alignment (MSA) is one of the hotspots of current research and is commonly used in sequence analysis scenarios. However, there is no lasting solution for MSA because it is a Nondeterministic Polynomially complete problem, and the existing methods still have room to improve the accuracy. RESULTS We propose Deep reinforcement learning with Positional encoding and self-Attention for MSA, based on deep reinforcement learning, to enhance the accuracy of the alignment Specifically, inspired by the translation technique in natural language processing, we introduce self-attention and positional encoding to improve accuracy and reliability. Firstly, positional encoding encodes the position of the sequence to prevent the loss of nucleotide position information. Secondly, the self-attention model is used to extract the key features of the sequence. Then input the features into a multi-layer perceptron, which can calculate the insertion position of the gap according to the features. In addition, a novel reinforcement learning environment is designed to convert the classic progressive alignment into progressive column alignment, gradually generating each column's sub-alignment. Finally, merge the sub-alignment into the complete alignment. Extensive experiments based on several datasets validate our method's effectiveness for MSA, outperforming some state-of-the-art methods in terms of the Sum-of-pairs and Column scores. AVAILABILITY AND IMPLEMENTATION The process is implemented in Python and available as open-source software from https://github.com/ZhangLab312/DPAMSA.
Collapse
Affiliation(s)
- Yuhang Liu
- School of Computer Science, Chengdu University of Information Technology, Chengdu 610225, China
| | - Hao Yuan
- School of Computer Science, Chengdu University of Information Technology, Chengdu 610225, China
| | - Qiang Zhang
- School of Computer Science, Chengdu University of Information Technology, Chengdu 610225, China
| | - Zixuan Wang
- College of Electronics and Information Engineering, Sichuan University, Chengdu 610065, China
| | - Shuwen Xiong
- School of Computer Science, Chengdu University of Information Technology, Chengdu 610225, China
| | - Naifeng Wen
- School of Mechanical and Electrical Engineering, Dalian Minzu University, Dalian 116600, China
| | - Yongqing Zhang
- School of Computer Science, Chengdu University of Information Technology, Chengdu 610225, China
| |
Collapse
|
2
|
Marriam S, Afghan MS, Nadeem M, Sajid M, Ahsan M, Basit A, Wajid M, Sabri S, Sajid M, Zafar I, Rashid S, Sehgal SA, Alkhalifah DHM, Hozzein WN, Chen KT, Sharma R. Elucidation of novel compounds and epitope-based peptide vaccine design against C30 endopeptidase regions of SARS-CoV-2 using immunoinformatics approaches. Front Cell Infect Microbiol 2023; 13:1134802. [PMID: 37293206 PMCID: PMC10244718 DOI: 10.3389/fcimb.2023.1134802] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2022] [Accepted: 04/29/2023] [Indexed: 06/10/2023] Open
Abstract
There has been progressive improvement in immunoinformatics approaches for epitope-based peptide design. Computational-based immune-informatics approaches were applied to identify the epitopes of SARS-CoV-2 to develop vaccines. The accessibility of the SARS-CoV-2 protein surface was analyzed, and hexa-peptide sequences (KTPKYK) were observed having a maximum score of 8.254, located between amino acids 97 and 102, whereas the FSVLAC at amino acids 112 to 117 showed the lowest score of 0.114. The surface flexibility of the target protein ranged from 0.864 to 1.099 having amino acid ranges of 159 to 165 and 118 to 124, respectively, harboring the FCYMHHM and YNGSPSG hepta-peptide sequences. The surface flexibility was predicted, and a 0.864 score was observed from amino acids 159 to 165 with the hepta-peptide (FCYMHHM) sequence. Moreover, the highest score of 1.099 was observed between amino acids 118 and 124 against YNGSPSG. B-cell epitopes and cytotoxic T-lymphocyte (CTL) epitopes were also identified against SARS-CoV-2. In molecular docking analyses, -0.54 to -26.21 kcal/mol global energy was observed against the selected CTL epitopes, exhibiting binding solid energies of -3.33 to -26.36 kcal/mol. Based on optimization, eight epitopes (SEDMLNPNY, GSVGFNIDY, LLEDEFTPF, DYDCVSFCY, GTDLEGNFY, QTFSVLACY, TVNVLAWLY, and TANPKTPKY) showed reliable findings. The study calculated the associated HLA alleles with MHC-I and MHC-II and found that MHC-I epitopes had higher population coverage (0.9019% and 0.5639%) than MHC-II epitopes, which ranged from 58.49% to 34.71% in Italy and China, respectively. The CTL epitopes were docked with antigenic sites and analyzed with MHC-I HLA protein. In addition, virtual screening was conducted using the ZINC database library, which contained 3,447 compounds. The 10 top-ranked scrutinized molecules (ZINC222731806, ZINC077293241, ZINC014880001, ZINC003830427, ZINC030731133, ZINC003932831, ZINC003816514, ZINC004245650, ZINC000057255, and ZINC011592639) exhibited the least binding energy (-8.8 to -7.5 kcal/mol). The molecular dynamics (MD) and immune simulation data suggest that these epitopes could be used to design an effective SARS-CoV-2 vaccine in the form of a peptide-based vaccine. Our identified CTL epitopes have the potential to inhibit SARS-CoV-2 replication.
Collapse
Affiliation(s)
- Saigha Marriam
- Department of Microbiology and Molecular Genetics, Faculty of Life Sciences, University of Okara, Okara, Pakistan
| | - Muhammad Sher Afghan
- Department of Ear, Nose, and Throat (ENT), District Headquarter (DHQ) Teaching Hospital Faisalabad, Faisalabad, Punjab, Pakistan
| | - Mazhar Nadeem
- Department of Ear, Nose, and Throat (ENT), District Headquarter (DHQ) Teaching Hospital Faisalabad, Faisalabad, Punjab, Pakistan
| | - Muhammad Sajid
- Department of Biotechnology, Faculty of Life Sciences, University of Okara, Okara, Pakistan
| | - Muhammad Ahsan
- Institute of Environmental and Agricultural Sciences, University of Okara, Okara, Pakistan
| | - Abdul Basit
- Department of Microbiology, University of Jhang, Jhang, Pakistan
| | - Muhammad Wajid
- Department of Zoology, Faculty of Life Sciences, University of Okara, Okara, Pakistan
| | - Sabeen Sabri
- Department of Microbiology and Molecular Genetics, Faculty of Life Sciences, University of Okara, Okara, Pakistan
| | - Muhammad Sajid
- Department of Biotechnology, Faculty of Life Sciences, University of Okara, Okara, Pakistan
| | - Imran Zafar
- Department of Bioinformatics and Computational Biology, Virtual University, Punjab, Pakistan
| | - Summya Rashid
- Department of Pharmacology and Toxicology, College of Pharmacy, Prince Sattam Bin Abdulaziz University, Al-Kharj, Saudi Arabia
| | - Sheikh Arslan Sehgal
- Department of Bioinformatics, Faculty of Life Sciences, University of Okara, Okara, Pakistan
- Department of Bioinformatics, Institute of Biochemistry, Biotechnology and Bioinformatics, The Islamia University of Bahawalpur, Bahawalpur, Pakistan
| | - Dalal Hussien M Alkhalifah
- Department of Biology, College of Science, Princess Nourah Bint Abdulrahman University, Riyadh, Saudi Arabia
| | - Wael N Hozzein
- Botany and Microbiology Department, Faculty of Science, Beni-Suef University, Beni-Suef, Egypt
| | - Kow-Tong Chen
- Department of Occupational Medicine, Tainan Municipal Hospital (managed by ShowChwan Medical Care Corporation), Tainan, Taiwan
- Department of Public Health, College of Medicine, National Cheng Kung University, Tainan, Taiwan
| | - Rohit Sharma
- Department of Rasa Shastra and Bhaishajya Kalpana, Faculty of Ayurveda, Institute of Medical Sciences, Banaras Hindu University, Varanasi, India
| |
Collapse
|
3
|
Kuang M, Zhang Y, Lam TW, Ting HF. MLProbs: A Data-Centric Pipeline for Better Multiple Sequence Alignment. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:524-533. [PMID: 35120007 DOI: 10.1109/tcbb.2022.3148382] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
In this paper, we explore using the data-centric approach to tackle the Multiple Sequence Alignment (MSA) construction problem. Unlike the algorithm-centric approach, which reduces the construction problem to a combinatorial optimization problem based on an abstract mathematical model, the data-centric approach explores using classification models trained from existing benchmark data to guide the construction. We identified two simple classifications to help us choose a better alignment tool and determine whether and how much to carry out realignment. We show that shallow machine-learning algorithms suffice to train sensitive models for these classifications. Based on these models, we implemented a new multiple sequence alignment pipeline, called MLProbs. Compared with 10 other popular alignment tools over four benchmark databases (namely, BAliBASE, OXBench, OXBench-X and SABMark), MLProbs consistently gives the highest TC score. More importantly, MLProbs shows non-trivial improvement for protein families with low similarity; in particular, when evaluated against the 1,356 protein families with similarity ≤ 50%, MLProbs achieves a TC score of 56.93, while the next best three tools are in the range of [55.41, 55.91] (increased by more than 1.8%). We also compared the performance of MLProbs and other MSA tools in two real-life applications - Phylogenetic Tree Construction Analysis and Protein Secondary Structure Prediction - and MLProbs also had the best performance. In our study, we used only shallow machine-learning algorithms to train our models. It would be interesting to study whether deep-learning methods can help make further improvements, so we suggest some possible research directions in the conclusion section.
Collapse
|
4
|
Hubley R, Wheeler TJ, Smit AFA. Accuracy of multiple sequence alignment methods in the reconstruction of transposable element families. NAR Genom Bioinform 2022; 4:lqac040. [PMID: 35591887 PMCID: PMC9112768 DOI: 10.1093/nargab/lqac040] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2021] [Revised: 03/29/2022] [Accepted: 04/29/2022] [Indexed: 02/06/2023] Open
Abstract
The construction of a high-quality multiple sequence alignment (MSA) from copies of a transposable element (TE) is a critical step in the characterization of a new TE family. Most studies of MSA accuracy have been conducted on protein or RNA sequence families, where structural features and strong signals of selection may assist with alignment. Less attention has been given to the quality of sequence alignments involving neutrally evolving DNA sequences such as those resulting from TE replication. Transposable element sequences are challenging to align due to their wide divergence ranges, fragmentation, and predominantly-neutral mutation patterns. To gain insight into the effects of these properties on MSA accuracy, we developed a simulator of TE sequence evolution, and used it to generate a benchmark with which we evaluated the MSA predictions produced by several popular aligners, along with Refiner, a method we developed in the context of our RepeatModeler software. We find that MAFFT and Refiner generally outperform other aligners for low to medium divergence simulated sequences, while Refiner is uniquely effective when tasked with aligning high-divergent and fragmented instances of a family.
Collapse
Affiliation(s)
- Robert Hubley
- Institute for Systems Biology, Seattle, WA 98109, USA
| | - Travis J Wheeler
- Department of Computer Science, University of Montana, Missoula, MT 59801, USA
| | | |
Collapse
|
5
|
Shrestha B, Adhikari B. Scoring protein sequence alignments using deep Learning. Bioinformatics 2022; 38:2988-2995. [PMID: 35385080 DOI: 10.1093/bioinformatics/btac210] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2021] [Revised: 04/01/2022] [Accepted: 04/05/2022] [Indexed: 11/12/2022] Open
Abstract
BACKGROUND A high-quality sequence alignment (SA) is the most important input feature for accurate protein structure prediction. For a protein sequence, there are many methods to generate a SA. However, when given a choice of more than one SA for a protein sequence, there are no methods to predict which SA may lead to more accurate models without actually building the models. In this work, we describe a method to predict the quality of a protein's SA. METHODS We created our own dataset by generating a variety of SAs for a set of 1,351 representative proteins and investigated various deep learning architectures to predict the local distance difference test (lDDT) scores of distance maps predicted with SAs as the input. These lDDT scores serve as indicators of the quality of the SAs. RESULTS Using two independent test datasets consisting of CASP13 and CASP14 targets, we show that our method is effective for scoring and ranking SAs when a pool of SAs is available for a protein sequence. With an example, we further discuss that SA selection using our method can lead to improved structure prediction. AVAILABILITY Code and datasets are available at https://github.com/ba-lab/Alignment-Score/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Bikash Shrestha
- Department of Computer Science, University of Missouri-St. Louis, St. Louis, MO 63132, USA
| | - Badri Adhikari
- Department of Computer Science, University of Missouri-St. Louis, St. Louis, MO 63132, USA
| |
Collapse
|
6
|
Zhang Y, Zhang Q, Zhou J, Zou Q. A survey on the algorithm and development of multiple sequence alignment. Brief Bioinform 2022; 23:6546258. [PMID: 35272347 DOI: 10.1093/bib/bbac069] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2021] [Revised: 01/30/2022] [Accepted: 02/09/2022] [Indexed: 12/21/2022] Open
Abstract
Multiple sequence alignment (MSA) is an essential cornerstone in bioinformatics, which can reveal the potential information in biological sequences, such as function, evolution and structure. MSA is widely used in many bioinformatics scenarios, such as phylogenetic analysis, protein analysis and genomic analysis. However, MSA faces new challenges with the gradual increase in sequence scale and the increasing demand for alignment accuracy. Therefore, developing an efficient and accurate strategy for MSA has become one of the research hotspots in bioinformatics. In this work, we mainly summarize the algorithms for MSA and its applications in bioinformatics. To provide a structured and clear perspective, we systematically introduce MSA's knowledge, including background, database, metric and benchmark. Besides, we list the most common applications of MSA in the field of bioinformatics, including database searching, phylogenetic analysis, genomic analysis, metagenomic analysis and protein analysis. Furthermore, we categorize and analyze classical and state-of-the-art algorithms, divided into progressive alignment, iterative algorithm, heuristics, machine learning and divide-and-conquer. Moreover, we also discuss the challenges and opportunities of MSA in bioinformatics. Our work provides a comprehensive survey of MSA applications and their relevant algorithms. It could bring valuable insights for researchers to contribute their knowledge to MSA and relevant studies.
Collapse
Affiliation(s)
- Yongqing Zhang
- School of Computer Science, Chengdu University of Information Technology, 610225, Chengdu, China.,School of Computer Science and Engineering, University of Electronic Science and Technology of China, 611731, Chengdu, China
| | - Qiang Zhang
- School of Computer Science, Chengdu University of Information Technology, 610225, Chengdu, China
| | - Jiliu Zhou
- School of Computer Science, Chengdu University of Information Technology, 610225, Chengdu, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, 610054, Chengdu, China
| |
Collapse
|
7
|
Tati S, Alisaraie L. Analysis of the Structural Mechanism of ATP Inhibition at the AAA1 Subunit of Cytoplasmic Dynein-1 Using a Chemical "Toolkit". Int J Mol Sci 2021; 22:ijms22147704. [PMID: 34299323 PMCID: PMC8304172 DOI: 10.3390/ijms22147704] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2021] [Revised: 07/11/2021] [Accepted: 07/14/2021] [Indexed: 11/28/2022] Open
Abstract
Dynein is a ~1.2 MDa cytoskeletal motor protein that carries organelles via retrograde transport in eukaryotic cells. The motor protein belongs to the ATPase family of proteins associated with diverse cellular activities and plays a critical role in transporting cargoes to the minus end of the microtubules. The motor domain of dynein possesses a hexameric head, where ATP hydrolysis occurs. The presented work analyzes the structure–activity relationship (SAR) of dynapyrazole A and B, as well as ciliobrevin A and D, in their various protonated states and their 46 analogues for their binding in the AAA1 subunit, the leading ATP hydrolytic site of the motor domain. This study exploits in silico methods to look at the analogues’ effects on the functionally essential subsites of the motor domain of dynein 1, since no similar experimental structural data are available. Ciliobrevin and its analogues bind to the ATP motifs of the AAA1, namely, the walker-A (W-A) or P-loop, the walker-B (W-B), and the sensor I and II. Ciliobrevin A shows a better binding affinity than its D analogue. Although the double bond in ciliobrevin A and D was expected to decrease the ligand potency, they show a better affinity to the AAA1 binding site than dynapyrazole A and B, lacking the bond. In addition, protonation of the nitrogen atom in ciliobrevin A and D, as well as dynapyrazole A and B, at the N9 site of ciliobrevin and the N7 of the latter increased their binding affinity. Exploring ciliobrevin A geometrical configuration suggests the E isomer has a superior binding profile over the Z due to binding at the critical ATP motifs. Utilizing the refined structure of the motor domain obtained through protein conformational search in this study exhibits that Arg1852 of the yeast cytoplasmic dynein could involve in the “glutamate switch” mechanism in cytoplasmic dynein 1 in lieu of the conserved Asn in AAA+ protein family.
Collapse
|
8
|
Abadi S, Avram O, Rosset S, Pupko T, Mayrose I. ModelTeller: Model Selection for Optimal Phylogenetic Reconstruction Using Machine Learning. Mol Biol Evol 2021; 37:3338-3352. [PMID: 32585030 DOI: 10.1093/molbev/msaa154] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023] Open
Abstract
Statistical criteria have long been the standard for selecting the best model for phylogenetic reconstruction and downstream statistical inference. Although model selection is regarded as a fundamental step in phylogenetics, existing methods for this task consume computational resources for long processing time, they are not always feasible, and sometimes depend on preliminary assumptions which do not hold for sequence data. Moreover, although these methods are dedicated to revealing the processes that underlie the sequence data, they do not always produce the most accurate trees. Notably, phylogeny reconstruction consists of two related tasks, topology reconstruction and branch-length estimation. It was previously shown that in many cases the most complex model, GTR+I+G, leads to topologies that are as accurate as using existing model selection criteria, but overestimates branch lengths. Here, we present ModelTeller, a computational methodology for phylogenetic model selection, devised within the machine-learning framework, optimized to predict the most accurate nucleotide substitution model for branch-length estimation. We demonstrate that ModelTeller leads to more accurate branch-length inference than current model selection criteria on data sets simulated under realistic processes. ModelTeller relies on a readily implemented machine-learning model and thus the prediction according to features extracted from the sequence data results in a substantial decrease in running time compared with existing strategies. By harnessing the machine-learning framework, we distinguish between features that mostly contribute to branch-length optimization, concerning the extent of sequence divergence, and features that are related to estimates of the model parameters that are important for the selection made by current criteria.
Collapse
Affiliation(s)
- Shiran Abadi
- School of Plant Sciences and Food security, Tel-Aviv University, Tel-Aviv, Israel
| | - Oren Avram
- School of Molecular Cell Biology & Biotechnology, Tel-Aviv University, Tel-Aviv, Israel
| | - Saharon Rosset
- Department of Statistics and Operations Research, School of Mathematical Sciences, Tel-Aviv University, Tel-Aviv, Israel
| | - Tal Pupko
- School of Molecular Cell Biology & Biotechnology, Tel-Aviv University, Tel-Aviv, Israel
| | - Itay Mayrose
- School of Plant Sciences and Food security, Tel-Aviv University, Tel-Aviv, Israel
| |
Collapse
|
9
|
N H, P SR, Sura M, Daddam JR. Structure prediction, molecular simulations of RmlD from Mycobacterium tuberculosis, and interaction studies of Rhodanine derivatives for anti-tuberculosis activity. J Mol Model 2021; 27:75. [PMID: 33547544 DOI: 10.1007/s00894-021-04696-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2020] [Accepted: 01/26/2021] [Indexed: 12/14/2022]
Abstract
Tuberculosis is the most dangerous disease causing maximum deaths than any other, caused by single infectious agent. Due to multidrug resistant of Mycobacterium tuberculosis strains, there is a need of new drugs and drug targets. In this work, we have selected RmlD (α-dTDP-6-deoxy-lyxo-4-hexulose reductase) in the dTDP Rhamnose pathway as drug target to control tuberculosis using Rhodanine analogues. In order to study interaction of RmlD with Rhodanine analogues, a three-dimensional model based on crystal structures such as 1VLO from Clostridium, 1KBZ from Salmonella typhimurium, and 2GGS from Sulfolobus was generated using Modeller 9v7. The modeled structure reliability has been checked using programs such as Procheck, What if, Prosa, Verify 3D, and Errat. In an attempt to find new inhibitors for RmlD enzyme, docking studies were done with a series of Rhodanine and its analogues. Detailed analysis of enzyme-inhibitor interactions identified specific key residues, SER5, VAL9, ILE51, HIS54, and GLY55 which were important in forming hydrogen bonds in binding affinity. Homology modeling and docking studies on RmlD model provided valuable insight information for designing better inhibitors as novel anti-tuberculosis drugs by rational method.
Collapse
Affiliation(s)
- Harathi N
- Department of Biochemistry, G. Pulla Reddy Dental College, Kurnool, India
| | - Sreenivasa Reddy P
- Department of Oral and Maxillofacial Surgery, G. Pulla Reddy Dental College & Hospital, Kurnool, 518002, India
| | - Mounica Sura
- Department of Foodtechnology, Jawaharlalnehru Technological University Anantapur, Anantapur, 515001, India
| | - Jayasimha Rayalu Daddam
- Cardiovascular and Mitochondria Related Diseases Research Center, Hualien Tzu Chi Hospital, Buddhist Tzu Chi Medical Foundation, Hualien, Taiwan.
| |
Collapse
|
10
|
Papathoti NK, Saengchan C, Daddam JR, Thongprom N, Tonpho K, Thanh TL, Buensanteai N. Plant systemic acquired resistance compound salicylic acid as a potent inhibitor against SCF (SKP1-CUL1-F-box protein) mediated complex in Fusarium oxysporum by homology modeling and molecular dynamics simulations. J Biomol Struct Dyn 2020; 40:1472-1479. [PMID: 33047664 DOI: 10.1080/07391102.2020.1828168] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
Fusarium oxysporum causes significant economic losses in many crop plants by causing root rot, necrosis, and wilting symptoms. Homology and molecular dynamics studies are promising tools for the detection in F. oxysporum of the systemic resistance compound, salicylic acid, for control of the SKP1-CUL1-F-box protein complex. The structure of SKP1-CUL1-F-box subunit Skp1 from F. oxysporum is produced by Modeler 9v7 for the conduct of docking studies. The Skp1 structure is based on the yeast Cdc4/Skp1 (PDB ID: 3MKS A) crystal structure collected by the Protein data bank. Applying molecular dynamic model simulation methods to the final predicted structure and further evaluated by 3D and PROCHECK test programmers, the final model is verified to be accurate. Applying GOLD 3.0.1, SCF Complex Skp1 is used to prevent stress-tolerant operation. The SKP1-CUL1-F-box model is predicted to be stabilized and tested as a stable docking structure. The predicted model of the SCF structure has been stabilized and confirmed to be a reliable structure for docking studies. The results indicated that GLN8, LYS9, VAL10, TRP11, GLU48, ASN49 in SCF complex are important determinant residues in binding as they have strong hydrogen bonding with salicylic acid, which showed best docking results with SKP1-CUL1-F-box complex subunit Skp1 with docking score 25.25KJ/mol. Insilco studies have been used to determine the mode of action of salicylic acid for Fusarium control. Salicylic acid hinders the SKP1-CUL1-F-box complex, which is important in protein-like interactions through hydrogen bodings. Results from docking studies have shown that the best energy for SKP1-CUL1-F-box was salicylic acid.Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
- Narendra Kumar Papathoti
- School of Crop Production Technology, Institute of Agricultural Technology, Suranaree University of Technology, Nakhon Ratchasima, Thailand
| | - Chanon Saengchan
- School of Crop Production Technology, Institute of Agricultural Technology, Suranaree University of Technology, Nakhon Ratchasima, Thailand
| | - Jayasimha Rayulu Daddam
- Department of Cardiovascular and Mitochondrial Related Disease Research Center, Hualien Tzu Chi Hospital, Hualien, Taiwan
| | - Nattaya Thongprom
- School of Crop Production Technology, Institute of Agricultural Technology, Suranaree University of Technology, Nakhon Ratchasima, Thailand
| | - Kodchaphon Tonpho
- School of Crop Production Technology, Institute of Agricultural Technology, Suranaree University of Technology, Nakhon Ratchasima, Thailand
| | - Toan Le Thanh
- Crop Protection Department, College of Agriculture, Can Tho University, Can Tho city, Vietnam
| | - Natthiya Buensanteai
- School of Crop Production Technology, Institute of Agricultural Technology, Suranaree University of Technology, Nakhon Ratchasima, Thailand
| |
Collapse
|
11
|
Trivedi R, Nagarajaram HA. Substitution scoring matrices for proteins - An overview. Protein Sci 2020; 29:2150-2163. [PMID: 32954566 DOI: 10.1002/pro.3954] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2020] [Revised: 09/17/2020] [Accepted: 09/18/2020] [Indexed: 01/17/2023]
Abstract
Sequence analysis is the primary and simplest approach to discover structural, functional and evolutionary details of related proteins. All the alignment based approaches of sequence analysis make use of amino acid substitution matrices, and the accuracy of the results largely depends on the type of scoring matrices used to perform alignment tasks. An amino acid substitution matrix is a 20 × 20 matrix in which the individual elements encapsulate the rates at which each of the 20 amino acid residues in proteins are substituted by other amino acid residues over time. In contrast to most globular/ordered proteins whose amino acids composition is considered as standard, there are several classes of proteins (e.g., transmembrane proteins) in which certain types of amino acid (e.g., hydrophobic residues) are enriched. These compositional differences among various classes of proteins are manifested in their underlying residue substitution frequencies. Therefore, each of the compositionally distinct class of proteins or protein segments should be studied using specific scoring matrices that reflect their distinct residue substitution pattern. In this review, we describe the development and application of various substitution scoring matrices peculiar to proteins with standard and biased compositions. Along with most commonly used standard matrices (PAM, BLOSUM, MD and VTML) that act as default parameters in various homologs search and alignment tools, different substitution scoring matrices specific to compositionally distinct class of proteins are discussed in detail.
Collapse
Affiliation(s)
- Rakesh Trivedi
- Laboratory of Computational Biology, Centre for DNA Fingerprinting and Diagnostics, Uppal, Hyderabad, Telangana, India.,Graduate School, Manipal Academy of Higher Education, Manipal, Karnataka, India
| | - Hampapathalu Adimurthy Nagarajaram
- Laboratory of Computational Biology, Department of Systems and Computational Biology, School of Life Sciences, University of Hyderabad, Hyderabad, Telangana, India.,Centre for Modelling, Simulation and Design, University of Hyderabad, Hyderabad, Telangana, India
| |
Collapse
|
12
|
Zhan Q, Fu Y, Jiang Q, Liu B, Peng J, Wang Y. SpliVert: A Protein Multiple Sequence Alignment Refinement Method Based on Splitting-Splicing Vertically. Protein Pept Lett 2020; 27:295-302. [PMID: 31385760 DOI: 10.2174/0929866526666190806143959] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2019] [Revised: 04/26/2019] [Accepted: 06/14/2019] [Indexed: 11/22/2022]
Abstract
BACKGROUND Multiple Sequence Alignment (MSA) is a fundamental task in bioinformatics and is required for many biological analysis tasks. The more accurate the alignments are, the more credible the downstream analyses. Most protein MSA algorithms realign an alignment to refine it by dividing it into two groups horizontally and then realign the two groups. However, this strategy does not consider that different regions of the sequences have different conservation; this property may lead to incorrect residue-residue or residue-gap pairs, which cannot be corrected by this strategy. OBJECTIVE In this article, our motivation is to develop a novel refinement method based on splitting- splicing vertically. METHODS Here, we present a novel refinement method based on splitting-splicing vertically, called SpliVert. For an alignment, we split it vertically into 3 parts, remove the gap characters in the middle, realign the middle part alone, and splice the realigned middle parts with the other two initial pieces to obtain a refined alignment. In the realign procedure of our method, the aligner will only focus on a certain part, ignoring the disturbance of the other parts, which could help fix the incorrect pairs. RESULTS We tested our refinement strategy for 2 leading MSA tools on 3 standard benchmarks, according to the commonly used average SP (and TC) score. The results show that given appropriate proportions to split the initial alignment, the average scores are increased comparably or slightly after using our method. We also compared the alignments refined by our method with alignments directly refined by the original alignment tools. The results suggest that using our SpliVert method to refine alignments can also outperform direct use of the original alignment tools. CONCLUSION The results reveal that splitting vertically and realigning part of the alignment is a good strategy for the refinement of protein multiple sequence alignments.
Collapse
Affiliation(s)
- Qing Zhan
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Yilei Fu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Qinghua Jiang
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Bo Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Jiajie Peng
- School of Computer Science, Northwestern Polytechnical University, Xi'an, China
| | - Yadong Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| |
Collapse
|
13
|
Heale KA, Alisaraie L. C-terminal Tail of β-Tubulin and its Role in the Alterations of Dynein Binding Mode. Cell Biochem Biophys 2020; 78:331-345. [PMID: 32462384 PMCID: PMC10020315 DOI: 10.1007/s12013-020-00920-7] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2020] [Accepted: 05/18/2020] [Indexed: 12/25/2022]
Abstract
Dynein is a cytoskeletal molecular motor protein that moves along the microtubule (MT) and transports various cellular cargos during its movement. Using standard Molecular Dynamics (MD) simulation, Principle Component Analysis (PCA), and Normal Mode Analysis (NMA) methods, this investigation studied large-scale movements and local interactions of dynein's Microtubule Binding Domain (MTBD) when bound to tubulin heterodimer subunits. Examination of the interactions between the MTBD segments, and their adjustments in terms of intra- and intermolecular distances at the interfacial area with tubulin heterodimer, particularly at α-H16, β-H18, and β-tubulin C-terminal tail (CTT), was the main focus of this study. The specific intramolecular interactions, electrostatic forces, and the salt bridge residue pairs were shown to be the dominating factors in orchestrating movements of the MTBD and MT interfacial segments in the dynein's low-high-affinity binding modes. Important interactions included β-Glu447 and β-Glu449 (CTT) with Arg3469 (MTBD-H6), Lys3472 (MTBD-H6-H7 loop) and Lys3479 (MTBD-H7); β-Glu449 with Lys3384 (MTBD-H8), Lys3386 and His3387 (MTBD-H1). The structural and precise position, orientation, and functional effects of the CTTs on the MT-MTBD, within reasonable cut-off distance for non-bonding interactions and under physiological conditions, are unavailable from previous studies. The absence of the residues in the highly flexible MT-CTTs in the experimentally solved structures is perhaps in some cases due to insufficient data from density maps, but these segments are crucial in protein binding. The presented work contributes to the information useful for the MT-MTBD structure refinement.
Collapse
Affiliation(s)
- Kali A Heale
- School of Pharmacy, Memorial University of Newfoundland, 300 Prince Philip Dr., St. John's, NL, A1B 3V6, Canada
| | - Laleh Alisaraie
- School of Pharmacy, Memorial University of Newfoundland, 300 Prince Philip Dr., St. John's, NL, A1B 3V6, Canada.
| |
Collapse
|
14
|
Daddam JR, Sreenivasulu B, Umamahesh K, Peddanna K, Rao DM. In Silico Studies on Anti-Stress Compounds of Ethanolic Root Extract of Hemidesmus indicus L. Curr Pharm Biotechnol 2020; 21:502-515. [PMID: 31823700 DOI: 10.2174/1389201021666191211152754] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2019] [Revised: 11/25/2019] [Accepted: 11/25/2019] [Indexed: 12/17/2022]
Abstract
BACKGROUND Alternative medicine is available for those diseases which cannot be treated by conventional medicine. Ayurveda and herbal medicines are important alternative methods in which the treatment is done with extracts of different medicinal plants. This work is concerned with the evaluation of anti-stress bioactive compounds from the ethanolic root extract of Hemidesmus indicus. METHODS Gas chromatography and Mass Spectrum studies are used to identify the compounds present in the ethanolic extract based on the retention time, area. In order to perform docking studies, Vasopressin model is generated using modeling by Modeller 9v7. Vasopressin structure is developed based on the crystal structure of neurophysin-oxytocin from Bos taurus (PDB ID: 1NPO_A) collected from the PDB data bank. Using molecular dynamics simulation methods, the final predicted structure is obtained and further analyzed by verifying 3D and PROCHECK programs, confirmed that the final model is reliable. The identified compounds are docked to vasopressin for the prediction of anti-stress activity using GOLD 3.0.1 software. RESULTS The predicted model of Vasopressin structure is stabilized and confirmed that it is a reliable structure for docking studies. The results indicated ARG4, THR7, ASP9, ASP26, ALA32, ALA 80 in Vasopressin are important determinant residues in binding as they have strong hydrogen bonding with phytocompounds. Among the 21 phytocompounds identified and docked, molecule Deoxiinositol, pentakis- O-(trimethylsilyl) showed the best docking results with Vasopressin. CONCLUSION The identified compounds were used for anti-stress activity by insilico method with Vasopressin which plays an important role in causing stress and hence selected for inhibitory studies with phytocompounds. The phytocompounds are inhibiting vasopressin through hydrogen bodings and are important in protein-ligand interactions. Docking results showed that out of twenty-one compounds, Deoxiinositol, pentakis-O-(trimethylsilyl) showed best docking energy to the Vasopressin.
Collapse
Affiliation(s)
- Jayasimha R Daddam
- Department of Biotechnology, JNTUA, Anantapur, Andhra Pradesh 515 002, India
| | - Basha Sreenivasulu
- Department of Microbiology, Sri Venkateswara University, Tirupati, Andhra Pradesh 517 502, India.,Department of Biological Sciences, University of Arkansas, Arkansas, Fayetteville AR 72701, United States
| | - Katike Umamahesh
- Department of Biochemistry, Sri Venkateswara University, Tirupati, Andhra Pradesh 517 502, India.,Cardiovascular and Mitochondrial Related Diseased Research Center, Hualien Tzu Chi Hospital, Buddist Tzu Chi Medical Foundation, Hualien 970, Taiwan
| | - Kotha Peddanna
- Department of Biochemistry, Sri Venkateswara University, Tirupati, Andhra Pradesh 517 502, India.,School of Chinese Medicine, College of Chinese Medicine, China Medical University, Taichung 404, Taiwan
| | - Dowlathabad M Rao
- Department of Biotechnology, Sri Krishnadevaraya University, Anantapur, Andhra Pradesh 515 003, India
| |
Collapse
|
15
|
Carpentier M, Chomilier J. Protein multiple alignments: sequence-based versus structure-based programs. Bioinformatics 2020; 35:3970-3980. [PMID: 30942864 DOI: 10.1093/bioinformatics/btz236] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2018] [Revised: 03/05/2019] [Accepted: 04/02/2019] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Multiple sequence alignment programs have proved to be very useful and have already been evaluated in the literature yet not alignment programs based on structure or both sequence and structure. In the present article we wish to evaluate the added value provided through considering structures. RESULTS We compared the multiple alignments resulting from 25 programs either based on sequence, structure or both, to reference alignments deposited in five databases (BALIBASE 2 and 3, HOMSTRAD, OXBENCH and SISYPHUS). On the whole, the structure-based methods compute more reliable alignments than the sequence-based ones, and even than the sequence+structure-based programs whatever the databases. Two programs lead, MAMMOTH and MATRAS, nevertheless the performances of MUSTANG, MATT, 3DCOMB, TCOFFEE+TM_ALIGN and TCOFFEE+SAP are better for some alignments. The advantage of structure-based methods increases at low levels of sequence identity, or for residues in regular secondary structures or buried ones. Concerning gap management, sequence-based programs set less gaps than structure-based programs. Concerning the databases, the alignments of the manually built databases are more challenging for the programs. AVAILABILITY AND IMPLEMENTATION All data and results presented in this study are available at: http://wwwabi.snv.jussieu.fr/people/mathilde/download/AliMulComp/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Mathilde Carpentier
- Institut Systématique Evolution Biodiversité (ISYEB), Sorbonne Université, MNHN, CNRS, EPHE, Paris, France
| | - Jacques Chomilier
- Sorbonne Université, MNHN, CNRS, IRD, Institut de Minéralogie, de Physique des Matériaux et de Cosmochimie (IMPMC), BiBiP, Paris, France
| |
Collapse
|
16
|
Nwaiwu O, Aduba CC. An in silico analysis of acquired antimicrobial resistance genes in Aeromonas plasmids. AIMS Microbiol 2020; 6:75-91. [PMID: 32226916 PMCID: PMC7099201 DOI: 10.3934/microbiol.2020005] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2020] [Accepted: 03/13/2020] [Indexed: 12/17/2022] Open
Abstract
Sequences of 105 Aeromonas species plasmids were probed for acquired anti-microbial resistance (AMR) genes using a bioinformatics approach. The plasmids showed no positive linear correlation between size and GC content and up to 55 acquired AMR genes were found in 39 (37%) plasmids after in silico screening for resistance against 15 antibiotic drug classes. Overall, potential multiple antibiotic resistance (p-MAR) index ranged from 0.07 to 0.53. Up to 18 plasmids were predicted to mediate multiple drug resistance (MDR). Plasmids pS121-1a (A. salmonicida), pWCX23_1 (A. hydrophila) and pASP-a58 (A. veronii) harboured 18, 15 and 14 AMR genes respectively. The five most occurring drug classes for which AMR genes were detected were aminoglycosides (27%), followed by beta-lactams (17%), sulphonamides (13%), fluoroquinolones (13%), and phenicols (10%). The most prevalent genes were a sulphonamide resistant gene Sul1, the gene aac (6')-Ib-cr (aminoglycoside 6'-N-acetyl transferase type Ib-cr) resistant to aminoglycosides and the blaKPC-2 gene, which encodes carbapenemase-production. Plasmid acquisition of AMR genes was mainly inter-genus rather than intra-genus. Eighteen plasmids showed template or host genes acquired from Pseudomonas monteilii, Salmonella enterica or Escherichia coli. The most occurring antimicrobial resistance determinants (ARDs) were beta-lactamase, followed by aminoglycosides acetyl-transferases, and then efflux pumps. Screening of new isolates in vitro and in vivo is required to ascertain the level of phenotypic expression of colistin and other acquired AMR genes detected.
Collapse
Affiliation(s)
- Ogueri Nwaiwu
- School of Biosciences, University of Nottingham, Sutton Bonington Campus, United Kingdom
| | - Chiugo Claret Aduba
- Department of Science Laboratory Technology, University of Nigeria, Nsukka, Nigeria
| |
Collapse
|
17
|
Daddam JR, Sreenivasulu B, Peddanna K, Umamahesh K. Designing, docking and molecular dynamics simulation studies of novel cloperastine analogues as anti-allergic agents: homology modeling and active site prediction for the human histamine H1 receptor. RSC Adv 2020; 10:4745-4754. [PMID: 35495246 PMCID: PMC9049021 DOI: 10.1039/c9ra09245e] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2019] [Accepted: 01/09/2020] [Indexed: 11/21/2022] Open
Abstract
The present study predicts a three-dimensional model for the histamine H1 receptor and the design of antihistamine inhibitors using cloperastine as the core molecule by docking studies. In this work, we predicted a three-dimensional structure of the histamine H1 receptor using the MODELLER9V7 software. The protein structure was developed based on the crystal structure of the histamine H1 receptor, the lysozyme chimera of Escherichia virus T4 (PDB ID: 3RZE_A) target collected from the PDB data bank. Using molecular dynamics simulation methods, the final predicted structure is obtained and further analyzed by VERIFY3D and PROCHECK programs, confirming that the final model is reliable. The drug derivatives of cloperastine were designed and docking was performed with the designed ligands along with the drug. The predicted model of the histamine H1 receptor structure is stable and confirms that it is a reliable structure for docking studies. The results indicate that MET 183, THR 184 and ILE 187 in the histamine H1 receptor are important determinant residues for binding as they have strong hydrogen bonding with cloperastine derivatives. The drug derivatives were docked to the histamine H1 receptor protein by hydrogen bonding interactions and these interactions played an important role in the binding studies. The molecule 1-{2-[(4-chlorophenyl) (phenyl) methoxy] ethyl}-4-methylenepiperidine showed the best docking results with the histamine H1 receptor. The docking results predicted the best compounds, which may act as better drugs than cloperastine and in the future, these may be developed for anti-allergy therapy. The present study predicts a three-dimensional model for the histamine H1 receptor and the design of antihistamine inhibitors using cloperastine as the core molecule by docking studies.![]()
Collapse
Affiliation(s)
| | - Basha Sreenivasulu
- Department of Microbiology
- Sri Venkateswara University
- Tirupati
- India-517502
- Department of Biological Sciences
| | - Kotha Peddanna
- Department of Biochemistry
- Sri Venkateswara University
- Tirupati
- India-517502
- School of Chinese Medicine
| | - Katike Umamahesh
- Department of Biochemistry
- Sri Venkateswara University
- Tirupati
- India-517502
| |
Collapse
|
18
|
Zhan Q, Wang N, Jin S, Tan R, Jiang Q, Wang Y. ProbPFP: a multiple sequence alignment algorithm combining hidden Markov model optimized by particle swarm optimization with partition function. BMC Bioinformatics 2019; 20:573. [PMID: 31760933 PMCID: PMC6876095 DOI: 10.1186/s12859-019-3132-7] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND During procedures for conducting multiple sequence alignment, that is so essential to use the substitution score of pairwise alignment. To compute adaptive scores for alignment, researchers usually use Hidden Markov Model or probabilistic consistency methods such as partition function. Recent studies show that optimizing the parameters for hidden Markov model, as well as integrating hidden Markov model with partition function can raise the accuracy of alignment. The combination of partition function and optimized HMM, which could further improve the alignment's accuracy, however, was ignored by these researches. RESULTS A novel algorithm for MSA called ProbPFP is presented in this paper. It intergrate optimized HMM by particle swarm with partition function. The algorithm of PSO was applied to optimize HMM's parameters. After that, the posterior probability obtained by the HMM was combined with the one obtained by partition function, and thus to calculate an integrated substitution score for alignment. In order to evaluate the effectiveness of ProbPFP, we compared it with 13 outstanding or classic MSA methods. The results demonstrate that the alignments obtained by ProbPFP got the maximum mean TC scores and mean SP scores on these two benchmark datasets: SABmark and OXBench, and it got the second highest mean TC scores and mean SP scores on the benchmark dataset BAliBASE. ProbPFP is also compared with 4 other outstanding methods, by reconstructing the phylogenetic trees for six protein families extracted from the database TreeFam, based on the alignments obtained by these 5 methods. The result indicates that the reference trees are closer to the phylogenetic trees reconstructed from the alignments obtained by ProbPFP than the other methods. CONCLUSIONS We propose a new multiple sequence alignment method combining optimized HMM and partition function in this paper. The performance validates this method could make a great improvement of the alignment's accuracy.
Collapse
Affiliation(s)
- Qing Zhan
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, China
| | - Nan Wang
- Department of Mathematics, Harbin Institute of Technology, Harbin, 150001, China
| | - Shuilin Jin
- Department of Mathematics, Harbin Institute of Technology, Harbin, 150001, China
| | - Renjie Tan
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, China
| | - Qinghua Jiang
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, 150001, China
| | - Yadong Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, China.
| |
Collapse
|
19
|
Unified rational protein engineering with sequence-based deep representation learning. Nat Methods 2019; 16:1315-1322. [PMID: 31636460 DOI: 10.1038/s41592-019-0598-1] [Citation(s) in RCA: 438] [Impact Index Per Article: 87.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2019] [Accepted: 09/11/2019] [Indexed: 01/03/2023]
Abstract
Rational protein engineering requires a holistic understanding of protein function. Here, we apply deep learning to unlabeled amino-acid sequences to distill the fundamental features of a protein into a statistical representation that is semantically rich and structurally, evolutionarily and biophysically grounded. We show that the simplest models built on top of this unified representation (UniRep) are broadly applicable and generalize to unseen regions of sequence space. Our data-driven approach predicts the stability of natural and de novo designed proteins, and the quantitative function of molecularly diverse mutants, competitively with the state-of-the-art methods. UniRep further enables two orders of magnitude efficiency improvement in a protein engineering task. UniRep is a versatile summary of fundamental protein features that can be applied across protein engineering informatics.
Collapse
|
20
|
Nakamura T, Yamada KD, Tomii K, Katoh K. Parallelization of MAFFT for large-scale multiple sequence alignments. Bioinformatics 2019; 34:2490-2492. [PMID: 29506019 PMCID: PMC6041967 DOI: 10.1093/bioinformatics/bty121] [Citation(s) in RCA: 501] [Impact Index Per Article: 100.2] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2017] [Accepted: 02/28/2018] [Indexed: 12/03/2022] Open
Abstract
Summary We report an update for the MAFFT multiple sequence alignment program to enable parallel calculation of large numbers of sequences. The G-INS-1 option of MAFFT was recently reported to have higher accuracy than other methods for large data, but this method has been impractical for most large-scale analyses, due to the requirement of large computational resources. We introduce a scalable variant, G-large-INS-1, which has equivalent accuracy to G-INS-1 and is applicable to 50 000 or more sequences. Availability and implementation This feature is available in MAFFT versions 7.355 or later at https://mafft.cbrc.jp/alignment/software/mpi.html. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Tsukasa Nakamura
- Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, University of Tokyo, Chiba, Japan.,Artificial Intelligence Research Center (AIRC), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan
| | - Kazunori D Yamada
- Artificial Intelligence Research Center (AIRC), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan.,Graduate School of Information Sciences, Tohoku University, Sendai, Japan
| | - Kentaro Tomii
- Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, University of Tokyo, Chiba, Japan.,Artificial Intelligence Research Center (AIRC), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan.,Biotechnology Research Institute for Drug Discovery (BRD), AIST, Tokyo, Japan.,AIST-Tokyo Tech Real World Big-Data Computation Open Innovation Laboratory (RWBC-OIL), Tokyo, Japan
| | - Kazutaka Katoh
- Artificial Intelligence Research Center (AIRC), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan.,Research Institute for Microbial Diseases, Osaka University, Suita, Japan
| |
Collapse
|
21
|
Sievers F, Higgins DG. QuanTest2: benchmarking multiple sequence alignments using secondary structure prediction. Bioinformatics 2019; 36:90-95. [PMID: 31292629 PMCID: PMC9881607 DOI: 10.1093/bioinformatics/btz552] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2019] [Revised: 06/17/2019] [Accepted: 07/09/2019] [Indexed: 02/02/2023] Open
Abstract
MOTIVATION Secondary structure prediction accuracy (SSPA) in the QuanTest benchmark can be used to measure accuracy of a multiple sequence alignment. SSPA correlates well with the sum-of-pairs score, if the results are averaged over many alignments but not on an alignment-by-alignment basis. This is due to a sub-optimal selection of reference and non-reference sequences in QuanTest. RESULTS We develop an improved strategy for selecting reference and non-reference sequences for a new benchmark, QuanTest2. In QuanTest2, SSPA and SP correlate better on an alignment-by-alignment basis than in QuanTest. Guide-trees for QuanTest2 are more balanced with respect to reference sequences than in QuanTest. QuanTest2 scores correlate well with other well-established benchmarks. AVAILABILITY AND IMPLEMENTATION QuanTest2 is available at http://bioinf.ucd.ie/quantest2.tar, comprises of reference and non-reference sequence sets and a scoring script. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Fabian Sievers
- Conway Institute, UCD School of Medicine, University College Dublin, Belfield, Dublin 4, Ireland
| | | |
Collapse
|
22
|
Rozewicki J, Li S, Amada KM, Standley DM, Katoh K. MAFFT-DASH: integrated protein sequence and structural alignment. Nucleic Acids Res 2019; 47:W5-W10. [PMID: 31062021 PMCID: PMC6602451 DOI: 10.1093/nar/gkz342] [Citation(s) in RCA: 242] [Impact Index Per Article: 48.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2019] [Revised: 04/07/2019] [Accepted: 04/25/2019] [Indexed: 12/22/2022] Open
Abstract
Here, we describe a web server that integrates structural alignments with the MAFFT multiple sequence alignment (MSA) tool. For this purpose, we have prepared a web-based Database of Aligned Structural Homologs (DASH), which provides structural alignments at the domain and chain levels for all proteins in the Protein Data Bank (PDB), and can be queried interactively or by a simple REST-like API. MAFFT-DASH integration can be invoked with a single flag on either the web (https://mafft.cbrc.jp/alignment/server/) or command-line versions of MAFFT. In our benchmarks using 878 cases from the BAliBase, HomFam, OXFam, Mattbench and SISYPHUS datasets, MAFFT-DASH showed 10-20% improvement over standard MAFFT for MSA problems with weak similarity, in terms of Sum-of-Pairs (SP), a measure of how well a program succeeds at aligning input sequences in comparison to a reference alignment. When MAFFT alignments were supplemented with homologous sequences, further improvement was observed. Potential applications of DASH beyond MSA enrichment include functional annotation through detection of remote homology and assembly of template libraries for homology modeling.
Collapse
Affiliation(s)
- John Rozewicki
- Department of Genome Informatics, Genome Information Research Center, Research Institute for Microbial Diseases, Osaka University, 3-1 Yamadaoka, Suita 565-0871, Japan
- Systems Immunology Laboratory, Immunology Frontier Research Center, Osaka University, 3-1 Yamadaoka, Suita 565-0871, Japan
| | - Songling Li
- Department of Genome Informatics, Genome Information Research Center, Research Institute for Microbial Diseases, Osaka University, 3-1 Yamadaoka, Suita 565-0871, Japan
- Systems Immunology Laboratory, Immunology Frontier Research Center, Osaka University, 3-1 Yamadaoka, Suita 565-0871, Japan
| | - Karlou Mar Amada
- Systems Immunology Laboratory, Immunology Frontier Research Center, Osaka University, 3-1 Yamadaoka, Suita 565-0871, Japan
| | - Daron M Standley
- Department of Genome Informatics, Genome Information Research Center, Research Institute for Microbial Diseases, Osaka University, 3-1 Yamadaoka, Suita 565-0871, Japan
- Systems Immunology Laboratory, Immunology Frontier Research Center, Osaka University, 3-1 Yamadaoka, Suita 565-0871, Japan
| | - Kazutaka Katoh
- Department of Genome Informatics, Genome Information Research Center, Research Institute for Microbial Diseases, Osaka University, 3-1 Yamadaoka, Suita 565-0871, Japan
- Systems Immunology Laboratory, Immunology Frontier Research Center, Osaka University, 3-1 Yamadaoka, Suita 565-0871, Japan
| |
Collapse
|
23
|
Abstract
Background Protein sequence alignment analyses have become a crucial step for many bioinformatics studies during the past decades. Multiple sequence alignment (MSA) and pair-wise sequence alignment (PSA) are two major approaches in sequence alignment. Former benchmark studies revealed drawbacks of MSA methods on nucleotide sequence alignments. To test whether similar drawbacks also influence protein sequence alignment analyses, we propose a new benchmark framework for protein clustering based on cluster validity. This new framework directly reflects the biological ground truth of the application scenarios that adopt sequence alignments, and evaluates the alignment quality according to the achievement of the biological goal, rather than the comparison on sequence level only, which averts the biases introduced by alignment scores or manual alignment templates. Compared with former studies, we calculate the cluster validity score based on sequence distances instead of clustering results. This strategy could avoid the influence brought by different clustering methods thus make results more dependable. Results Results showed that PSA methods performed better than MSA methods on most of the BAliBASE benchmark datasets. Analyses on the 80 re-sampled benchmark datasets constructed by randomly choosing 90% of each dataset 10 times showed similar results. Conclusions These results validated that the drawbacks of MSA methods revealed in nucleotide level also existed in protein sequence alignment analyses and affect the accuracy of results. Electronic supplementary material The online version of this article (10.1186/s12859-018-2524-4) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Yingying Wang
- Research Center for Biomedical Information Technology, Shenzhen Institutes of Advanced Technologies, Chinese Academy of Sciences, Shenzhen, China
| | - Hongyan Wu
- Research Center for Biomedical Information Technology, Shenzhen Institutes of Advanced Technologies, Chinese Academy of Sciences, Shenzhen, China.
| | - Yunpeng Cai
- Research Center for Biomedical Information Technology, Shenzhen Institutes of Advanced Technologies, Chinese Academy of Sciences, Shenzhen, China.
| |
Collapse
|
24
|
Chaabane L. A hybrid solver for protein multiple sequence alignment problem. J Bioinform Comput Biol 2018; 16:1850015. [DOI: 10.1142/s0219720018500154] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
In this work, a novel hybrid model called PSOSA for solving multiple sequence alignment (MSA) problem is proposed. The developed approach is a combination between particle swarm optimization (PSO) algorithm and simulated annealing (SA) technique. In our PSOSA approach, PSO is exploited in global search, but it is easily trapped into local optimum and may lead to premature convergence. SA is incorporated as local improvement approach to overcome local optimum problem and intensify the search in local regions to improve solution quality. Numerical results on BAliBASE benchmark have shown the effectiveness of the proposed method and its ability to achieve good quality solutions when compared with those given by other existing methods.
Collapse
Affiliation(s)
- Lamiche Chaabane
- Department of Computer Science, Mohamed Boudiaf University, BP. 166 M’sila 28000, Algeria
| |
Collapse
|
25
|
SikanderAzam S, Ahmad S, Navid A, Sajid NUA, Ahmad I, Wadood A. Implications of sequence conservation patterns of serpin B family leading to structural and functional importance. GENE REPORTS 2018. [DOI: 10.1016/j.genrep.2018.05.009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
26
|
DeBlasio D, Kececioglu J. Adaptive Local Realignment of Protein Sequences. J Comput Biol 2018; 25:780-793. [PMID: 29889553 DOI: 10.1089/cmb.2018.0045] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
While mutation rates can vary markedly over the residues of a protein, multiple sequence alignment tools typically use the same values for their scoring-function parameters across a protein's entire length. We present a new approach, called adaptive local realignment, that in contrast automatically adapts to the diversity of mutation rates along protein sequences. This builds upon a recent technique known as parameter advising, which finds global parameter settings for an aligner, to now adaptively find local settings. Our approach in essence identifies local regions with low estimated accuracy, constructs a set of candidate realignments using a carefully-chosen collection of parameter settings, and replaces the region if a realignment has higher estimated accuracy. This new method of local parameter advising, when combined with prior methods for global advising, boosts alignment accuracy as much as 26% over the best default setting on hard-to-align protein benchmarks, and by 6.4% over global advising alone. Adaptive local realignment has been implemented within the Opal aligner using the Facet accuracy estimator.
Collapse
Affiliation(s)
- Dan DeBlasio
- 1 Computational Biology Department, Carnegie Mellon University , Pittsburgh, Pennsylvania
| | - John Kececioglu
- 2 Department of Computer Science, The University of Arizona , Tucson, Arizona
| |
Collapse
|
27
|
Le Q, Sievers F, Higgins DG. Protein multiple sequence alignment benchmarking through secondary structure prediction. Bioinformatics 2018; 33:1331-1337. [PMID: 28093407 PMCID: PMC5408826 DOI: 10.1093/bioinformatics/btw840] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2016] [Accepted: 01/10/2017] [Indexed: 12/26/2022] Open
Abstract
Motivation Multiple sequence alignment (MSA) is commonly used to analyze sets of homologous protein or DNA sequences. This has lead to the development of many methods and packages for MSA over the past 30 years. Being able to compare different methods has been problematic and has relied on gold standard benchmark datasets of ‘true’ alignments or on MSA simulations. A number of protein benchmark datasets have been produced which rely on a combination of manual alignment and/or automated superposition of protein structures. These are either restricted to very small MSAs with few sequences or require manual alignment which can be subjective. In both cases, it remains very difficult to properly test MSAs of more than a few dozen sequences. PREFAB and HomFam both rely on using a small subset of sequences of known structure and do not fairly test the quality of a full MSA. Results In this paper we describe QuanTest, a fully automated and highly scalable test system for protein MSAs which is based on using secondary structure prediction accuracy (SSPA) to measure alignment quality. This is based on the assumption that better MSAs will give more accurate secondary structure predictions when we include sequences of known structure. SSPA measures the quality of an entire alignment however, not just the accuracy on a handful of selected sequences. It can be scaled to alignments of any size but here we demonstrate its use on alignments of either 200 or 1000 sequences. This allows the testing of slow accurate programs as well as faster, less accurate ones. We show that the scores from QuanTest are highly correlated with existing benchmark scores. We also validate the method by comparing a wide range of MSA alignment options and by including different levels of mis-alignment into MSA, and examining the effects on the scores. Availability and Implementation QuanTest is available from http://www.bioinf.ucd.ie/download/QuanTest.tgz Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Quan Le
- Conway Institute, UCD School of Medicine and Medical Science, University College Dublin, Belfield, Dublin, Dublin 4, Ireland
| | - Fabian Sievers
- Conway Institute, UCD School of Medicine and Medical Science, University College Dublin, Belfield, Dublin, Dublin 4, Ireland
| | - Desmond G Higgins
- Conway Institute, UCD School of Medicine and Medical Science, University College Dublin, Belfield, Dublin, Dublin 4, Ireland
| |
Collapse
|
28
|
Ksouri A, Ghedira K, Ben Abderrazek R, Shankar BG, Benkahla A, Bishop OT, Bouhaouala-Zahar B. Homology modeling and docking of AahII-Nanobody complexes reveal the epitope binding site on AahII scorpion toxin. Biochem Biophys Res Commun 2018; 496:1025-1032. [DOI: 10.1016/j.bbrc.2018.01.036] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2017] [Accepted: 01/04/2018] [Indexed: 11/25/2022]
|
29
|
Rubio-Largo Á, Vanneschi L, Castelli M, Vega-Rodríguez MA. Using biological knowledge for multiple sequence aligner decision making. Inf Sci (N Y) 2017. [DOI: 10.1016/j.ins.2017.08.069] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
30
|
Bacterial Foraging Optimization -Genetic Algorithm for Multiple Sequence Alignment with Multi-Objectives. Sci Rep 2017; 7:8833. [PMID: 28821841 PMCID: PMC5562892 DOI: 10.1038/s41598-017-09499-1] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2017] [Accepted: 07/27/2017] [Indexed: 01/06/2023] Open
Abstract
This research work focus on the multiple sequence alignment, as developing an exact multiple sequence alignment for different protein sequences is a difficult computational task. In this research, a hybrid algorithm named Bacterial Foraging Optimization-Genetic Algorithm (BFO-GA) algorithm is aimed to improve the multi-objectives and carrying out measures of multiple sequence alignment. The proposed algorithm employs multi-objectives such as variable gap penalty minimization, maximization of similarity and non-gap percentage. The proposed BFO-GA algorithm is measured with various MSA methods such as T-Coffee, Clustal Omega, Muscle, K-Align, MAFFT, GA, ACO, ABC and PSO. The experiments were taken on four benchmark datasets such as BAliBASE 3.0, Prefab 4.0, SABmark 1.65 and Oxbench 1.3 databases and the outcomes prove that the proposed BFO-GA algorithm obtains better statistical significance results as compared with the other well-known methods. This research study also evaluates the practicability of the alignments of BFO-GA by applying the optimal sequence to predict the phylogenetic tree by using ClustalW2 Phylogeny tool and compare with the existing algorithms by using the Robinson-Foulds (RF) distance performance metric. Lastly, the statistical implication of the proposed algorithm is computed by using the Wilcoxon Matched-Pair Signed- Rank test and also it infers better results.
Collapse
|
31
|
Chowdhury B, Garai G. A review on multiple sequence alignment from the perspective of genetic algorithm. Genomics 2017; 109:419-431. [PMID: 28669847 DOI: 10.1016/j.ygeno.2017.06.007] [Citation(s) in RCA: 45] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2017] [Revised: 05/27/2017] [Accepted: 06/27/2017] [Indexed: 01/04/2023]
Abstract
Sequence alignment is an active research area in the field of bioinformatics. It is also a crucial task as it guides many other tasks like phylogenetic analysis, function, and/or structure prediction of biological macromolecules like DNA, RNA, and Protein. Proteins are the building blocks of every living organism. Although protein alignment problem has been studied for several decades, unfortunately, every available method produces alignment results differently for a single alignment problem. Multiple sequence alignment is characterized as a very high computational complex problem. Many stochastic methods, therefore, are considered for improving the accuracy of alignment. Among them, many researchers frequently use Genetic Algorithm. In this study, we have shown different types of the method applied in alignment and the recent trends in the multiobjective genetic algorithm for solving multiple sequence alignment. Many recent studies have demonstrated considerable progress in finding the alignment accuracy.
Collapse
Affiliation(s)
- Biswanath Chowdhury
- Department of Biophysics, Molecular Biology and Bioinformatics, University of Calcutta, Kolkata, WB, 700009, India.
| | - Gautam Garai
- Computational Sciences Division, Saha Institute of Nuclear Physics, Kolkata, WB 700064, India.
| |
Collapse
|
32
|
Keul F, Hess M, Goesele M, Hamacher K. PFASUM: a substitution matrix from Pfam structural alignments. BMC Bioinformatics 2017; 18:293. [PMID: 28583067 PMCID: PMC5460430 DOI: 10.1186/s12859-017-1703-z] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2017] [Accepted: 05/22/2017] [Indexed: 11/10/2022] Open
Abstract
Background Detecting homologous protein sequences and computing multiple sequence alignments (MSA) are fundamental tasks in molecular bioinformatics. These tasks usually require a substitution matrix for modeling evolutionary substitution events derived from a set of aligned sequences. Over the last years, the known sequence space increased drastically and several publications demonstrated that this can lead to significantly better performing matrices. Interestingly, matrices based on dated sequence datasets are still the de facto standard for both tasks even though their data basis may limit their capabilities. We address these aspects by presenting a new substitution matrix series called PFASUM. These matrices are derived from Pfam seed MSAs using a novel algorithm and thus build upon expert ground truth data covering a large and diverse sequence space. Results We show results for two use cases: First, we tested the homology search performance of PFASUM matrices on up-to-date ASTRAL databases with varying sequence similarity. Our study shows that the usage of PFASUM matrices can lead to significantly better homology search results when compared to conventional matrices. PFASUM matrices with comparable relative entropies to the commonly used substitution matrices BLOSUM50, BLOSUM62, PAM250, VTML160 and VTML200 outperformed their corresponding counterparts in 93% of all test cases. A general assessment also comparing matrices with different relative entropies showed that PFASUM matrices delivered the best homology search performance in the test set. Second, our results demonstrate that the usage of PFASUM matrices for MSA construction improves their quality when compared to conventional matrices. On up-to-date MSA benchmarks, at least 60% of all MSAs were reconstructed in an equal or higher quality when using MUSCLE with PFASUM31, PFASUM43 and PFASUM60 matrices instead of conventional matrices. This rate even increases to at least 76% for MSAs containing similar sequences. Conclusions We present the novel PFASUM substitution matrices derived from manually curated MSA ground truth data covering the currently known sequence space. Our results imply that PFASUM matrices improve homology search performance as well as MSA quality in many cases when compared to conventional substitution matrices. Hence, we encourage the usage of PFASUM matrices and especially PFASUM60 for these specific tasks. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1703-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Frank Keul
- Computational Biology and Simulation, Department of Biology, Technische Universität Darmstadt, Schnittspahnstraße 2, Darmstadt, 64287, Germany
| | - Martin Hess
- Graphics, Capture and Massively Parallel Computing, Department of Computer Science, Technische Universität Darmstadt, Rundeturmstraße 12, Darmstadt, 64283, Germany.
| | - Michael Goesele
- Graphics, Capture and Massively Parallel Computing, Department of Computer Science, Technische Universität Darmstadt, Rundeturmstraße 12, Darmstadt, 64283, Germany
| | - Kay Hamacher
- Computational Biology and Simulation, Department of Biology, Technische Universität Darmstadt, Schnittspahnstraße 2, Darmstadt, 64287, Germany
| |
Collapse
|
33
|
Zambrano-Vega C, Nebro AJ, Durillo JJ, García-Nieto J, Aldana-Montes JF. Multiple Sequence Alignment with Multiobjective Metaheuristics. A Comparative Study. INT J INTELL SYST 2017. [DOI: 10.1002/int.21892] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Affiliation(s)
- Cristian Zambrano-Vega
- Facultad de Ciencias de la Ingeniería; Universidad Técnica Estatal de Quevedo; Quevedo Ecuador
| | - Antonio J. Nebro
- Edificio de Investigación Ada Byron; University of Málaga; Málaga Spain
| | - Juan J. Durillo
- Distributed and Parallel Systems Group; University of Innsbruck; Innsbruck Austria
| | - José García-Nieto
- Edificio de Investigación Ada Byron; University of Málaga; Málaga Spain
| | | |
Collapse
|
34
|
Vaitinadapoule A, Etchebest C. Molecular Modeling of Transporters: From Low Resolution Cryo-Electron Microscopy Map to Conformational Exploration. The Example of TSPO. Methods Mol Biol 2017; 1635:383-416. [PMID: 28755381 DOI: 10.1007/978-1-4939-7151-0_21] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
This chapter describes a protocol to establish a three-dimensional (3D) model of a protein and to explore its conformational landscape. It combines predictions from up-to-date bioinformatics methods with low-resolution experimental data. It also proposes to examine rapidly the dynamics of the protein using molecular dynamics simulations with a coarse-grained force field. Tools for analyzing these trajectories are suggested as well as those for constructing all-atoms models. Thus, starting from a protein sequence and using free software, the user can get important conformational information, which might improve the knowledge about the protein function.
Collapse
Affiliation(s)
- Aurore Vaitinadapoule
- Unité INSERM UMRS1134, Laboratory of Excellence, Institut National de la Transfusion Sanguine, Université Paris-Diderot, Sorbonne Paris Cité, Université de la Réunion, 6 rue Alexandre Cabanel, 75015, Paris Cedex 15, France
| | - Catherine Etchebest
- Unité INSERM UMRS1134, Laboratory of Excellence, Institut National de la Transfusion Sanguine, Université Paris-Diderot, Sorbonne Paris Cité, Université de la Réunion, 6 rue Alexandre Cabanel, 75015, Paris Cedex 15, France.
| |
Collapse
|
35
|
Abstract
The increasing importance of Next Generation Sequencing (NGS) techniques has highlighted the key role of multiple sequence alignment (MSA) in comparative structure and function analysis of biological sequences. MSA often leads to fundamental biological insight into sequence-structure-function relationships of nucleotide or protein sequence families. Significant advances have been achieved in this field, and many useful tools have been developed for constructing alignments, although many biological and methodological issues are still open. This chapter first provides some background information and considerations associated with MSA techniques, concentrating on the alignment of protein sequences. Then, a practical overview of currently available methods and a description of their specific advantages and limitations are given, to serve as a helpful guide or starting point for researchers who aim to construct a reliable MSA.
Collapse
|
36
|
Deorowicz S, Debudaj-Grabysz A, Gudyś A. FAMSA: Fast and accurate multiple sequence alignment of huge protein families. Sci Rep 2016; 6:33964. [PMID: 27670777 PMCID: PMC5037421 DOI: 10.1038/srep33964] [Citation(s) in RCA: 65] [Impact Index Per Article: 8.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2016] [Accepted: 08/31/2016] [Indexed: 11/10/2022] Open
Abstract
Rapid development of modern sequencing platforms has contributed to the unprecedented growth of protein families databases. The abundance of sets containing hundreds of thousands of sequences is a formidable challenge for multiple sequence alignment algorithms. The article introduces FAMSA, a new progressive algorithm designed for fast and accurate alignment of thousands of protein sequences. Its features include the utilization of the longest common subsequence measure for determining pairwise similarities, a novel method of evaluating gap costs, and a new iterative refinement scheme. What matters is that its implementation is highly optimized and parallelized to make the most of modern computer platforms. Thanks to the above, quality indicators, i.e. sum-of-pairs and total-column scores, show FAMSA to be superior to competing algorithms, such as Clustal Omega or MAFFT for datasets exceeding a few thousand sequences. Quality does not compromise on time or memory requirements, which are an order of magnitude lower than those in the existing solutions. For example, a family of 415519 sequences was analyzed in less than two hours and required no more than 8 GB of RAM. FAMSA is available for free at http://sun.aei.polsl.pl/REFRESH/famsa.
Collapse
Affiliation(s)
- Sebastian Deorowicz
- Institute of Informatics, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Poland
| | | | - Adam Gudyś
- Institute of Informatics, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Poland
| |
Collapse
|
37
|
Ye Y, Lam TW, Ting HF. PnpProbs: a better multiple sequence alignment tool by better handling of guide trees. BMC Bioinformatics 2016; 17 Suppl 8:285. [PMID: 27585754 PMCID: PMC5009527 DOI: 10.1186/s12859-016-1121-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND This paper describes a new MSA tool called PnpProbs, which constructs better multiple sequence alignments by better handling of guide trees. It classifies sequences into two types: normally related and distantly related. For normally related sequences, it uses an adaptive approach to construct the guide tree needed for progressive alignment; it first estimates the input's discrepancy by computing the standard deviation of their percent identities, and based on this estimate, it chooses the better method to construct the guide tree. For distantly related sequences, PnpProbs abandons the guide tree and uses instead some non-progressive alignment method to generate the alignment. RESULTS To evaluate PnpProbs, we have compared it with thirteen other popular MSA tools, and PnpProbs has the best alignment scores in all but one test. We have also used it for phylogenetic analysis, and found that the phylogenetic trees constructed from PnpProbs' alignments are closest to the model trees. CONCLUSIONS By combining the strength of the progressive and non-progressive alignment methods, we have developed an MSA tool called PnpProbs. We have compared PnpProbs with thirteen other popular MSA tools and our results showed that our tool usually constructed the best alignments.
Collapse
Affiliation(s)
- Yongtao Ye
- HKU-BGI Bioinformatics Algorithms & Core Technology Research Lab, Computer Science Department, University of Hong Kong, Hong Kong, China
| | - Tak-Wah Lam
- HKU-BGI Bioinformatics Algorithms & Core Technology Research Lab, Computer Science Department, University of Hong Kong, Hong Kong, China
| | - Hing-Fung Ting
- HKU-BGI Bioinformatics Algorithms & Core Technology Research Lab, Computer Science Department, University of Hong Kong, Hong Kong, China.
| |
Collapse
|
38
|
Yamada KD, Tomii K, Katoh K. Application of the MAFFT sequence alignment program to large data-reexamination of the usefulness of chained guide trees. Bioinformatics 2016; 32:3246-3251. [PMID: 27378296 PMCID: PMC5079479 DOI: 10.1093/bioinformatics/btw412] [Citation(s) in RCA: 198] [Impact Index Per Article: 24.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2016] [Accepted: 06/20/2016] [Indexed: 11/26/2022] Open
Abstract
Motivation: Large multiple sequence alignments (MSAs), consisting of thousands of sequences, are becoming more and more common, due to advances in sequencing technologies. The MAFFT MSA program has several options for building large MSAs, but their performances have not been sufficiently assessed yet, because realistic benchmarking of large MSAs has been difficult. Recently, such assessments have been made possible through the HomFam and ContTest benchmark protein datasets. Along with the development of these datasets, an interesting theory was proposed: chained guide trees increase the accuracy of MSAs of structurally conserved regions. This theory challenges the basis of progressive alignment methods and needs to be examined by being compared with other known methods including computationally intensive ones. Results: We used HomFam, ContTest and OXFam (an extended version of OXBench) to evaluate several methods enabled in MAFFT: (1) a progressive method with approximate guide trees, (2) a progressive method with chained guide trees, (3) a combination of an iterative refinement method and a progressive method and (4) a less approximate progressive method that uses a rigorous guide tree and consistency score. Other programs, Clustal Omega and UPP, available for large MSAs, were also included into the comparison. The effect of method 2 (chained guide trees) was positive in ContTest but negative in HomFam and OXFam. Methods 3 and 4 increased the benchmark scores more consistently than method 2 for the three datasets, suggesting that they are safer to use. Availability and Implementation:http://mafft.cbrc.jp/alignment/software/ Contact:katoh@ifrec.osaka-u.ac.jp Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Kazunori D Yamada
- Graduate School of Information Sciences, Tohoku University, Sendai 980-8579, Japan Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 135-0064, Japan
| | - Kentaro Tomii
- Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 135-0064, Japan Biotechnology Research Institute for Drug Discovery, National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 135-0064, Japan
| | - Kazutaka Katoh
- Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 135-0064, Japan Immunology Frontier Research Center, Osaka University, Suita 565-0871, Japan
| |
Collapse
|
39
|
Katoh K, Standley DM. A simple method to control over-alignment in the MAFFT multiple sequence alignment program. Bioinformatics 2016; 32:1933-42. [PMID: 27153688 PMCID: PMC4920119 DOI: 10.1093/bioinformatics/btw108] [Citation(s) in RCA: 318] [Impact Index Per Article: 39.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2015] [Accepted: 02/19/2016] [Indexed: 12/17/2022] Open
Abstract
Motivation: We present a new feature of the MAFFT multiple alignment program for suppressing over-alignment (aligning unrelated segments). Conventional MAFFT is highly sensitive in aligning conserved regions in remote homologs, but the risk of over-alignment is recently becoming greater, as low-quality or noisy sequences are increasing in protein sequence databases, due, for example, to sequencing errors and difficulty in gene prediction. Results: The proposed method utilizes a variable scoring matrix for different pairs of sequences (or groups) in a single multiple sequence alignment, based on the global similarity of each pair. This method significantly increases the correctly gapped sites in real examples and in simulations under various conditions. Regarding sensitivity, the effect of the proposed method is slightly negative in real protein-based benchmarks, and mostly neutral in simulation-based benchmarks. This approach is based on natural biological reasoning and should be compatible with many methods based on dynamic programming for multiple sequence alignment. Availability and implementation: The new feature is available in MAFFT versions 7.263 and higher. http://mafft.cbrc.jp/alignment/software/ Contact:katoh@ifrec.osaka-u.ac.jp Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Kazutaka Katoh
- Immunology Frontier Research Center, Osaka University, Suita 565-0871, Japan
| | - Daron M Standley
- Immunology Frontier Research Center, Osaka University, Suita 565-0871, Japan Institute for Virus Research, Kyoto University, Kyoto 606-8507, Japan
| |
Collapse
|
40
|
Al-Shatnawi M, Ahmad MO, Swamy MNS. MSAIndelFR: a scheme for multiple protein sequence alignment using information on indel flanking regions. BMC Bioinformatics 2015; 16:393. [PMID: 26597571 PMCID: PMC4657235 DOI: 10.1186/s12859-015-0826-3] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2015] [Accepted: 11/14/2015] [Indexed: 11/16/2022] Open
Abstract
Background The alignment of multiple protein sequences is one of the most commonly performed tasks in bioinformatics. In spite of considerable research and efforts that have been recently deployed for improving the performance of multiple sequence alignment (MSA) algorithms, finding a highly accurate alignment between multiple protein sequences is still a challenging problem. Results We propose a novel and efficient algorithm called, MSAIndelFR, for multiple sequence alignment using the information on the predicted locations of IndelFRs and the computed average log–loss values obtained from IndelFR predictors, each of which is designed for a different protein fold. We demonstrate that the introduction of a new variable gap penalty function based on the predicted locations of the IndelFRs and the computed average log–loss values into the proposed algorithm substantially improves the protein alignment accuracy. This is illustrated by evaluating the performance of the algorithm in aligning sequences belonging to the protein folds for which the IndelFR predictors already exist and by using the reference alignments of the four popular benchmarks, BAliBASE 3.0, OXBENCH, PREFAB 4.0, and SABRE (SABmark 1.65). Conclusions We have proposed a novel and efficient algorithm, the MSAIndelFR algorithm, for multiple protein sequence alignment incorporating a new variable gap penalty function. It is shown that the performance of the proposed algorithm is superior to that of the most–widely used alignment algorithms, Clustal W2, Clustal Omega, Kalign2, MSAProbs, MAFFT, MUSCLE, ProbCons and Probalign, in terms of both the sum–of–pairs and total column metrics. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0826-3) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Mufleh Al-Shatnawi
- Department of Electrical and Computer Engineering, Concordia University, 1455 De Maisonneuve Blvd. W., Montreal, H3G 1M8, Quebec, Canada.
| | - M Omair Ahmad
- Department of Electrical and Computer Engineering, Concordia University, 1455 De Maisonneuve Blvd. W., Montreal, H3G 1M8, Quebec, Canada.
| | - M N S Swamy
- Department of Electrical and Computer Engineering, Concordia University, 1455 De Maisonneuve Blvd. W., Montreal, H3G 1M8, Quebec, Canada.
| |
Collapse
|
41
|
Wright ES. DECIPHER: harnessing local sequence context to improve protein multiple sequence alignment. BMC Bioinformatics 2015; 16:322. [PMID: 26445311 PMCID: PMC4595117 DOI: 10.1186/s12859-015-0749-z] [Citation(s) in RCA: 198] [Impact Index Per Article: 22.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2015] [Accepted: 09/23/2015] [Indexed: 12/20/2022] Open
Abstract
BACKGROUND Alignment of large and diverse sequence sets is a common task in biological investigations, yet there remains considerable room for improvement in alignment quality. Multiple sequence alignment programs tend to reach maximal accuracy when aligning only a few sequences, and then diminish steadily as more sequences are added. This drop in accuracy can be partly attributed to a build-up of error and ambiguity as more sequences are aligned. Most high-throughput sequence alignment algorithms do not use contextual information under the assumption that sites are independent. This study examines the extent to which local sequence context can be exploited to improve the quality of large multiple sequence alignments. RESULTS Two predictors based on local sequence context were assessed: (i) single sequence secondary structure predictions, and (ii) modulation of gap costs according to the surrounding residues. The results indicate that context-based predictors have appreciable information content that can be utilized to create more accurate alignments. Furthermore, local context becomes more informative as the number of sequences increases, enabling more accurate protein alignments of large empirical benchmarks. These discoveries became the basis for DECIPHER, a new context-aware program for sequence alignment, which outperformed other programs on large sequence sets. CONCLUSIONS Predicting secondary structure based on local sequence context is an efficient means of breaking the independence assumption in alignment. Since secondary structure is more conserved than primary sequence, it can be leveraged to improve the alignment of distantly related proteins. Moreover, secondary structure predictions increase in accuracy as more sequences are used in the prediction. This enables the scalable generation of large sequence alignments that maintain high accuracy even on diverse sequence sets. The DECIPHER R package and source code are freely available for download at DECIPHER.cee.wisc.edu and from the Bioconductor repository.
Collapse
Affiliation(s)
- Erik S Wright
- Department of Bacteriology, University of Wisconsin-Madison, Madison, WI, 53715, USA. .,Wisconsin Institute for Discovery, University of Wisconsin-Madison, 330 N. Orchard St., Madison, WI, 53715, USA.
| |
Collapse
|
42
|
Computational approaches to study the effects of small genomic variations. J Mol Model 2015; 21:251. [PMID: 26350246 DOI: 10.1007/s00894-015-2794-y] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2015] [Accepted: 08/23/2015] [Indexed: 10/23/2022]
Abstract
Advances in DNA sequencing technologies have led to an avalanche-like increase in the number of gene sequences deposited in public databases over the last decade as well as the detection of an enormous number of previously unseen nucleotide variants therein. Given the size and complex nature of the genome-wide sequence variation data, as well as the rate of data generation, experimental characterization of the disease association of each of these variations or their effects on protein structure/function would be costly, laborious, time-consuming, and essentially impossible. Thus, in silico methods to predict the functional effects of sequence variations are constantly being developed. In this review, we summarize the major computational approaches and tools that are aimed at the prediction of the functional effect of mutations, and describe the state-of-the-art databases that can be used to obtain information about mutation significance. We also discuss future directions in this highly competitive field.
Collapse
|
43
|
Bawono P, van der Velde A, Abeln S, Heringa J. Quantifying the displacement of mismatches in multiple sequence alignment benchmarks. PLoS One 2015; 10:e0127431. [PMID: 25993129 PMCID: PMC4438059 DOI: 10.1371/journal.pone.0127431] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2014] [Accepted: 04/14/2015] [Indexed: 11/18/2022] Open
Abstract
Multiple Sequence Alignment (MSA) methods are typically benchmarked on sets of reference alignments. The quality of the alignment can then be represented by the sum-of-pairs (SP) or column (CS) scores, which measure the agreement between a reference and corresponding query alignment. Both the SP and CS scores treat mismatches between a query and reference alignment as equally bad, and do not take the separation into account between two amino acids in the query alignment, that should have been matched according to the reference alignment. This is significant since the magnitude of alignment shifts is often of relevance in biological analyses, including homology modeling and MSA refinement/manual alignment editing. In this study we develop a new alignment benchmark scoring scheme, SPdist, that takes the degree of discordance of mismatches into account by measuring the sequence distance between mismatched residue pairs in the query alignment. Using this new score along with the standard SP score, we investigate the discriminatory behavior of the new score by assessing how well six different MSA methods perform with respect to BAliBASE reference alignments. The SP score and the SPdist score yield very similar outcomes when the reference and query alignments are close. However, for more divergent reference alignments the SPdist score is able to distinguish between methods that keep alignments approximately close to the reference and those exhibiting larger shifts. We observed that by using SPdist together with SP scoring we were able to better delineate the alignment quality difference between alternative MSA methods. With a case study we exemplify why it is important, from a biological perspective, to consider the separation of mismatches. The SPdist scoring scheme has been implemented in the VerAlign web server (http://www.ibi.vu.nl/programs/veralignwww/). The code for calculating SPdist score is also available upon request.
Collapse
Affiliation(s)
- Punto Bawono
- Centre for Integrative Bioinformatics (IBIVU), VU University Amsterdam, Amsterdam, The Netherlands
- * E-mail: (PB); (JH)
| | - Arjan van der Velde
- Centre for Integrative Bioinformatics (IBIVU), VU University Amsterdam, Amsterdam, The Netherlands
| | - Sanne Abeln
- Centre for Integrative Bioinformatics (IBIVU), VU University Amsterdam, Amsterdam, The Netherlands
| | - Jaap Heringa
- Centre for Integrative Bioinformatics (IBIVU), VU University Amsterdam, Amsterdam, The Netherlands
- Amsterdam Institute for Molecules Medicines and Systems (AIMMS), VU University Amsterdam, Amsterdam, The Netherlands
- * E-mail: (PB); (JH)
| |
Collapse
|
44
|
Herman JL, Novák Á, Lyngsø R, Szabó A, Miklós I, Hein J. Efficient representation of uncertainty in multiple sequence alignments using directed acyclic graphs. BMC Bioinformatics 2015; 16:108. [PMID: 25888064 PMCID: PMC4395974 DOI: 10.1186/s12859-015-0516-1] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2014] [Accepted: 02/24/2015] [Indexed: 11/30/2022] Open
Abstract
BACKGROUND A standard procedure in many areas of bioinformatics is to use a single multiple sequence alignment (MSA) as the basis for various types of analysis. However, downstream results may be highly sensitive to the alignment used, and neglecting the uncertainty in the alignment can lead to significant bias in the resulting inference. In recent years, a number of approaches have been developed for probabilistic sampling of alignments, rather than simply generating a single optimum. However, this type of probabilistic information is currently not widely used in the context of downstream inference, since most existing algorithms are set up to make use of a single alignment. RESULTS In this work we present a framework for representing a set of sampled alignments as a directed acyclic graph (DAG) whose nodes are alignment columns; each path through this DAG then represents a valid alignment. Since the probabilities of individual columns can be estimated from empirical frequencies, this approach enables sample-based estimation of posterior alignment probabilities. Moreover, due to conditional independencies between columns, the graph structure encodes a much larger set of alignments than the original set of sampled MSAs, such that the effective sample size is greatly increased. CONCLUSIONS The alignment DAG provides a natural way to represent a distribution in the space of MSAs, and allows for existing algorithms to be efficiently scaled up to operate on large sets of alignments. As an example, we show how this can be used to compute marginal probabilities for tree topologies, averaging over a very large number of MSAs. This framework can also be used to generate a statistically meaningful summary alignment; example applications show that this summary alignment is consistently more accurate than the majority of the alignment samples, leading to improvements in downstream tree inference. Implementations of the methods described in this article are available at http://statalign.github.io/WeaveAlign .
Collapse
Affiliation(s)
- Joseph L Herman
- Department of Statistics, University of Oxford, 1 South Parks Road, Oxford, OX1 3TG, UK.
- Division of Mathematical Biology, National Institute of Medical Research,, The Ridgeway, London, NW7 1AA, UK.
| | - Ádám Novák
- Department of Statistics, University of Oxford, 1 South Parks Road, Oxford, OX1 3TG, UK.
| | - Rune Lyngsø
- Department of Statistics, University of Oxford, 1 South Parks Road, Oxford, OX1 3TG, UK.
| | - Adrienn Szabó
- Institute of Computer Science and Control, Hungarian Academy of Sciences, Lagymanyosi u. 11., Budapest, 1111, Hungary.
| | - István Miklós
- Institute of Computer Science and Control, Hungarian Academy of Sciences, Lagymanyosi u. 11., Budapest, 1111, Hungary.
- Department of Stochastics, Rényi Institute, Reáltanoda u. 13-15, Budapest, 1053, Hungary.
| | - Jotun Hein
- Department of Statistics, University of Oxford, 1 South Parks Road, Oxford, OX1 3TG, UK.
| |
Collapse
|
45
|
Zhan Q, Ye Y, Lam TW, Yiu SM, Wang Y, Ting HF. Improving multiple sequence alignment by using better guide trees. BMC Bioinformatics 2015; 16 Suppl 5:S4. [PMID: 25859903 PMCID: PMC4402577 DOI: 10.1186/1471-2105-16-s5-s4] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022] Open
Abstract
Progressive sequence alignment is one of the most commonly used method for multiple sequence alignment. Roughly speaking, the method first builds a guide tree, and then aligns the sequences progressively according to the topology of the tree. It is believed that guide trees are very important to progressive alignment; a better guide tree will give an alignment with higher accuracy. Recently, we have proposed an adaptive method for constructing guide trees. This paper studies the quality of the guide trees constructed by such method. Our study showed that our adaptive method can be used to improve the accuracy of many different progressive MSA tools. In fact, we give evidences showing that the guide trees constructed by the adaptive method are among the best.
Collapse
Affiliation(s)
- Qing Zhan
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Yongtao Ye
- HKU-BGI Bioinformatics Algorithms & Core Technology Research Lab, Computer Science Department, University of Hong Kong, Hong Kong, China
| | - Tak-Wah Lam
- HKU-BGI Bioinformatics Algorithms & Core Technology Research Lab, Computer Science Department, University of Hong Kong, Hong Kong, China
| | - Siu-Ming Yiu
- HKU-BGI Bioinformatics Algorithms & Core Technology Research Lab, Computer Science Department, University of Hong Kong, Hong Kong, China
| | - Yadong Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Hing-Fung Ting
- HKU-BGI Bioinformatics Algorithms & Core Technology Research Lab, Computer Science Department, University of Hong Kong, Hong Kong, China
| |
Collapse
|
46
|
Ye Y, Cheung DWL, Wang Y, Yiu SM, Zhan Q, Lam TW, Ting HF. GLProbs: Aligning Multiple Sequences Adaptively. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2015; 12:67-78. [PMID: 26357079 DOI: 10.1109/tcbb.2014.2316820] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
This paper introduces a simple and effective approach to improve the accuracy of multiple sequence alignment. We use a natural measure to estimate the similarity of the input sequences, and based on this measure, we align the input sequences differently. For example, for inputs with high similarity, we consider the whole sequences and align them globally, while for those with moderately low similarity, we may ignore the flank regions and align them locally. To test the effectiveness of this approach, we have implemented a multiple sequence alignment tool called GLProbs and compared its performance with about one dozen leading alignment tools on three benchmark alignment databases, and GLProbs's alignments have the best scores in almost all testings. We have also evaluated the practicability of the alignments of GLProbs by applying the tool to three biological applications, namely phylogenetic trees construction, protein secondary structure prediction and the detection of high risk members for cervical cancer in the HPV-E6 family, and the results are very encouraging.
Collapse
|
47
|
Eser E, Can T, Ferhatosmanoğlu H. Div-BLAST: diversification of sequence search results. PLoS One 2014; 9:e115445. [PMID: 25531115 PMCID: PMC4274030 DOI: 10.1371/journal.pone.0115445] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2014] [Accepted: 11/24/2014] [Indexed: 11/30/2022] Open
Abstract
Sequence similarity tools, such as BLAST, seek sequences most similar to a query from a database of sequences. They return results significantly similar to the query sequence and that are typically highly similar to each other. Most sequence analysis tasks in bioinformatics require an exploratory approach, where the initial results guide the user to new searches. However, diversity has not yet been considered an integral component of sequence search tools for this discipline. Some redundancy can be avoided by introducing non-redundancy during database construction, but it is not feasible to dynamically set a level of non-redundancy tailored to a query sequence. We introduce the problem of diverse search and browsing in sequence databases that produce non-redundant results optimized for any given query. We define diversity measures for sequences and propose methods to obtain diverse results extracted from current sequence similarity search tools. We also propose a new measure to evaluate the diversity of a set of sequences that is returned as a result of a sequence similarity query. We evaluate the effectiveness of the proposed methods in post-processing BLAST and PSI-BLAST results. We also assess the functional diversity of the returned results based on available Gene Ontology annotations. Additionally, we include a comparison with a current redundancy elimination tool, CD-HIT. Our experiments show that the proposed methods are able to achieve more diverse yet significant result sets compared to static non-redundancy approaches. In both sequence-based and functional diversity evaluation, the proposed diversification methods significantly outperform original BLAST results and other baselines. A web based tool implementing the proposed methods, Div-BLAST, can be accessed at cedar.cs.bilkent.edu.tr/Div-BLAST.
Collapse
Affiliation(s)
- Elif Eser
- Department of Computer Engineering, Bilkent University, Ankara, Turkey
| | - Tolga Can
- Department of Computer Engineering, Middle East Technical University, Ankara, Turkey
| | | |
Collapse
|
48
|
Lyras DP, Metzler D. ReformAlign: improved multiple sequence alignments using a profile-based meta-alignment approach. BMC Bioinformatics 2014; 15:265. [PMID: 25099134 PMCID: PMC4133627 DOI: 10.1186/1471-2105-15-265] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2014] [Accepted: 07/29/2014] [Indexed: 11/16/2022] Open
Abstract
Background Obtaining an accurate sequence alignment is fundamental for consistently analyzing biological data. Although this problem may be efficiently solved when only two sequences are considered, the exact inference of the optimal alignment easily gets computationally intractable for the multiple sequence alignment case. To cope with the high computational expenses, approximate heuristic methods have been proposed that address the problem indirectly by progressively aligning the sequences in pairs according to their relatedness. These methods however are not flexible to change the alignment of an already aligned group of sequences in the view of new data, resulting thus in compromises on the quality of the deriving alignment. In this paper we present ReformAlign, a novel meta-alignment approach that may significantly improve on the quality of the deriving alignments from popular aligners. We call ReformAlign a meta-aligner as it requires an initial alignment, for which a variety of alignment programs can be used. The main idea behind ReformAlign is quite straightforward: at first, an existing alignment is used to construct a standard profile which summarizes the initial alignment and then all sequences are individually re-aligned against the formed profile. From each sequence-profile comparison, the alignment of each sequence against the profile is recorded and the final alignment is indirectly inferred by merging all the individual sub-alignments into a unified set. The employment of ReformAlign may often result in alignments which are significantly more accurate than the starting alignments. Results We evaluated the effect of ReformAlign on the generated alignments from ten leading alignment methods using real data of variable size and sequence identity. The experimental results suggest that the proposed meta-aligner approach may often lead to statistically significant more accurate alignments. Furthermore, we show that ReformAlign results in more substantial improvement in cases where the starting alignment is of relatively inferior quality or when the input sequences are harder to align. Conclusions The proposed profile-based meta-alignment approach seems to be a promising and computationally efficient method that can be combined with practically all popular alignment methods and may lead to significant improvements in the generated alignments. Electronic supplementary material The online version of this article (doi:10.1186/1471-2105-15-265) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Dimitrios P Lyras
- Faculty of Biology, Department II, Ludwig-Maximilians Universität München, Planegg-Martinsried 82152, Germany.
| | | |
Collapse
|
49
|
A comparative assessment and analysis of 20 representative sequence alignment methods for protein structure prediction. Sci Rep 2014; 3:2619. [PMID: 24018415 PMCID: PMC3965362 DOI: 10.1038/srep02619] [Citation(s) in RCA: 128] [Impact Index Per Article: 12.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2013] [Accepted: 08/22/2013] [Indexed: 11/08/2022] Open
Abstract
Protein sequence alignment is essential for template-based protein structure prediction and function annotation. We collect 20 sequence alignment algorithms, 10 published and 10 newly developed, which cover all representative sequence- and profile-based alignment approaches. These algorithms are benchmarked on 538 non-redundant proteins for protein fold-recognition on a uniform template library. Results demonstrate dominant advantage of profile-profile based methods, which generate models with average TM-score 26.5% higher than sequence-profile methods and 49.8% higher than sequence-sequence alignment methods. There is no obvious difference in results between methods with profiles generated from PSI-BLAST PSSM matrix and hidden Markov models. Accuracy of profile-profile alignments can be further improved by 9.6% or 21.4% when predicted or native structure features are incorporated. Nevertheless, TM-scores from profile-profile methods including experimental structural features are still 37.1% lower than that from TM-align, demonstrating that the fold-recognition problem cannot be solved solely by improving accuracy of structure feature predictions.
Collapse
|
50
|
Gudyś A, Deorowicz S. QuickProbs--a fast multiple sequence alignment algorithm designed for graphics processors. PLoS One 2014; 9:e88901. [PMID: 24586435 PMCID: PMC3934876 DOI: 10.1371/journal.pone.0088901] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2013] [Accepted: 01/15/2014] [Indexed: 12/03/2022] Open
Abstract
Multiple sequence alignment is a crucial task in a number of biological analyses like secondary structure prediction, domain searching, phylogeny, etc. MSAProbs is currently the most accurate alignment algorithm, but its effectiveness is obtained at the expense of computational time. In the paper we present QuickProbs, the variant of MSAProbs customised for graphics processors. We selected the two most time consuming stages of MSAProbs to be redesigned for GPU execution: the posterior matrices calculation and the consistency transformation. Experiments on three popular benchmarks (BAliBASE, PREFAB, OXBench-X) on quad-core PC equipped with high-end graphics card show QuickProbs to be 5.7 to 9.7 times faster than original CPU-parallel MSAProbs. Additional tests performed on several protein families from Pfam database give overall speed-up of 6.7. Compared to other algorithms like MAFFT, MUSCLE, or ClustalW, QuickProbs proved to be much more accurate at similar speed. Additionally we introduce a tuned variant of QuickProbs which is significantly more accurate on sets of distantly related sequences than MSAProbs without exceeding its computation time. The GPU part of QuickProbs was implemented in OpenCL, thus the package is suitable for graphics processors produced by all major vendors.
Collapse
Affiliation(s)
- Adam Gudyś
- Institute of Informatics, Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, Gliwice, Poland
- * E-mail:
| | - Sebastian Deorowicz
- Institute of Informatics, Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, Gliwice, Poland
| |
Collapse
|