1
|
Julian AT, Pombert JF. SYNY: a pipeline to investigate and visualize collinearity between genomes. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.09.593317. [PMID: 38798446 PMCID: PMC11118330 DOI: 10.1101/2024.05.09.593317] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2024]
Abstract
Investigating collinearity between chromosomes is often used in comparative genomics to help identify gene orthologs, pinpoint genes that might have been overlooked as part of annotation processes and/or perform various evolutionary inferences. Collinear segments, also known as syntenic blocks, can be inferred from sequence alignments and/or from the identification of genes arrayed in the same order and relative orientations between investigated genomes. To help perform these analyses and assess their outcomes, we built a simple pipeline called SYNY (for synteny) that implements the two distinct approaches and produces different visualizations. The SYNY pipeline was built with ease of use in mind and runs on modest hardware. The pipeline is written in Perl and Python and is available on GitHub (https://github.com/PombertLab/SYNY) under the permissive MIT license.
Collapse
|
2
|
Wilson D, Rogers JD. Evaluating Compression-Based Phylogeny Estimation in the Presence of Incomplete Lineage Sorting. J Comput Biol 2023; 30:250-260. [PMID: 36848254 DOI: 10.1089/cmb.2022.0197] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/01/2023] Open
Abstract
This study assesses characteristics of the normalized compression distance (NCD) technique for building phylogenetic trees from molecular data. We examined results from a mammalian biological data set as well as a collection of simulated data with varying levels of incomplete lineage sorting. The implementation of NCD we analyze is a concatenation-based, distance-based, alignment-free, and model-free phylogeny estimation method, which takes concatenated unaligned sequence data as input and outputs a matrix of distances. We compare the NCD phylogeny estimation method with various other methods, including coalescent- and concatenation-based methods.
Collapse
Affiliation(s)
- Deangelo Wilson
- School of Computing, DePaul University, Chicago, Illinois, USA
| | - John D Rogers
- School of Computing, DePaul University, Chicago, Illinois, USA
| |
Collapse
|
3
|
Ray M, Sarkar S, Rath SN. Druggability for COVID-19: in silico discovery of potential drug compounds against nucleocapsid (N) protein of SARS-CoV-2. Genomics Inform 2020; 18:e43. [PMID: 33412759 PMCID: PMC7808868 DOI: 10.5808/gi.2020.18.4.e43] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2020] [Revised: 10/19/2020] [Accepted: 10/20/2020] [Indexed: 12/13/2022] Open
Abstract
The coronavirus disease 2019 is a contagious disease and had caused havoc throughout the world by creating widespread mortality and morbidity. The unavailability of vaccines and proper antiviral drugs encourages the researchers to identify potential antiviral drugs to be used against the virus. The presence of RNA binding domain in the nucleocapsid (N) protein of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) could be a potential drug target, which serves multiple critical functions during the viral life cycle, especially the viral replication. Since vaccine development might take some time, the identification of a drug compound targeting viral replication might offer a solution for treatment. The study analyzed the phylogenetic relationship of N protein sequence divergence with other 49 coronavirus species and also identified the conserved regions according to protein families through conserved domain search. Good structural binding affinities of a few natural and/or synthetic phytocompounds or drugs against N protein were determined using the molecular docking approaches. The analyzed compounds presented the higher numbers of hydrogen bonds of selected chemicals supporting the drug-ability of these compounds. Among them, the established antiviral drug glycyrrhizic acid and the phytochemical theaflavin can be considered as possible drug compounds against target N protein of SARS-CoV-2 as they showed lower binding affinities. The findings of this study might lead to the development of a drug for the SARS-Cov-2 mediated disease and offer solution to treatment of SARS-CoV-2 infection.
Collapse
Affiliation(s)
- Manisha Ray
- All India Institute of Medical Sciences, Bhubaneswar, Odisha 751019, India
| | - Saurav Sarkar
- All India Institute of Medical Sciences, Bhubaneswar, Odisha 751019, India
| | - Surya Narayan Rath
- Department of Bioinformatics, Odisha University of Agriculture and Technology, Bhubaneswar, Odisha 751003, India
| |
Collapse
|
4
|
Bhattacharjee A, Bayzid MS. Machine learning based imputation techniques for estimating phylogenetic trees from incomplete distance matrices. BMC Genomics 2020; 21:497. [PMID: 32689946 PMCID: PMC7370488 DOI: 10.1186/s12864-020-06892-5] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2020] [Accepted: 07/07/2020] [Indexed: 02/08/2023] Open
Abstract
BACKGROUND With the rapid growth rate of newly sequenced genomes, species tree inference from genes sampled throughout the whole genome has become a basic task in comparative and evolutionary biology. However, substantial challenges remain in leveraging these large scale molecular data. One of the foremost challenges is to develop efficient methods that can handle missing data. Popular distance-based methods, such as NJ (neighbor joining) and UPGMA (unweighted pair group method with arithmetic mean) require complete distance matrices without any missing data. RESULTS We introduce two highly accurate machine learning based distance imputation techniques. These methods are based on matrix factorization and autoencoder based deep learning architectures. We evaluated these two methods on a collection of simulated and biological datasets. Experimental results suggest that our proposed methods match or improve upon the best alternate distance imputation techniques. Moreover, these methods are scalable to large datasets with hundreds of taxa, and can handle a substantial amount of missing data. CONCLUSIONS This study shows, for the first time, the power and feasibility of applying deep learning techniques for imputing distance matrices. Thus, this study advances the state-of-the-art in phylogenetic tree construction in the presence of missing data. The proposed methods are available in open source form at https://github.com/Ananya-Bhattacharjee/ImputeDistances .
Collapse
Affiliation(s)
- Ananya Bhattacharjee
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, 1205 Bangladesh
- Department of Computer Science and Engineering, Eastern University, Dhaka, Bangladesh
| | - Md. Shamsuzzoha Bayzid
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, 1205 Bangladesh
| |
Collapse
|
5
|
Nwaiwu O, Aduba CC. An in silico analysis of acquired antimicrobial resistance genes in Aeromonas plasmids. AIMS Microbiol 2020; 6:75-91. [PMID: 32226916 PMCID: PMC7099201 DOI: 10.3934/microbiol.2020005] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2020] [Accepted: 03/13/2020] [Indexed: 12/17/2022] Open
Abstract
Sequences of 105 Aeromonas species plasmids were probed for acquired anti-microbial resistance (AMR) genes using a bioinformatics approach. The plasmids showed no positive linear correlation between size and GC content and up to 55 acquired AMR genes were found in 39 (37%) plasmids after in silico screening for resistance against 15 antibiotic drug classes. Overall, potential multiple antibiotic resistance (p-MAR) index ranged from 0.07 to 0.53. Up to 18 plasmids were predicted to mediate multiple drug resistance (MDR). Plasmids pS121-1a (A. salmonicida), pWCX23_1 (A. hydrophila) and pASP-a58 (A. veronii) harboured 18, 15 and 14 AMR genes respectively. The five most occurring drug classes for which AMR genes were detected were aminoglycosides (27%), followed by beta-lactams (17%), sulphonamides (13%), fluoroquinolones (13%), and phenicols (10%). The most prevalent genes were a sulphonamide resistant gene Sul1, the gene aac (6')-Ib-cr (aminoglycoside 6'-N-acetyl transferase type Ib-cr) resistant to aminoglycosides and the blaKPC-2 gene, which encodes carbapenemase-production. Plasmid acquisition of AMR genes was mainly inter-genus rather than intra-genus. Eighteen plasmids showed template or host genes acquired from Pseudomonas monteilii, Salmonella enterica or Escherichia coli. The most occurring antimicrobial resistance determinants (ARDs) were beta-lactamase, followed by aminoglycosides acetyl-transferases, and then efflux pumps. Screening of new isolates in vitro and in vivo is required to ascertain the level of phenotypic expression of colistin and other acquired AMR genes detected.
Collapse
Affiliation(s)
- Ogueri Nwaiwu
- School of Biosciences, University of Nottingham, Sutton Bonington Campus, United Kingdom
| | - Chiugo Claret Aduba
- Department of Science Laboratory Technology, University of Nigeria, Nsukka, Nigeria
| |
Collapse
|
6
|
Ali RH, Bogusz M, Whelan S. Identifying Clusters of High Confidence Homologies in Multiple Sequence Alignments. Mol Biol Evol 2019; 36:2340-2351. [PMID: 31209473 PMCID: PMC6933875 DOI: 10.1093/molbev/msz142] [Citation(s) in RCA: 52] [Impact Index Per Article: 10.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open
Abstract
Multiple sequence alignment (MSA) is ubiquitous in evolution and bioinformatics. MSAs are usually taken to be a known and fixed quantity on which to perform downstream analysis despite extensive evidence that MSA accuracy and uncertainty affect results. These errors are known to cause a wide range of problems for downstream evolutionary inference, ranging from false inference of positive selection to long branch attraction artifacts. The most popular approach to dealing with this problem is to remove (filter) specific columns in the MSA that are thought to be prone to error. Although popular, this approach has had mixed success and several studies have even suggested that filtering might be detrimental to phylogenetic studies. We present a graph-based clustering method to address MSA uncertainty and error in the software Divvier (available at https://github.com/simonwhelan/Divvier), which uses a probabilistic model to identify clusters of characters that have strong statistical evidence of shared homology. These clusters can then be used to either filter characters from the MSA (partial filtering) or represent each of the clusters in a new column (divvying). We validate Divvier through its performance on real and simulated benchmarks, finding Divvier substantially outperforms existing filtering software by retaining more true pairwise homologies calls and removing more false positive pairwise homologies. We also find that Divvier, in contrast to other filtering tools, can alleviate long branch attraction artifacts induced by MSA and reduces the variation in tree estimates caused by MSA uncertainty.
Collapse
Affiliation(s)
- Raja Hashim Ali
- Department of Evolutionary Biology, Evolutionary Biology Centre, Uppsala University, Uppsala, Sweden
- Faculty of Computer Science and Engineering, Ghulam Ishaq Khan Institute of Engineering Sciences and Technology, Topi, Pakistan
| | - Marcin Bogusz
- Department of Evolutionary Biology, Evolutionary Biology Centre, Uppsala University, Uppsala, Sweden
| | - Simon Whelan
- Department of Evolutionary Biology, Evolutionary Biology Centre, Uppsala University, Uppsala, Sweden
| |
Collapse
|
7
|
Vizueta J, Rozas J, Sánchez-Gracia A. Comparative Genomics Reveals Thousands of Novel Chemosensory Genes and Massive Changes in Chemoreceptor Repertories across Chelicerates. Genome Biol Evol 2018; 10:1221-1236. [PMID: 29788250 PMCID: PMC5952958 DOI: 10.1093/gbe/evy081] [Citation(s) in RCA: 27] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 04/17/2018] [Indexed: 12/15/2022] Open
Abstract
Chemoreception is a widespread biological function that is essential for the survival, reproduction, and social communication of animals. Though the molecular mechanisms underlying chemoreception are relatively well known in insects, they are poorly studied in the other major arthropod lineages. Current availability of a number of chelicerate genomes constitutes a great opportunity to better characterize gene families involved in this important function in a lineage that emerged and colonized land independently of insects. At the same time, that offers new opportunities and challenges for the study of this interesting animal branch in many translational research areas. Here, we have performed a comprehensive comparative genomics study that explicitly considers the high fragmentation of available draft genomes and that for the first time included complete genome data that cover most of the chelicerate diversity. Our exhaustive searches exposed thousands of previously uncharacterized chemosensory sequences, most of them encoding members of the gustatory and ionotropic receptor families. The phylogenetic and gene turnover analyses of these sequences indicated that the whole-genome duplication events proposed for this subphylum would not explain the differences in the number of chemoreceptors observed across species. A constant and prolonged gene birth and death process, altered by episodic bursts of gene duplication yielding lineage-specific expansions, has contributed significantly to the extant chemosensory diversity in this group of animals. This study also provides valuable insights into the origin and functional diversification of other relevant chemosensory gene families different from receptors, such as odorant-binding proteins and other related molecules.
Collapse
Affiliation(s)
- Joel Vizueta
- Departament de Genètica, Microbiologia i Estadística and Institut de Recerca de la Biodiversitat (IRBio), Facultat de Biologia, Universitat de Barcelona, Barcelona, Spain
| | - Julio Rozas
- Departament de Genètica, Microbiologia i Estadística and Institut de Recerca de la Biodiversitat (IRBio), Facultat de Biologia, Universitat de Barcelona, Barcelona, Spain
| | - Alejandro Sánchez-Gracia
- Departament de Genètica, Microbiologia i Estadística and Institut de Recerca de la Biodiversitat (IRBio), Facultat de Biologia, Universitat de Barcelona, Barcelona, Spain
| |
Collapse
|
8
|
Levy Karin E, Shkedy D, Ashkenazy H, Cartwright RA, Pupko T. Inferring Rates and Length-Distributions of Indels Using Approximate Bayesian Computation. Genome Biol Evol 2018; 9:1280-1294. [PMID: 28453624 PMCID: PMC5438127 DOI: 10.1093/gbe/evx084] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 04/25/2017] [Indexed: 02/07/2023] Open
Abstract
The most common evolutionary events at the molecular level are single-base substitutions, as well as insertions and deletions (indels) of short DNA segments. A large body of research has been devoted to develop probabilistic substitution models and to infer their parameters using likelihood and Bayesian approaches. In contrast, relatively little has been done to model indel dynamics, probably due to the difficulty in writing explicit likelihood functions. Here, we contribute to the effort of modeling indel dynamics by presenting SpartaABC, an approximate Bayesian computation (ABC) approach to infer indel parameters from sequence data (either aligned or unaligned). SpartaABC circumvents the need to use an explicit likelihood function by extracting summary statistics from simulated sequences. First, summary statistics are extracted from the input sequence data. Second, SpartaABC samples indel parameters from a prior distribution and uses them to simulate sequences. Third, it computes summary statistics from the simulated sets of sequences. By computing a distance between the summary statistics extracted from the input and each simulation, SpartaABC can provide an approximation to the posterior distribution of indel parameters as well as point estimates. We study the performance of our methodology and show that it provides accurate estimates of indel parameters in simulations. We next demonstrate the utility of SpartaABC by studying the impact of alignment errors on the inference of positive selection. A C ++ program implementing SpartaABC is freely available in http://spartaabc.tau.ac.il.
Collapse
Affiliation(s)
- Eli Levy Karin
- Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel Aviv University, Israel.,Department of Molecular Biology & Ecology of Plants, George S. Wise Faculty of Life Sciences, Tel Aviv University, Israel
| | - Dafna Shkedy
- Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel Aviv University, Israel
| | - Haim Ashkenazy
- Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel Aviv University, Israel
| | - Reed A Cartwright
- The Biodesign Institute, Arizona State University, Tempe, AZ.,School of Life Sciences, Arizona State University, Tempe, AZ
| | - Tal Pupko
- Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel Aviv University, Israel
| |
Collapse
|
9
|
Mavrodiev EV, Dell C, Schroder L. A laid-back trip through the Hennigian Forests. PeerJ 2017; 5:e3578. [PMID: 28740753 PMCID: PMC5522724 DOI: 10.7717/peerj.3578] [Citation(s) in RCA: 35] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2017] [Accepted: 06/23/2017] [Indexed: 11/20/2022] Open
Abstract
BACKGROUND This paper is a comment on the idea of matrix-free Cladistics. Demonstration of this idea's efficiency is a major goal of the study. Within the proposed framework, the ordinary (phenetic) matrix is necessary only as "source" of Hennigian trees, not as a primary subject of the analysis. Switching from the matrix-based thinking to the matrix-free Cladistic approach clearly reveals that optimizations of the character-state changes are related not to the real processes, but to the form of the data representation. METHODS We focused our study on the binary data. We wrote the simple ruby-based script FORESTER version 1.0 that helps represent a binary matrix as an array of the rooted trees (as a "Hennigian forest"). The binary representations of the genomic (DNA) data have been made by script 1001. The Average Consensus method as well as the standard Maximum Parsimony (MP) approach has been used to analyze the data. PRINCIPLE FINDINGS The binary matrix may be easily re-written as a set of rooted trees (maximal relationships). The latter might be analyzed by the Average Consensus method. Paradoxically, this method, if applied to the Hennigian forests, in principle can help to identify clades despite the absence of the direct evidence from the primary data. Our approach may handle the clock- or non clock-like matrices, as well as the hypothetical, molecular or morphological data. DISCUSSION Our proposal clearly differs from the numerous phenetic alignment-free techniques of the construction of the phylogenetic trees. Dealing with the relations, not with the actual "data" also distinguishes our approach from all optimization-based methods, if the optimization is defined as a way to reconstruct the sequences of the character-state changes on a tree, either the standard alignment-based techniques or the "direct" alignment-free procedure. We are not viewing our recent framework as an alternative to the three-taxon statement analysis (3TA), but there are two major differences between our recent proposal and the 3TA, as originally designed and implemented: (1) the 3TA deals with the three-taxon statements or minimal relationships. According to the logic of 3TA, the set of the minimal trees must be established as a binary matrix and used as an input for the parsimony program. In this paper, we operate directly with maximal relationships written just as trees, not as binary matrices, while also using the Average Consensus method instead of the MP analysis. The solely 'reversal'-based groups can always be found by our method without the separate scoring of the putative reversals before analyses.
Collapse
Affiliation(s)
- Evgeny V. Mavrodiev
- University of Florida, Florida Museum of Natural History, Gainesville, FL, USA
| | - Christopher Dell
- University of Florida, Florida Museum of Natural History, Gainesville, FL, USA
| | - Laura Schroder
- Department of Anatomy and Neurobiology of University of Tennessee, University of Tennessee Health Science Center Washington, Memphis, TN, USA
- Washington, Wyoming, Alaska, Montana and Idaho Medical Education Program, University of Idaho, Moscow, ID, USA
| |
Collapse
|
10
|
Nojoomi S, Koehl P. String kernels for protein sequence comparisons: improved fold recognition. BMC Bioinformatics 2017; 18:137. [PMID: 28245816 PMCID: PMC5331664 DOI: 10.1186/s12859-017-1560-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2016] [Accepted: 02/23/2017] [Indexed: 11/28/2022] Open
Abstract
BACKGROUND The amino acid sequence of a protein is the blueprint from which its structure and ultimately function can be derived. Therefore, sequence comparison methods remain essential for the determination of similarity between proteins. Traditional approaches for comparing two protein sequences begin with strings of letters (amino acids) that represent the sequences, before generating textual alignments between these strings and providing scores for each alignment. When the similitude between the two protein sequences to be compared is low however, the quality of the corresponding sequence alignment is usually poor, leading to poor performance for the recognition of similarity. RESULTS In this study, we develop an alignment free alternative to these methods that is based on the concept of string kernels. Starting from recently proposed kernels on the discrete space of protein sequences (Shen et al, Found. Comput. Math., 2013,14:951-984), we introduce our own version, SeqKernel. Its implementation depends on two parameters, a coefficient that tunes the substitution matrix and the maximum length of k-mers that it includes. We provide an exhaustive analysis of the impacts of these two parameters on the performance of SeqKernel for fold recognition. We show that with the right choice of parameters, use of the SeqKernel similarity measure improves fold recognition compared to the use of traditional alignment-based methods. We illustrate the application of SeqKernel to inferring phylogeny on RNA polymerases and show that it performs as well as methods based on multiple sequence alignments. CONCLUSION We have presented and characterized a new alignment free method based on a mathematical kernel for scoring the similarity of protein sequences. We discuss possible improvements of this method, as well as an extension of its applications to other modeling methods that rely on sequence comparison.
Collapse
Affiliation(s)
- Saghi Nojoomi
- Biotechnology program, University of California, Davis, 1, Shields Avenue, Davis, CA, 95616 USA
| | - Patrice Koehl
- Department of Computer Science and Genome Center, 1, Shields Avenue, Davis, CA, 95616 USA
| |
Collapse
|