1
|
Arora UP, Dumont BL. Molecular evolution of the mammalian kinetochore complex. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.06.27.600994. [PMID: 38979348 PMCID: PMC11230421 DOI: 10.1101/2024.06.27.600994] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/10/2024]
Abstract
Mammalian centromeres are satellite-rich chromatin domains that serve as sites for kinetochore complex assembly. Centromeres are highly variable in sequence and satellite organization across species, but the processes that govern the co-evolutionary dynamics between rapidly evolving centromeres and their associated kinetochore proteins remain poorly understood. Here, we pursue a course of phylogenetic analyses to investigate the molecular evolution of the complete kinetochore complex across primate and rodent species with divergent centromere repeat sequences and features. We show that many protein components of the core centromere associated network (CCAN) harbor signals of adaptive evolution, consistent with their intimate association with centromere satellite DNA and roles in the stability and recruitment of additional kinetochore proteins. Surprisingly, CCAN and outer kinetochore proteins exhibit comparable rates of adaptive divergence, suggesting that changes in centromere DNA can ripple across the kinetochore to drive adaptive protein evolution within distant domains of the complex. Our work further identifies kinetochore proteins subject to lineage-specific adaptive evolution, including rapidly evolving proteins in species with centromere satellites characterized by higher-order repeat structure and lacking CENP-B boxes. Thus, features of centromeric chromatin beyond the linear DNA sequence may drive selection on kinetochore proteins. Overall, our work spotlights adaptively evolving proteins with diverse centromere-associated functions, including centromere chromatin structure, kinetochore protein assembly, kinetochore-microtubule association, cohesion maintenance, and DNA damage response pathways. These adaptively evolving kinetochore protein candidates present compelling opportunities for future functional investigations exploring how their concerted changes with centromere DNA ensure the maintenance of genome stability.
Collapse
|
2
|
Czernecki D, Nourisson A, Legrand P, Delarue M. Reclassification of family A DNA polymerases reveals novel functional subfamilies and distinctive structural features. Nucleic Acids Res 2023; 51:4488-4507. [PMID: 37070157 PMCID: PMC10201439 DOI: 10.1093/nar/gkad242] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2022] [Revised: 03/07/2023] [Accepted: 03/24/2023] [Indexed: 04/19/2023] Open
Abstract
Family A DNA polymerases (PolAs) form an important and well-studied class of extant polymerases participating in DNA replication and repair. Nonetheless, despite the characterization of multiple subfamilies in independent, dedicated works, their comprehensive classification thus far is missing. We therefore re-examine all presently available PolA sequences, converting their pairwise similarities into positions in Euclidean space, separating them into 19 major clusters. While 11 of them correspond to known subfamilies, eight had not been characterized before. For every group, we compile their general characteristics, examine their phylogenetic relationships and perform conservation analysis in the essential sequence motifs. While most subfamilies are linked to a particular domain of life (including phages), one subfamily appears in Bacteria, Archaea and Eukaryota. We also show that two new bacterial subfamilies contain functional enzymes. We use AlphaFold2 to generate high-confidence prediction models for all clusters lacking an experimentally determined structure. We identify new, conserved features involving structural alterations, ordered insertions and an apparent structural incorporation of a uracil-DNA glycosylase (UDG) domain. Finally, genetic and structural analyses of a subset of T7-like phages indicate a splitting of the 3'-5' exo and pol domains into two separate genes, observed in PolAs for the first time.
Collapse
Affiliation(s)
- Dariusz Czernecki
- Institut Pasteur, Université Paris Cité, CNRS UMR 3528, Unit of Architecture and Dynamics of Biological Macromolecules, 75015 Paris, France
- Sorbonne Université, Collège Doctoral, ED 515, 75005 Paris, France
| | - Antonin Nourisson
- Institut Pasteur, Université Paris Cité, CNRS UMR 3528, Unit of Architecture and Dynamics of Biological Macromolecules, 75015 Paris, France
- Sorbonne Université, Collège Doctoral, ED 515, 75005 Paris, France
| | - Pierre Legrand
- Institut Pasteur, Université Paris Cité, CNRS UMR 3528, Unit of Architecture and Dynamics of Biological Macromolecules, 75015 Paris, France
- Synchrotron SOLEIL, L’Orme des Merisiers, 91190 Saint-Aubin, France
| | - Marc Delarue
- Institut Pasteur, Université Paris Cité, CNRS UMR 3528, Unit of Architecture and Dynamics of Biological Macromolecules, 75015 Paris, France
| |
Collapse
|
3
|
Xie R, Zan X, Chu L, Su Y, Xu P, Liu W. Study of the error correction capability of multiple sequence alignment algorithm (MAFFT) in DNA storage. BMC Bioinformatics 2023; 24:111. [PMID: 36959531 PMCID: PMC10037887 DOI: 10.1186/s12859-023-05237-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2022] [Accepted: 03/17/2023] [Indexed: 03/25/2023] Open
Abstract
Synchronization (insertions-deletions) errors are still a major challenge for reliable information retrieval in DNA storage. Unlike traditional error correction codes (ECC) that add redundancy in the stored information, multiple sequence alignment (MSA) solves this problem by searching the conserved subsequences. In this paper, we conduct a comprehensive simulation study on the error correction capability of a typical MSA algorithm, MAFFT. Our results reveal that its capability exhibits a phase transition when there are around 20% errors. Below this critical value, increasing sequencing depth can eventually allow it to approach complete recovery. Otherwise, its performance plateaus at some poor levels. Given a reasonable sequencing depth (≤ 70), MSA could achieve complete recovery in the low error regime, and effectively correct 90% of the errors in the medium error regime. In addition, MSA is robust to imperfect clustering. It could also be combined with other means such as ECC, repeated markers, or any other code constraints. Furthermore, by selecting an appropriate sequencing depth, this strategy could achieve an optimal trade-off between cost and reading speed. MSA could be a competitive alternative for future DNA storage.
Collapse
Affiliation(s)
- Ranze Xie
- Institution of Computational Science and Technology, Guangzhou University, Guangzhou, 510006, China
| | - Xiangzhen Zan
- Institution of Computational Science and Technology, Guangzhou University, Guangzhou, 510006, China
| | - Ling Chu
- Institution of Computational Science and Technology, Guangzhou University, Guangzhou, 510006, China
| | - Yanqing Su
- Institution of Computational Science and Technology, Guangzhou University, Guangzhou, 510006, China
| | - Peng Xu
- Institution of Computational Science and Technology, Guangzhou University, Guangzhou, 510006, China.
| | - Wenbin Liu
- Institution of Computational Science and Technology, Guangzhou University, Guangzhou, 510006, China.
| |
Collapse
|
4
|
Zan X, Chu L, Xie R, Su Y, Yao X, Xu P, Liu W. An image cryptography method by highly error-prone DNA storage channel. Front Bioeng Biotechnol 2023; 11:1173763. [PMID: 37152655 PMCID: PMC10154519 DOI: 10.3389/fbioe.2023.1173763] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2023] [Accepted: 03/30/2023] [Indexed: 05/09/2023] Open
Abstract
Introduction: Rapid development in synthetic technologies has boosted DNA as a potential medium for large-scale data storage. Meanwhile, how to implement data security in the DNA storage system is still an unsolved problem. Methods: In this article, we propose an image encryption method based on the modulation-based storage architecture. The key idea is to take advantage of the unpredictable modulation signals to encrypt images in highly error-prone DNA storage channels. Results and Discussion: Numerical results have demonstrated that our image encryption method is feasible and effective with excellent security against various attacks (statistical, differential, noise, and data loss). When compared with other methods such as the hybridization reactions of DNA molecules, the proposed method is more reliable and feasible for large-scale applications.
Collapse
Affiliation(s)
- Xiangzhen Zan
- Institute of Computational Science and Technology, Guangzhou University, Guangzhou, Guangdong, China
| | - Ling Chu
- Institute of Computational Science and Technology, Guangzhou University, Guangzhou, Guangdong, China
| | - Ranze Xie
- Institute of Computational Science and Technology, Guangzhou University, Guangzhou, Guangdong, China
| | - Yanqing Su
- Institute of Computational Science and Technology, Guangzhou University, Guangzhou, Guangdong, China
| | - Xiangyu Yao
- Institute of Computational Science and Technology, Guangzhou University, Guangzhou, Guangdong, China
| | - Peng Xu
- Institute of Computational Science and Technology, Guangzhou University, Guangzhou, Guangdong, China
- School of Computer Science of Information Technology, Qiannan Normal University for Nationalities, Duyun, Guizhou, China
- Guangdong Provincial Key Laboratory of Artificial Intelligence in Medical Image Analysis and Application, Guangzhou, Guangdong, China
- *Correspondence: Peng Xu, ; Wenbin Liu,
| | - Wenbin Liu
- Institute of Computational Science and Technology, Guangzhou University, Guangzhou, Guangdong, China
- Guangdong Provincial Key Laboratory of Artificial Intelligence in Medical Image Analysis and Application, Guangzhou, Guangdong, China
- *Correspondence: Peng Xu, ; Wenbin Liu,
| |
Collapse
|
5
|
Waite DW, Liefting L, Delmiglio C, Chernyavtseva A, Ha HJ, Thompson JR. Development and Validation of a Bioinformatic Workflow for the Rapid Detection of Viruses in Biosecurity. Viruses 2022; 14:v14102163. [PMID: 36298719 PMCID: PMC9610911 DOI: 10.3390/v14102163] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2022] [Accepted: 09/25/2022] [Indexed: 11/05/2022] Open
Abstract
The field of biosecurity has greatly benefited from the widespread adoption of high-throughput sequencing technologies, for its ability to deeply query plant and animal samples for pathogens for which no tests exist. However, the bioinformatics analysis tools designed for rapid analysis of these sequencing datasets are not developed with this application in mind, limiting the ability of diagnosticians to standardise their workflows using published tool kits. We sought to assess previously published bioinformatic tools for their ability to identify plant- and animal-infecting viruses while distinguishing from the host genetic material. We discovered that many of the current generation of virus-detection pipelines are not adequate for this task, being outperformed by more generic classification tools. We created synthetic MinION and HiSeq libraries simulating plant and animal infections of economically important viruses and assessed a series of tools for their suitability for rapid and accurate detection of infection, and further tested the top performing tools against the VIROMOCK Challenge dataset to ensure that our findings were reproducible when compared with international standards. Our work demonstrated that several methods provide sensitive and specific detection of agriculturally important viruses in a timely manner and provides a key piece of ground truthing for method development in this space.
Collapse
Affiliation(s)
- David W. Waite
- Plant Health and Environment Laboratory, Ministry for Primary Industries, P.O. Box 2095, Auckland 1140, New Zealand
- Correspondence:
| | - Lia Liefting
- Plant Health and Environment Laboratory, Ministry for Primary Industries, P.O. Box 2095, Auckland 1140, New Zealand
| | - Catia Delmiglio
- Plant Health and Environment Laboratory, Ministry for Primary Industries, P.O. Box 2095, Auckland 1140, New Zealand
| | | | - Hye Jeong Ha
- Animal Health Laboratory, Ministry for Primary Industries, Upper Hutt 5018, New Zealand
| | - Jeremy R. Thompson
- Plant Health and Environment Laboratory, Ministry for Primary Industries, P.O. Box 2095, Auckland 1140, New Zealand
| |
Collapse
|
6
|
Hubley R, Wheeler TJ, Smit AFA. Accuracy of multiple sequence alignment methods in the reconstruction of transposable element families. NAR Genom Bioinform 2022; 4:lqac040. [PMID: 35591887 PMCID: PMC9112768 DOI: 10.1093/nargab/lqac040] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2021] [Revised: 03/29/2022] [Accepted: 04/29/2022] [Indexed: 02/06/2023] Open
Abstract
The construction of a high-quality multiple sequence alignment (MSA) from copies of a transposable element (TE) is a critical step in the characterization of a new TE family. Most studies of MSA accuracy have been conducted on protein or RNA sequence families, where structural features and strong signals of selection may assist with alignment. Less attention has been given to the quality of sequence alignments involving neutrally evolving DNA sequences such as those resulting from TE replication. Transposable element sequences are challenging to align due to their wide divergence ranges, fragmentation, and predominantly-neutral mutation patterns. To gain insight into the effects of these properties on MSA accuracy, we developed a simulator of TE sequence evolution, and used it to generate a benchmark with which we evaluated the MSA predictions produced by several popular aligners, along with Refiner, a method we developed in the context of our RepeatModeler software. We find that MAFFT and Refiner generally outperform other aligners for low to medium divergence simulated sequences, while Refiner is uniquely effective when tasked with aligning high-divergent and fragmented instances of a family.
Collapse
Affiliation(s)
- Robert Hubley
- Institute for Systems Biology, Seattle, WA 98109, USA
| | - Travis J Wheeler
- Department of Computer Science, University of Montana, Missoula, MT 59801, USA
| | | |
Collapse
|
7
|
Czech L, Stamatakis A, Dunthorn M, Barbera P. Metagenomic Analysis Using Phylogenetic Placement-A Review of the First Decade. FRONTIERS IN BIOINFORMATICS 2022; 2:871393. [PMID: 36304302 PMCID: PMC9580882 DOI: 10.3389/fbinf.2022.871393] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2022] [Accepted: 04/11/2022] [Indexed: 12/20/2022] Open
Abstract
Phylogenetic placement refers to a family of tools and methods to analyze, visualize, and interpret the tsunami of metagenomic sequencing data generated by high-throughput sequencing. Compared to alternative (e. g., similarity-based) methods, it puts metabarcoding sequences into a phylogenetic context using a set of known reference sequences and taking evolutionary history into account. Thereby, one can increase the accuracy of metagenomic surveys and eliminate the requirement for having exact or close matches with existing sequence databases. Phylogenetic placement constitutes a valuable analysis tool per se, but also entails a plethora of downstream tools to interpret its results. A common use case is to analyze species communities obtained from metagenomic sequencing, for example via taxonomic assignment, diversity quantification, sample comparison, and identification of correlations with environmental variables. In this review, we provide an overview over the methods developed during the first 10 years. In particular, the goals of this review are 1) to motivate the usage of phylogenetic placement and illustrate some of its use cases, 2) to outline the full workflow, from raw sequences to publishable figures, including best practices, 3) to introduce the most common tools and methods and their capabilities, 4) to point out common placement pitfalls and misconceptions, 5) to showcase typical placement-based analyses, and how they can help to analyze, visualize, and interpret phylogenetic placement data.
Collapse
Affiliation(s)
- Lucas Czech
- Department of Plant Biology, Carnegie Institution for Science, Stanford, CA, United States
| | - Alexandros Stamatakis
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
- Institute for Theoretical Informatics, Karlsruhe Institute of Technology, Karlsruhe, Germany
| | - Micah Dunthorn
- Natural History Museum, University of Oslo, Oslo, Norway
| | | |
Collapse
|
8
|
O’Boyle B, Shrestha S, Kochut K, Eyers PA, Kannan N. Computational tools and resources for pseudokinase research. Methods Enzymol 2022; 667:403-426. [PMID: 35525549 PMCID: PMC9733567 DOI: 10.1016/bs.mie.2022.03.040] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Pseudokinases regulate diverse cellular processes associated with normal cellular functions and disease. They are defined bioinformatically based on the absence of one or more catalytic residues that are required for canonical protein kinase functions. The ability to define pseudokinases based on primary sequence comparison has enabled the systematic mapping and cataloging of pseudokinase orthologs across the tree of life. While these sequences contain critical information regarding pseudokinase evolution and functional specialization, extracting this information and generating testable hypotheses based on integrative mining of sequence and structural data requires specialized computational tools and resources. In this chapter, we review recent advances in the development and application of open-source tools and resources for pseudokinase research. Specifically, we describe the application of an interactive data analytics framework, KinView, for visualizing the patterns of conservation and variation in the catalytic domain motifs of pseudokinases and evolutionarily related canonical kinases using a consistent set of curated alignments organized based on the widely used kinome evolutionary hierarchy. We also demonstrate the application of an integrated Protein Kinase Ontology (ProKinO) and an interactive viewer, ProtVista, for mapping and analyzing primary sequence motifs and annotations in the context of 3D structures and AlphaFold2 models. We provide examples and protocols for generating testable hypotheses on pseudokinase functions both for bench biologists and advanced users.
Collapse
Affiliation(s)
- Brady O’Boyle
- Department of Biochemistry & Molecular Biology, University of Georgia, Athens, GA 30602, USA
| | - Safal Shrestha
- Institute of Bioinformatics, University of Georgia, Athens, GA 30602, USA
| | - Krzysztof Kochut
- Department of Computer Science, University of Georgia, Athens, GA 30602, USA
| | - Patrick A Eyers
- Department of Biochemistry and Systems Biology, Institute of Systems, Molecular and Integrative Biology, University of Liverpool, UK
| | - Natarajan Kannan
- Department of Biochemistry & Molecular Biology, University of Georgia, Athens, GA 30602, USA,Institute of Bioinformatics, University of Georgia, Athens, GA 30602, USA,Corresponding author:
| |
Collapse
|
9
|
Escobari B, Borsch T, Quedensley TS, Gruenstaeudl M. Plastid phylogenomics of the Gynoxoid group (Senecioneae, Asteraceae) highlights the importance of motif-based sequence alignment amid low genetic distances. AMERICAN JOURNAL OF BOTANY 2021; 108:2235-2256. [PMID: 34636417 DOI: 10.1002/ajb2.1775] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/23/2021] [Accepted: 08/12/2021] [Indexed: 06/13/2023]
Abstract
PREMISE The genus Gynoxys and relatives form a species-rich lineage of Andean shrubs and trees with low genetic distances within the sunflower subtribe Tussilaginineae. Previous molecular phylogenetic investigations of the Tussilaginineae have included few, if any, representatives of this Gynoxoid group or reconstructed ambiguous patterns of relationships for it. METHODS We sequenced complete plastid genomes of 21 species of the Gynoxoid group and related Tussilaginineae and conducted detailed comparisons of the phylogenetic relationships supported by the gene, intron, and intergenic spacer partitions of these genomes. We also evaluated the impact of manual, motif-based adjustments of automatic DNA sequence alignments on phylogenetic tree inference. RESULTS Our results indicate that the inclusion of all plastid genome partitions is needed to infer well-supported phylogenetic trees of the Gynoxoid group. Whole plastome-based tree inference suggests that the genera Gynoxys and Nordenstamia are polyphyletic and form the core clade of the Gynoxoid group. This clade is sister to a clade of Aequatorium and Paragynoxys and also includes some but not all representatives of Paracalia. CONCLUSIONS The concatenation and combined analysis of all plastid genome partitions and the construction of manually-curated, motif-based DNA sequence alignments are found to be instrumental in the recovery of well-supported relationships of the Gynoxoid group. We demonstrate that the correct assessment of homology in genome-level plastid sequence data sets is crucial for subsequent phylogeny reconstruction and that the manual post-processing of multiple sequence alignments improves the reliability of such reconstructions amid low genetic distances between taxa.
Collapse
Affiliation(s)
- Belen Escobari
- Botanischer Garten und Botanisches Museum Berlin, Freie Universität Berlin, Berlin, 14195, Germany
- Herbario Nacional de Bolivia, Universidad Mayor de San Andres, Casilla, La Paz, 10077, Bolivia
| | - Thomas Borsch
- Botanischer Garten und Botanisches Museum Berlin, Freie Universität Berlin, Berlin, 14195, Germany
- Institut für Biologie, Systematische Botanik und Pflanzengeographie, Freie Universität Berlin, Berlin, 14195, Germany
| | - Taylor S Quedensley
- Department of Biology, Texas Christian University, Fort Worth, TX, 76109, USA
| | - Michael Gruenstaeudl
- Institut für Biologie, Systematische Botanik und Pflanzengeographie, Freie Universität Berlin, Berlin, 14195, Germany
| |
Collapse
|
10
|
Aadland K, Kolaczkowski B. Alignment-Integrated Reconstruction of Ancestral Sequences Improves Accuracy. Genome Biol Evol 2021; 12:1549-1565. [PMID: 32785673 PMCID: PMC7523730 DOI: 10.1093/gbe/evaa164] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 08/03/2020] [Indexed: 12/31/2022] Open
Abstract
Ancestral sequence reconstruction (ASR) uses an alignment of extant protein sequences, a phylogeny describing the history of the protein family and a model of the molecular-evolutionary process to infer the sequences of ancient proteins, allowing researchers to directly investigate the impact of sequence evolution on protein structure and function. Like all statistical inferences, ASR can be sensitive to violations of its underlying assumptions. Previous studies have shown that, whereas phylogenetic uncertainty has only a very weak impact on ASR accuracy, uncertainty in the protein sequence alignment can more strongly affect inferred ancestral sequences. Here, we show that errors in sequence alignment can produce errors in ASR across a range of realistic and simplified evolutionary scenarios. Importantly, sequence reconstruction errors can lead to errors in estimates of structural and functional properties of ancestral proteins, potentially undermining the reliability of analyses relying on ASR. We introduce an alignment-integrated ASR approach that combines information from many different sequence alignments. We show that integrating alignment uncertainty improves ASR accuracy and the accuracy of downstream structural and functional inferences, often performing as well as highly accurate structure-guided alignment. Given the growing evidence that sequence alignment errors can impact the reliability of ASR studies, we recommend that future studies incorporate approaches to mitigate the impact of alignment uncertainty. Probabilistic modeling of insertion and deletion events has the potential to radically improve ASR accuracy when the model reflects the true underlying evolutionary history, but further studies are required to thoroughly evaluate the reliability of these approaches under realistic conditions.
Collapse
Affiliation(s)
- Kelsey Aadland
- Department of Microbiology and Cell Science, Institute of Food and Agricultural Sciences, University of Florida
| | - Bryan Kolaczkowski
- Department of Microbiology and Cell Science, Institute of Food and Agricultural Sciences, University of Florida
| |
Collapse
|
11
|
New evaluation methods of read mapping by 17 aligners on simulated and empirical NGS data: an updated comparison of DNA- and RNA-Seq data from Illumina and Ion Torrent technologies. Neural Comput Appl 2021; 33:15669-15692. [PMID: 34155424 PMCID: PMC8208613 DOI: 10.1007/s00521-021-06188-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2020] [Accepted: 06/02/2021] [Indexed: 12/13/2022]
Abstract
During the last (15) years, improved omics sequencing technologies have expanded the scale and resolution of various biological applications, generating high-throughput datasets that require carefully chosen software tools to be processed. Therefore, following the sequencing development, bioinformatics researchers have been challenged to implement alignment algorithms for next-generation sequencing reads. However, nowadays selection of aligners based on genome characteristics is poorly studied, so our benchmarking study extended the “state of art” comparing 17 different aligners. The chosen tools were assessed on empirical human DNA- and RNA-Seq data, as well as on simulated datasets in human and mouse, evaluating a set of parameters previously not considered in such kind of benchmarks. As expected, we found that each tool was the best in specific conditions. For Ion Torrent single-end RNA-Seq samples, the most suitable aligners were CLC and BWA-MEM, which reached the best results in terms of efficiency, accuracy, duplication rate, saturation profile and running time. About Illumina paired-end osteomyelitis transcriptomics data, instead, the best performer algorithm, together with the already cited CLC, resulted Novoalign, which excelled in accuracy and saturation analyses. Segemehl and DNASTAR performed the best on both DNA-Seq data, with Segemehl particularly suitable for exome data. In conclusion, our study could guide users in the selection of a suitable aligner based on genome and transcriptome characteristics. However, several other aspects, emerged from our work, should be considered in the evolution of alignment research area, such as the involvement of artificial intelligence to support cloud computing and mapping to multiple genomes.
Collapse
|
12
|
Gatter T, von Löhneysen S, Fallmann J, Drozdova P, Hartmann T, Stadler PF. LazyB: fast and cheap genome assembly. Algorithms Mol Biol 2021; 16:8. [PMID: 34074310 PMCID: PMC8168326 DOI: 10.1186/s13015-021-00186-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2021] [Accepted: 05/06/2021] [Indexed: 12/27/2022] Open
Abstract
Background Advances in genome sequencing over the last years have lead to a fundamental paradigm shift in the field. With steadily decreasing sequencing costs, genome projects are no longer limited by the cost of raw sequencing data, but rather by computational problems associated with genome assembly. There is an urgent demand for more efficient and and more accurate methods is particular with regard to the highly complex and often very large genomes of animals and plants. Most recently, “hybrid” methods that integrate short and long read data have been devised to address this need. Results LazyB is such a hybrid genome assembler. It has been designed specificially with an emphasis on utilizing low-coverage short and long reads. LazyB starts from a bipartite overlap graph between long reads and restrictively filtered short-read unitigs. This graph is translated into a long-read overlap graph G. Instead of the more conventional approach of removing tips, bubbles, and other local features, LazyB stepwisely extracts subgraphs whose global properties approach a disjoint union of paths. First, a consistently oriented subgraph is extracted, which in a second step is reduced to a directed acyclic graph. In the next step, properties of proper interval graphs are used to extract contigs as maximum weight paths. These path are translated into genomic sequences only in the final step. A prototype implementation of LazyB, entirely written in python, not only yields significantly more accurate assemblies of the yeast and fruit fly genomes compared to state-of-the-art pipelines but also requires much less computational effort. Conclusions LazyB is new low-cost genome assembler that copes well with large genomes and low coverage. It is based on a novel approach for reducing the overlap graph to a collection of paths, thus opening new avenues for future improvements. Availability The LazyB prototype is available at https://github.com/TGatter/LazyB.
Collapse
|
13
|
Mangiaterra G, Cedraro N, Laudadio E, Minnelli C, Citterio B, Andreoni F, Mobbili G, Galeazzi R, Biavasco F. The Natural Alkaloid Berberine Can Reduce the Number of Pseudomonas aeruginosa Tolerant Cells. JOURNAL OF NATURAL PRODUCTS 2021; 84:993-1001. [PMID: 33848161 DOI: 10.1021/acs.jnatprod.0c01151] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
The eradication of recurrent Pseudomonas aeruginosa (PA) lung infection in cystic fibrosis (CF) patients may be hampered by the development of persistent bacterial forms, which can tolerate antibiotics through efflux pump overexpression. After demonstrating the efflux pump inhibitory effect of the alkaloid berberine on the PA MexXY-OprM efflux pump, in this study, we tested its ability (80/320 μg/mL) to enhance tobramycin (20xMIC/1000xMIC) activity against PA planktonic/biofilm cultures. Preliminary investigations of the involvement of MexY in PA tolerance to tobramycin treatment, performed on the isogenic pair PA K767 (wild type)/K1525 (ΔmexY) growing in planktonic and biofilm cultures, demonstrated that the ΔmexY mutant K1525 produced a lower (100 and 10 000 times, respectively) amount of tolerant cells than that of the wild type. Next, we grew broth cultures of PAO1, PA14, and 20 PA clinical isolates (of which 13 were from CF patients) in the presence of 20xMIC tobramycin with and without berberine 80 μg/mL. Accordingly, most strains showed a greater (from 10- to 1000-fold) tolerance reduction in the presence of berberine. These findings highlight the involvement of the MexXY-OprM system in the tobramycin tolerance of PA and suggest that berberine may be used in new valuable therapeutic combinations to counteract persister survival.
Collapse
Affiliation(s)
- Gianmarco Mangiaterra
- Department of Life and Environmental Sciences, Polytechnic University of Marche, Ancona 60131, Italy
| | - Nicholas Cedraro
- Department of Life and Environmental Sciences, Polytechnic University of Marche, Ancona 60131, Italy
| | - Emiliano Laudadio
- Department of Materials, Environmental Sciences and Urban Planning, Polytechnic University of Marche, Ancona 60131, Italy
| | - Cristina Minnelli
- Department of Life and Environmental Sciences, Polytechnic University of Marche, Ancona 60131, Italy
| | - Barbara Citterio
- Department of Biomolecular Sciences, sect. Biotechnology, University of Urbino "Carlo Bo", Fano 61032, Italy
| | - Francesca Andreoni
- Department of Biomolecular Sciences, sect. Biotechnology, University of Urbino "Carlo Bo", Fano 61032, Italy
| | - Giovanna Mobbili
- Department of Life and Environmental Sciences, Polytechnic University of Marche, Ancona 60131, Italy
| | - Roberta Galeazzi
- Department of Life and Environmental Sciences, Polytechnic University of Marche, Ancona 60131, Italy
| | - Francesca Biavasco
- Department of Life and Environmental Sciences, Polytechnic University of Marche, Ancona 60131, Italy
| |
Collapse
|
14
|
Repecka D, Jauniskis V, Karpus L, Rembeza E, Rokaitis I, Zrimec J, Poviloniene S, Laurynenas A, Viknander S, Abuajwa W, Savolainen O, Meskys R, Engqvist MKM, Zelezniak A. Expanding functional protein sequence spaces using generative adversarial networks. NAT MACH INTELL 2021. [DOI: 10.1038/s42256-021-00310-5] [Citation(s) in RCA: 63] [Impact Index Per Article: 21.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023]
|
15
|
Czech L, Barbera P, Stamatakis A. Genesis and Gappa: processing, analyzing and visualizing phylogenetic (placement) data. Bioinformatics 2020; 36:3263-3265. [PMID: 32016344 PMCID: PMC7214027 DOI: 10.1093/bioinformatics/btaa070] [Citation(s) in RCA: 133] [Impact Index Per Article: 33.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2019] [Revised: 01/22/2020] [Accepted: 01/28/2020] [Indexed: 11/14/2022] Open
Abstract
Summary We present genesis, a library for working with phylogenetic data, and gappa, an accompanying command-line tool for conducting typical analyses on such data. The tools target phylogenetic trees and phylogenetic placements, sequences, taxonomies and other relevant data types, offer high-level simplicity as well as low-level customizability, and are computationally efficient, well-tested and field-proven. Availability and implementation Both genesis and gappa are written in modern C++11, and are freely available under GPLv3 at http://github.com/lczech/genesis and http://github.com/lczech/gappa. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Lucas Czech
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg 69118, Germany
| | - Pierre Barbera
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg 69118, Germany
| | - Alexandros Stamatakis
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg 69118, Germany.,Institute for Theoretical Informatics, Karlsruhe Institute of Technology, Karlsruhe 76131, Germany
| |
Collapse
|
16
|
Mohamad Sobri MF, Abd-Aziz S, Abu Bakar FD, Ramli N. In-Silico Characterization of Glycosyl Hydrolase Family 1 β-Glucosidase from Trichoderma asperellum UPM1. Int J Mol Sci 2020; 21:ijms21114035. [PMID: 32512945 PMCID: PMC7311958 DOI: 10.3390/ijms21114035] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2020] [Revised: 04/22/2020] [Accepted: 04/30/2020] [Indexed: 11/16/2022] Open
Abstract
β-glucosidases (Bgl) are widely utilized for releasing non-reducing terminal glucosyl residues. Nevertheless, feedback inhibition by glucose end product has limited its application. A noticeable exception has been found for β-glucosidases of the glycoside hydrolase (GH) family 1, which exhibit tolerance and even stimulation by glucose. In this study, using local isolate Trichoderma asperellum UPM1, the gene encoding β-glucosidase from GH family 1, hereafter designated as TaBgl2, was isolated and characterized via in-silico analyses. A comparison of enzyme activity was subsequently made by heterologous expression in Escherichia coli BL21(DE3). The presence of N-terminal signature, cis-peptide bonds, conserved active site motifs, non-proline cis peptide bonds, substrate binding, and a lone conserved stabilizing tryptophan (W) residue confirms the identity of Trichoderma sp. GH family 1 β-glucosidase isolated. Glucose tolerance was suggested by the presence of 14 of 22 known consensus residues, along with corresponding residues L167 and P172, crucial in the retention of the active site's narrow cavity. Retention of 40% of relative hydrolytic activity on ρ-nitrophenyl-β-D-glucopyranoside (ρNPG) in a concentration of 0.2 M glucose was comparable to that of GH family 1 β-glucosidase (Cel1A) from Trichoderma reesei. This research thus underlines the potential in the prediction of enzymatic function, and of industrial importance, glucose tolerance of family 1 β-glucosidases following relevant in-silico analyses.
Collapse
Affiliation(s)
- Mohamad Farhan Mohamad Sobri
- Department of Bioprocess Technology, Faculty of Biotechnology and Biomolecular Sciences, Universiti Putra Malaysia, Serdang 43400 UPM, Selangor, Malaysia; (M.F.M.S.); (S.A.-A.)
- School of Bioprocess Engineering, Universiti Malaysia Perlis, Kompleks Pusat Pengajian Jejawi 3, Arau 02600, Perlis, Malaysia
| | - Suraini Abd-Aziz
- Department of Bioprocess Technology, Faculty of Biotechnology and Biomolecular Sciences, Universiti Putra Malaysia, Serdang 43400 UPM, Selangor, Malaysia; (M.F.M.S.); (S.A.-A.)
| | - Farah Diba Abu Bakar
- School of Biosciences and Biotechnology, Faculty of Science and Technology, Universiti Kebangsaan Malaysia, Bangi 43600 UKM, Selangor, Malaysia;
| | - Norhayati Ramli
- Department of Bioprocess Technology, Faculty of Biotechnology and Biomolecular Sciences, Universiti Putra Malaysia, Serdang 43400 UPM, Selangor, Malaysia; (M.F.M.S.); (S.A.-A.)
- Correspondence: ; Tel.: +60-3-9769-1948
| |
Collapse
|
17
|
Sirt4 Modulates Oxidative Metabolism and Sensitivity to Rapamycin Through Species-Dependent Phenotypes in Drosophila mtDNA Haplotypes. G3-GENES GENOMES GENETICS 2020; 10:1599-1612. [PMID: 32152006 PMCID: PMC7202034 DOI: 10.1534/g3.120.401174] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
Abstract
The endosymbiotic theory proposes that eukaryotes evolved from the symbiotic relationship between anaerobic (host) and aerobic prokaryotes. Through iterative genetic transfers, the mitochondrial and nuclear genomes coevolved, establishing the mitochondria as the hub of oxidative metabolism. To study this coevolution, we disrupt mitochondrial-nuclear epistatic interactions by using strains that have mitochondrial DNA (mtDNA) and nuclear DNA (nDNA) from evolutionarily divergent species. We undertake a multifaceted approach generating introgressed Drosophila strains containing D. simulans mtDNA and D. melanogaster nDNA with Sirtuin 4 (Sirt4)-knockouts. Sirt4 is a nuclear-encoded enzyme that functions, exclusively within the mitochondria, as a master regulator of oxidative metabolism. We exposed flies to the drug rapamycin in order to eliminate TOR signaling, thereby compromising the cytoplasmic crosstalk between the mitochondria and nucleus. Our results indicate that D. simulans and D. melanogaster mtDNA haplotypes display opposite Sirt4-mediated phenotypes in the regulation of whole-fly oxygen consumption. Moreover, our data reflect that the deletion of Sirt4 rescued the metabolic response to rapamycin among the introgressed strains. We propose that Sirt4 is a suitable candidate for studying the properties of mitochondrial-nuclear epistasis in modulating mitochondrial metabolism.
Collapse
|
18
|
Evolution of Photorespiratory Glycolate Oxidase among Archaeplastida. PLANTS 2020; 9:plants9010106. [PMID: 31952152 PMCID: PMC7020209 DOI: 10.3390/plants9010106] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/29/2019] [Revised: 01/10/2020] [Accepted: 01/11/2020] [Indexed: 12/17/2022]
Abstract
Photorespiration has been shown to be essential for all oxygenic phototrophs in the present-day oxygen-containing atmosphere. The strong similarity of the photorespiratory cycle in cyanobacteria and plants led to the hypothesis that oxygenic photosynthesis and photorespiration co-evolved in cyanobacteria, and then entered the eukaryotic algal lineages up to land plants via endosymbiosis. However, the evolutionary origin of the photorespiratory enzyme glycolate oxidase (GOX) is controversial, which challenges the common origin hypothesis. Here, we tested this hypothesis using phylogenetic and biochemical approaches with broad taxon sampling. Phylogenetic analysis supported the view that a cyanobacterial GOX-like protein of the 2-hydroxy-acid oxidase family most likely served as an ancestor for GOX in all eukaryotes. Furthermore, our results strongly indicate that GOX was recruited to the photorespiratory metabolism at the origin of Archaeplastida, because we verified that Glaucophyta, Rhodophyta, and Streptophyta all express GOX enzymes with preference for the substrate glycolate. Moreover, an “ancestral” protein synthetically derived from the node separating all prokaryotic from eukaryotic GOX-like proteins also preferred glycolate over l-lactate. These results support the notion that a cyanobacterial ancestral protein laid the foundation for the evolution of photorespiratory GOX enzymes in modern eukaryotic phototrophs.
Collapse
|
19
|
Eren K, Murrell B. RIFRAF: a frame-resolving consensus algorithm. Bioinformatics 2019; 34:3817-3824. [PMID: 29850783 DOI: 10.1093/bioinformatics/bty426] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2017] [Accepted: 05/22/2018] [Indexed: 01/08/2023] Open
Abstract
Motivation Protein coding genes can be studied using long-read next generation sequencing. However, high rates of indel sequencing errors are problematic, corrupting the reading frame. Even the consensus of multiple independent sequence reads retains indel errors. To solve this problem, we introduce Reference-Informed Frame-Resolving multiple-Alignment Free template inference algorithm (RIFRAF), a sequence consensus algorithm that takes a set of error-prone reads and a reference sequence and infers an accurate in-frame consensus. RIFRAF uses a novel structure, analogous to a two-layer hidden Markov model: the consensus is optimized to maximize alignment scores with both the set of noisy reads and with a reference. The template-to-reads component of the model encodes the preponderance of indels, and is sensitive to the per-base quality scores, giving greater weight to more accurate bases. The reference-to-template component of the model penalizes frame-destroying indels. A local search algorithm proceeds in stages to find the best consensus sequence for both objectives. Results Using Pacific Biosciences SMRT sequences from an HIV-1 env clone, NL4-3, we compare our approach to other consensus and frame correction methods. RIFRAF consistently finds a consensus sequence that is more accurate and in-frame, especially with small numbers of reads. It was able to perfectly reconstruct over 80% of consensus sequences from as few as three reads, whereas the best alternative required twice as many. RIFRAF is able to achieve these results and keep the consensus in-frame even with a distantly related reference sequence. Moreover, unlike other frame correction methods, RIFRAF can detect and keep true indels while removing erroneous ones. Availability and implementation RIFRAF is implemented in Julia, and source code is publicly available at https://github.com/MurrellGroup/Rifraf.jl. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Kemal Eren
- Bioinformatics and Systems Biology, University of California San Diego, La Jolla, CA, USA
| | - Ben Murrell
- Department of Medicine, University of California San Diego, La Jolla, CA, USA
| |
Collapse
|
20
|
Boutte J, Fishbein M, Liston A, Straub SCK. NGS-Indel Coder: A pipeline to code indel characters in phylogenomic data with an example of its application in milkweeds (Asclepias). Mol Phylogenet Evol 2019; 139:106534. [PMID: 31212081 DOI: 10.1016/j.ympev.2019.106534] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2019] [Revised: 05/12/2019] [Accepted: 06/13/2019] [Indexed: 12/30/2022]
Abstract
Targeted genome sequencing approaches allow characterization of evolutionary relationships using a considerable number of nuclear genes and informative characters. However, most phylogenomic analyses only utilize single nucleotide polymorphisms (SNPs). Studies at the species level, especially in groups that have recently radiated, often recover low amounts of phylogenetically informative variation in coding regions, and require non-coding sequences, which are richer in indels, to resolve gene trees. Here, NGS-Indel Coder, a pipeline to detect and omit false positive indels inferred from assemblies of short read sequence data, was developed to resolve the relationships among and within major clades of the American milkweeds (Asclepias), which are the result of a rapid and recent evolutionary radiation, and whose phylogeny has been difficult to resolve. This pipeline was applied to a Hyb-Seq data set of 768 loci including targeted exons and flanking intron regions from 33 milkweed species. Robust species tree inference was improved by excluding small alignment partitions (<100 bp) that increased gene tree ambiguity and incongruence. To further investigate the robustness of indel coding, data sets that included small and large indels were explored, and species trees derived from concatenated loci versus coalescent methods based on gene trees were compared. The phylogeny of Asclepias obtained using nuclear data was well resolved, and phylogenetic information from indels improved resolution of specific nodes. The Temperate North American, Mexican Highland, and Incarnatae clades were well supported as monophyletic. Asclepias coulteri, which has been considered part of the Sonoran Desert clade based on plastome analyses, was placed as sister to all the other milkweed species studied here, rather than as a member of that clade. Two groups within the Temperate North American and Mexican clades were not resolved, and the inferred relationships strongly conflicted when comparing results based on data sets that did or did not include indel characters. This new pipeline represents a step forward in making maximal use of the information content in phylogenomic data sets.
Collapse
Affiliation(s)
- Julien Boutte
- Department of Biology, Hobart and William Smith Colleges, Geneva, NY, USA
| | - Mark Fishbein
- Department of Plant Biology, Ecology and Evolution, Oklahoma State University, Stillwater, OK, USA
| | - Aaron Liston
- Department of Botany and Plant Pathology, Oregon State University, Corvallis, OR, USA
| | - Shannon C K Straub
- Department of Biology, Hobart and William Smith Colleges, Geneva, NY, USA.
| |
Collapse
|
21
|
Pervez MT, Shah HA, Babar ME, Naveed N, Shoaib M. SAliBASE: A Database of Simulated Protein Alignments. Evol Bioinform Online 2019; 15:1176934318821080. [PMID: 30733625 PMCID: PMC6343434 DOI: 10.1177/1176934318821080] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2018] [Accepted: 11/26/2018] [Indexed: 01/17/2023] Open
Abstract
Simulated alignments are alternatives to manually constructed multiple sequence alignments for evaluating performance of multiple sequence alignment tools. The importance of simulated sequences is recognized because their true evolutionary history is known, which is very helpful for reconstructing accurate phylogenetic trees and alignments. However, generating simulated alignments require expertise to use bioinformatics tools and consume several hours for reconstructing even a few hundreds of simulated sequences. It becomes a tedious job for an end user who needs a few datasets of variety of simulated sequences. Currently, there is no databank available which may help researchers to download simulated sequences/alignments for their study. Major focus of our study was to develop a database of simulated protein sequences (SAliBASE) based on different varying parameters such as insertion rate, deletion rate, sequence length, number of sequences, and indel size. Each dataset has corresponding alignment as well. This repository is very useful for evaluating multiple alignment methods.
Collapse
Affiliation(s)
- Muhammad Tariq Pervez
- Department of Bioinformatics and Computational Biology, Virtual University of Pakistan, Lahore, Pakistan
| | - Hayat Ali Shah
- Department of Computer Science, Virtual University of Pakistan, Lahore, Pakistan
| | | | - Nasir Naveed
- Department of Computer Science, Virtual University of Pakistan, Lahore, Pakistan
| | - Muhammad Shoaib
- Department of Computer Science and Engineering, UET, Lahore, Pakistan
| |
Collapse
|
22
|
High-Throughput Reconstruction of Ancestral Protein Sequence, Structure, and Molecular Function. Methods Mol Biol 2019; 1851:135-170. [PMID: 30298396 DOI: 10.1007/978-1-4939-8736-8_8] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Ancestral protein sequence reconstruction is a powerful technique for explicitly testing hypotheses about the evolution of molecular function, allowing researchers to meticulously dissect how historical changes in protein sequence impacted functional repertoire by altering the protein's 3D structure. These techniques have provided concrete, experimentally validated insights into ancient evolutionary processes and help illuminate the complex relationship between protein sequence, structure, and function. Inferring the protein family phylogenies on which ancestral sequence reconstruction depends and reconstructing the sequences, themselves, are amenable to high-throughput computational analysis. However, determining the structures of ancestral-reconstructed proteins and characterizing their functions typically rely on time-consuming and expensive laboratory analyses, limiting most current studies to examining a relatively small number of specific hypotheses. For this reason, we have little detailed, unbiased information about how molecular function evolves across large protein family phylogenies. Here we describe a generalized protocol that integrates ancestral sequence reconstruction with structural homology modeling and structure-based molecular affinity prediction to characterize historical changes in protein function across families with thousands of individual sequences. We highlight key steps in the analysis protocol requiring particularly careful attention to avoid introducing potential errors as well as steps for which computationally efficient subroutines can be substituted for more intensive approaches, allowing researchers to scale the analysis up or down, depending on available resources and requirements for reproducibility and scientific rigor. In our view, this approach provides a compelling compliment to more laboratory-intensive procedures, generating important contextual information that can help guide detailed experiments.
Collapse
|
23
|
The Adaptive Evolution Database (TAED): A New Release of a Database of Phylogenetically Indexed Gene Families from Chordates. J Mol Evol 2017; 85:46-56. [PMID: 28795237 DOI: 10.1007/s00239-017-9806-8] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2017] [Accepted: 08/03/2017] [Indexed: 12/11/2022]
Abstract
With the large collections of gene and genome sequences, there is a need to generate curated comparative genomic databases that enable interpretation of results in an evolutionary context. Such resources can facilitate an understanding of the co-evolution of genes in the context of a genome mapped onto a phylogeny, of a protein structure, and of interactions within a pathway. A phylogenetically indexed gene family database, the adaptive evolution database (TAED), is presented that organizes gene families and their evolutionary histories in a species tree context. Gene families include alignments, phylogenetic trees, lineage-specific dN/dS ratios, reconciliation with the species tree to enable both the mapping and the identification of duplication events, mapping of gene families onto pathways, and mapping of amino acid substitutions onto protein structures. In addition to organization of the data, new phylogenetic visualization tools have been developed to aid in interpreting the data that are also available, including TreeThrasher and TAED Tree Viewer. A new resource of gene families organized by species and taxonomic lineage promises to be a valuable comparative genomics database for molecular biologists, evolutionary biologists, and ecologists. The new visualization tools and database framework will be of interest to both evolutionary biologists and bioinformaticians.
Collapse
|
24
|
Richard J, Kim ED, Nguyen H, Kim CD, Kim S. Allostery Wiring Map for Kinesin Energy Transduction and Its Evolution. J Biol Chem 2016; 291:20932-20945. [PMID: 27507814 PMCID: PMC5076506 DOI: 10.1074/jbc.m116.733675] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2016] [Indexed: 12/28/2022] Open
Abstract
How signals between the kinesin active and cytoskeletal binding sites are transmitted is an open question and an allosteric question. By extracting correlated evolutionary changes within 700+ sequences, we built a model of residues that are energetically coupled and that define molecular routes for signal transmission. Typically, these coupled residues are located at multiple distal sites and thus are predicted to form a complex, non-linear network that wires together different functional sites in the protein. Of note, our model connected the site for ATP hydrolysis with sites that ultimately utilize its free energy, such as the microtubule-binding site, drug-binding loop 5, and necklinker. To confirm the calculated energetic connectivity between non-adjacent residues, double-mutant cycle analysis was conducted with 22 kinesin mutants. There was a direct correlation between thermodynamic coupling in experiment and evolutionarily derived energetic coupling. We conclude that energy transduction is coordinated by multiple distal sites in the protein rather than only being relayed through adjacent residues. Moreover, this allosteric map forecasts how energetic orchestration gives rise to different nanomotor behaviors within the superfamily.
Collapse
Affiliation(s)
- Jessica Richard
- From the Department of Biochemistry and Molecular Biology, Louisiana State University School of Medicine & Health Sciences Center, New Orleans, Louisiana 70112
| | - Elizabeth D Kim
- From the Department of Biochemistry and Molecular Biology, Louisiana State University School of Medicine & Health Sciences Center, New Orleans, Louisiana 70112
| | - Hoang Nguyen
- From the Department of Biochemistry and Molecular Biology, Louisiana State University School of Medicine & Health Sciences Center, New Orleans, Louisiana 70112
| | - Catherine D Kim
- From the Department of Biochemistry and Molecular Biology, Louisiana State University School of Medicine & Health Sciences Center, New Orleans, Louisiana 70112
| | - Sunyoung Kim
- From the Department of Biochemistry and Molecular Biology, Louisiana State University School of Medicine & Health Sciences Center, New Orleans, Louisiana 70112
| |
Collapse
|
25
|
Lai D, Meyer IM. A comprehensive comparison of general RNA-RNA interaction prediction methods. Nucleic Acids Res 2016; 44:e61. [PMID: 26673718 PMCID: PMC4838349 DOI: 10.1093/nar/gkv1477] [Citation(s) in RCA: 38] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2015] [Revised: 12/03/2015] [Accepted: 12/05/2015] [Indexed: 12/15/2022] Open
Abstract
RNA-RNA interactions are fast emerging as a major functional component in many newly discovered non-coding RNAs. Basepairing is believed to be a major contributor to the stability of these intermolecular interactions, much like intramolecular basepairs formed in RNA secondary structure. As such, using algorithms similar to those for predicting RNA secondary structure, computational methods have been recently developed for the prediction of RNA-RNA interactions. We provide the first comprehensive comparison comprising 14 methods that predict general intermolecular basepairs. To evaluate these, we compile an extensive data set of 54 experimentally confirmed fungal snoRNA-rRNA interactions and 102 bacterial sRNA-mRNA interactions. We test the performance accuracy of all methods, evaluating the effects of tool settings, sequence length, and multiple sequence alignment usage and quality. Our results show that-unlike for RNA secondary structure prediction--the overall best performing tools are non-comparative energy-based tools utilizing accessibility information that predict short interactions on this data set. Furthermore, we find that maintaining high accuracy across biologically different data sets and increasing input lengths remains a huge challenge, causing implications for de novo transcriptome-wide searches. Finally, we make our interaction data set publicly available for future development and benchmarking efforts.
Collapse
Affiliation(s)
- Daniel Lai
- Centre for High-Throughput Biology, Department of Computer Science and Department of Medical Genetics, University of British Columbia, Vancouver V6T 1Z4, Canada
| | - Irmtraud M Meyer
- Centre for High-Throughput Biology, Department of Computer Science and Department of Medical Genetics, University of British Columbia, Vancouver V6T 1Z4, Canada
| |
Collapse
|
26
|
Computational approaches to study the effects of small genomic variations. J Mol Model 2015; 21:251. [PMID: 26350246 DOI: 10.1007/s00894-015-2794-y] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2015] [Accepted: 08/23/2015] [Indexed: 10/23/2022]
Abstract
Advances in DNA sequencing technologies have led to an avalanche-like increase in the number of gene sequences deposited in public databases over the last decade as well as the detection of an enormous number of previously unseen nucleotide variants therein. Given the size and complex nature of the genome-wide sequence variation data, as well as the rate of data generation, experimental characterization of the disease association of each of these variations or their effects on protein structure/function would be costly, laborious, time-consuming, and essentially impossible. Thus, in silico methods to predict the functional effects of sequence variations are constantly being developed. In this review, we summarize the major computational approaches and tools that are aimed at the prediction of the functional effect of mutations, and describe the state-of-the-art databases that can be used to obtain information about mutation significance. We also discuss future directions in this highly competitive field.
Collapse
|