1
|
Chong LC, Khan AM. A Systematic Bioinformatics Approach for Mapping the Minimal Set of a Viral Peptidome. Curr Protoc 2024; 4:e1056. [PMID: 38856995 DOI: 10.1002/cpz1.1056] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/11/2024]
Abstract
Sequence changes in viral genomes generate protein sequence diversity that enables viruses to evade the host immune system, hindering the development of effective preventive and therapeutic interventions. The massive proliferation of sequence data provides unprecedented opportunities to study viral adaptation and evolution. An alignment-free approach removes various restrictions posed by an alignment-dependent approach for studying sequence diversity. The publicly available tool, UNIQmin, offers an alignment-free approach for studying viral sequence diversity at any given rank of taxonomy lineage and is big data ready. The tool performs an exhaustive search to determine the minimal set of sequences required to capture the peptidome diversity within a given dataset. This compression is possible through the removal of identical sequences and unique sequences that do not contribute effectively to the peptidome diversity pool. Herein, we describe a detailed four-part protocol utilizing UNIQmin to generate the minimal set for the purpose of viral diversity analyses, alignment-free at any rank of the taxonomy lineage, using the recent global public health threat Monkeypox virus (MPX) sequence data as a case study. The protocol enables a systematic bioinformatics approach to study sequence diversity across taxonomic lineages, which is crucial for our future preparedness against viral epidemics. This is particularly important when data are abundant, freely available, and alignment is not an option. © 2024 Wiley Periodicals LLC. Basic Protocol 1: Tool installation and input file preparation Basic Protocol 2: Generation of a minimal set of sequences for a given dataset Basic Protocol 3: Comparative minimal set analysis across taxonomic lineage ranks Basic Protocol 4: Factors affecting the minimal set of sequences.
Collapse
Affiliation(s)
- Li Chuin Chong
- Centre for Bioinformatics, School of Data Sciences, Perdana University, Kuala Lumpur, Malaysia
- Beykoz Institute of Life Sciences and Biotechnology, Bezmialem Vakif University, Beykoz, Turkey
- Current affiliation: Institute for Experimental Virology, TWINCORE Centre for Experimental and Clinical Infection Research, a Medical School Hannover (MHH) and Helmholtz Centre for Infection Research (HZI) joint venture, Hannover, Germany
| | - Asif M Khan
- Centre for Bioinformatics, School of Data Sciences, Perdana University, Kuala Lumpur, Malaysia
- Beykoz Institute of Life Sciences and Biotechnology, Bezmialem Vakif University, Beykoz, Turkey
- Current affiliation: College of Computing and Information Technology, University of Doha for Science and Technology, Doha, Qatar
| |
Collapse
|
2
|
Saha G, Sawmya S, Saha A, Akil MA, Tasnim S, Rahman MS, Rahman MS. PRIEST: predicting viral mutations with immune escape capability of SARS-CoV-2 using temporal evolutionary information. Brief Bioinform 2024; 25:bbae218. [PMID: 38742520 PMCID: PMC11091746 DOI: 10.1093/bib/bbae218] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2023] [Revised: 04/04/2024] [Accepted: 04/06/2024] [Indexed: 05/16/2024] Open
Abstract
The dynamic evolution of the severe acute respiratory syndrome coronavirus 2 virus is primarily driven by mutations in its genetic sequence, culminating in the emergence of variants with increased capability to evade host immune responses. Accurate prediction of such mutations is fundamental in mitigating pandemic spread and developing effective control measures. This study introduces a robust and interpretable deep-learning approach called PRIEST. This innovative model leverages time-series viral sequences to foresee potential viral mutations. Our comprehensive experimental evaluations underscore PRIEST's proficiency in accurately predicting immune-evading mutations. Our work represents a substantial step in utilizing deep-learning methodologies for anticipatory viral mutation analysis and pandemic response.
Collapse
Affiliation(s)
- Gourab Saha
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh
| | - Shashata Sawmya
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh
| | - Arpita Saha
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh
| | - Md Ajwad Akil
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh
| | - Sadia Tasnim
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh
| | - Md Saifur Rahman
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh
| | - M Sohel Rahman
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh
| |
Collapse
|
3
|
Navhaya LT, Blessing DM, Yamkela M, Godlo S, Makhoba XH. A comprehensive review of the interaction between COVID-19 spike proteins with mammalian small and major heat shock proteins. Biomol Concepts 2024; 15:bmc-2022-0027. [PMID: 38872399 DOI: 10.1515/bmc-2022-0027] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2023] [Accepted: 02/13/2023] [Indexed: 06/15/2024] Open
Abstract
Coronavirus disease 2019 (COVID-19) is a novel disease that had devastating effects on human lives and the country's economies worldwide. This disease shows similar parasitic traits, requiring the host's biomolecules for its survival and propagation. Spike glycoproteins severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2 spike protein) located on the surface of the COVID-19 virus serve as a potential hotspot for antiviral drug development based on their structure. COVID-19 virus calls into action the chaperonin system that assists the attacker, hence favoring infection. To investigate the interaction that occurs between SARS-CoV-2 spike protein and human molecular chaperons (HSPA8 and sHSP27), a series of steps were carried out which included sequence attainment and analysis, followed by multiple sequence alignment, homology modeling, and protein-protein docking which we performed using Cluspro to predict the interactions between SARS-CoV-2 spike protein and human molecular chaperones of interest. Our findings depicted that SARS-CoV-2 spike protein consists of three distinct chains, chains A, B, and C, which interact forming hydrogen bonds, hydrophobic interactions, and electrostatic interactions with both human HSPA8 and HSP27 with -828.3 and -827.9 kcal/mol as binding energies for human HSPA8 and -1166.7 and -1165.9 kcal/mol for HSP27.
Collapse
Affiliation(s)
- Liberty T Navhaya
- Department of Biochemistry, Microbiology and Biotechnology, University of Limpopo, Turfloop Campus, Sovenga, 0727, South Africa
| | - Dzveta Mutsawashe Blessing
- Department of Biochemistry and Microbiology, University of Fort Hare, Alice Campus, 1 King Williams Town, 5700, South Africa
| | - Mthembu Yamkela
- Department of Life and Consumer Sciences, College of Agriculture and Environmental Sciences, University of South Africa (UNISA), Florida Campus, Roodepoort, 1709, South Africa
| | - Sesethu Godlo
- Department of Life and Consumer Sciences, College of Agriculture and Environmental Sciences, University of South Africa (UNISA), Florida Campus, Roodepoort, 1709, South Africa
| | - Xolani Henry Makhoba
- Department of Life and Consumer Sciences, College of Agriculture and Environmental Sciences, University of South Africa (UNISA), Florida Campus, Roodepoort, 1709, South Africa
| |
Collapse
|
4
|
Chakraborty A, Hussain A, Sabnam N. Uncovering the structural stability of Magnaporthe oryzae effectors: a secretome-wide in silico analysis. J Biomol Struct Dyn 2023:1-22. [PMID: 38109060 DOI: 10.1080/07391102.2023.2292795] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2023] [Accepted: 11/23/2023] [Indexed: 12/19/2023]
Abstract
Rice blast, caused by the ascomycete fungus Magnaporthe oryzae, is a deadly disease and a major threat to global food security. The pathogen secretes small proteinaceous effectors, virulence factors, inside the host to manipulate and perturb the host immune system, allowing the pathogen to colonize and establish a successful infection. While the molecular functions of several effectors are characterized, very little is known about the structural stability of these effectors. We analyzed a total of 554 small secretory proteins (SSPs) from the M. oryzae secretome to decipher key features of intrinsic disorder (ID) and the structural dynamics of the selected putative effectors through thorough and systematic in silico studies. Our results suggest that out of the total SSPs, 66% were predicted as effector proteins, released either into the apoplast or cytoplasm of the host cell. Of these, 68% were found to be intrinsically disordered effector proteins (IDEPs). Among the six distinct classes of disordered effectors, we observed peculiar relationships between the localization of several effectors in the apoplast or cytoplasm and the degree of disorder. We determined the degree of structural disorder and its impact on protein foldability across all the putative small secretory effector proteins from the blast pathogen, further validated by molecular dynamics simulation studies. This study provides definite clues toward unraveling the mystery behind the importance of structural distortions in effectors and their impact on plant-pathogen interactions. The study of these dynamical segments may help identify new effectors as well.Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
| | - Afzal Hussain
- Department of Bioinformatics, Maulana Azad National Institute of Technology, Bhopal, India
| | - Nazmiara Sabnam
- Department of Life Sciences, Presidency University, Kolkata, India
| |
Collapse
|
5
|
Fruzangohar M, Moolhuijzen P, Bakaj N, Taylor J. CoreDetector: a flexible and efficient program for core-genome alignment of evolutionary diverse genomes. Bioinformatics 2023; 39:btad628. [PMID: 37878789 PMCID: PMC10663985 DOI: 10.1093/bioinformatics/btad628] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2023] [Revised: 09/20/2023] [Accepted: 10/23/2023] [Indexed: 10/27/2023] Open
Abstract
MOTIVATION Whole genome alignment of eukaryote species remains an important method for the determination of sequence and structural variations and can also be used to ascertain the representative non-redundant core-genome sequence of a population. Many whole genome alignment tools were first developed for the more mature analysis of prokaryote species with few current tools containing the functionality to process larger genomes of eukaryotes as well as genomes of more divergent species. In addition, the functionality of these tools becomes computationally prohibitive due to the significant compute resources needed to handle larger genomes. RESULTS In this research, we present CoreDetector, an easy-to-use general-purpose program that can align the core-genome sequences for a range of genome sizes and divergence levels. To illustrate the flexibility of CoreDetector, we conducted alignments of a large set of closely related fungal pathogen and hexaploid wheat cultivar genomes as well as more divergent fly and rodent species genomes. In all cases, compared to existing multiple genome alignment tools, CoreDetector exhibited improved flexibility, efficiency, and competitive accuracy in tested cases. AVAILABILITY AND IMPLEMENTATION CoreDetector was developed in the cross platform, and easily deployable, Java language. A packaged pipeline is readily executable in a bash terminal without any external need for Perl or Python environments. Installation, example data, and usage instructions for CoreDetector are freely available from https://github.com/mfruzan/CoreDetector.
Collapse
Affiliation(s)
- Mario Fruzangohar
- The Biometry Hub, School of Agriculture, Food and Wine, University of Adelaide, Urrbrae, South Australia 5064, Australia
| | - Paula Moolhuijzen
- Centre for Crop Disease Management, School of Molecular and Life Sciences, Curtin University, Bentley, Western Australia 6102, Australia
| | - Nicolette Bakaj
- The Biometry Hub, School of Agriculture, Food and Wine, University of Adelaide, Urrbrae, South Australia 5064, Australia
| | - Julian Taylor
- The Biometry Hub, School of Agriculture, Food and Wine, University of Adelaide, Urrbrae, South Australia 5064, Australia
| |
Collapse
|
6
|
João M, Sena AC, Rebello VEF. On closing the inopportune gap with consistency transformation and iterative refinement. PLoS One 2023; 18:e0287483. [PMID: 37440507 PMCID: PMC10343097 DOI: 10.1371/journal.pone.0287483] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2023] [Accepted: 06/06/2023] [Indexed: 07/15/2023] Open
Abstract
The problem of aligning multiple biological sequences has fascinated scientists for a long time. Over the last four decades, tens of heuristic-based Multiple Sequence Alignment (MSA) tools have been proposed, the vast majority being built on the concept of Progressive Alignment. It is known, however, that this approach suffers from an inherent drawback regarding the inadvertent insertion of gaps when aligning sequences. Two well-known corrective solutions have frequently been adopted to help mitigate this: Consistency Transformation and Iterative Refinement. This paper takes a tool-independent technique-oriented look at the alignment quality benefits of these two strategies using problem instances from the HOMSTRAD and BAliBASE benchmarks. Eighty MSA aligners have been used to compare 4 classes of heuristics: Progressive Alignments, Iterative Alignments, Consistency-based Alignments, and Consistency-based Progressive Alignments with Iterative Refinement. Statistically, while both Consistency-based classes are better for alignments with low similarity, for sequences with higher similarity, the differences between the classes are less clear. Iterative Refinement has its own drawbacks resulting in there being statistically little advantage for Progressive Aligners to adopt this technique either with Consistency Transformation or without. Nevertheless, all 4 classes are capable of bettering each other, depending on the instance problem. This further motivates the development of MSA frameworks, such as the one being developed for this research, which simultaneously contemplate multiple classes and techniques in their attempt to uncover better solutions.
Collapse
Affiliation(s)
- Mario João
- Medical Sciences College, State University of Rio de Janeiro, Rio de Janeiro, Rio de Janeiro, Brazil
- Institute of Computing, Fluminense Federal University, Niterói, Rio de Janeiro, Brazil
| | - Alexandre C Sena
- Institute of Mathematics and Statistics, State University of Rio de Janeiro, Rio de Janeiro, Rio de Janeiro, Brazil
| | - Vinod E F Rebello
- Institute of Computing, Fluminense Federal University, Niterói, Rio de Janeiro, Brazil
| |
Collapse
|
7
|
Khodji H, Collet P, Thompson JD, Jeannin-Girardon A. De-MISTED: Image-based classification of erroneous multiple sequence alignments using convolutional neural networks. APPL INTELL 2023. [DOI: 10.1007/s10489-022-04390-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/11/2023]
|
8
|
Genome structure-based Juglandaceae phylogenies contradict alignment-based phylogenies and substitution rates vary with DNA repair genes. Nat Commun 2023; 14:617. [PMID: 36739280 PMCID: PMC9899254 DOI: 10.1038/s41467-023-36247-z] [Citation(s) in RCA: 12] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2022] [Accepted: 01/20/2023] [Indexed: 02/06/2023] Open
Abstract
In lineages of allopolyploid origin, sets of homoeologous chromosomes may coexist that differ in gene content and syntenic structure. Presence or absence of genes and microsynteny along chromosomal blocks can serve to differentiate subgenomes and to infer phylogenies. We here apply genome-structural data to infer relationships in an ancient allopolyploid lineage, the walnut family (Juglandaceae), by using seven chromosome-level genomes, two of them newly assembled. Microsynteny and gene-content analyses yield identical topologies that place Platycarya with Engelhardia as did a 1980s morphological-cladistic study. DNA-alignment-based topologies here and in numerous earlier studies instead group Platycarya with Carya and Juglans, perhaps misled by past hybridization. All available data support a hybrid origin of Juglandaceae from extinct or unsampled progenitors nested within, or sister to, Myricaceae. Rhoiptelea chiliantha, sister to all other Juglandaceae, contains proportionally more DNA repair genes and appears to evolve at a rate 2.6- to 3.5-times slower than the remaining species.
Collapse
|
9
|
Hu Y, Buehler MJ. End-to-End Protein Normal Mode Frequency Predictions Using Language and Graph Models and Application to Sonification. ACS NANO 2022; 16:20656-20670. [PMID: 36416536 DOI: 10.1021/acsnano.2c07681] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
The prediction of mechanical and dynamical properties of proteins is an important frontier, especially given the greater availability of proteins structures. Here we report a series of models that provide end-to-end predictions of nanodynamical properties of proteins, focused on high-throughput normal mode predictions directly from the amino acid sequence. Using neural network models within the family of Natural Language Processing and graph-based methods, we offer atomistically based mechanistic predictions of key protein mechanical features. The models include an end-to-end long short-term memory (LSTM) model, an end-to-end transformer model, a graph-based transformer model, and an equivariant graph neural network. All four models show exceptional performance, with the graph-based transformer architecture offering the best results but at the cost of requiring a graph structure as input. Conversely, the LSTM and transformer models offer end-to-end sequence-to-property prediction capabilities, providing efficient avenues for protein engineering, analysis, and design. We compare our results against published data based on a Principal Neighborhood Aggregation graph neural network, revealing that the transformer model offers better performance while also being able to predict a large set of the first 64 normal mode frequencies, simultaneously. The use of the end-to-end transformer model may facilitate other downstream applications through the use of transfer learning, and it offers a comprehensive prediction of dynamical properties without any structural knowledge, directly from the amino acid sequence. We demonstrate a potential application in scientific sonification, where the normal mode frequencies are transposed to generate audible signals for a detailed analysis of subtle changes of protein sequences.
Collapse
Affiliation(s)
- Yiwen Hu
- Laboratory for Atomistic and Molecular Mechanics, Massachusetts Institute of Technology, 77 Massachusetts Ave., Cambridge, Massachusetts 02139, United States
| | - Markus J Buehler
- Laboratory for Atomistic and Molecular Mechanics, Massachusetts Institute of Technology, 77 Massachusetts Ave., Cambridge, Massachusetts 02139, United States
- Center for Computational Science and Engineering, Schwarzman College of Computing, Massachusetts Institute of Technology, 77 Massachusetts Ave., Cambridge, Massachusetts 02139, United States
| |
Collapse
|
10
|
Rosignoli S, Paiardini A. Boosting the Full Potential of PyMOL with Structural Biology Plugins. Biomolecules 2022; 12:biom12121764. [PMID: 36551192 PMCID: PMC9775141 DOI: 10.3390/biom12121764] [Citation(s) in RCA: 14] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2022] [Revised: 11/23/2022] [Accepted: 11/24/2022] [Indexed: 11/29/2022] Open
Abstract
Over the past few decades, the number of available structural bioinformatics pipelines, libraries, plugins, web resources and software has increased exponentially and become accessible to the broad realm of life scientists. This expansion has shaped the field as a tangled network of methods, algorithms and user interfaces. In recent years PyMOL, widely used software for biomolecules visualization and analysis, has started to play a key role in providing an open platform for the successful implementation of expert knowledge into an easy-to-use molecular graphics tool. This review outlines the plugins and features that make PyMOL an eligible environment for supporting structural bioinformatics analyses.
Collapse
|
11
|
Wu L, Yin C, Zhu J, Wu Z, He L, Xia Y, Xie S, Qin T, Liu TY. SPRoBERTa: protein embedding learning with local fragment modeling. Brief Bioinform 2022; 23:6711410. [PMID: 36136367 DOI: 10.1093/bib/bbac401] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2022] [Revised: 07/18/2022] [Accepted: 08/18/2022] [Indexed: 12/14/2022] Open
Abstract
Well understanding protein function and structure in computational biology helps in the understanding of human beings. To face the limited proteins that are annotated structurally and functionally, the scientific community embraces the self-supervised pre-training methods from large amounts of unlabeled protein sequences for protein embedding learning. However, the protein is usually represented by individual amino acids with limited vocabulary size (e.g. 20 type proteins), without considering the strong local semantics existing in protein sequences. In this work, we propose a novel pre-training modeling approach SPRoBERTa. We first present an unsupervised protein tokenizer to learn protein representations with local fragment pattern. Then, a novel framework for deep pre-training model is introduced to learn protein embeddings. After pre-training, our method can be easily fine-tuned for different protein tasks, including amino acid-level prediction task (e.g. secondary structure prediction), amino acid pair-level prediction task (e.g. contact prediction) and also protein-level prediction task (remote homology prediction, protein function prediction). Experiments show that our approach achieves significant improvements in all tasks and outperforms the previous methods. We also provide detailed ablation studies and analysis for our protein tokenizer and training framework.
Collapse
Affiliation(s)
- Lijun Wu
- Microsoft Research Asia, No. 5 Dan Ling Street, Haidian District, 100080, Beijing, China
| | - Chengcan Yin
- National Key Laboratory for Novel Software Technology, Nanjing University, 163 Xianlin Road, Qixia District, 210023, Nanjing, Jiangsu Province, China
| | - Jinhua Zhu
- CAS Key Laboratory of GIPAS, EEIS Department, University of Science and Technology of China, No.96, JinZhai Road Baohe District, 230026, Hefei, Anhui Province, China
| | - Zhen Wu
- National Key Laboratory for Novel Software Technology, Nanjing University, 163 Xianlin Road, Qixia District, 210023, Nanjing, Jiangsu Province, China
| | - Liang He
- Microsoft Research Asia, No. 5 Dan Ling Street, Haidian District, 100080, Beijing, China
| | - Yingce Xia
- Microsoft Research Asia, No. 5 Dan Ling Street, Haidian District, 100080, Beijing, China
| | - Shufang Xie
- Microsoft Research Asia, No. 5 Dan Ling Street, Haidian District, 100080, Beijing, China
| | - Tao Qin
- Microsoft Research Asia, No. 5 Dan Ling Street, Haidian District, 100080, Beijing, China
| | - Tie-Yan Liu
- Microsoft Research Asia, No. 5 Dan Ling Street, Haidian District, 100080, Beijing, China
| |
Collapse
|
12
|
Hubley R, Wheeler TJ, Smit AFA. Accuracy of multiple sequence alignment methods in the reconstruction of transposable element families. NAR Genom Bioinform 2022; 4:lqac040. [PMID: 35591887 PMCID: PMC9112768 DOI: 10.1093/nargab/lqac040] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2021] [Revised: 03/29/2022] [Accepted: 04/29/2022] [Indexed: 02/06/2023] Open
Abstract
The construction of a high-quality multiple sequence alignment (MSA) from copies of a transposable element (TE) is a critical step in the characterization of a new TE family. Most studies of MSA accuracy have been conducted on protein or RNA sequence families, where structural features and strong signals of selection may assist with alignment. Less attention has been given to the quality of sequence alignments involving neutrally evolving DNA sequences such as those resulting from TE replication. Transposable element sequences are challenging to align due to their wide divergence ranges, fragmentation, and predominantly-neutral mutation patterns. To gain insight into the effects of these properties on MSA accuracy, we developed a simulator of TE sequence evolution, and used it to generate a benchmark with which we evaluated the MSA predictions produced by several popular aligners, along with Refiner, a method we developed in the context of our RepeatModeler software. We find that MAFFT and Refiner generally outperform other aligners for low to medium divergence simulated sequences, while Refiner is uniquely effective when tasked with aligning high-divergent and fragmented instances of a family.
Collapse
Affiliation(s)
- Robert Hubley
- Institute for Systems Biology, Seattle, WA 98109, USA
| | - Travis J Wheeler
- Department of Computer Science, University of Montana, Missoula, MT 59801, USA
| | | |
Collapse
|
13
|
Nayeem MA, Bayzid MS, Rahman AH, Shahriyar R, Rahman MS. Multiobjective Formulation of Multiple Sequence Alignment for Phylogeny Inference. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:2775-2786. [PMID: 33044939 DOI: 10.1109/tcyb.2020.3020308] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Multiple sequence alignment (MSA) is a preliminary task for estimating phylogenies. It is used for homology inference among the sequences of a set of species. Generally, the MSA task is handled as a single-objective optimization process. The alignments computed under one criterion may be different from the alignments generated by other criteria, inferring discordant homologies and thus leading to different hypothesized evolutionary histories relating the sequences. The multiobjective (MO) formulation of MSA has recently been advocated by several researchers, to address this issue. An MO approach independently optimizes multiple (often conflicting) objective functions at the same time and outputs a set of competitive alignments. However, no conceptual or experimental rational from a real-world application perspective has been reported so far for any MO formulation of MSA. This article work investigates the impact of MO formulation in the context of an important scientific problem, namely, phylogeny estimation. Employing popular evolutionary MO algorithms, we show that: 1) trees inferred based on alignments produced by the existing MSA methods used in practice are substantially worse in quality than the trees inferred based on the alignment's output by an MO algorithm and 2) even high-quality alignments (according to popular measures available in the literature) may fail to achieve acceptable accuracy in generating phylogenetic trees. Thus, we essentially ask the following natural question: "can a phylogeny-aware (i.e., application-aware) metric guide in selecting appropriate MO formulations to ensure better phylogeny estimation?" Here, we report a carefully designed extensive experimental study that positively answers this question.
Collapse
|
14
|
Kostenko DO, Korotkov EV. Application of the MAHDS Method for Multiple Alignment of Highly Diverged Amino Acid Sequences. Int J Mol Sci 2022; 23:ijms23073764. [PMID: 35409125 PMCID: PMC8998981 DOI: 10.3390/ijms23073764] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2022] [Revised: 03/23/2022] [Accepted: 03/23/2022] [Indexed: 12/10/2022] Open
Abstract
The aim of this work was to compare the multiple alignment methods MAHDS, T-Coffee, MUSCLE, Clustal Omega, Kalign, MAFFT, and PRANK in their ability to align highly divergent amino acid sequences. To accomplish this, we created test amino acid sequences with an average number of substitutions per amino acid (x) from 0.6 to 5.6, a total of 81 sets. Comparison of the performance of sequence alignments constructed by MAHDS and previously developed algorithms using the CS and Z score criteria and the benchmark alignment database (BAliBASE) indicated that, although the quality of the alignments built with MAHDS was somewhat lower than that of the other algorithms, it was compensated by greater statistical significance. MAHDS could construct statistically significant alignments of artificial sequences with x ≤ 4.8, whereas the other algorithms (T-Coffee, MUSCLE, Clustal Omega, Kalign, MAFFT, and PRANK) could not perform that at x > 2.4. The application of MAHDS to align 21 families of highly diverged proteins (identity < 20%) from Pfam and HOMSTRAD databases showed that it could calculate statistically significant alignments in cases when the other methods failed. Thus, MAHDS could be used to construct statistically significant multiple alignments of highly divergent protein sequences, which accumulated multiple mutations during evolution.
Collapse
|
15
|
Petrov PB, Awoniyi LO, Šuštar V, Balci MÖ, Mattila PK. AutoCoEv—A High-Throughput In Silico Pipeline for Predicting Inter-Protein Coevolution. Int J Mol Sci 2022; 23:ijms23063351. [PMID: 35328772 PMCID: PMC8952222 DOI: 10.3390/ijms23063351] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2022] [Revised: 03/15/2022] [Accepted: 03/17/2022] [Indexed: 11/16/2022] Open
Abstract
Protein–protein interactions govern cellular processes via complex regulatory networks, which are still far from being understood. Thus, identifying and understanding connections between proteins can significantly facilitate our comprehension of the mechanistic principles of protein functions. Coevolution between proteins is a sign of functional communication and, as such, provides a powerful approach to search for novel direct or indirect molecular partners. However, an evolutionary analysis of large arrays of proteins in silico is a highly time-consuming effort that has limited the usage of this method for protein pairs or small protein groups. Here, we developed AutoCoEv, a user-friendly, open source, computational pipeline for the search of coevolution between a large number of proteins. By driving 15 individual programs, culminating in CAPS2 as the software for detecting coevolution, AutoCoEv achieves a seamless automation and parallelization of the workflow. Importantly, we provide a patch to the CAPS2 source code to strengthen its statistical output, allowing for multiple comparison corrections and an enhanced analysis of the results. We apply the pipeline to inspect coevolution among 324 proteins identified to be located at the vicinity of the lipid rafts of B lymphocytes. We successfully detected multiple coevolutionary relations between the proteins, predicting many novel partners and previously unidentified clusters of functionally related molecules. We conclude that AutoCoEv, can be used to predict functional interactions from large datasets in a time- and cost-efficient manner.
Collapse
Affiliation(s)
- Petar B. Petrov
- MediCity Research Laboratories, Institute of Biomedicine, University of Turku, 20014 Turku, Finland; (L.O.A.); (V.Š.); (M.Ö.B.)
- Turku Bioscience Centre, University of Turku and Åbo Akademi University, 20520 Turku, Finland
- Correspondence: (P.B.P.); (P.K.M.)
| | - Luqman O. Awoniyi
- MediCity Research Laboratories, Institute of Biomedicine, University of Turku, 20014 Turku, Finland; (L.O.A.); (V.Š.); (M.Ö.B.)
- Turku Bioscience Centre, University of Turku and Åbo Akademi University, 20520 Turku, Finland
| | - Vid Šuštar
- MediCity Research Laboratories, Institute of Biomedicine, University of Turku, 20014 Turku, Finland; (L.O.A.); (V.Š.); (M.Ö.B.)
| | - M. Özge Balci
- MediCity Research Laboratories, Institute of Biomedicine, University of Turku, 20014 Turku, Finland; (L.O.A.); (V.Š.); (M.Ö.B.)
- Turku Bioscience Centre, University of Turku and Åbo Akademi University, 20520 Turku, Finland
| | - Pieta K. Mattila
- MediCity Research Laboratories, Institute of Biomedicine, University of Turku, 20014 Turku, Finland; (L.O.A.); (V.Š.); (M.Ö.B.)
- Turku Bioscience Centre, University of Turku and Åbo Akademi University, 20520 Turku, Finland
- Correspondence: (P.B.P.); (P.K.M.)
| |
Collapse
|
16
|
Alpert A, Nahman O, Starosvetsky E, Hayun M, Curiel TJ, Ofran Y, Shen-Orr SS. Alignment of single-cell trajectories by tuMap enables high-resolution quantitative comparison of cancer samples. Cell Syst 2022; 13:71-82.e8. [PMID: 34624253 PMCID: PMC8776581 DOI: 10.1016/j.cels.2021.09.003] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2020] [Revised: 06/20/2021] [Accepted: 09/09/2021] [Indexed: 01/21/2023]
Abstract
Single-cell technologies allow characterization of cancer samples as continuous developmental trajectories. Yet, the obtained temporal resolution cannot be leveraged for a comparative analysis due to the large phenotypic heterogeneity existing between patients. Here, we present the tuMap algorithm that exploits high-dimensional single-cell data of cancer samples exhibiting an underlying developmental structure to align them with the healthy development, yielding the tuMap pseudotime axis that allows their systematic, meaningful comparison. We applied tuMap on single-cell mass cytometry data of acute lymphoblastic and myeloid leukemia to reveal associations between the tuMap pseudotime axis and clinics that outperform cellular assignment into developmental populations. Application of the tuMap algorithm on single-cell RNA sequencing data further identified gene signatures of stem cells residing at the very-early parts of the cancer trajectories. The quantitative framework provided by tuMap allows generation of metrics for cancer patients evaluation.
Collapse
Affiliation(s)
- Ayelet Alpert
- Department of Immunology, Faculty of Medicine, Technion Israel Institute of Technology, Haifa 3525422, Israel
| | - Ornit Nahman
- Department of Immunology, Faculty of Medicine, Technion Israel Institute of Technology, Haifa 3525422, Israel
| | - Elina Starosvetsky
- Department of Immunology, Faculty of Medicine, Technion Israel Institute of Technology, Haifa 3525422, Israel
| | - Michal Hayun
- Department of Hematology and Bone Marrow Transplantation, Rambam Health Care Campus, Haifa 3109601, Israel
| | - Tyler J Curiel
- Department of Medicine/Hematology & Medical Oncology, School of Medicine, the University of Texas Health San Antonio, San Antonio, TX 78229, USA
| | - Yishai Ofran
- Department of Immunology, Faculty of Medicine, Technion Israel Institute of Technology, Haifa 3525422, Israel; Department of Hematology and Bone Marrow Transplantation, Rambam Health Care Campus, Haifa 3109601, Israel; Department of Hematology, Shaare Zedek Medical Center, Faculty of Medicine, Hebrew University of Jerusalem, Jerusalem 9103102, Israel.
| | - Shai S Shen-Orr
- Department of Immunology, Faculty of Medicine, Technion Israel Institute of Technology, Haifa 3525422, Israel.
| |
Collapse
|
17
|
Biological sequence analysis. Bioinformatics 2022. [DOI: 10.1016/b978-0-323-89775-4.00003-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
|
18
|
De Luca D, Lauritano C. Transcriptome Mining to Identify Genes of Interest: From Local Databases to Phylogenetic Inference. Methods Mol Biol 2022; 2498:43-51. [PMID: 35727539 DOI: 10.1007/978-1-0716-2313-8_3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
The advancement in next-generation sequencing technologies and the dropping of sequencing costs have seen an increase in the amount of transcriptome data generated each year. These data are of big potential for identifying genes and molecular pathways of interest across a plethora of organisms. However, navigating these resources requires some bioinformatics and evolutionary skills. Here, we describe a protocol of transcriptome data mining for genes of interest, from the creation of a protein database to the inference of phylogenetic trees, which was used for marine protists, but can be used as general pipeline across different taxa.
Collapse
Affiliation(s)
- Daniele De Luca
- Department of Biology, University of Naples Federico II, Botanic Garden of Naples, Naples, Italy.
| | - Chiara Lauritano
- Department of Ecosustainable Marine Biotechnology, Stazione Zoologica Anton Dohrn, Naples, Italy.
| |
Collapse
|
19
|
Spielman SJ, Miraglia ML. Relative model selection of evolutionary substitution models can be sensitive to multiple sequence alignment uncertainty. BMC Ecol Evol 2021; 21:214. [PMID: 34844571 PMCID: PMC8628390 DOI: 10.1186/s12862-021-01931-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2021] [Accepted: 10/06/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Multiple sequence alignments (MSAs) represent the fundamental unit of data inputted to most comparative sequence analyses. In phylogenetic analyses in particular, errors in MSA construction have the potential to induce further errors in downstream analyses such as phylogenetic reconstruction itself, ancestral state reconstruction, and divergence time estimation. In addition to providing phylogenetic methods with an MSA to analyze, researchers must also specify a suitable evolutionary model for the given analysis. Most commonly, researchers apply relative model selection to select a model from candidate set and then provide both the MSA and the selected model as input to subsequent analyses. While the influence of MSA errors has been explored for most stages of phylogenetics pipelines, the potential effects of MSA uncertainty on the relative model selection procedure itself have not been explored. RESULTS We assessed the consistency of relative model selection when presented with multiple perturbed versions of a given MSA. We find that while relative model selection is mostly robust to MSA uncertainty, in a substantial proportion of circumstances, relative model selection identifies distinct best-fitting models from different MSAs created from the same set of sequences. We find that this issue is more pervasive for nucleotide data compared to amino-acid data. However, we also find that it is challenging to predict whether relative model selection will be robust or sensitive to uncertainty in a given MSA. CONCLUSIONS We find that that MSA uncertainty can affect virtually all steps of phylogenetic analysis pipelines to a greater extent than has previously been recognized, including relative model selection.
Collapse
Affiliation(s)
| | - Molly L Miraglia
- Department of Molecular and Cellular Biosciences, Rowan University, Glassboro, NJ, 08028, USA.,Fox Chase Cancer Center, Philadelphia, PA, 19111, USA
| |
Collapse
|
20
|
Generator based approach to analyze mutations in genomic datasets. Sci Rep 2021; 11:21084. [PMID: 34702945 PMCID: PMC8548350 DOI: 10.1038/s41598-021-00609-8] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2021] [Accepted: 10/13/2021] [Indexed: 11/09/2022] Open
Abstract
In contrast to the conventional approach of directly comparing genomic sequences using sequence alignment tools, we propose a computational approach that performs comparisons between sequence generators. These sequence generators are learned via a data-driven approach that empirically computes the state machine generating the genomic sequence of interest. As the state machine based generator of the sequence is independent of the sequence length, it provides us with an efficient method to compute the statistical distance between large sets of genomic sequences. Moreover, our technique provides a fast and efficient method to cluster large datasets of genomic sequences, characterize their temporal and spatial evolution in a continuous manner, get insights into the locality sensitive information about the sequences without any need for alignment. Furthermore, we show that the technique can be used to detect local regions with mutation activity, which can then be applied to aid alignment techniques for the fast discovery of mutations. To demonstrate the efficacy of our technique on real genomic data, we cluster different strains of SARS-CoV-2 viral sequences, characterize their evolution and identify regions of the viral sequence with mutations.
Collapse
|
21
|
Wang Y, Zhao Y, Pan Q. Advances, challenges and opportunities of phylogenetic and social network analysis using COVID-19 data. Brief Bioinform 2021; 23:6380452. [PMID: 34601563 DOI: 10.1093/bib/bbab406] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2021] [Revised: 08/04/2021] [Accepted: 09/03/2021] [Indexed: 11/15/2022] Open
Abstract
Coronavirus disease 2019 (COVID-19) has attracted research interests from all fields. Phylogenetic and social network analyses based on connectivity between either COVID-19 patients or geographic regions and similarity between syndrome coronavirus 2 (SARS-CoV-2) sequences provide unique angles to answer public health and pharmaco-biological questions such as relationships between various SARS-CoV-2 mutants, the transmission pathways in a community and the effectiveness of prevention policies. This paper serves as a systematic review of current phylogenetic and social network analyses with applications in COVID-19 research. Challenges in current phylogenetic network analysis on SARS-CoV-2 such as unreliable inferences, sampling bias and batch effects are discussed as well as potential solutions. Social network analysis combined with epidemiology models helps to identify key transmission characteristics and measure the effectiveness of prevention and control strategies. Finally, future new directions of network analysis motivated by COVID-19 data are summarized.
Collapse
Affiliation(s)
- Yue Wang
- School of Mathematical and Natural Science, Arizona State University, 4701 W Thunderbird Rd, 85306, Arizona, USA
| | - Yunpeng Zhao
- School of Mathematical and Natural Science, Arizona State University, 4701 W Thunderbird Rd, 85306, Arizona, USA
| | - Qing Pan
- Department of Statistics, George Washington University, 801 22nd St. NW, 20052, Washington DC, USA
| |
Collapse
|
22
|
Li Y. Sequence Alignment with Q-Learning Based on the Actor-Critic Model. ACM T ASIAN LOW-RESO 2021. [DOI: 10.1145/3433540] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Abstract
Multiple sequence alignment methods refer to a series of algorithmic solutions for the alignment of evolutionary-related sequences while taking into account evolutionary events such as mutations, insertions, deletions, and rearrangements under certain conditions. In this article, we propose a method with Q-learning based on the Actor-Critic model for sequence alignment. We transform the sequence alignment problem into an agent's autonomous learning process. In this process, the reward of the possible next action taken is calculated, and the cumulative reward of the entire process is calculated. The results show that the method we propose is better than the gene algorithm and the dynamic programming method.
Collapse
Affiliation(s)
- Yarong Li
- The Experimental High School Attached to Beijing Normal University, Beijing, China
| |
Collapse
|
23
|
Poirier D, Théolier J, Marega R, Delahaut P, Gillard N, Godefroy SB. Evaluation of the discriminatory potential of antibodies created from synthetic peptides derived from wheat, barley, rye and oat gluten. PLoS One 2021; 16:e0257466. [PMID: 34555094 PMCID: PMC8459967 DOI: 10.1371/journal.pone.0257466] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2021] [Accepted: 09/01/2021] [Indexed: 11/18/2022] Open
Abstract
Celiac disease (CD) is triggered by ingestion of gluten-containing cereals such as wheat, barley, rye and in some cases oat. The only way for affected individuals to avoid symptoms of this condition is to adopt a gluten-free diet. Thus, gluten-free foodstuffs need to be monitored in order to ensure their innocuity. For this purpose, commercial immunoassays based on recognition of defined linear gluten sequences are currently used. These immunoassays are designed to detect or quantify total gluten regardless of the cereal, and often result in over or underestimation of the exact gluten content. In addition, Canadian regulations require a declaration of the source of gluten on the label of prepackaged foods, which cannot be done due to the limitations of existing methods. In this study, the development of new antibodies targeting discrimination of gluten sources was conducted using synthetic peptides as immunization strategy. Fourteen synthetic peptides selected from unique linear amino acid sequences of gluten were bioconjugated to Concholepas concholepas hemocyanin (CCH) as protein carrier, to elicit antibodies in rabbit. The resulting polyclonal antibodies (pAbs) successfully discriminated wheat, barley and oat prolamins during indirect ELISA assessments. pAbs raised against rye synthetic peptides cross-reacted evenly with wheat and rye prolamins but could still be useful to successfully discriminate gluten sources in combination with the other pAbs. Discrimination of gluten sources can be further refined and enhanced by raising monoclonal antibodies using a similar immunization strategy. A methodology capable of discriminating gluten sources, such as the one proposed in this study, could facilitate compliance with Canadian regulations on this matter. This type of discrimination could also complement current immunoassays by settling the issue of over and underestimation of gluten content, thus improving the safety of food intended to CD and wheat-allergic patients.
Collapse
Affiliation(s)
- David Poirier
- Department of Food Science and Nutrition, Pavillon Paul-Comtois, Université Laval, Québec, Québec, Canada
- Institute of Nutrition and Functional Foods, Université Laval, Québec, Québec, Canada
| | - Jérémie Théolier
- Department of Food Science and Nutrition, Pavillon Paul-Comtois, Université Laval, Québec, Québec, Canada
- Institute of Nutrition and Functional Foods, Université Laval, Québec, Québec, Canada
| | | | | | | | - Samuel Benrejeb Godefroy
- Department of Food Science and Nutrition, Pavillon Paul-Comtois, Université Laval, Québec, Québec, Canada
- Institute of Nutrition and Functional Foods, Université Laval, Québec, Québec, Canada
| |
Collapse
|
24
|
Neuwald AF, Lanczycki CJ, Hodges TK, Marchler-Bauer A. Obtaining extremely large and accurate protein multiple sequence alignments from curated hierarchical alignments. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2021; 2020:5850901. [PMID: 32500917 PMCID: PMC7297217 DOI: 10.1093/database/baaa042] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/26/2019] [Revised: 04/01/2020] [Accepted: 05/06/2020] [Indexed: 11/12/2022]
Abstract
For optimal performance, machine learning methods for protein sequence/structural analysis typically require as input a large multiple sequence alignment (MSA), which is often created using query-based iterative programs, such as PSI-BLAST or JackHMMER. However, because these programs align database sequences using a query sequence as a template, they may fail to detect or may tend to misalign sequences distantly related to the query. More generally, automated MSA programs often fail to align sequences correctly due to the unpredictable nature of protein evolution. Addressing this problem typically requires manual curation in the light of structural data. However, curated MSAs tend to contain too few sequences to serve as input for statistically based methods. We address these shortcomings by making publicly available a set of 252 curated hierarchical MSAs (hiMSAs), containing a total of 26 212 066 sequences, along with programs for generating from these extremely large MSAs. Each hiMSA consists of a set of hierarchically arranged MSAs representing individual subgroups within a superfamily along with template MSAs specifying how to align each subgroup MSA against MSAs higher up the hierarchy. Central to this approach is the MAPGAPS search program, which uses a hiMSA as a query to align (potentially vast numbers of) matching database sequences with accuracy comparable to that of the curated hiMSA. We illustrate this process for the exonuclease–endonuclease–phosphatase superfamily and for pleckstrin homology domains. A set of extremely large MSAs generated from the hiMSAs in this way is available as input for deep learning, big data analyses. MAPGAPS, auxiliary programs CDD2MGS, AddPhylum, PurgeMSA and ConvertMSA and links to National Center for Biotechnology Information data files are available at https://www.igs.umaryland.edu/labs/neuwald/software/mapgaps/.
Collapse
Affiliation(s)
- Andrew F Neuwald
- Institute for Genome Sciences.,Department of Biochemistry & Molecular Biology, University of Maryland School of Medicine, 670 W. Baltimore Street, Baltimore, MD 21201, USA
| | - Christopher J Lanczycki
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38 A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | | | - Aron Marchler-Bauer
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38 A, 8600 Rockville Pike, Bethesda, MD 20894, USA
| |
Collapse
|
25
|
Storer JM, Hubley R, Rosen J, Smit AFA. Curation Guidelines for de novo Generated Transposable Element Families. Curr Protoc 2021; 1:e154. [PMID: 34138525 DOI: 10.1002/cpz1.154] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Transposable elements (TEs) have the ability to alter individual genomic landscapes and shape the course of evolution for species in which they reside. Such profound changes can be understood by studying the biology of the organism and the interplay of the TEs it hosts. Characterizing and curating TEs across a wide range of species is a fundamental first step in this endeavor. This protocol employs techniques honed while developing TE libraries for a wide range of organisms and specifically addresses: (1) the extension of truncated de novo results into full-length TE families; (2) the iterative refinement of TE multiple sequence alignments; and (3) the use of alignment visualization to assess model completeness and subfamily structure. © 2021 Wiley Periodicals LLC. Basic Protocol: Extension and edge polishing of consensi and seed alignments derived from de novo repeat finders Support Protocol: Generating seed alignments using a library of consensi and a genome assembly.
Collapse
Affiliation(s)
| | | | - Jeb Rosen
- Institute for Systems Biology, Seattle, Washington
| | | |
Collapse
|
26
|
Bhardwaj V, Pevzner PA, Rashtchian C, Safonova Y. Trace Reconstruction Problems in Computational Biology. IEEE TRANSACTIONS ON INFORMATION THEORY 2021; 67:3295-3314. [PMID: 34176957 PMCID: PMC8224466 DOI: 10.1109/tit.2020.3030569] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
The problem of reconstructing a string from its error-prone copies, the trace reconstruction problem, was introduced by Vladimir Levenshtein two decades ago. While there has been considerable theoretical work on trace reconstruction, practical solutions have only recently started to emerge in the context of two rapidly developing research areas: immunogenomics and DNA data storage. In immunogenomics, traces correspond to mutated copies of genes, with mutations generated naturally by the adaptive immune system. In DNA data storage, traces correspond to noisy copies of DNA molecules that encode digital data, with errors being artifacts of the data retrieval process. In this paper, we introduce several new trace generation models and open questions relevant to trace reconstruction for immunogenomics and DNA data storage, survey theoretical results on trace reconstruction, and highlight their connections to computational biology. Throughout, we discuss the applicability and shortcomings of known solutions and suggest future research directions.
Collapse
Affiliation(s)
- Vinnu Bhardwaj
- Electrical and Computer Engineering Department, University of California San Diego, La Jolla, USA
| | - Pavel A. Pevzner
- Computer Science and Engineering Department, University of California San Diego, La Jolla, USA
| | - Cyrus Rashtchian
- Computer Science and Engineering Department, University of California San Diego, La Jolla, USA
- Qualcomm Institute, University of California San Diego, La Jolla, USA
| | - Yana Safonova
- Computer Science and Engineering Department, University of California San Diego, La Jolla, USA
| |
Collapse
|
27
|
Neuwald AF, Kolaczkowski BD, Altschul SF. eCOMPASS: evaluative comparison of multiple protein alignments by statistical score. Bioinformatics 2021; 37:3456-3463. [PMID: 33983436 PMCID: PMC8545322 DOI: 10.1093/bioinformatics/btab374] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2020] [Revised: 03/31/2021] [Accepted: 05/12/2021] [Indexed: 11/21/2022] Open
Abstract
Motivation Detecting subtle biologically relevant patterns in protein sequences often requires the construction of a large and accurate multiple sequence alignment (MSA). Methods for constructing MSAs are usually evaluated using benchmark alignments, which, however, typically contain very few sequences and are therefore inappropriate when dealing with large numbers of proteins. Results eCOMPASS addresses this problem using a statistical measure of relative alignment quality based on direct coupling analysis (DCA): to maintain protein structural integrity over evolutionary time, substitutions at one residue position typically result in compensating substitutions at other positions. eCOMPASS computes the statistical significance of the congruence between high scoring directly coupled pairs and 3D contacts in corresponding structures, which depends upon properly aligned homologous residues. We illustrate eCOMPASS using both simulated and real MSAs. Availability and implementation The eCOMPASS executable, C++ open source code and input data sets are available at https://www.igs.umaryland.edu/labs/neuwald/software/compass Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Andrew F Neuwald
- Department of Biochemistry & Molecular Biology, University of Maryland School of Medicine, Baltimore, MD 21201, USA
| | - Bryan D Kolaczkowski
- Department of Microbiology & Cell Science, University of Florida, Gainesville, FL 32611, USA
| | - Stephen F Altschul
- Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA
| |
Collapse
|
28
|
Akand EH, Murray JM. NGlyAlign: an automated library building tool to align highly divergent HIV envelope sequences. BMC Bioinformatics 2021; 22:54. [PMID: 33557755 PMCID: PMC7869453 DOI: 10.1186/s12859-020-03901-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2020] [Accepted: 11/23/2020] [Indexed: 08/29/2023] Open
Abstract
BACKGROUND The high variability in envelope regions of some viruses such as HIV allow the virus to establish infection and to escape subsequent immune surveillance. This variability, as well as increasing incorporation of N-linked glycosylation sites, is fundamental to this evasion. It also creates difficulties for multiple sequence alignment methods (MSA) that provide the first step in their analysis. Existing MSA tools often fail to properly align highly variable HIV envelope sequences requiring extensive manual editing that is impractical with even a moderate number of these variable sequences. RESULTS We developed an automated library building tool NGlyAlign, that organizes similar N-linked glycosylation sites as block constraints and statistically conserved global sites as single site constraints to automatically enforce partial columns in consistency-based MSA methods such as Dialign. This combined method accurately aligns variable HIV-1 envelope sequences. We tested the method on two datasets: a set of 156 founder and chronic gp160 HIV-1 subtype B sequences as well as a set of reference sequences of gp120 in the highly variable region 1. On measures such as entropy scores, sum of pair scores, column score, and similarity heat maps, NGlyAlign+Dialign proved superior against methods such as T-Coffee, ClustalOmega, ClustalW, Praline, HIValign and Muscle. The method is scalable to large sequence sets producing accurate alignments without requiring manual editing. As well as this application to HIV, our method can be used for other highly variable glycoproteins such as hepatitis C virus envelope. CONCLUSIONS NGlyAlign is an automated tool for mapping and building glycosylation motif libraries to accurately align highly variable regions in HIV sequences. It can provide the basis for many studies reliant on single robust alignments. NGlyAlign has been developed as an open-source tool and is freely available at https://github.com/UNSW-Mathematical-Biology/NGlyAlign_v1.0 .
Collapse
Affiliation(s)
- Elma H Akand
- School of Mathematics and Statistics, UNSW, Sydney, NSW, Australia.
| | - John M Murray
- School of Mathematics and Statistics, UNSW, Sydney, NSW, Australia
| |
Collapse
|
29
|
New Approaches for Inferring Phylogenies in the Presence of Paralogs. Trends Genet 2021; 37:174-187. [DOI: 10.1016/j.tig.2020.08.012] [Citation(s) in RCA: 28] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2020] [Revised: 08/13/2020] [Accepted: 08/19/2020] [Indexed: 12/18/2022]
|
30
|
Optical pattern generator for efficient bio-data encoding in a photonic sequence comparison architecture. PLoS One 2021; 16:e0245095. [PMID: 33449928 PMCID: PMC7810328 DOI: 10.1371/journal.pone.0245095] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2020] [Accepted: 12/21/2020] [Indexed: 11/19/2022] Open
Abstract
In this study, optical technology is considered as SA issues' solution with the potential ability to increase the speed, overcome memory-limitation, reduce power consumption, and increase output accuracy. So we examine the effect of bio-data encoding and the creation of input images on the pattern-recognition error-rate at the output of optical Vander-lugt correlator. Moreover, we present a genetic algorithm-based coding approach, named as GAC, to minimize output noises of cross-correlating data. As a case study, we adopt the proposed coding approach within a correlation-based optical architecture for counting k-mers in a DNA string. As verified by the simulations on Salmonella whole-genome, we can improve sensitivity and speed more than 86% and 81%, respectively, compared to BLAST by using coding set generated by GAC method fed to the proposed optical correlator system. Moreover, we present a comprehensive report on the impact of 1D and 2D cross-correlation approaches, as-well-as various coding parameters on the output noise, which motivate the system designers to customize the coding sets within the optical setup.
Collapse
|
31
|
Perrin A, Rocha EPC. PanACoTA: a modular tool for massive microbial comparative genomics. NAR Genom Bioinform 2021; 3:lqaa106. [PMID: 33575648 PMCID: PMC7803007 DOI: 10.1093/nargab/lqaa106] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2020] [Revised: 11/10/2020] [Accepted: 12/01/2020] [Indexed: 02/06/2023] Open
Abstract
The study of the gene repertoires of microbial species, their pangenomes, has become a key part of microbial evolution and functional genomics. Yet, the increasing number of genomes available complicates the establishment of the basic building blocks of comparative genomics. Here, we present PanACoTA (https://github.com/gem-pasteur/PanACoTA), a tool that allows to download all genomes of a species, build a database with those passing quality and redundancy controls, uniformly annotate and then build their pangenome, several variants of core genomes, their alignments and a rapid but accurate phylogenetic tree. While many programs building pangenomes have become available in the last few years, we have focused on a modular method, that tackles all the key steps of the process, from download to phylogenetic inference. While all steps are integrated, they can also be run separately and multiple times to allow rapid and extensive exploration of the parameters of interest. PanACoTA is built in Python3, includes a singularity container and features to facilitate its future development. We believe PanACoTa is an interesting addition to the current set of comparative genomics tools, since it will accelerate and standardize the more routine parts of the work, allowing microbial genomicists to more quickly tackle their specific questions.
Collapse
Affiliation(s)
- Amandine Perrin
- Microbial Evolutionary Genomics, CNRS, UMR3525, Institut Pasteur, 28, rue Dr Roux, Paris 75015, France
| | - Eduardo P C Rocha
- Microbial Evolutionary Genomics, CNRS, UMR3525, Institut Pasteur, 28, rue Dr Roux, Paris 75015, France
| |
Collapse
|
32
|
Risser F, Collin S, Dos Santos-Morais R, Gruez A, Chagot B, Weissman KJ. Towards improved understanding of intersubunit interactions in modular polyketide biosynthesis: Docking in the enacyloxin IIa polyketide synthase. J Struct Biol 2020; 212:107581. [DOI: 10.1016/j.jsb.2020.107581] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2020] [Revised: 07/14/2020] [Accepted: 07/16/2020] [Indexed: 12/26/2022]
|
33
|
Portik DM, Wiens JJ. Do Alignment and Trimming Methods Matter for Phylogenomic (UCE) Analyses? Syst Biol 2020; 70:440-462. [PMID: 32797207 DOI: 10.1093/sysbio/syaa064] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2019] [Revised: 08/02/2020] [Accepted: 08/03/2020] [Indexed: 11/14/2022] Open
Abstract
Alignment is a crucial issue in molecular phylogenetics because different alignment methods can potentially yield very different topologies for individual genes. But it is unclear if the choice of alignment methods remains important in phylogenomic analyses, which incorporate data from hundreds or thousands of genes. For example, problematic biases in alignment might be multiplied across many loci, whereas alignment errors in individual genes might become irrelevant. The issue of alignment trimming (i.e., removing poorly aligned regions or missing data from individual genes) is also poorly explored. Here, we test the impact of 12 different combinations of alignment and trimming methods on phylogenomic analyses. We compare these methods using published phylogenomic data from ultraconserved elements (UCEs) from squamate reptiles (lizards and snakes), birds, and tetrapods. We compare the properties of alignments generated by different alignment and trimming methods (e.g., length, informative sites, missing data). We also test whether these data sets can recover well-established clades when analyzed with concatenated (RAxML) and species-tree methods (ASTRAL-III), using the full data ($\sim $5000 loci) and subsampled data sets (10% and 1% of loci). We show that different alignment and trimming methods can significantly impact various aspects of phylogenomic data sets (e.g., length, informative sites). However, these different methods generally had little impact on the recovery and support values for well-established clades, even across very different numbers of loci. Nevertheless, our results suggest several "best practices" for alignment and trimming. Intriguingly, the choice of phylogenetic methods impacted the phylogenetic results most strongly, with concatenated analyses recovering significantly more well-established clades (with stronger support) than the species-tree analyses. [Alignment; concatenated analysis; phylogenomics; sequence length heterogeneity; species-tree analysis; trimming].
Collapse
Affiliation(s)
- Daniel M Portik
- Department of Ecology and Evolutionary Biology, University of Arizona, Tucson, AZ 85721, USA.,California Academy of Sciences, San Francisco, CA 94118, USA
| | - John J Wiens
- Department of Ecology and Evolutionary Biology, University of Arizona, Tucson, AZ 85721, USA
| |
Collapse
|
34
|
Carpentier M, Chomilier J. Protein multiple alignments: sequence-based versus structure-based programs. Bioinformatics 2020; 35:3970-3980. [PMID: 30942864 DOI: 10.1093/bioinformatics/btz236] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2018] [Revised: 03/05/2019] [Accepted: 04/02/2019] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Multiple sequence alignment programs have proved to be very useful and have already been evaluated in the literature yet not alignment programs based on structure or both sequence and structure. In the present article we wish to evaluate the added value provided through considering structures. RESULTS We compared the multiple alignments resulting from 25 programs either based on sequence, structure or both, to reference alignments deposited in five databases (BALIBASE 2 and 3, HOMSTRAD, OXBENCH and SISYPHUS). On the whole, the structure-based methods compute more reliable alignments than the sequence-based ones, and even than the sequence+structure-based programs whatever the databases. Two programs lead, MAMMOTH and MATRAS, nevertheless the performances of MUSTANG, MATT, 3DCOMB, TCOFFEE+TM_ALIGN and TCOFFEE+SAP are better for some alignments. The advantage of structure-based methods increases at low levels of sequence identity, or for residues in regular secondary structures or buried ones. Concerning gap management, sequence-based programs set less gaps than structure-based programs. Concerning the databases, the alignments of the manually built databases are more challenging for the programs. AVAILABILITY AND IMPLEMENTATION All data and results presented in this study are available at: http://wwwabi.snv.jussieu.fr/people/mathilde/download/AliMulComp/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Mathilde Carpentier
- Institut Systématique Evolution Biodiversité (ISYEB), Sorbonne Université, MNHN, CNRS, EPHE, Paris, France
| | - Jacques Chomilier
- Sorbonne Université, MNHN, CNRS, IRD, Institut de Minéralogie, de Physique des Matériaux et de Cosmochimie (IMPMC), BiBiP, Paris, France
| |
Collapse
|
35
|
Jermiin LS, Catullo RA, Holland BR. A new phylogenetic protocol: dealing with model misspecification and confirmation bias in molecular phylogenetics. NAR Genom Bioinform 2020; 2:lqaa041. [PMID: 33575594 PMCID: PMC7671319 DOI: 10.1093/nargab/lqaa041] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2020] [Revised: 05/18/2020] [Accepted: 06/04/2020] [Indexed: 12/15/2022] Open
Abstract
Molecular phylogenetics plays a key role in comparative genomics and has increasingly significant impacts on science, industry, government, public health and society. In this paper, we posit that the current phylogenetic protocol is missing two critical steps, and that their absence allows model misspecification and confirmation bias to unduly influence phylogenetic estimates. Based on the potential offered by well-established but under-used procedures, such as assessment of phylogenetic assumptions and tests of goodness of fit, we introduce a new phylogenetic protocol that will reduce confirmation bias and increase the accuracy of phylogenetic estimates.
Collapse
Affiliation(s)
- Lars S Jermiin
- CSIRO Land & Water, Canberra, ACT 2601, Australia
- Research School of Biology, Australian National University, Canberra, ACT 2601, Australia
- School of Biology & Environment Science, University College Dublin, Belfield, Dublin 4, Ireland
- Earth Institute, University College Dublin, Belfield, Dublin 4, Ireland
| | - Renee A Catullo
- CSIRO Land & Water, Canberra, ACT 2601, Australia
- Research School of Biology, Australian National University, Canberra, ACT 2601, Australia
- School of Science and Health & Hawkesbury Institute of the Environment, Western Sydney University, Penrith, NSW 2751, Australia
| | - Barbara R Holland
- School of Natural Sciences, University of Tasmania, Hobart, TAS 7001, Australia
| |
Collapse
|
36
|
Features spaces and a learning system for structural-temporal data, and their application on a use case of real-time communication network validation data. PLoS One 2020; 15:e0228434. [PMID: 32027668 PMCID: PMC7004316 DOI: 10.1371/journal.pone.0228434] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2018] [Accepted: 01/15/2020] [Indexed: 11/21/2022] Open
Abstract
The service quality and system dependability of real-time communication networks strongly depends on the analysis of monitored data, to identify concrete problems and their causes. Many of these can be described by either their structural or temporal properties, or a combination of both. As current research is short of approaches sufficiently addressing both properties simultaneously, we propose a new feature space specifically suited for this task, which we analyze for its theoretical properties and its practical relevance. We evaluate its classification performance when used on real-world data sets of structural-temporal mobile communication data, and compare it to the performance achieved of feature representations used in related work. For this purpose we propose a system which allows the automatic detection and prediction of classes of pre-defined sequence behavior, greatly reducing costs caused by the otherwise required manual analysis. With our proposed feature spaces this system achieves a precision of more than 93% at recall values of 100%, with an up to 6.7% higher effective recall than otherwise similarly performing alternatives, notably outperforming alternative deep learning, kernel learning and ensemble learning approaches of related work. Furthermore the supported system calibration allows separating reliable from unreliable predictions more effectively, which is highly relevant for any practical application.
Collapse
|
37
|
Nunez-Castilla J, Siltberg-Liberles J. An Easy Protocol for Evolutionary Analysis of Intrinsically Disordered Proteins. Methods Mol Biol 2020; 2141:147-177. [PMID: 32696356 DOI: 10.1007/978-1-0716-0524-0_7] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
We present an easy protocol for evolutionary analysis of proteins, with an emphasis on studying the evolutionary dynamics of disordered regions. Using the p53 protein family as an example, we provide a guide for finding homologous sequences in a database and refining a dataset before constructing the evolutionary context by building a phylogenetic tree. We show how a multiple sequence alignment and phylogeny for a protein family can be further partitioned into smaller datasets in order to investigate the changes in disorder content across the phylogeny. Based on the evolutionary context, we also investigate site-specific conservation of disorder. Last, we address how to evaluate the evolutionary dynamics of disorder-to-order transitions.
Collapse
Affiliation(s)
- Janelle Nunez-Castilla
- Department of Biological Sciences, Biomolecular Sciences Institute, Florida International University, Miami, FL, USA
| | - Jessica Siltberg-Liberles
- Department of Biological Sciences, Biomolecular Sciences Institute, Florida International University, Miami, FL, USA.
| |
Collapse
|
38
|
Emms DM, Kelly S. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol 2019; 20:238. [PMID: 31727128 PMCID: PMC6857279 DOI: 10.1186/s13059-019-1832-y] [Citation(s) in RCA: 2922] [Impact Index Per Article: 584.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2019] [Accepted: 09/23/2019] [Indexed: 12/22/2022] Open
Abstract
Here, we present a major advance of the OrthoFinder method. This extends OrthoFinder's high accuracy orthogroup inference to provide phylogenetic inference of orthologs, rooted gene trees, gene duplication events, the rooted species tree, and comparative genomics statistics. Each output is benchmarked on appropriate real or simulated datasets, and where comparable methods exist, OrthoFinder is equivalent to or outperforms these methods. Furthermore, OrthoFinder is the most accurate ortholog inference method on the Quest for Orthologs benchmark test. Finally, OrthoFinder's comprehensive phylogenetic analysis is achieved with equivalent speed and scalability to the fastest, score-based heuristic methods. OrthoFinder is available at https://github.com/davidemms/OrthoFinder.
Collapse
Affiliation(s)
- David M Emms
- Department of Plant Sciences, University of Oxford, South Parks Road, Oxford, OX1 3RB, UK
| | - Steven Kelly
- Department of Plant Sciences, University of Oxford, South Parks Road, Oxford, OX1 3RB, UK.
| |
Collapse
|
39
|
Zhang D, Gao F, Jakovlić I, Zou H, Zhang J, Li WX, Wang GT. PhyloSuite: An integrated and scalable desktop platform for streamlined molecular sequence data management and evolutionary phylogenetics studies. Mol Ecol Resour 2019; 20:348-355. [DOI: 10.1111/1755-0998.13096] [Citation(s) in RCA: 825] [Impact Index Per Article: 165.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2019] [Revised: 09/12/2019] [Accepted: 09/24/2019] [Indexed: 01/12/2023]
Affiliation(s)
- Dong Zhang
- Key Laboratory of Aquaculture Disease Control Ministry of Agriculture, and State Key Laboratory of Freshwater Ecology and Biotechnology Institute of Hydrobiology Chinese Academy of Sciences Wuhan China
- University of Chinese Academy of Sciences Beijing China
| | - Fangluan Gao
- Institute of Plant Virology Fujian Agriculture and Forestry University Fuzhou Fujian China
| | | | - Hong Zou
- Key Laboratory of Aquaculture Disease Control Ministry of Agriculture, and State Key Laboratory of Freshwater Ecology and Biotechnology Institute of Hydrobiology Chinese Academy of Sciences Wuhan China
| | | | - Wen X. Li
- Key Laboratory of Aquaculture Disease Control Ministry of Agriculture, and State Key Laboratory of Freshwater Ecology and Biotechnology Institute of Hydrobiology Chinese Academy of Sciences Wuhan China
| | - Gui T. Wang
- Key Laboratory of Aquaculture Disease Control Ministry of Agriculture, and State Key Laboratory of Freshwater Ecology and Biotechnology Institute of Hydrobiology Chinese Academy of Sciences Wuhan China
| |
Collapse
|
40
|
Cuevas-Caballé C, Riutort M, Álvarez-Presas M. Diet assessment of two land planarian species using high-throughput sequencing data. Sci Rep 2019; 9:8679. [PMID: 31213615 PMCID: PMC6581950 DOI: 10.1038/s41598-019-44952-3] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2018] [Accepted: 05/29/2019] [Indexed: 11/30/2022] Open
Abstract
Geoplanidae (Platyhelminthes: Tricladida) feed on soil invertebrates. Observations of their predatory behavior in nature are scarce, and most of the information has been obtained from food preference experiments. Although these experiments are based on a wide variety of prey, this catalog is often far from being representative of the fauna present in the natural habitat of planarians. As some geoplanid species have recently become invasive, obtaining accurate knowledge about their feeding habits is crucial for the development of plans to control and prevent their expansion. Using high throughput sequencing data, we perform a metagenomic analysis to identify the in situ diet of two endemic and codistributed species of geoplanids from the Brazilian Atlantic Forest: Imbira marcusi and Cephaloflexa bergi. We have tested four different methods of taxonomic assignment and find that phylogenetic-based assignment methods outperform those based on similarity. The results show that the diet of I. marcusi is restricted to earthworms, whereas C. bergi preys on spiders, harvestmen, woodlice, grasshoppers, Hymenoptera, Lepidoptera and possibly other geoplanids. Furthermore, both species change their feeding habits among the different sample locations. In conclusion, the integration of metagenomics with phylogenetics should be considered when establishing studies on the feeding habits of invertebrates.
Collapse
Affiliation(s)
- Cristian Cuevas-Caballé
- Departament de Genètica, Microbiologia i Estadística, Facultat de Biologia, Universitat de Barcelona, Barcelona, Spain
- Institut de Recerca de la Biodiversitat (IRBio), Universitat de Barcelona, Barcelona, Spain
| | - Marta Riutort
- Departament de Genètica, Microbiologia i Estadística, Facultat de Biologia, Universitat de Barcelona, Barcelona, Spain
- Institut de Recerca de la Biodiversitat (IRBio), Universitat de Barcelona, Barcelona, Spain
| | - Marta Álvarez-Presas
- Departament de Genètica, Microbiologia i Estadística, Facultat de Biologia, Universitat de Barcelona, Barcelona, Spain.
- Institut de Recerca de la Biodiversitat (IRBio), Universitat de Barcelona, Barcelona, Spain.
| |
Collapse
|
41
|
Assessing the Evolutionary Conservation of Protein Disulphide Bonds. Methods Mol Biol 2019. [PMID: 31069762 DOI: 10.1007/978-1-4939-9187-7_2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/13/2023]
Abstract
Studying the evolutionary conservation of proteins can be a valuable tool for understanding its function. At the sequence level, the conservation of each residue can be used to infer the importance of the particular regions of proteins. In the case of protein disulphide bonds, the conservation of the cysteines involved can be used to infer the conservation of the disulphide bond itself. In this chapter, bioinformatics methods are described that can be used to assess the conservation of a protein disulphide bond with a focus on the study of human proteins. Conservation will be assessed at the species and at the human population level. The methods described make use of publicly available databases and can be applied by any researcher using a standard desktop computer with Internet access.
Collapse
|
42
|
Alazem O, Abramyan J. Reptile enamel matrix proteins: Selection, divergence, and functional constraint. JOURNAL OF EXPERIMENTAL ZOOLOGY PART B-MOLECULAR AND DEVELOPMENTAL EVOLUTION 2019; 332:136-148. [PMID: 31045323 DOI: 10.1002/jez.b.22857] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/30/2018] [Revised: 02/24/2019] [Accepted: 04/01/2019] [Indexed: 12/14/2022]
Abstract
The three major enamel matrix proteins (EMPs): amelogenin (AMEL), ameloblastin (AMBN), and enamelin (ENAM), are intrinsically linked to tooth development in tetrapods. However, reptiles and mammals exhibit significant differences in dental patterning and development, potentially affecting how EMPs evolve in each group. In most reptiles, teeth are replaced continuously throughout life, while mammals have reduced replacement to only one or two generations. Reptiles also form structurally simple, aprismatic enamel while mammalian enamel is composed of highly organized hydroxyapatite prisms. These differences, combined with reported low sequence homology in reptiles, led us to predict that reptiles may experience lower selection pressure on their EMPs as compared with mammals. However, we found that like mammals, reptile EMPs are under moderate purifying selection, with some differences evident between AMEL, AMBN, and ENAM. We also demonstrate that sequence homology in reptile EMPs is closely associated with divergence times, with more recently diverged lineages exhibiting high homology, along with strong phylogenetic signal. Lastly, despite sequence divergence, none of the reptile species in our study exhibited mutations consistent with diseases that cause degeneration of enamel (e.g. amelogenesis imperfecta). Despite short tooth retention time and simplicity in enamel structure, reptile EMPs still exhibit purifying selection required to form durable enamel.
Collapse
Affiliation(s)
- Omar Alazem
- Department of Natural Sciences, University of Michigan-Dearborn, Dearborn, Michigan
| | - John Abramyan
- Department of Natural Sciences, University of Michigan-Dearborn, Dearborn, Michigan
| |
Collapse
|
43
|
Nute M, Saleh E, Warnow T. Evaluating Statistical Multiple Sequence Alignment in Comparison to Other Alignment Methods on Protein Data Sets. Syst Biol 2019; 68:396-411. [PMID: 30329135 PMCID: PMC6472439 DOI: 10.1093/sysbio/syy068] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2018] [Revised: 09/27/2018] [Accepted: 10/11/2018] [Indexed: 01/15/2023] Open
Abstract
The estimation of multiple sequence alignments of protein sequences is a basic step in many bioinformatics pipelines, including protein structure prediction, protein family identification, and phylogeny estimation. Statistical coestimation of alignments and trees under stochastic models of sequence evolution has long been considered the most rigorous technique for estimating alignments and trees, but little is known about the accuracy of such methods on biological benchmarks. We report the results of an extensive study evaluating the most popular protein alignment methods as well as the statistical coestimation method BAli-Phy on 1192 protein data sets from established benchmarks as well as on 120 simulated data sets. Our study (which used more than 230 CPU years for the BAli-Phy analyses alone) shows that BAli-Phy has better precision and recall (with respect to the true alignments) than the other alignment methods on the simulated data sets but has consistently lower recall on the biological benchmarks (with respect to the reference alignments) than many of the other methods. In other words, we find that BAli-Phy systematically underaligns when operating on biological sequence data but shows no sign of this on simulated data. There are several potential causes for this change in performance, including model misspecification, errors in the reference alignments, and conflicts between structural alignment and evolutionary alignments, and future research is needed to determine the most likely explanation. We conclude with a discussion of the potential ramifications for each of these possibilities. [BAli-Phy; homology; multiple sequence alignment; protein sequences; structural alignment.]
Collapse
Affiliation(s)
- Michael Nute
- Department of Statistics, University of Illinois at Urbana-Champaign, 725 S Wright St #101, Champaign, IL 61820, USA
| | - Ehsan Saleh
- Department of Computer Science, University of Illinois at Urbana-Champaign, 201 N. Goodwin Ave, Urbana, IL 61801, USA
| | - Tandy Warnow
- Department of Computer Science, University of Illinois at Urbana-Champaign, 201 N. Goodwin Ave, Urbana, IL 61801, USA.,Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, 1205 W. Clark St., Urbana, IL 61801, USA.,National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| |
Collapse
|
44
|
Mangul S, Martin LS, Hill BL, Lam AKM, Distler MG, Zelikovsky A, Eskin E, Flint J. Systematic benchmarking of omics computational tools. Nat Commun 2019; 10:1393. [PMID: 30918265 PMCID: PMC6437167 DOI: 10.1038/s41467-019-09406-4] [Citation(s) in RCA: 86] [Impact Index Per Article: 17.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2018] [Accepted: 03/06/2019] [Indexed: 01/11/2023] Open
Abstract
Computational omics methods packaged as software have become essential to modern biological research. The increasing dependence of scientists on these powerful software tools creates a need for systematic assessment of these methods, known as benchmarking. Adopting a standardized benchmarking practice could help researchers who use omics data to better leverage recent technological innovations. Our review summarizes benchmarking practices from 25 recent studies and discusses the challenges, advantages, and limitations of benchmarking across various domains of biology. We also propose principles that can make computational biology benchmarking studies more sustainable and reproducible, ultimately increasing the transparency of biomedical data and results. Benchmarking studies are important for comprehensively understanding and evaluating different computational omics methods. Here, the authors review practices from 25 recent studies and propose principles to improve the quality of benchmarking studies.
Collapse
Affiliation(s)
- Serghei Mangul
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA, 90095, USA. .,Institute for Quantitative and Computational Biosciences, University of California Los Angeles, 611 Charles E Young Drive East, Los Angeles, CA, 90095, USA.
| | - Lana S Martin
- Institute for Quantitative and Computational Biosciences, University of California Los Angeles, 611 Charles E Young Drive East, Los Angeles, CA, 90095, USA
| | - Brian L Hill
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA, 90095, USA
| | - Angela Ka-Mei Lam
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA, 90095, USA
| | - Margaret G Distler
- Department of Psychiatry and Biobehavioral Sciences, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - Alex Zelikovsky
- Department of Computer Science, Georgia State University, Atlanta, GA, 30303, USA.,The Laboratory of Bioinformatics, I.M. Sechenov First Moscow State Medical University, Moscow, 119991, Russia
| | - Eleazar Eskin
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA, 90095, USA.,Department of Human Genetics, University of California Los Angeles, 695 Charles E. Young, Los Angeles, CA, USA
| | - Jonathan Flint
- Department of Psychiatry and Biobehavioral Sciences, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, CA, 90095, USA
| |
Collapse
|
45
|
Liu L, Wang H. The Recent Applications and Developments of Bioinformatics and Omics Technologies in Traditional Chinese Medicine. Curr Bioinform 2019. [DOI: 10.2174/1574893614666190102125403] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Abstract
Background:Traditional Chinese Medicine (TCM) is widely utilized as complementary health care in China whose acceptance is still hindered by conventional scientific research methodology, although it has been exercised and implemented for nearly 2000 years. Identifying the molecular mechanisms, targets and bioactive components in TCM is a critical step in the modernization of TCM because of the complexity and uniqueness of the TCM system. With recent advances in computational approaches and high throughput technologies, it has become possible to understand the potential TCM mechanisms at the molecular and systematic level, to evaluate the effectiveness and toxicity of TCM treatments. Bioinformatics is gaining considerable attention to unearth the in-depth molecular mechanisms of TCM, which emerges as an interdisciplinary approach owing to the explosive omics data and development of computer science. Systems biology, based on the omics techniques, opens up a new perspective which enables us to investigate the holistic modulation effect on the body.Objective:This review aims to sum up the recent efforts of bioinformatics and omics techniques in the research of TCM including Systems biology, Metabolomics, Proteomics, Genomics and Transcriptomics.Conclusion:Overall, bioinformatics tools combined with omics techniques have been extensively used to scientifically support the ancient practice of TCM to be scientific and international through the acquisition, storage and analysis of biomedical data.
Collapse
Affiliation(s)
- Lin Liu
- Department of Mathematics and Computer Science, Freie Universität Berlin, Berlin 14195, Germany
| | - Hao Wang
- Institute of Chemistry and Biochemistry, Freie Universität Berlin, Berlin 14195, Germany
| |
Collapse
|
46
|
High-Throughput Reconstruction of Ancestral Protein Sequence, Structure, and Molecular Function. Methods Mol Biol 2019; 1851:135-170. [PMID: 30298396 DOI: 10.1007/978-1-4939-8736-8_8] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Ancestral protein sequence reconstruction is a powerful technique for explicitly testing hypotheses about the evolution of molecular function, allowing researchers to meticulously dissect how historical changes in protein sequence impacted functional repertoire by altering the protein's 3D structure. These techniques have provided concrete, experimentally validated insights into ancient evolutionary processes and help illuminate the complex relationship between protein sequence, structure, and function. Inferring the protein family phylogenies on which ancestral sequence reconstruction depends and reconstructing the sequences, themselves, are amenable to high-throughput computational analysis. However, determining the structures of ancestral-reconstructed proteins and characterizing their functions typically rely on time-consuming and expensive laboratory analyses, limiting most current studies to examining a relatively small number of specific hypotheses. For this reason, we have little detailed, unbiased information about how molecular function evolves across large protein family phylogenies. Here we describe a generalized protocol that integrates ancestral sequence reconstruction with structural homology modeling and structure-based molecular affinity prediction to characterize historical changes in protein function across families with thousands of individual sequences. We highlight key steps in the analysis protocol requiring particularly careful attention to avoid introducing potential errors as well as steps for which computationally efficient subroutines can be substituted for more intensive approaches, allowing researchers to scale the analysis up or down, depending on available resources and requirements for reproducibility and scientific rigor. In our view, this approach provides a compelling compliment to more laboratory-intensive procedures, generating important contextual information that can help guide detailed experiments.
Collapse
|
47
|
Ashkenazy H, Sela I, Levy Karin E, Landan G, Pupko T. Multiple Sequence Alignment Averaging Improves Phylogeny Reconstruction. Syst Biol 2019; 68:117-130. [PMID: 29771363 PMCID: PMC6657586 DOI: 10.1093/sysbio/syy036] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2017] [Revised: 05/07/2018] [Accepted: 05/09/2018] [Indexed: 01/11/2023] Open
Abstract
The classic methodology of inferring a phylogenetic tree from sequence data is composed of two steps. First, a multiple sequence alignment (MSA) is computed. Then, a tree is reconstructed assuming the MSA is correct. Yet, inferred MSAs were shown to be inaccurate and alignment errors reduce tree inference accuracy. It was previously proposed that filtering unreliable alignment regions can increase the accuracy of tree inference. However, it was also demonstrated that the benefit of this filtering is often obscured by the resulting loss of phylogenetic signal. In this work we explore an approach, in which instead of relying on a single MSA, we generate a large set of alternative MSAs and concatenate them into a single SuperMSA. By doing so, we account for phylogenetic signals contained in columns that are not present in the single MSA computed by alignment algorithms. Using simulations, we demonstrate that this approach results, on average, in more accurate trees compared to 1) using an unfiltered MSA and 2) using a single MSA with weights assigned to columns according to their reliability. Next, we explore in which regions of the MSA space our approach is expected to be beneficial. Finally, we provide a simple criterion for deciding whether or not the extra effort of computing a SuperMSA and inferring a tree from it is beneficial. Based on these assessments, we expect our methodology to be useful for many cases in which diverged sequences are analyzed. The option to generate such a SuperMSA is available at http://guidance.tau.ac.il.
Collapse
Affiliation(s)
- Haim Ashkenazy
- Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel Aviv University, Ramat Aviv 69978, Tel Aviv, Israel
| | - Itamar Sela
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Eli Levy Karin
- Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel Aviv University, Ramat Aviv 69978, Tel Aviv, Israel
- Department of Molecular Biology & Ecology of Plants, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel
| | - Giddy Landan
- Institute of Microbiology, Christian-Albrechts-University of Kiel, 24118 Kiel, Germany
| | - Tal Pupko
- Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel Aviv University, Ramat Aviv 69978, Tel Aviv, Israel
| |
Collapse
|
48
|
Abstract
Background Protein sequence alignment analyses have become a crucial step for many bioinformatics studies during the past decades. Multiple sequence alignment (MSA) and pair-wise sequence alignment (PSA) are two major approaches in sequence alignment. Former benchmark studies revealed drawbacks of MSA methods on nucleotide sequence alignments. To test whether similar drawbacks also influence protein sequence alignment analyses, we propose a new benchmark framework for protein clustering based on cluster validity. This new framework directly reflects the biological ground truth of the application scenarios that adopt sequence alignments, and evaluates the alignment quality according to the achievement of the biological goal, rather than the comparison on sequence level only, which averts the biases introduced by alignment scores or manual alignment templates. Compared with former studies, we calculate the cluster validity score based on sequence distances instead of clustering results. This strategy could avoid the influence brought by different clustering methods thus make results more dependable. Results Results showed that PSA methods performed better than MSA methods on most of the BAliBASE benchmark datasets. Analyses on the 80 re-sampled benchmark datasets constructed by randomly choosing 90% of each dataset 10 times showed similar results. Conclusions These results validated that the drawbacks of MSA methods revealed in nucleotide level also existed in protein sequence alignment analyses and affect the accuracy of results. Electronic supplementary material The online version of this article (10.1186/s12859-018-2524-4) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Yingying Wang
- Research Center for Biomedical Information Technology, Shenzhen Institutes of Advanced Technologies, Chinese Academy of Sciences, Shenzhen, China
| | - Hongyan Wu
- Research Center for Biomedical Information Technology, Shenzhen Institutes of Advanced Technologies, Chinese Academy of Sciences, Shenzhen, China.
| | - Yunpeng Cai
- Research Center for Biomedical Information Technology, Shenzhen Institutes of Advanced Technologies, Chinese Academy of Sciences, Shenzhen, China.
| |
Collapse
|
49
|
Zhou PY, Sze-To A, Wong AKC. Discovery and disentanglement of aligned residue associations from aligned pattern clusters to reveal subgroup characteristics. BMC Med Genomics 2018; 11:103. [PMID: 30453949 PMCID: PMC6245498 DOI: 10.1186/s12920-018-0417-z] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
Background A protein family has similar and diverse functions locally conserved. An aligned pattern cluster (APC) can reflect the conserved functionality. Discovering aligned residue associations (ARAs) in APCs can reveal subtle inner working characteristics of conserved regions of protein families. However, ARAs corresponding to different functionalities/subgroups/classes could be entangled because of subtle multiple entwined factors. Methods To discover and disentangle patterns from mixed-mode datasets, such as APCs when the residues are replaced by their fundamental biochemical properties list, this paper presents a novel method, Extended Aligned Residual Association Discovery and Disentanglement (E-ARADD). E-ARADD discretizes the numerical dataset to transform the mixed-mode dataset into an event-value dataset, constructs an ARA Frequency Matrix and then converts it into an adjusted Statistical Residual (SR) Vector Space (SRV) capturing statistical deviation from randomness. By applying Principal Component (PC) Decomposition on SRV, PCs ranked by their variance are obtained. Finally, the disentangled ARAs are discovered when the projections on a PC is re-projected to a vector space with the same basis vectors of SRV. Results Experiments on synthetic, cytochrome c and class A scavenger data have shown that E-ARADD can a) disentangle the entwined ARAs in APCs (with residues or biochemical properties), b) reveal subtle AR clusters relating to classes, subtle subgroups or specific functionalities. Conclusions E-ARADD can discover and disentangle ARs and ARAs entangled in functionality and location of protein families to reveal functional subgroups and subgroup characteristics of biological conserved regions. Experimental results on synthetic data provides the proof-of-concept validation on the successful disentanglement that reveals class-associated ARAs with or without class labels as input. Experiments on cytochrome c data proved the efficacy of E-ARADD in handing both types of residue data. Our novel methodology is not only able to discover and disentangle ARs and ARAs in specific statistical/functional (PCs and RSRVs) spaces, but also their locations in the protein family functional domains. The success of E-ARADD shows its great potential to proteomic research, drug discovery and precision and personalized genetic medicine.
Collapse
Affiliation(s)
- Pei-Yuan Zhou
- Systems Design Engineering, University of Waterloo, Waterloo, ON, Canada
| | - Antonio Sze-To
- Systems Design Engineering, University of Waterloo, Waterloo, ON, Canada
| | - Andrew K C Wong
- Systems Design Engineering, University of Waterloo, Waterloo, ON, Canada.
| |
Collapse
|
50
|
Lowe EK, Garm AL, Ullrich-Lüter E, Cuomo C, Arnone MI. The crowns have eyes: multiple opsins found in the eyes of the crown-of-thorns starfish Acanthaster planci. BMC Evol Biol 2018; 18:168. [PMID: 30419810 PMCID: PMC6233551 DOI: 10.1186/s12862-018-1276-0] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2017] [Accepted: 10/18/2018] [Indexed: 01/01/2023] Open
Abstract
Background Opsins are G protein-coupled receptors used for both visual and non-visual photoreception, and these proteins evolutionarily date back to the base of the bilaterians. In the current sequencing age, phylogenomic analysis has proven to be a powerful tool, facilitating the increase in knowledge about diversity within the opsin subclasses and, so far, at least nine types of opsins have been identified. Within echinoderms, opsins have been studied in Echinoidea and Ophiuroidea, which do not possess proper image forming eyes, but rather widely dispersed dermal photoreceptors. However, most species of Asteroidea, the starfish, possess true eyes and studying them will shed light on the diversity of opsin usage within echinoderms and help resolve the evolutionary history of opsins. Results Using high-throughput RNA sequencing, we have sequenced and analyzed the transcriptomes of different Acanthaster planci tissue samples: eyes, radial nerve, tube feet and a mixture of tissues from other organs. At least ten opsins were identified, and eight of them were found significantly differentially expressed in both eyes and radial nerve, with R-opsin being the most highly expressed in the eye. Conclusion This study provides new important insight into the involvement of opsins in visual and nonvisual photoreception. Of relevance, we found the first indication of an r-opsin photopigment expressed in a well-developed visual eye in a deuterostome animal. Additionally, we provided tissue specific A. planci transcriptomes that will aid in future Evo Devo studies. Electronic supplementary material The online version of this article (10.1186/s12862-018-1276-0) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Elijah K Lowe
- Biology and Evolution of Marine Organisms, Stazione Zoologica Anton Dohrn, Villa comunale, 80122, Naples, Italy
| | - Anders L Garm
- Marine Biological Section, University of Copenhagen, Copenhagen, Denmark
| | | | - Claudia Cuomo
- Biology and Evolution of Marine Organisms, Stazione Zoologica Anton Dohrn, Villa comunale, 80122, Naples, Italy
| | - Maria I Arnone
- Biology and Evolution of Marine Organisms, Stazione Zoologica Anton Dohrn, Villa comunale, 80122, Naples, Italy.
| |
Collapse
|