1
|
Jia K, Kilinc M, Jernigan RL. New alignment method for remote protein sequences by the direct use of pairwise sequence correlations and substitutions. FRONTIERS IN BIOINFORMATICS 2023; 3:1227193. [PMID: 37900964 PMCID: PMC10602800 DOI: 10.3389/fbinf.2023.1227193] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2023] [Accepted: 08/14/2023] [Indexed: 10/31/2023] Open
Abstract
Understanding protein sequences and how they relate to the functions of proteins is extremely important. One of the most basic operations in bioinformatics is sequence alignment and usually the first things learned from these are which positions are the most conserved and often these are critical parts of the structure, such as enzyme active site residues. In addition, the contact pairs in a protein usually correspond closely to the correlations between residue positions in the multiple sequence alignment, and these usually change in a systematic and coordinated way, if one position changes then the other member of the pair also changes to compensate. In the present work, these correlated pairs are taken as anchor points for a new type of sequence alignment. The main advantage of the method here is its combining the remote homolog detection from our method PROST with pairwise sequence substitutions in the rigorous method from Kleinjung et al. We show a few examples of some resulting sequence alignments, and how they can lead to improvements in alignments for function, even for a disordered protein.
Collapse
Affiliation(s)
- Kejue Jia
- Roy J. Carver Department of Biochemistry, Biophysics, and Molecular Biology, Iowa State University, Ames, IA, United States
| | - Mesih Kilinc
- Roy J. Carver Department of Biochemistry, Biophysics, and Molecular Biology, Iowa State University, Ames, IA, United States
- Bioinformatics and Computational Biology Program, Iowa State University, Ames, IA, United States
| | - Robert L. Jernigan
- Roy J. Carver Department of Biochemistry, Biophysics, and Molecular Biology, Iowa State University, Ames, IA, United States
- Bioinformatics and Computational Biology Program, Iowa State University, Ames, IA, United States
| |
Collapse
|
2
|
Goodheart JA, Collins AG, Cummings MP, Egger B, Rawlinson KA. A phylogenomic approach to resolving interrelationships of polyclad flatworms, with implications for life-history evolution. ROYAL SOCIETY OPEN SCIENCE 2023; 10:220939. [PMID: 36998763 PMCID: PMC10049750 DOI: 10.1098/rsos.220939] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/22/2022] [Accepted: 03/07/2023] [Indexed: 06/19/2023]
Abstract
Platyhelminthes (flatworms) are a diverse invertebrate phylum useful for exploring life-history evolution. Within Platyhelminthes, only two clades develop through a larval stage: free-living polyclads and parasitic neodermatans. Neodermatan larvae are considered evolutionarily derived, whereas polyclad larvae are hypothesized to be ancestral due to ciliary band similarities among polyclad and other spiralian larvae. However, larval evolution has been challenging to investigate within polyclads due to low support for deeper phylogenetic relationships. To investigate polyclad life-history evolution, we generated transcriptomic data for 21 species of polyclads to build a well-supported phylogeny for the group. The resulting tree provides strong support for deeper nodes, and we recover a new monophyletic clade of early branching cotyleans. We then used ancestral state reconstructions to investigate ancestral modes of development within Polycladida and more broadly within flatworms. In polyclads, we were unable to reconstruct the ancestral state of deeper nodes with significant support because early branching clades show diverse modes of development. This suggests a complex history of larval evolution in polyclads that likely includes multiple losses and/or multiple gains. However, our ancestral state reconstruction across a previously published platyhelminth phylogeny supports a direct developing prorhynchid/polyclad ancestor, which suggests that a larval stage in the life cycle evolved along the polyclad stem lineage or within polyclads.
Collapse
Affiliation(s)
- Jessica A. Goodheart
- Division of Invertebrate Zoology, American Museum of Natural History, New York, NY 10024, USA
- Scripps Institution of Oceanography, University of California, San Diego, La Jolla, CA 92037, USA
| | - Allen G. Collins
- NMFS, National Systematics Laboratory, National Museum of Natural History, Smithsonian Institution, MRC-153, PO Box 37012, Washington, DC 20013, USA
| | - Michael P. Cummings
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD 20742, USA
| | - Bernhard Egger
- Universität Innsbruck, Department of Zoology, Technikerstr. 25, 6020 Innsbruck, Austria
| | - Kate A. Rawlinson
- Wellcome Sanger Institute, Hinxton, Cambridgeshire CB10 1SA, UK
- Josephine Bay Paul Center, Marine Biological Laboratory, Woods Hole, MA, 02543
| |
Collapse
|
3
|
Polyanovsky V, Lifanov A, Esipova N, Tumanyan V. The ranging of amino acids substitution matrices of various types in accordance with the alignment accuracy criterion. BMC Bioinformatics 2020; 21:294. [PMID: 32921315 PMCID: PMC7489204 DOI: 10.1186/s12859-020-03616-0] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2020] [Accepted: 06/18/2020] [Indexed: 11/15/2022] Open
Abstract
Background The alignment of character sequences is important in bioinformatics. The quality of this procedure is determined by the substitution matrix and parameters of the insertion-deletion penalty function. These matrices are derived from sequence alignment and thus reflect the evolutionary process. Currently, in addition to evolutionary matrices, a large number of different background matrices have been obtained. To make an optimal choice of the substitution matrix and the penalty parameters, we conducted a numerical experiment using a representative sample of existing matrices of various types and origins. Results We tested both the classical evolutionary matrix series (PAM, Blosum, VTML, Pfasum); structural alignment based matrices, contact energy matrix, and matrix based on the properties of the genetic code. This study presents results for two test set types: first, we simulated sequences that reflect the divergent evolution; second, we performed tests on Balibase sequences. In both cases, we obtained the dependences of the alignment quality (Accuracy, Confidence) on the evolutionary distance between sequences and the evolutionary distance to which the substitution matrices correspond. Optimization of a combination of matrices and the penalty parameters was carried out for local and global alignment on the values of penalty function parameters. Consequently, we found that the best alignment quality is achieved with matrices corresponding to the largest evolutionary distance. These matrices prove to be universal, i.e. suitable for aligning sequences separated by both large and small evolutionary distances. We analysed the correspondence of the correlation coefficients of matrices to the alignment quality. It was found that matrices showing high quality alignment have an above average correlation value, but the converse is not true. Conclusions This study showed that the best alignment quality is achieved with evolutionary matrices designed for long distances: Gonnet, VTML250, PAM250, MIQS, and Pfasum050. The same property is inherent in matrices not only of evolutionary origin, but also of another background corresponding to a large evolutionary distance. Therefore, matrices based on structural data show alignment quality close enough to its value for evolutionary matrices. This agrees with the idea that the spatial structure is more conservative than the protein sequence.
Collapse
|
4
|
Goodheart JA, Bazinet AL, Collins AG, Cummings MP. Relationships within Cladobranchia (Gastropoda: Nudibranchia) based on RNA-Seq data: an initial investigation. ROYAL SOCIETY OPEN SCIENCE 2015; 2:150196. [PMID: 26473045 PMCID: PMC4593679 DOI: 10.1098/rsos.150196] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/07/2015] [Accepted: 08/26/2015] [Indexed: 05/28/2023]
Abstract
Cladobranchia (Gastropoda: Nudibranchia) is a diverse (approx. 1000 species) but understudied group of sea slug molluscs. In order to fully comprehend the diversity of nudibranchs and the evolution of character traits within Cladobranchia, a solid understanding of evolutionary relationships is necessary. To date, only two direct attempts have been made to understand the evolutionary relationships within Cladobranchia, neither of which resulted in well-supported phylogenetic hypotheses. In addition to these studies, several others have addressed some of the relationships within this clade while investigating the evolutionary history of more inclusive groups (Nudibranchia and Euthyneura). However, all of the resulting phylogenetic hypotheses contain conflicting topologies within Cladobranchia. In this study, we address some of these long-standing issues regarding the evolutionary history of Cladobranchia using RNA-Seq data (transcriptomes). We sequenced 16 transcriptomes and combined these with four transcriptomes from the NCBI Sequence Read Archive. Transcript assembly using Trinity and orthology determination using HaMStR yielded 839 orthologous groups for analysis. These data provide a well-supported and almost fully resolved phylogenetic hypothesis for Cladobranchia. Our results support the monophyly of Cladobranchia and the sub-clade Aeolidida, but reject the monophyly of Dendronotida.
Collapse
Affiliation(s)
- Jessica A. Goodheart
- Laboratory of Molecular Evolution, Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD 20742, USA
- NMFS, National Systematics Laboratory, National Museum of Natural History, Smithsonian Institution, MRC-153, PO Box 37012, Washington, DC 20013, USA
| | - Adam L. Bazinet
- Laboratory of Molecular Evolution, Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD 20742, USA
| | - Allen G. Collins
- NMFS, National Systematics Laboratory, National Museum of Natural History, Smithsonian Institution, MRC-153, PO Box 37012, Washington, DC 20013, USA
| | - Michael P. Cummings
- Laboratory of Molecular Evolution, Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD 20742, USA
| |
Collapse
|
5
|
Rios S, Fernandez MF, Caltabiano G, Campillo M, Pardo L, Gonzalez A. GPCRtm: An amino acid substitution matrix for the transmembrane region of class A G Protein-Coupled Receptors. BMC Bioinformatics 2015; 16:206. [PMID: 26134144 PMCID: PMC4489126 DOI: 10.1186/s12859-015-0639-4] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2015] [Accepted: 06/06/2015] [Indexed: 01/08/2023] Open
Abstract
Background Protein sequence alignments and database search methods use standard scoring matrices calculated from amino acid substitution frequencies in general sets of proteins. These general-purpose matrices are not optimal to align accurately sequences with marked compositional biases, such as hydrophobic transmembrane regions found in membrane proteins. In this work, an amino acid substitution matrix (GPCRtm) is calculated for the membrane spanning segments of the G protein-coupled receptor (GPCR) rhodopsin family; one of the largest transmembrane protein family in humans with great importance in health and disease. Results The GPCRtm matrix reveals the amino acid compositional bias distinctive of the GPCR rhodopsin family and differs from other standard substitution matrices. These membrane receptors, as expected, are characterized by a high content of hydrophobic residues with regard to globular proteins. On the other hand, the presence of polar and charged residues is higher than in average membrane proteins, displaying high frequencies of replacement within themselves. Conclusions Analysis of amino acid frequencies and values obtained from the GPCRtm matrix reveals patterns of residue replacements different from other standard substitution matrices. GPCRs prioritize the reactivity properties of the amino acids over their bulkiness in the transmembrane regions. A distinctive role is that charged and polar residues seem to evolve at different rates than other amino acids. This observation is related to the role of the transmembrane bundle in the binding of ligands, that in many cases involve electrostatic and hydrogen bond interactions. This new matrix can be useful in database search and for the construction of more accurate sequence alignments of GPCRs. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0639-4) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Santiago Rios
- Laboratori de Medicina Computacional, Unitat de Bioestadística, Facultat de Medicina, Universitat Autònoma de Barcelona, 08193, Bellaterra, Barcelona, Spain
| | - Marta F Fernandez
- Laboratori de Medicina Computacional, Unitat de Bioestadística, Facultat de Medicina, Universitat Autònoma de Barcelona, 08193, Bellaterra, Barcelona, Spain
| | - Gianluigi Caltabiano
- Laboratori de Medicina Computacional, Unitat de Bioestadística, Facultat de Medicina, Universitat Autònoma de Barcelona, 08193, Bellaterra, Barcelona, Spain
| | - Mercedes Campillo
- Laboratori de Medicina Computacional, Unitat de Bioestadística, Facultat de Medicina, Universitat Autònoma de Barcelona, 08193, Bellaterra, Barcelona, Spain
| | - Leonardo Pardo
- Laboratori de Medicina Computacional, Unitat de Bioestadística, Facultat de Medicina, Universitat Autònoma de Barcelona, 08193, Bellaterra, Barcelona, Spain
| | - Angel Gonzalez
- Laboratori de Medicina Computacional, Unitat de Bioestadística, Facultat de Medicina, Universitat Autònoma de Barcelona, 08193, Bellaterra, Barcelona, Spain.
| |
Collapse
|
6
|
Leclercq S, Dittmer J, Bouchon D, Cordaux R. Phylogenomics of "Candidatus Hepatoplasma crinochetorum," a lineage of mollicutes associated with noninsect arthropods. Genome Biol Evol 2015; 6:407-15. [PMID: 24482531 PMCID: PMC3942034 DOI: 10.1093/gbe/evu020] [Citation(s) in RCA: 28] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
Bacterial gut communities of arthropods are highly diverse and tightly related to host feeding habits. However, our understanding of the origin and role of the symbionts is often hindered by the lack of genetic information. “Candidatus Hepatoplasma crinochetorum” is a Mollicutes symbiont found in the midgut glands of terrestrial isopods. The only available nucleotide sequence for this symbiont is a partial 16S rRNA gene sequence. Here, we present the 657,101 bp assembled genome of Candidatus Hepatoplasma crinochetorum isolated from the terrestrial isopod Armadillidium vulgare. While previous 16S rRNA gene-based analyses have provided inconclusive results regarding the phylogenetic position of Candidatus Hepatoplasma crinochetorum within Mollicutes, we performed a phylogenomic analysis of 127 Mollicutes orthologous genes which confidently branches the species as a sister group to the Hominis group of Mycoplasma. Several genome properties of Candidatus Hepatoplasma crinochetorum are also highlighted compared with other Mollicutes genomes, including adjacent tryptophan tRNA genes, which further our understanding of the evolutionary dynamics of these genes in Mollicutes, and the presence of a probably inactivated CRISPR/Cas system, which constitutes a testimony of past interactions between Candidatus Hepatoplasma crinochetorum and mobile genetic elements, despite their current lack in this streamlined genome. Overall, the availability of the complete genome sequence of Candidatus Hepatoplasma crinochetorum paves the way for further investigation of its ecology and evolution.
Collapse
Affiliation(s)
- Sébastien Leclercq
- Université de Poitiers, UMR CNRS 7267 Ecologie et Biologie des Interactions, Equipe Ecologie Evolution Symbiose, Poitiers, France
| | | | | | | |
Collapse
|
7
|
Yamada K, Tomii K. Revisiting amino acid substitution matrices for identifying distantly related proteins. ACTA ACUST UNITED AC 2013; 30:317-25. [PMID: 24281694 PMCID: PMC3904525 DOI: 10.1093/bioinformatics/btt694] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/04/2022]
Abstract
Motivation: Although many amino acid substitution matrices have been developed, it has not been well understood which is the best for similarity searches, especially for remote homology detection. Therefore, we collected information related to existing matrices, condensed it and derived a novel matrix that can detect more remote homology than ever. Results: Using principal component analysis with existing matrices and benchmarks, we developed a novel matrix, which we designate as MIQS. The detection performance of MIQS is validated and compared with that of existing general purpose matrices using SSEARCH with optimized gap penalties for each matrix. Results show that MIQS is able to detect more remote homology than the existing matrices on an independent dataset. In addition, the performance of our developed matrix was superior to that of CS-BLAST, which was a novel similarity search method with no amino acid matrix. We also evaluated the alignment quality of matrices and methods, which revealed that MIQS shows higher alignment sensitivity than that with the existing matrix series and CS-BLAST. Fundamentally, these results are expected to constitute good proof of the availability and/or importance of amino acid matrices in sequence analysis. Moreover, with our developed matrix, sophisticated similarity search methods such as sequence–profile and profile–profile comparison methods can be improved further. Availability and implementation: Newly developed matrices and datasets used for this study are available at http://csas.cbrc.jp/Ssearch/. Contact:k-tomii@aist.go.jp Supplementary information:Supplementary data are available at Bioinformatics online
Collapse
Affiliation(s)
- Kazunori Yamada
- Computational Biology Research Center (CBRC), National Institute of Advanced Industrial Science and Technology (AIST), 2-4-7 Aomi, Koto-ku, Tokyo 135-0064, Japan
| | | |
Collapse
|
8
|
Characterization of the in vitro core surface proteome of Mycoplasma mycoides subsp. mycoides, the causative agent of contagious bovine pleuropneumonia. Vet Microbiol 2013; 168:116-23. [PMID: 24332827 DOI: 10.1016/j.vetmic.2013.10.025] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2013] [Revised: 10/13/2013] [Accepted: 10/31/2013] [Indexed: 11/20/2022]
Abstract
Contagious bovine pleuropneumonia (CBPP), caused by Mycoplasma mycoides subsp. mycoides (Mmm) is a severe cattle disease, present in many countries in sub-Saharan Africa. The development of improved diagnostic tests and vaccines for CBPP control remains a research priority. Polyacrylamide gel electrophoresis and mass spectrometry were used to characterize the Triton X-114 soluble proteome of nine Mmm strains isolated from Europe or Africa. Of a total of 250 proteins detected, 67 were present in all strains investigated. Of these, 44 were predicted to be lipoproteins or cytoplasmic membrane-associated proteins and are thus likely to be members of the core in vitro surface membrane-associated proteome of Mmm. Moreover, the presence of all identified proteins in other ruminant Mycoplasma pathogens were investigated. Two proteins of the core proteome were identified only in other cattle pathogens of the genus Mycoplasma pointing towards a role in host-pathogen interactions. The data generated will facilitate the identification and prioritization of candidate Mycoplasma antigens for improved control measures, as it is likely that surface-exposed membrane proteins will include those that are involved in host-pathogen interactions.
Collapse
|