1
|
Bromberg Y, Prabakaran R, Kabir A, Shehu A. Variant Effect Prediction in the Age of Machine Learning. Cold Spring Harb Perspect Biol 2024; 16:a041467. [PMID: 38621825 PMCID: PMC11216171 DOI: 10.1101/cshperspect.a041467] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/17/2024]
Abstract
Over the years, many computational methods have been created for the analysis of the impact of single amino acid substitutions resulting from single-nucleotide variants in genome coding regions. Historically, all methods have been supervised and thus limited by the inadequate sizes of experimentally curated data sets and by the lack of a standardized definition of variant effect. The emergence of unsupervised, deep learning (DL)-based methods raised an important question: Can machines learn the language of life from the unannotated protein sequence data well enough to identify significant errors in the protein "sentences"? Our analysis suggests that some unsupervised methods perform as well or better than existing supervised methods. Unsupervised methods are also faster and can, thus, be useful in large-scale variant evaluations. For all other methods, however, their performance varies by both evaluation metrics and by the type of variant effect being predicted. We also note that the evaluation of method performance is still lacking on less-studied, nonhuman proteins where unsupervised methods hold the most promise.
Collapse
Affiliation(s)
- Yana Bromberg
- Department of Biology, Emory University, Atlanta 30322, Georgia, USA
- Department of Computer Science, Emory University, Atlanta 30322, Georgia, USA
| | - R Prabakaran
- Department of Biology, Emory University, Atlanta 30322, Georgia, USA
| | - Anowarul Kabir
- Department of Computer Science, George Mason University, Fairfax 22030, Virginia, USA
| | - Amarda Shehu
- Department of Computer Science, George Mason University, Fairfax 22030, Virginia, USA
| |
Collapse
|
2
|
Insana G, Martin MJ, Pearson WR. Improved selection of canonical proteins for reference proteomes. NAR Genom Bioinform 2024; 6:lqae066. [PMID: 38863529 PMCID: PMC11165316 DOI: 10.1093/nargab/lqae066] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2024] [Revised: 05/04/2024] [Accepted: 05/23/2024] [Indexed: 06/13/2024] Open
Abstract
The 'canonical' protein sets distributed by UniProt are widely used for similarity searching, and functional and structural annotation. For many investigators, canonical sequences are the only version of a protein examined. However, higher eukaryotes often encode multiple isoforms of a protein from a single gene. For unreviewed (UniProtKB/TrEMBL) protein sequences, the longest sequence in a Gene-Centric group is chosen as canonical. This choice can create inconsistencies, selecting >95% identical orthologs with dramatically different lengths, which is biologically unlikely. We describe the ortho2tree pipeline, which examines Reference Proteome canonical and isoform sequences from sets of orthologous proteins, builds multiple alignments, constructs gap-distance trees, and identifies low-cost clades of isoforms with similar lengths. After examining 140 000 proteins from eight mammals in UniProtKB release 2022_05, ortho2tree proposed 7804 canonical changes for release 2023_01, while confirming 53 434 canonicals. Gap distributions for isoforms selected by ortho2tree are similar to those in bacterial and yeast alignments, organisms unaffected by isoform selection, suggesting ortho2tree canonicals more accurately reflect genuine biological variation. 82% of ortho2tree proposed-changes agreed with MANE; for confirmed canonicals, 92% agreed with MANE. Ortho2tree can improve canonical assignment among orthologous sequences that are >60% identical, a group that includes vertebrates and higher plants.
Collapse
Affiliation(s)
- Giuseppe Insana
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Maria J Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - William R Pearson
- Department of Biochemistry and Molecular Genetics, University of Virginia School of Medicine, Charlottesville, VA 22908, USA
| |
Collapse
|
3
|
Vitting-Seerup K. Most protein domains exist as variants with distinct functions across cells, tissues and diseases. NAR Genom Bioinform 2023; 5:lqad084. [PMID: 37745975 PMCID: PMC10516350 DOI: 10.1093/nargab/lqad084] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2023] [Revised: 08/09/2023] [Accepted: 09/05/2023] [Indexed: 09/26/2023] Open
Abstract
Protein domains are the active subunits that provide proteins with specific functions through precise three-dimensional structures. Such domains facilitate most protein functions, including molecular interactions and signal transduction. Currently, these protein domains are described and analyzed as invariable molecular building blocks with fixed functions. Here, I show that most human protein domains exist as multiple distinct variants termed 'domain isotypes'. Domain isotypes are used in a cell, tissue and disease-specific manner and have surprisingly different 3D structures. Accordingly, domain isotypes, compared to each other, modulate or abolish the functionality of protein domains. These results challenge the current view of protein domains as invariable building blocks and have significant implications for both wet- and dry-lab workflows. The extensive use of protein domain isotypes within protein isoforms adds to the literature indicating we need to transition to an isoform-centric research paradigm.
Collapse
Affiliation(s)
- Kristoffer Vitting-Seerup
- The Bioinformatics Section, Department of Health Technology, The Technical University of Denmark (DTU), Denmark
| |
Collapse
|
4
|
Bacala R, Hatcher DW, Perreault H, Fu BX. Challenges and opportunities for proteomics and the improvement of bread wheat quality. JOURNAL OF PLANT PHYSIOLOGY 2022; 275:153743. [PMID: 35749977 DOI: 10.1016/j.jplph.2022.153743] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/28/2022] [Revised: 05/13/2022] [Accepted: 05/30/2022] [Indexed: 06/15/2023]
Abstract
Wheat remains a critical global food source, pressured by climate change and the need to maximize yield, improve processing and nutritional quality and ensure safety. An enormous amount of research has been conducted to understand gluten protein composition and structure in relation to end-use quality, yet progress has become stagnant. This is mainly due to the need and inability to biochemically characterize the intact functional glutenin polymer in order to correlate to quality, necessitating reduction to monomeric subunits and a loss of contextual information. While some individual gluten proteins might have a positive or negative influence on gluten quality, it is the sum total of these proteins, their relative and absolute expression, their sub-cellular trafficking, the amount and size of glutenin polymers, and ratios between gluten protein classes that define viscoelasticity of gluten. The sub-cellular trafficking of gluten proteins during seed maturation is still not completely clear and there is evidence of dual pathways and therefore different destinations for proteins, either constitutively or temporally. The trafficking of proteins is also unclear in endosperm cells as they undergo programmed cell death; Golgi disappear around 12 DPA but protein filling continues at least to 25 DPA. Modulation of the timing of cellular events will invariably affect protein deposition and therefore gluten strength and function. Existing and emerging proteomics technologies such as proteoform profiling and top-down proteomics offer new tools to study gluten protein composition as a whole system and identify compositional patterns that can modify gluten structure with improved functionality.
Collapse
Affiliation(s)
- Ray Bacala
- Canadian Grain Commission, Grain Research Laboratory, 1404-303 Main Street, Winnipeg, Manitoba, R3C 3G8, Canada; University of Manitoba, Department of Chemistry, 144 Dysart Road, Winnipeg, Manitoba, R3T 2N2, Canada.
| | - Dave W Hatcher
- Canadian Grain Commission, Grain Research Laboratory, 1404-303 Main Street, Winnipeg, Manitoba, R3C 3G8, Canada
| | - Héléne Perreault
- University of Manitoba, Department of Chemistry, 144 Dysart Road, Winnipeg, Manitoba, R3T 2N2, Canada.
| | - Bin Xiao Fu
- Canadian Grain Commission, Grain Research Laboratory, 1404-303 Main Street, Winnipeg, Manitoba, R3C 3G8, Canada; Department of Food and Human Nutritional Sciences, 209 - 35 Chancellor's Circle, University of Manitoba, Winnipeg, Manitoba, R3T 2N2, Canada.
| |
Collapse
|
5
|
Kuo TCY, Hatakeyama M, Tameshige T, Shimizu KK, Sese J. Homeolog expression quantification methods for allopolyploids. Brief Bioinform 2021; 21:395-407. [PMID: 30590436 PMCID: PMC7299288 DOI: 10.1093/bib/bby121] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2018] [Revised: 11/06/2018] [Accepted: 11/21/2018] [Indexed: 12/19/2022] Open
Abstract
Genome duplication with hybridization, or allopolyploidization, occurs in animals, fungi and plants, and is especially common in crop plants. There is an increasing interest in the study of allopolyploids because of advances in polyploid genome assembly; however, the high level of sequence similarity in duplicated gene copies (homeologs) poses many challenges. Here we compared standard RNA-seq expression quantification approaches used currently for diploid species against subgenome-classification approaches which maps reads to each subgenome separately. We examined mapping error using our previous and new RNA-seq data in which a subgenome is experimentally added (synthetic allotetraploid Arabidopsis kamchatica) or reduced (allohexaploid wheat Triticum aestivum versus extracted allotetraploid) as ground truth. The error rates in the two species were very similar. The standard approaches showed higher error rates (>10% using pseudo-alignment with Kallisto) while subgenome-classification approaches showed much lower error rates (<1% using EAGLE-RC, <2% using HomeoRoq). Although downstream analysis may partly mitigate mapping errors, the difference in methods was substantial in hexaploid wheat, where Kallisto appeared to have systematic differences relative to other methods. Only approximately half of the differentially expressed homeologs detected using Kallisto overlapped with those by any other method in wheat. In general, disagreement in low-expression genes was responsible for most of the discordance between methods, which is consistent with known biases in Kallisto. We also observed that there exist uncertainties in genome sequences and annotation which can affect each method differently. Overall, subgenome-classification approaches tend to perform better than standard approaches with EAGLE-RC having the highest precision.
Collapse
Affiliation(s)
- Tony C Y Kuo
- Artificial Intelligence Research Center, AIST, 2-3-26 Aomi, Koto-ku, Tokyo 135-0064, Japan.,AIST-Tokyo Tech RWBC-OIL, 2-12-1 Okayama, Meguro-ku, Tokyo 152-8550, Japan
| | - Masaomi Hatakeyama
- Department of Evolutionary Biology and Environmental Studies, University of Zurich, Winterthurerstrasse 190, Zurich CH-8057, Switzerland.,Functional Genomics Center Zurich, Winterthurerstrasse 190, Zurich CH-8057, Switzerland.,Swiss Institute of Bioinformatics, Quartier Sorge - Batiment Genopode, Lausanne 1015, Switzerland
| | - Toshiaki Tameshige
- Kihara Institute for Biological Research, Yokohama City University, 641-12, Maioka, Totsuka-ku, Yokohama 244-0813, Japan
| | - Kentaro K Shimizu
- Department of Evolutionary Biology and Environmental Studies, University of Zurich, Winterthurerstrasse 190, Zurich CH-8057, Switzerland.,Kihara Institute for Biological Research, Yokohama City University, 641-12, Maioka, Totsuka-ku, Yokohama 244-0813, Japan
| | - Jun Sese
- Artificial Intelligence Research Center, AIST, 2-3-26 Aomi, Koto-ku, Tokyo 135-0064, Japan.,AIST-Tokyo Tech RWBC-OIL, 2-12-1 Okayama, Meguro-ku, Tokyo 152-8550, Japan
| |
Collapse
|
6
|
Tran HKR, Grebenc DW, Klein TA, Whitney JC. Bacterial type VII secretion: An important player in host-microbe and microbe-microbe interactions. Mol Microbiol 2021; 115:478-489. [PMID: 33410158 DOI: 10.1111/mmi.14680] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2020] [Revised: 01/03/2021] [Accepted: 01/04/2021] [Indexed: 12/19/2022]
Abstract
Type VII secretion systems (T7SSs) are poorly understood protein export apparatuses found in mycobacteria and many species of Gram-positive bacteria. To date, this pathway has predominantly been studied in Mycobacterium tuberculosis, where it has been shown to play an essential role in virulence; however, much less studied is an evolutionarily divergent subfamily of T7SSs referred to as the T7SSb. The T7SSb is found in the major Gram-positive phylum Firmicutes where it was recently shown to target both eukaryotic and prokaryotic cells, suggesting a dual role for this pathway in host-microbe and microbe-microbe interactions. In this review, we compare the current understanding of the molecular architectures and substrate repertoires of the well-studied mycobacterial T7SSa systems to that of recently characterized T7SSb pathways and highlight how these differences may explain the observed biological functions of this understudied protein export machine.
Collapse
Affiliation(s)
- Hiu-Ki R Tran
- Michael DeGroote Institute for Infectious Disease Research, McMaster University, Hamilton, ON, Canada.,Department of Biochemistry and Biomedical Sciences, McMaster University, Hamilton, ON, Canada
| | - Dirk W Grebenc
- Michael DeGroote Institute for Infectious Disease Research, McMaster University, Hamilton, ON, Canada.,Department of Biochemistry and Biomedical Sciences, McMaster University, Hamilton, ON, Canada
| | - Timothy A Klein
- Michael DeGroote Institute for Infectious Disease Research, McMaster University, Hamilton, ON, Canada.,Department of Biochemistry and Biomedical Sciences, McMaster University, Hamilton, ON, Canada
| | - John C Whitney
- Michael DeGroote Institute for Infectious Disease Research, McMaster University, Hamilton, ON, Canada.,Department of Biochemistry and Biomedical Sciences, McMaster University, Hamilton, ON, Canada.,David Braley Centre for Antibiotic Discovery, McMaster University, Hamilton, ON, Canada
| |
Collapse
|
7
|
Kemena C, Dohmen E, Bornberg-Bauer E. DOGMA: a web server for proteome and transcriptome quality assessment. Nucleic Acids Res 2020; 47:W507-W510. [PMID: 31076763 PMCID: PMC6602495 DOI: 10.1093/nar/gkz366] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2019] [Revised: 04/18/2019] [Accepted: 04/29/2019] [Indexed: 11/16/2022] Open
Abstract
Even in the era of next generation sequencing, in which bioinformatics tools abound, annotating transcriptomes and proteomes remains a challenge. This can have major implications for the reliability of studies based on these datasets. Therefore, quality assessment represents a crucial step prior to downstream analyses on novel transcriptomes and proteomes. DOGMA allows such a quality assessment to be carried out. The data of interest are evaluated based on a comparison with a core set of conserved protein domains and domain arrangements. Depending on the studied species, DOGMA offers precomputed core sets for different phylogenetic clades. We now developed a web server for the DOGMA software, offering a user-friendly, simple to use interface. Additionally, the server provides a graphical representation of the analysis results and their placement in comparison to publicly available data. The server is freely available under https://domainworld-services.uni-muenster.de/dogma/. Additionally, for large scale analyses the software can be downloaded free of charge from https://domainworld.uni-muenster.de.
Collapse
Affiliation(s)
- Carsten Kemena
- Institute for Evolution and Biodiversity, Westfälische Wilhelms-Universität Münster, Hüfferstrasse 1, NRW, 48149 Münster, Germany
| | - Elias Dohmen
- Institute for Evolution and Biodiversity, Westfälische Wilhelms-Universität Münster, Hüfferstrasse 1, NRW, 48149 Münster, Germany
| | - Erich Bornberg-Bauer
- Institute for Evolution and Biodiversity, Westfälische Wilhelms-Universität Münster, Hüfferstrasse 1, NRW, 48149 Münster, Germany
| |
Collapse
|
8
|
Genomic analysis of the tryptome reveals molecular mechanisms of gland cell evolution. EvoDevo 2019; 10:23. [PMID: 31583070 PMCID: PMC6767649 DOI: 10.1186/s13227-019-0138-1] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2019] [Accepted: 09/13/2019] [Indexed: 12/25/2022] Open
Abstract
Background Understanding the drivers of morphological diversity is a persistent challenge in evolutionary biology. Here, we investigate functional diversification of secretory cells in the sea anemone Nematostella vectensis to understand the mechanisms promoting cellular specialization across animals. Results We demonstrate regionalized expression of gland cell subtypes in the internal ectoderm of N. vectensis and show that adult gland cell identity is acquired very early in development. A phylogenetic survey of trypsins across animals suggests that this gene family has undergone numerous expansions. We reveal unexpected diversity in trypsin protein structure and show that trypsin diversity arose through independent acquisitions of non-trypsin domains. Finally, we show that trypsin diversification in N. vectensis was effected through a combination of tandem duplication, exon shuffling, and retrotransposition. Conclusions Together, these results reveal the numerous evolutionary mechanisms that drove trypsin duplication and divergence during the morphological specialization of cell types and suggest that the secretory cell phenotype is highly adaptable as a vehicle for novel secretory products.
Collapse
|
9
|
Deutekom ES, Vosseberg J, van Dam TJP, Snel B. Measuring the impact of gene prediction on gene loss estimates in Eukaryotes by quantifying falsely inferred absences. PLoS Comput Biol 2019; 15:e1007301. [PMID: 31461468 PMCID: PMC6736253 DOI: 10.1371/journal.pcbi.1007301] [Citation(s) in RCA: 30] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2018] [Revised: 09/10/2019] [Accepted: 08/01/2019] [Indexed: 12/25/2022] Open
Abstract
In recent years it became clear that in eukaryotic genome evolution gene loss is prevalent over gene gain. However, the absence of genes in an annotated genome is not always equivalent to the loss of genes. Due to sequencing issues, or incorrect gene prediction, genes can be falsely inferred as absent. This implies that loss estimates are overestimated and, more generally, that falsely inferred absences impact genomic comparative studies. However, reliable estimates of how prevalent this issue is are lacking. Here we quantified the impact of gene prediction on gene loss estimates in eukaryotes by analysing 209 phylogenetically diverse eukaryotic organisms and comparing their predicted proteomes to that of their respective six-frame translated genomes. We observe that 4.61% of domains per species were falsely inferred to be absent for Pfam domains predicted to have been present in the last eukaryotic common ancestor. Between phylogenetically different categories this estimate varies substantially: for clade-specific loss (ancestral loss) we found 1.30% and for species-specific loss 16.88% to be falsely inferred as absent. For BUSCO 1-to-1 orthologous families, 18.30% were falsely inferred to be absent. Finally, we showed that falsely inferred absences indeed impact loss estimates, with the number of losses decreasing by 11.78%. Our work strengthens the increasing number of studies showing that gene loss is an important factor in eukaryotic genome evolution. However, while we demonstrate that on average inferring gene absences from predicted proteomes is reliable, caution is warranted when inferring species-specific absences.
Collapse
Affiliation(s)
- Eva S. Deutekom
- Theoretical Biology and Bioinformatics, Department of Biology, Science faculty, Utrecht University, Utrecht, The Netherlands
| | - Julian Vosseberg
- Theoretical Biology and Bioinformatics, Department of Biology, Science faculty, Utrecht University, Utrecht, The Netherlands
| | - Teunis J. P. van Dam
- Theoretical Biology and Bioinformatics, Department of Biology, Science faculty, Utrecht University, Utrecht, The Netherlands
| | - Berend Snel
- Theoretical Biology and Bioinformatics, Department of Biology, Science faculty, Utrecht University, Utrecht, The Netherlands
| |
Collapse
|
10
|
Kirsip H, Abroi A. Protein Structure-Guided Hidden Markov Models (HMMs) as A Powerful Method in the Detection of Ancestral Endogenous Viral Elements. Viruses 2019; 11:v11040320. [PMID: 30986983 PMCID: PMC6520822 DOI: 10.3390/v11040320] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2019] [Revised: 03/23/2019] [Accepted: 03/27/2019] [Indexed: 12/19/2022] Open
Abstract
It has been believed for a long time that the transfer and fixation of genetic material from RNA viruses to eukaryote genomes is very unlikely. However, during the last decade, there have been several cases in which “virus-to-host” gene transfer from various viral families into various eukaryotic phyla have been described. These transfers have been identified by sequence similarity, which may disappear very quickly, especially in the case of RNA viruses. However, compared to sequences, protein structure is known to be more conserved. Applying protein structure-guided protein domain-specific Hidden Markov Models, we detected homologues of the Virgaviridae capsid protein in Schizophora flies. Further data analysis supported “virus-to-host” transfer into Schizophora ancestors as a single transfer event. This transfer was not identifiable by BLAST or by other methods we applied. Our data show that structure-guided Hidden Markov Models should be used to detect ancestral virus-to-host transfers.
Collapse
Affiliation(s)
- Heleri Kirsip
- Department of Bioinformatics, University of Tartu, Tartu, 51010, Riia 23, Estonia.
| | - Aare Abroi
- Institute of Technology, University of Tartu, Tartu, 50411, Nooruse 1, Estonia.
| |
Collapse
|
11
|
Mahajan S, Ramya TNC. Nature-inspired engineering of an F-type lectin for increased binding strength. Glycobiology 2019; 28:933-948. [PMID: 30202877 DOI: 10.1093/glycob/cwy082] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2017] [Accepted: 09/07/2018] [Indexed: 11/13/2022] Open
Abstract
Individual lectin-carbohydrate interactions are usually of low affinity. However, high avidity is frequently attained by the multivalent presentation of glycans on biological surfaces coupled with the occurrence of high order lectin oligomers or tandem repeats of lectin domains in the polypeptide. F-type lectins are l-fucose binding lectins with a typical sequence motif, HX(26)RXDX(4)R/K, whose residues participate in l-fucose binding. We previously reported the presence of a few eukaryotic F-type lectin domains with partial sequence duplication that results in the presence of two l-fucose-binding sequence motifs. We hypothesized that such partial sequence duplication would result in greater avidity of lectin-ligand interactions. Inspired by this example from Nature, we attempted to engineer a bacterial F-type lectin domain from Streptosporangium roseum to attain avid binding by mimicking partial duplication. The engineered lectin demonstrated 12-fold greater binding strength than the wild-type lectin to multivalent fucosylated glycoconjugates. However, the affinity to the monosaccharide l-fucose in solution was similar and partial sequence duplication did not result in an additional functional l-fucose binding site. We also cloned, expressed and purified a Branchiostoma floridae F-type lectin domain with naturally occurring partial sequence duplication and confirmed that the duplicated region with the F-type lectin sequence motif did not participate in l-fucose binding. We found that the greater binding strength of the engineered lectin from S. roseum was instead due to increased oligomerization. We believe that this Nature-inspired strategy might be useful for engineering lectins to improve binding strength in various applications.
Collapse
Affiliation(s)
- Sonal Mahajan
- Institute of Microbial Technology, Sector 39-A, Chandigarh, India
| | - T N C Ramya
- Institute of Microbial Technology, Sector 39-A, Chandigarh, India
| |
Collapse
|
12
|
Vaattovaara A, Leppälä J, Salojärvi J, Wrzaczek M. High-throughput sequencing data and the impact of plant gene annotation quality. JOURNAL OF EXPERIMENTAL BOTANY 2019; 70:1069-1076. [PMID: 30590678 PMCID: PMC6382340 DOI: 10.1093/jxb/ery434] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/16/2018] [Accepted: 11/28/2018] [Indexed: 06/02/2023]
Abstract
The use of draft genomes of different species and re-sequencing of accessions and populations are now common tools for plant biology research. The de novo assembled draft genomes make it possible to identify pivotal divergence points in the plant lineage and provide an opportunity to investigate the genomic basis and timing of biological innovations by inferring orthologs between species. Furthermore, re-sequencing facilitates the mapping and subsequent molecular characterization of causative loci for traits, such as those for plant stress tolerance and development. In both cases high-quality gene annotation-the identification of protein-coding regions, gene promoters, and 5'- and 3'-untranslated regions-is critical for investigation of gene function. Annotations are constantly improving but automated gene annotations still require manual curation and experimental validation. This is particularly important for genes with large introns, genes located in regions rich with transposable elements or repeats, large gene families, and segmentally duplicated genes. In this opinion paper, we highlight the impact of annotation quality on evolutionary analyses, genome-wide association studies, and the identification of orthologous genes in plants. Furthermore, we predict that incorporating accurate information from manual curation into databases will dramatically improve the performance of automated gene predictors.
Collapse
Affiliation(s)
- Aleksia Vaattovaara
- Organismal and Evolutionary Biology Research Programme, Viikki Plant Science Centre, VIPS, Faculty of Biological and Environmental Sciences, University of Helsinki, Viikinkaari 1 (POB65), Helsinki, Finland
| | - Johanna Leppälä
- Department of Ecology and Environmental Science, Umeå University, Linnaeus väg 6, Umeå, Sweden
| | - Jarkko Salojärvi
- Organismal and Evolutionary Biology Research Programme, Viikki Plant Science Centre, VIPS, Faculty of Biological and Environmental Sciences, University of Helsinki, Viikinkaari 1 (POB65), Helsinki, Finland
- School of Biological Sciences, Nanyang Technological University, Singapore, Singapore
| | - Michael Wrzaczek
- Organismal and Evolutionary Biology Research Programme, Viikki Plant Science Centre, VIPS, Faculty of Biological and Environmental Sciences, University of Helsinki, Viikinkaari 1 (POB65), Helsinki, Finland
| |
Collapse
|
13
|
Cocker JM, Wright J, Li J, Swarbreck D, Dyer S, Caccamo M, Gilmartin PM. Primula vulgaris (primrose) genome assembly, annotation and gene expression, with comparative genomics on the heterostyly supergene. Sci Rep 2018; 8:17942. [PMID: 30560928 PMCID: PMC6299000 DOI: 10.1038/s41598-018-36304-4] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2018] [Accepted: 11/14/2018] [Indexed: 11/24/2022] Open
Abstract
Primula vulgaris (primrose) exhibits heterostyly: plants produce self-incompatible pin- or thrum-form flowers, with anthers and stigma at reciprocal heights. Darwin concluded that this arrangement promotes insect-mediated cross-pollination; later studies revealed control by a cluster of genes, or supergene, known as the S (Style length) locus. The P. vulgaris S locus is absent from pin plants and hemizygous in thrum plants (thrum-specific); mutation of S locus genes produces self-fertile homostyle flowers with anthers and stigma at equal heights. Here, we present a 411 Mb P. vulgaris genome assembly of a homozygous inbred long homostyle, representing ~87% of the genome. We annotate over 24,000 P. vulgaris genes, and reveal more genes up-regulated in thrum than pin flowers. We show reduced genomic read coverage across the S locus in other Primula species, including P. veris, where we define the conserved structure and expression of the S locus genes in thrum. Further analysis reveals the S locus has elevated repeat content (64%) compared to the wider genome (37%). Our studies suggest conservation of S locus genetic architecture in Primula, and provide a platform for identification and evolutionary analysis of the S locus and downstream targets that regulate heterostyly in diverse heterostylous species.
Collapse
Affiliation(s)
- Jonathan M Cocker
- School of Biological Sciences, University of East Anglia, Norwich Research Park, Norwich, NR4 7TJ, United Kingdom.,Earlham Institute, Norwich Research Park, Norwich, NR4 7UZ, United Kingdom
| | - Jonathan Wright
- Earlham Institute, Norwich Research Park, Norwich, NR4 7UZ, United Kingdom
| | - Jinhong Li
- School of Biological Sciences, University of East Anglia, Norwich Research Park, Norwich, NR4 7TJ, United Kingdom.,Earlham Institute, Norwich Research Park, Norwich, NR4 7UZ, United Kingdom
| | - David Swarbreck
- Earlham Institute, Norwich Research Park, Norwich, NR4 7UZ, United Kingdom
| | - Sarah Dyer
- National Institute for Agricultural Botany, Huntingdon Road, Cambridge, CB3 0LE, United Kingdom
| | - Mario Caccamo
- National Institute for Agricultural Botany, Huntingdon Road, Cambridge, CB3 0LE, United Kingdom
| | - Philip M Gilmartin
- School of Biological Sciences, University of East Anglia, Norwich Research Park, Norwich, NR4 7TJ, United Kingdom. .,Earlham Institute, Norwich Research Park, Norwich, NR4 7UZ, United Kingdom.
| |
Collapse
|
14
|
Silveira MC, Azevedo da Silva R, Faria da Mota F, Catanho M, Jardim R, R Guimarães AC, de Miranda AB. Systematic Identification and Classification of β-Lactamases Based on Sequence Similarity Criteria: β-Lactamase Annotation. Evol Bioinform Online 2018; 14:1176934318797351. [PMID: 30210232 PMCID: PMC6131288 DOI: 10.1177/1176934318797351] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2018] [Accepted: 08/08/2018] [Indexed: 12/11/2022] Open
Abstract
β-lactamases, the enzymes responsible for resistance to β-lactam antibiotics, are
widespread among prokaryotic genera. However, current β-lactamase classification
schemes do not represent their present diversity. Here, we propose a workflow to
identify and classify β-lactamases. Initially, a set of curated sequences was
used as a model for the construction of profiles Hidden Markov Models (HMM),
specific for each β-lactamase class. An extensive, nonredundant set of
β-lactamase sequences was constructed from 7 different resistance proteins
databases to test the methodology. The profiles HMM were improved for their
specificity and sensitivity and then applied to fully assembled genomes. Five
hierarchical classification levels are described, and a new class of
β-lactamases with fused domains is proposed. Our profiles HMM provide a better
annotation of β-lactamases, with classes and subclasses defined by objective
criteria such as sequence similarity. This classification offers a solid base to
the elaboration of studies on the diversity, dispersion, prevalence, and
evolution of the different classes and subclasses of this critical enzymatic
activity.
Collapse
Affiliation(s)
- Melise Chaves Silveira
- Laboratório de Biologia Computacional e Sistemas, Instituto Oswaldo Cruz, Fiocruz, Rio de Janeiro, Brazil
| | - Rangeline Azevedo da Silva
- Laboratório de Biologia Computacional e Sistemas, Instituto Oswaldo Cruz, Fiocruz, Rio de Janeiro, Brazil
| | - Fábio Faria da Mota
- Laboratório de Biologia Computacional e Sistemas, Instituto Oswaldo Cruz, Fiocruz, Rio de Janeiro, Brazil
| | - Marcos Catanho
- Laboratório de Genômica Funcional e Bioinformática, Instituto Oswaldo Cruz, Fiocruz, Rio de Janeiro, Brazil
| | - Rodrigo Jardim
- Laboratório de Biologia Computacional e Sistemas, Instituto Oswaldo Cruz, Fiocruz, Rio de Janeiro, Brazil
| | - Ana Carolina R Guimarães
- Laboratório de Genômica Funcional e Bioinformática, Instituto Oswaldo Cruz, Fiocruz, Rio de Janeiro, Brazil
| | - Antonio B de Miranda
- Laboratório de Biologia Computacional e Sistemas, Instituto Oswaldo Cruz, Fiocruz, Rio de Janeiro, Brazil
| |
Collapse
|
15
|
Kinjo AR. Cooperative "folding transition" in the sequence space facilitates function-driven evolution of protein families. J Theor Biol 2018; 443:18-27. [PMID: 29355538 DOI: 10.1016/j.jtbi.2018.01.019] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2017] [Revised: 01/16/2018] [Accepted: 01/17/2018] [Indexed: 12/23/2022]
Abstract
In the protein sequence space, natural proteins form clusters of families which are characterized by their unique native folds whereas the great majority of random polypeptides are neither clustered nor foldable to unique structures. Since a given polypeptide can be either foldable or unfoldable, a kind of "folding transition" is expected at the boundary of a protein family in the sequence space. By Monte Carlo simulations of a statistical mechanical model of protein sequence alignment that coherently incorporates both short-range and long-range interactions as well as variable-length insertions to reproduce the statistics of the multiple sequence alignment of a given protein family, we demonstrate the existence of such transition between natural-like sequences and random sequences in the sequence subspaces for 15 domain families of various folds. The transition was found to be highly cooperative and two-state-like. Furthermore, enforcing or suppressing consensus residues on a few of the well-conserved sites enhanced or diminished, respectively, the natural-like pattern formation over the entire sequence. In most families, the key sites included ligand binding sites. These results suggest some selective pressure on the key residues, such as ligand binding activity, may cooperatively facilitate the emergence of a protein family during evolution. From a more practical aspect, the present results highlight an essential role of long-range effects in precisely defining protein families, which are absent in conventional sequence models.
Collapse
Affiliation(s)
- Akira R Kinjo
- Institute for Protein Research, Osaka University, 3-2 Yamadaoka, Suita, Osaka 565-0871, Japan.
| |
Collapse
|
16
|
Menichelli C, Gascuel O, Bréhélin L. Improving pairwise comparison of protein sequences with domain co-occurrence. PLoS Comput Biol 2018; 14:e1005889. [PMID: 29293498 PMCID: PMC5766236 DOI: 10.1371/journal.pcbi.1005889] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2017] [Revised: 01/12/2018] [Accepted: 11/23/2017] [Indexed: 01/17/2023] Open
Abstract
Comparing and aligning protein sequences is an essential task in bioinformatics. More specifically, local alignment tools like BLAST are widely used for identifying conserved protein sub-sequences, which likely correspond to protein domains or functional motifs. However, to limit the number of false positives, these tools are used with stringent sequence-similarity thresholds and hence can miss several hits, especially for species that are phylogenetically distant from reference organisms. A solution to this problem is then to integrate additional contextual information to the procedure. Here, we propose to use domain co-occurrence to increase the sensitivity of pairwise sequence comparisons. Domain co-occurrence is a strong feature of proteins, since most protein domains tend to appear with a limited number of other domains on the same protein. We propose a method to take this information into account in a typical BLAST analysis and to construct new domain families on the basis of these results. We used Plasmodium falciparum as a case study to evaluate our method. The experimental findings showed an increase of 14% of the number of significant BLAST hits and an increase of 25% of the proteome area that can be covered with a domain. Our method identified 2240 new domains for which, in most cases, no model of the Pfam database could be linked. Moreover, our study of the quality of the new domains in terms of alignment and physicochemical properties show that they are close to that of standard Pfam domains. Source code of the proposed approach and supplementary data are available at: https://gite.lirmm.fr/menichelli/pairwise-comparison-with-cooccurrence Deciphering the functions of the different proteins of an organism constitutes a first step toward the understanding of its biology. Because they provide strong clues regarding protein functions, domains occupy a key position among the relevant annotations that can be assigned to a protein. Protein domains are sequential motifs that are conserved along evolution and are found in different proteins and in different combinations. One common approach for identifying the domains of a protein is to run sequence-sequence comparisons with local alignment tools as BLAST. However these approaches sometimes miss several hits, especially for species that are phylogenetically distant from reference organisms. We propose here an approach to increase the sensitivity of pairwise sequence comparisons. This approach makes use of the fact that protein domains tend to appear with a limited number of other domains on the same protein (the domain co-occurrence property). On P. falciparum, our approach allows identifying 2240 new domains for which, in most cases, no domain of the Pfam database could be linked.
Collapse
Affiliation(s)
| | - Olivier Gascuel
- IBC, LIRMM, Univ. Montpellier, CNRS, Montpellier, France
- Unité de Bioinformatique Evolutive, C3BI - USR 3756, Institut Pasteur et CNRS, Paris, France
| | - Laurent Bréhélin
- IBC, LIRMM, Univ. Montpellier, CNRS, Montpellier, France
- * E-mail:
| |
Collapse
|
17
|
Complex evolutionary footprints revealed in an analysis of reused protein segments of diverse lengths. Proc Natl Acad Sci U S A 2017; 114:11703-11708. [PMID: 29078314 PMCID: PMC5676897 DOI: 10.1073/pnas.1707642114] [Citation(s) in RCA: 55] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023] Open
Abstract
We question a central paradigm: namely, that the protein domain is the “atomic unit” of evolution. In conflict with the current textbook view, our results unequivocally show that duplication of protein segments happens both above and below the domain level among amino acid segments of diverse lengths. Indeed, we show that significant evolutionary information is lost when the protein is approached as a string of domains. Our finer-grained approach reveals a far more complicated picture, where reused segments often intertwine and overlap with each other. Our results are consistent with a recursive model of evolution, in which segments of various lengths, typically smaller than domains, “hop” between environments. The fit segments remain, leaving traces that can still be detected. Proteins share similar segments with one another. Such “reused parts”—which have been successfully incorporated into other proteins—are likely to offer an evolutionary advantage over de novo evolved segments, as most of the latter will not even have the capacity to fold. To systematically explore the evolutionary traces of segment “reuse” across proteins, we developed an automated methodology that identifies reused segments from protein alignments. We search for “themes”—segments of at least 35 residues of similar sequence and structure—reused within representative sets of 15,016 domains [Evolutionary Classification of Protein Domains (ECOD) database] or 20,398 chains [Protein Data Bank (PDB)]. We observe that theme reuse is highly prevalent and that reuse is more extensive when the length threshold for identifying a theme is lower. Structural domains, the best characterized form of reuse in proteins, are just one of many complex and intertwined evolutionary traces. Others include long themes shared among a few proteins, which encompass and overlap with shorter themes that recur in numerous proteins. The observed complexity is consistent with evolution by duplication and divergence, and some of the themes might include descendants of ancestral segments. The observed recursive footprints, where the same amino acid can simultaneously participate in several intertwined themes, could be a useful concept for protein design. Data are available at http://trachel-srv.cs.haifa.ac.il/rachel/ppi/themes/.
Collapse
|
18
|
Koehorst JJ, Saccenti E, Schaap PJ, Martins Dos Santos VAP, Suarez-Diez M. Protein domain architectures provide a fast, efficient and scalable alternative to sequence-based methods for comparative functional genomics. F1000Res 2016; 5:1987. [PMID: 27703668 PMCID: PMC5031134 DOI: 10.12688/f1000research.9416.3] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 06/26/2017] [Indexed: 11/20/2022] Open
Abstract
A functional comparative genome analysis is essential to understand the mechanisms underlying bacterial evolution and adaptation. Detection of functional orthologs using standard global sequence similarity methods faces several problems; the need for defining arbitrary acceptance thresholds for similarity and alignment length, lateral gene acquisition and the high computational cost for finding bi-directional best matches at a large scale. We investigated the use of protein domain architectures for large scale functional comparative analysis as an alternative method. The performance of both approaches was assessed through functional comparison of 446 bacterial genomes sampled at different taxonomic levels. We show that protein domain architectures provide a fast and efficient alternative to methods based on sequence similarity to identify groups of functionally equivalent proteins within and across taxonomic boundaries, and it is suitable for large scale comparative analysis. Running both methods in parallel pinpoints potential functional adaptations that may add to bacterial fitness.
Collapse
Affiliation(s)
- Jasper J Koehorst
- Laboratory of Systems and Synthetic Biology, Wageningen University and Research, Wageningen, Netherlands
| | - Edoardo Saccenti
- Laboratory of Systems and Synthetic Biology, Wageningen University and Research, Wageningen, Netherlands
| | - Peter J Schaap
- Laboratory of Systems and Synthetic Biology, Wageningen University and Research, Wageningen, Netherlands
| | - Vitor A P Martins Dos Santos
- Laboratory of Systems and Synthetic Biology, Wageningen University and Research, Wageningen, Netherlands.,LifeGlimmer GmBH, Berlin, Germany
| | - Maria Suarez-Diez
- Laboratory of Systems and Synthetic Biology, Wageningen University and Research, Wageningen, Netherlands
| |
Collapse
|
19
|
Lees JG, Dawson NL, Sillitoe I, Orengo CA. Functional innovation from changes in protein domains and their combinations. Curr Opin Struct Biol 2016; 38:44-52. [DOI: 10.1016/j.sbi.2016.05.016] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2016] [Revised: 05/17/2016] [Accepted: 05/24/2016] [Indexed: 10/21/2022]
|
20
|
Punta M, Mistry J. Homology-Based Annotation of Large Protein Datasets. Methods Mol Biol 2016; 1415:153-176. [PMID: 27115632 DOI: 10.1007/978-1-4939-3572-7_8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Advances in DNA sequencing technologies have led to an increasing amount of protein sequence data being generated. Only a small fraction of this protein sequence data will have experimental annotation associated with them. Here, we describe a protocol for in silico homology-based annotation of large protein datasets that makes extensive use of manually curated collections of protein families. We focus on annotations provided by the Pfam database and suggest ways to identify family outliers and family variations. This protocol may be useful to people who are new to protein data analysis, or who are unfamiliar with the current computational tools that are available.
Collapse
Affiliation(s)
- Marco Punta
- Sorbonne Universités, UPMC-Univ P6, CNRS, Laboratoire de Biologie Computationnelle et Quantitative - UMR 7238, 15 rue de l'Ecole deMédecine, Paris, France.
| | - Jaina Mistry
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| |
Collapse
|
21
|
Kelley LA, Sternberg MJE. Partial protein domains: evolutionary insights and bioinformatics challenges. Genome Biol 2015; 16:100. [PMID: 25986583 PMCID: PMC4436111 DOI: 10.1186/s13059-015-0663-8] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Protein domains are generally thought to correspond to units of evolution. New research raises questions about how such domains are defined with bioinformatics tools and sheds light on how evolution has enabled partial domains to be viable.
Collapse
Affiliation(s)
- Lawrence A Kelley
- Structural Bioinformatics Group, Centre for Integrative Systems Biology and Bioinformatics, Department of Life Sciences, Imperial College London, London, SW7 2AZ, UK.
| | - Michael J E Sternberg
- Structural Bioinformatics Group, Centre for Integrative Systems Biology and Bioinformatics, Department of Life Sciences, Imperial College London, London, SW7 2AZ, UK.
| |
Collapse
|
22
|
Prakash A, Bateman A. Domain atrophy creates rare cases of functional partial protein domains. Genome Biol 2015; 16:88. [PMID: 25924720 PMCID: PMC4432964 DOI: 10.1186/s13059-015-0655-8] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2014] [Accepted: 04/15/2015] [Indexed: 01/12/2023] Open
Abstract
BACKGROUND Protein domains display a range of structural diversity, with numerous additions and deletions of secondary structural elements between related domains. We have observed a small number of cases of surprising large-scale deletions of core elements of structural domains. We propose a new concept called domain atrophy, where protein domains lose a significant number of core structural elements. RESULTS Here, we implement a new pipeline to systematically identify new cases of domain atrophy across all known protein sequences. The output of this pipeline was carefully checked by hand, which filtered out partial domain instances that were unlikely to represent true domain atrophy due to misannotations or un-annotated sequence fragments. We identify 75 cases of domain atrophy, of which eight cases are found in a three-dimensional protein structure and 67 cases have been inferred based on mapping to a known homologous structure. Domains with structural variations include ancient folds such as the TIM-barrel and Rossmann folds. Most of these domains are observed to show structural loss that does not affect their functional sites. CONCLUSION Our analysis has significantly increased the known cases of domain atrophy. We discuss specific instances of domain atrophy and see that there has often been a compensatory mechanism that helps to maintain the stability of the partial domain. Our study indicates that although domain atrophy is an extremely rare phenomenon, protein domains under certain circumstances can tolerate extreme mutations giving rise to partial, but functional, domains.
Collapse
Affiliation(s)
- Ananth Prakash
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK.
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK.
| |
Collapse
|