1
|
Khodji H, Collet P, Thompson JD, Jeannin-Girardon A. De-MISTED: Image-based classification of erroneous multiple sequence alignments using convolutional neural networks. APPL INTELL 2023. [DOI: 10.1007/s10489-022-04390-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/11/2023]
|
2
|
Bagheri H, Severin AJ, Rajan H. Detecting and correcting misclassified sequences in the large-scale public databases. Bioinformatics 2020; 36:4699-4705. [PMID: 32579213 PMCID: PMC7821992 DOI: 10.1093/bioinformatics/btaa586] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2020] [Revised: 06/10/2020] [Accepted: 06/16/2020] [Indexed: 11/21/2022] Open
Abstract
Motivation As the cost of sequencing decreases, the amount of data being deposited into public repositories is increasing rapidly. Public databases rely on the user to provide metadata for each submission that is prone to user error. Unfortunately, most public databases, such as non-redundant (NR), rely on user input and do not have methods for identifying errors in the provided metadata, leading to the potential for error propagation. Previous research on a small subset of the NR database analyzed misclassification based on sequence similarity. To the best of our knowledge, the amount of misclassification in the entire database has not been quantified. We propose a heuristic method to detect potentially misclassified taxonomic assignments in the NR database. We applied a curation technique and quality control to find the most probable taxonomic assignment. Our method incorporates provenance and frequency of each annotation from manually and computationally created databases and clustering information at 95% similarity. Results We found more than two million potentially taxonomically misclassified proteins in the NR database. Using simulated data, we show a high precision of 97% and a recall of 87% for detecting taxonomically misclassified proteins. The proposed approach and findings could also be applied to other databases. Availability and implementation Source code, dataset, documentation, Jupyter notebooks and Docker container are available at https://github.com/boalang/nr. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Andrew J Severin
- Genome Informatics Facility, Iowa State University, Ames, IA 50011, USA
| | | |
Collapse
|
3
|
Soluri MF, Puccio S, Caredda G, Grillo G, Licciulli VF, Consiglio A, Edomi P, Santoro C, Sblattero D, Peano C. Interactome-Seq: A Protocol for Domainome Library Construction, Validation and Selection by Phage Display and Next Generation Sequencing. J Vis Exp 2018. [PMID: 30346377 DOI: 10.3791/56981] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/14/2023] Open
Abstract
Folding reporters are proteins with easily identifiable phenotypes, such as antibiotic resistance, whose folding and function is compromised when fused to poorly folding proteins or random open reading frames. We have developed a strategy where, by using TEM-1 β-lactamase (the enzyme conferring ampicillin resistance) on a genomic scale, we can select collections of correctly folded protein domains from the coding portion of the DNA of any intronless genome. The protein fragments obtained by this approach, the so called "domainome", will be well expressed and soluble, making them suitable for structural/functional studies. By cloning and displaying the "domainome" directly in a phage display system, we have showed that it is possible to select specific protein domains with the desired binding properties (e.g., to other proteins or to antibodies), thus providing essential experimental information for gene annotation or antigen identification. The identification of the most enriched clones in a selected polyclonal population can be achieved by using novel next-generation sequencing technologies (NGS). For these reasons, we introduce deep sequencing analysis of the library itself and the selection outputs to provide complete information on diversity, abundance and precise mapping of each of the selected fragment. The protocols presented here show the key steps for library construction, characterization, and validation.
Collapse
Affiliation(s)
- Maria Felicia Soluri
- Department of Health Sciences, Università del Piemonte Orientale & IRCAD, Novara, Italy
| | - Simone Puccio
- Institute of Biomedical Technologies, National Research Council, Segrate, Milan, Italy
| | - Giada Caredda
- Institute of Biomedical Technologies, National Research Council, Segrate, Milan, Italy
| | - Giorgio Grillo
- Institute of Biomedical Technologies, National Research Council, Bari, Italy
| | | | - Arianna Consiglio
- Institute of Biomedical Technologies, National Research Council, Bari, Italy
| | - Paolo Edomi
- Department of Life Sciences, University of Trieste, Italy
| | - Claudio Santoro
- Department of Health Sciences, Università del Piemonte Orientale & IRCAD, Novara, Italy
| | | | - Clelia Peano
- Institute of Genetic and Biomedical Research, National Research Council, Rozzano, Milan, Italy; Humanitas Clinical and Research Center, Rozzano, Milan, Italy;
| |
Collapse
|
4
|
Bányai L, Kerekes K, Trexler M, Patthy L. Morphological Stasis and Proteome Innovation in Cephalochordates. Genes (Basel) 2018; 9:genes9070353. [PMID: 30013013 PMCID: PMC6071037 DOI: 10.3390/genes9070353] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2018] [Revised: 07/11/2018] [Accepted: 07/11/2018] [Indexed: 11/16/2022] Open
Abstract
Lancelets, extant representatives of basal chordates, are prototypic examples of evolutionary stasis; they preserved a morphology and body-plan most similar to the fossil chordates from the early Cambrian. Such a low level of morphological evolution is in harmony with a low rate of amino acid substitution; cephalochordate proteins were shown to evolve slower than those of the slowest evolving vertebrate, the elephant shark. Surprisingly, a study comparing the predicted proteomes of Chinese amphioxus, Branchiostoma belcheri and the Florida amphioxus, Branchiostoma floridae has led to the conclusion that the rate of creation of novel domain combinations is orders of magnitude greater in lancelets than in any other Metazoa, a finding that contradicts the notion that high rates of protein innovation are usually associated with major evolutionary innovations. Our earlier studies on a representative sample of proteins have provided evidence suggesting that the differences in the domain architectures of predicted proteins of these two lancelet species reflect annotation errors, rather than true innovations. In the present work, we have extended these studies to include a larger sample of genes and two additional lancelet species, Asymmetron lucayanum and Branchiostoma lanceolatum. These analyses have confirmed that the domain architecture differences of orthologous proteins of the four lancelet species are because of errors of gene prediction, the error rate in the given species being inversely related to the quality of the transcriptome dataset that was used to aid gene prediction.
Collapse
Affiliation(s)
- László Bányai
- Institute of Enzymology, Research Centre for Natural Sciences, Hungarian Academy of Sciences, H-1117 Budapest, Hungary.
| | - Krisztina Kerekes
- Institute of Enzymology, Research Centre for Natural Sciences, Hungarian Academy of Sciences, H-1117 Budapest, Hungary.
| | - Mária Trexler
- Institute of Enzymology, Research Centre for Natural Sciences, Hungarian Academy of Sciences, H-1117 Budapest, Hungary.
| | - László Patthy
- Institute of Enzymology, Research Centre for Natural Sciences, Hungarian Academy of Sciences, H-1117 Budapest, Hungary.
| |
Collapse
|
5
|
Stroehlein AJ, Young ND, Gasser RB. Improved strategy for the curation and classification of kinases, with broad applicability to other eukaryotic protein groups. Sci Rep 2018; 8:6808. [PMID: 29717207 PMCID: PMC5931623 DOI: 10.1038/s41598-018-25020-8] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2018] [Accepted: 04/12/2018] [Indexed: 12/20/2022] Open
Abstract
Despite the substantial amount of genomic and transcriptomic data available for a wide range of eukaryotic organisms, most genomes are still in a draft state and can have inaccurate gene predictions. To gain a sound understanding of the biology of an organism, it is crucial that inferred protein sequences are accurately identified and annotated. However, this can be challenging to achieve, particularly for organisms such as parasitic worms (helminths), as most gene prediction approaches do not account for substantial phylogenetic divergence from model organisms, such as Caenorhabditis elegans and Drosophila melanogaster, whose genomes are well-curated. In this paper, we describe a bioinformatic strategy for the curation of gene families and subsequent annotation of encoded proteins. This strategy relies on pairwise gene curation between at least two closely related species using genomic and transcriptomic data sets, and is built on recent work on kinase complements of parasitic worms. Here, we discuss salient technical aspects of this strategy and its implications for the curation of protein families more generally.
Collapse
Affiliation(s)
- Andreas J Stroehlein
- Melbourne Veterinary School, Department of Veterinary Biosciences, Faculty of Veterinary and Agricultural Sciences, The University of Melbourne, Parkville, Victoria, 3010, Australia.
| | - Neil D Young
- Melbourne Veterinary School, Department of Veterinary Biosciences, Faculty of Veterinary and Agricultural Sciences, The University of Melbourne, Parkville, Victoria, 3010, Australia
| | - Robin B Gasser
- Melbourne Veterinary School, Department of Veterinary Biosciences, Faculty of Veterinary and Agricultural Sciences, The University of Melbourne, Parkville, Victoria, 3010, Australia.
| |
Collapse
|
6
|
VirusSeeker, a computational pipeline for virus discovery and virome composition analysis. Virology 2017; 503:21-30. [PMID: 28110145 DOI: 10.1016/j.virol.2017.01.005] [Citation(s) in RCA: 90] [Impact Index Per Article: 12.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2016] [Revised: 01/07/2017] [Accepted: 01/10/2017] [Indexed: 01/21/2023]
Abstract
The advent of Next Generation Sequencing (NGS) has vastly increased our ability to discover novel viruses and to systematically define the spectrum of viruses present in a given specimen. Such studies have led to the discovery of novel viral pathogens as well as broader associations of the virome with diverse diseases including inflammatory bowel disease, severe acute malnutrition and HIV/AIDS. Critical to the success of these efforts are robust bioinformatic pipelines for rapid classification of microbial sequences. Existing computational tools are typically focused on either eukaryotic virus discovery or virome composition analysis but not both. Here we present VirusSeeker, a BLAST-based NGS data analysis pipeline designed for both purposes. VirusSeeker has been successfully applied in several previously published virome studies. Here we demonstrate the functionality of VirusSeeker in both novel virus discovery and virome composition analysis.
Collapse
|
7
|
Gradnigo JS, Majumdar A, Norgren RB, Moriyama EN. Advantages of an Improved Rhesus Macaque Genome for Evolutionary Analyses. PLoS One 2016; 11:e0167376. [PMID: 27911958 PMCID: PMC5135103 DOI: 10.1371/journal.pone.0167376] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2015] [Accepted: 11/14/2016] [Indexed: 01/12/2023] Open
Abstract
The rhesus macaque (Macaca mulatta) is widely used in molecular evolutionary analyses, particularly to identify genes under adaptive or unique evolution in the human lineage. For such studies, it is necessary to align nucleotide sequences of homologous protein-coding genes among multiple species. The validity of these analyses is dependent on high quality genomic data. However, for most mammalian species (other than humans and mice), only draft genomes are available. There has been concern that some results obtained from evolutionary analyses using draft genomes may not be correct. The rhesus macaque provides a unique opportunity to determine whether an improved genome (MacaM) yields better results than a draft genome (rheMac2) for evolutionary studies. We compared protein-coding genes annotated in the rheMac2 and MacaM genomes with their human orthologs. We found many genes annotated in rheMac2 had apparently spurious sequences not present in genes derived from MacaM. The rheMac2 annotations also appeared to inflate a frequently used evolutionary index, ω (the ratio of nonsynonymous to synonymous substitution rates). Genes with these spurious sequences must be filtered out from evolutionary analyses to obtain correct results. With the MacaM genome, improved sequence information means many more genes can be examined for indications of selection. These results indicate how upgrading genomes from draft status to a higher level of quality can improve interpretation of evolutionary patterns.
Collapse
Affiliation(s)
- Julien S. Gradnigo
- School of Biological Sciences, University of Nebraska-Lincoln, Lincoln, Nebraska, United States of America
| | - Abhishek Majumdar
- Department of Genetics, Cell Biology and Anatomy, University of Nebraska Medical Center, Omaha, Nebraska, United States of America
| | - Robert B. Norgren
- Department of Genetics, Cell Biology and Anatomy, University of Nebraska Medical Center, Omaha, Nebraska, United States of America
| | - Etsuko N. Moriyama
- School of Biological Sciences and Center for Plant Science Innovation, University of Nebraska-Lincoln, Lincoln, Nebraska, United States of America
- * E-mail:
| |
Collapse
|
8
|
Leuthaeuser JB, Morris JH, Harper AF, Ferrin TE, Babbitt PC, Fetrow JS. DASP3: identification of protein sequences belonging to functionally relevant groups. BMC Bioinformatics 2016; 17:458. [PMID: 27835946 PMCID: PMC5106842 DOI: 10.1186/s12859-016-1295-z] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2016] [Accepted: 10/20/2016] [Indexed: 01/26/2023] Open
Abstract
Background Development of automatable processes for clustering proteins into functionally relevant groups is a critical hurdle as an increasing number of sequences are deposited into databases. Experimental function determination is exceptionally time-consuming and can’t keep pace with the identification of protein sequences. A tool, DASP (Deacon Active Site Profiler), was previously developed to identify protein sequences with active site similarity to a query set. Development of two iterative, automatable methods for clustering proteins into functionally relevant groups exposed algorithmic limitations to DASP. Results The accuracy and efficiency of DASP was significantly improved through six algorithmic enhancements implemented in two stages: DASP2 and DASP3. Validation demonstrated DASP3 provides greater score separation between true positives and false positives than earlier versions. In addition, DASP3 shows similar performance to previous versions in clustering protein structures into isofunctional groups (validated against manual curation), but DASP3 gathers and clusters protein sequences into isofunctional groups more efficiently than DASP and DASP2. Conclusions DASP algorithmic enhancements resulted in improved efficiency and accuracy of identifying proteins that contain active site features similar to those of the query set. These enhancements provide incremental improvement in structure database searches and initial sequence database searches; however, the enhancements show significant improvement in iterative sequence searches, suggesting DASP3 is an appropriate tool for the iterative processes required for clustering proteins into isofunctional groups. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1295-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Janelle B Leuthaeuser
- Molecular Genetics and Genomics Program, Wake Forest University, Winston-Salem, NC, 27106, USA. .,Present address: University of Richmond, Gottwald Hall C302, Richmond, VA, 23173, USA.
| | - John H Morris
- Department of Pharmaceutical Chemistry, University of California San Francisco, San Francisco, CA, 94158, USA
| | - Angela F Harper
- Department of Physics, Wake Forest University, Winston-Salem, NC, 27106, USA
| | - Thomas E Ferrin
- Department of Pharmaceutical Chemistry, University of California San Francisco, San Francisco, CA, 94158, USA
| | - Patricia C Babbitt
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, 94158, USA
| | - Jacquelyn S Fetrow
- Department of Chemistry, University of Richmond, Richmond, VA, 23173, USA
| |
Collapse
|
9
|
Putative extremely high rate of proteome innovation in lancelets might be explained by high rate of gene prediction errors. Sci Rep 2016; 6:30700. [PMID: 27476717 PMCID: PMC4967905 DOI: 10.1038/srep30700] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2016] [Accepted: 07/06/2016] [Indexed: 01/17/2023] Open
Abstract
A recent analysis of the genomes of Chinese and Florida lancelets has concluded that the rate of creation of novel protein domain combinations is orders of magnitude greater in lancelets than in other metazoa and it was suggested that continuous activity of transposable elements in lancelets is responsible for this increased rate of protein innovation. Since morphologically Chinese and Florida lancelets are highly conserved, this finding would contradict the observation that high rates of protein innovation are usually associated with major evolutionary innovations. Here we show that the conclusion that the rate of proteome innovation is exceptionally high in lancelets may be unjustified: the differences observed in domain architectures of orthologous proteins of different amphioxus species probably reflect high rates of gene prediction errors rather than true innovation.
Collapse
|
10
|
Meng X, Li C, Xiu C, Zhang J, Li J, Huang L, Zhang Y, Liu Z. Identification and Biochemical Properties of Two New Acetylcholinesterases in the Pond Wolf Spider (Pardosa pseudoannulata). PLoS One 2016; 11:e0158011. [PMID: 27337188 PMCID: PMC4919072 DOI: 10.1371/journal.pone.0158011] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2016] [Accepted: 06/08/2016] [Indexed: 01/17/2023] Open
Abstract
Acetylcholinesterase (AChE), an important neurotransmitter hydrolase in both invertebrates and vertebrates, is targeted by organophosphorus and carbamate insecticides. In this study, two new AChEs were identified in the pond wolf spider Pardosa pseudoannulata, an important predatory natural enemy of several insect pests. In total, four AChEs were found in P. pseudoannulata (including two AChEs previously identified in our laboratory). The new putative AChEs PpAChE3 and PpAChE4 contain most of the common features of the AChE family, including cysteine residues, choline binding sites, the conserved sequence 'FGESAG' and conserved aromatic residues but with a catalytic triad of 'SDH' rather than 'SEH'. Recombinant enzymes expressed in Sf9 cells showed significant differences in biochemical properties compared to other AChEs, such as the optimal pH, substrate specificity, and catalytic efficiency. Among three test substrates, PpAChE1, PpAChE3 and PpAChE4 showed the highest catalytic efficiency (Vmax/KM) for ATC (acetylthiocholine iodide), with PpAChE3 exhibiting a clear preference for ATC based on the VmaxATC/VmaxBTC ratio. In addition, the four PpAChEs were more sensitive to the AChE-specific inhibitor BW284C51, which acts against ATC hydrolysis, than to the BChE-specific inhibitor ISO-OMPA, which acts against BTC hydrolysis, with at least a 8.5-fold difference in IC50 values for each PpAChE. PpAChE3, PpAChE4, and PpAChE1 were more sensitive than PpAChE2 to the tested Carb insecticides, and PpAChE3 was more sensitive than the other three AChEs to the tested OP insecticides. Based on all the results, two new functional AChEs were identified from P. pseudoannulata. The differences in AChE sequence between this spider and insects enrich our knowledge of invertebrate AChE diversity, and our findings will be helpful for understanding the selectivity of insecticides between insects and natural enemy spiders.
Collapse
Affiliation(s)
- Xiangkun Meng
- Key Laboratory of Integrated Management of Crop Diseases and Pests (Ministry of Education), College of Plant Protection, Nanjing Agricultural University, Weigang 1, Nanjing, 210095, China
| | - Chunrui Li
- Key Laboratory of Integrated Management of Crop Diseases and Pests (Ministry of Education), College of Plant Protection, Nanjing Agricultural University, Weigang 1, Nanjing, 210095, China
| | - Chunli Xiu
- Key Laboratory of Integrated Management of Crop Diseases and Pests (Ministry of Education), College of Plant Protection, Nanjing Agricultural University, Weigang 1, Nanjing, 210095, China
| | - Jianhua Zhang
- Key Laboratory of Integrated Management of Crop Diseases and Pests (Ministry of Education), College of Plant Protection, Nanjing Agricultural University, Weigang 1, Nanjing, 210095, China
| | - Jingjing Li
- Key Laboratory of Integrated Management of Crop Diseases and Pests (Ministry of Education), College of Plant Protection, Nanjing Agricultural University, Weigang 1, Nanjing, 210095, China
| | - Lixin Huang
- Key Laboratory of Integrated Management of Crop Diseases and Pests (Ministry of Education), College of Plant Protection, Nanjing Agricultural University, Weigang 1, Nanjing, 210095, China
| | - Yixi Zhang
- Key Laboratory of Integrated Management of Crop Diseases and Pests (Ministry of Education), College of Plant Protection, Nanjing Agricultural University, Weigang 1, Nanjing, 210095, China
- * E-mail: (ZWL); (YXZ)
| | - Zewen Liu
- Key Laboratory of Integrated Management of Crop Diseases and Pests (Ministry of Education), College of Plant Protection, Nanjing Agricultural University, Weigang 1, Nanjing, 210095, China
- * E-mail: (ZWL); (YXZ)
| |
Collapse
|
11
|
Holliday GL, Bairoch A, Bagos PG, Chatonnet A, Craik DJ, Finn RD, Henrissat B, Landsman D, Manning G, Nagano N, O’Donovan C, Pruitt KD, Rawlings ND, Saier M, Sowdhamini R, Spedding M, Srinivasan N, Vriend G, Babbitt PC, Bateman A. Key challenges for the creation and maintenance of specialist protein resources. Proteins 2015; 83:1005-13. [PMID: 25820941 PMCID: PMC4446195 DOI: 10.1002/prot.24803] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2015] [Revised: 03/06/2015] [Accepted: 03/20/2015] [Indexed: 11/12/2022]
Abstract
As the volume of data relating to proteins increases, researchers rely more and more on the analysis of published data, thus increasing the importance of good access to these data that vary from the supplemental material of individual articles, all the way to major reference databases with professional staff and long-term funding. Specialist protein resources fill an important middle ground, providing interactive web interfaces to their databases for a focused topic or family of proteins, using specialized approaches that are not feasible in the major reference databases. Many are labors of love, run by a single lab with little or no dedicated funding and there are many challenges to building and maintaining them. This perspective arose from a meeting of several specialist protein resources and major reference databases held at the Wellcome Trust Genome Campus (Cambridge, UK) on August 11 and 12, 2014. During this meeting some common key challenges involved in creating and maintaining such resources were discussed, along with various approaches to address them. In laying out these challenges, we aim to inform users about how these issues impact our resources and illustrate ways in which our working together could enhance their accuracy, currency, and overall value.
Collapse
Affiliation(s)
- Gemma L Holliday
- Department of Bioengineering and Therapeutic Sciences, University of CaliforniaSan Francisco, California, 94158
| | - Amos Bairoch
- SIB—Swiss Institute of Bioinformatics, University of GenevaGeneva, Switzerland
| | - Pantelis G Bagos
- Department of Computer Science and Biomedical Informatics, University of ThessalyLamia, 35100, Greece
| | - Arnaud Chatonnet
- INRA, Umr866 Dynamique Musculaire Et MétabolismeMontpellier, F-34000, France
- Université MontpellierMontpellier, F-34000, France
| | - David J Craik
- Institute for Molecular Bioscience. The University of QueenslandBrisbane, Queensland, 4072, Australia
| | - Robert D Finn
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI)Wellcome Trust Genome Campus, Hinxton, Cambridge, Cb10 1SD, United Kingdom
| | - Bernard Henrissat
- Architecture Et Fonction Des Macromolécules Biologiques, CNRS, Aix-Marseille UniversitéMarseille, 13288, France
- Department of Biological Sciences, King Abdulaziz UniversityJeddah, Saudi Arabia
| | - David Landsman
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of HealthBethesda, Maryland, 20892
| | - Gerard Manning
- Department of Bioinformatics & Computational Biology, Genentech1 DNA Way, South San Francisco, California, 98010
| | - Nozomi Nagano
- Computational Biology Research Center, National Institute of Advanced Industrial Science and TechnologyTokyo, 135-0064, Japan
| | - Claire O’Donovan
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI)Wellcome Trust Genome Campus, Hinxton, Cambridge, Cb10 1SD, United Kingdom
| | - Kim D Pruitt
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of HealthBethesda, Maryland, 20892
| | - Neil D Rawlings
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI)Wellcome Trust Genome Campus, Hinxton, Cambridge, Cb10 1SD, United Kingdom
- Wellcome Trust Sanger InstituteWellcome Trust Genome Campus, Hinxton, Cambridge, Cb10 1SD, United Kingdom
| | - Milton Saier
- Department of Molecular Biology, University of California at San DiegoLa Jolla, California, 92093
| | - Ramanathan Sowdhamini
- National Centre for Biological Sciences, TIFRGKVK Campus, Bellary Road, Bangalore, 560065, India
| | - Michael Spedding
- Chair NC-IUPHAR, Spedding Research Solutions SARL6 Rue Ampere, Le Vesinet, 78110, France
| | | | - Gert Vriend
- Centre for Molecular and Biomolecular Informatics (CMBI), Radboud University Medical Center, Geert Grooteplein Zuid 26-28, 6525 GANijmegen, The Netherlands
| | - Patricia C Babbitt
- Department of Bioengineering and Therapeutic Sciences, University of CaliforniaSan Francisco, California, 94158
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI)Wellcome Trust Genome Campus, Hinxton, Cambridge, Cb10 1SD, United Kingdom
| |
Collapse
|
12
|
Triant DA, Pearson WR. Most partial domains in proteins are alignment and annotation artifacts. Genome Biol 2015; 16:99. [PMID: 25976240 PMCID: PMC4443539 DOI: 10.1186/s13059-015-0656-7] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2014] [Accepted: 04/15/2015] [Indexed: 12/19/2022] Open
Abstract
BACKGROUND Protein domains are commonly used to assess the functional roles and evolutionary relationships of proteins and protein families. Here, we use the Pfam protein family database to examine a set of candidate partial domains. Pfam protein domains are often thought of as evolutionarily indivisible, structurally compact, units from which larger functional proteins are assembled; however, almost 4% of Pfam27 PfamA domains are shorter than 50% of their family model length, suggesting that more than half of the domain is missing at those locations. To better understand the structural nature of partial domains in proteins, we examined 30,961 partial domain regions from 136 domain families contained in a representative subset of PfamA domains (RefProtDom2 or RPD2). RESULTS We characterized three types of apparent partial domains: split domains, bounded partials, and unbounded partials. We find that bounded partial domains are over-represented in eukaryotes and in lower quality protein predictions, suggesting that they often result from inaccurate genome assemblies or gene models. We also find that a large percentage of unbounded partial domains produce long alignments, which suggests that their annotation as a partial is an alignment artifact; yet some can be found as partials in other sequence contexts. CONCLUSIONS Partial domains are largely the result of alignment and annotation artifacts and should be viewed with caution. The presence of partial domain annotations in proteins should raise the concern that the prediction of the protein's gene may be incomplete. In general, protein domains can be considered the structural building blocks of proteins.
Collapse
Affiliation(s)
- Deborah A Triant
- Department of Biochemistry and Molecular Genetics, University of Virginia, Box 800733, Charlottesville, VA, 22908, USA.
| | - William R Pearson
- Department of Biochemistry and Molecular Genetics, University of Virginia, Box 800733, Charlottesville, VA, 22908, USA.
| |
Collapse
|
13
|
Expression and functional activity of neurotransmitter system components in sea urchins' early development. ZYGOTE 2015; 24:206-18. [PMID: 25920999 DOI: 10.1017/s0967199415000040] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Reverse-transcription polymerase chain reaction (RT-PCR) investigation of the expression of the components supposedly taking part in serotonin regulation of the early development of Paracentrotus lividus has shown the presence of transcripts of five receptors, one of which has conservative amino acid residues characteristic of monoaminergic receptors. At the early stages of embryogenesis the expressions of serotonin transporter (SERT) and noradrenaline transporter (NET) were also recognized. The activities of the enzymes of serotonin synthesis and serotonin transporter were shown using immunohistochemistry and incubation with para-chlorophenylalanine (PСРА) and 5-hydroxytryptophan (HTP). Pharmacological experiments have shown a preferential cytostatic activity of ligands characterized as mammalian 5-hydroxytryptamine (5-HT)1-antagonists. On the basis of the sum of the data from molecular biology and embryo physiological experiments, it is suggested that metabotropic serotonin receptors and membrane transporters take part in the regulatory processes of early sea urchin embryogenesis.
Collapse
|
14
|
Yu JF, Guo J, Liu QB, Hou Y, Xiao K, Chen QL, Wang JH, Sun X. A hybrid strategy for comprehensive annotation of the protein coding genes in prokaryotic genome. Genes Genomics 2015. [DOI: 10.1007/s13258-014-0263-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
15
|
Chowanadisai W. Comparative genomic analysis of slc39a12/ZIP12: insight into a zinc transporter required for vertebrate nervous system development. PLoS One 2014; 9:e111535. [PMID: 25375179 PMCID: PMC4222902 DOI: 10.1371/journal.pone.0111535] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2013] [Accepted: 10/04/2014] [Indexed: 01/23/2023] Open
Abstract
The zinc transporter ZIP12, which is encoded by the gene slc39a12, has previously been shown to be important for neuronal differentiation in mouse Neuro-2a neuroblastoma cells and primary mouse neurons and necessary for neurulation during Xenopus tropicalis embryogenesis. However, relatively little is known about the biochemical properties, cellular regulation, or the physiological role of this gene. The hypothesis that ZIP12 is a zinc transporter important for nervous system function and development guided a comparative genetics approach to uncover the presence of ZIP12 in various genomes and identify conserved sequences and expression patterns associated with ZIP12. Ortholog detection of slc39a12 was conducted with reciprocal BLAST hits with the amino acid sequence of human ZIP12 in comparison to the human paralog ZIP4 and conserved local synteny between genomes. ZIP12 is present in the genomes of almost all vertebrates examined, from humans and other mammals to most teleost fish. However, ZIP12 appears to be absent from the zebrafish genome. The discrimination of ZIP12 compared to ZIP4 was unsuccessful or inconclusive in other invertebrate chordates and deuterostomes. Splice variation, due to the inclusion or exclusion of a conserved exon, is present in humans, rats, and cows and likely has biological significance. ZIP12 also possesses many putative di-leucine and tyrosine motifs often associated with intracellular trafficking, which may control cellular zinc uptake activity through the localization of ZIP12 within the cell. These findings highlight multiple aspects of ZIP12 at the biochemical, cellular, and physiological levels with likely biological significance. ZIP12 appears to have conserved function as a zinc uptake transporter in vertebrate nervous system development. Consequently, the role of ZIP12 may be an important link to reported congenital malformations in numerous animal models and humans that are caused by zinc deficiency.
Collapse
Affiliation(s)
- Winyoo Chowanadisai
- Department of Nutrition, University of California Davis, Davis, California, United States of America
- * E-mail:
| |
Collapse
|
16
|
Zimin AV, Cornish AS, Maudhoo MD, Gibbs RM, Zhang X, Pandey S, Meehan DT, Wipfler K, Bosinger SE, Johnson ZP, Tharp GK, Marçais G, Roberts M, Ferguson B, Fox HS, Treangen T, Salzberg SL, Yorke JA, Norgren RB. A new rhesus macaque assembly and annotation for next-generation sequencing analyses. Biol Direct 2014; 9:20. [PMID: 25319552 PMCID: PMC4214606 DOI: 10.1186/1745-6150-9-20] [Citation(s) in RCA: 136] [Impact Index Per Article: 13.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2014] [Accepted: 10/03/2014] [Indexed: 12/13/2022] Open
Abstract
Background The rhesus macaque (Macaca mulatta) is a key species for advancing biomedical research. Like all draft mammalian genomes, the draft rhesus assembly (rheMac2) has gaps, sequencing errors and misassemblies that have prevented automated annotation pipelines from functioning correctly. Another rhesus macaque assembly, CR_1.0, is also available but is substantially more fragmented than rheMac2 with smaller contigs and scaffolds. Annotations for these two assemblies are limited in completeness and accuracy. High quality assembly and annotation files are required for a wide range of studies including expression, genetic and evolutionary analyses. Results We report a new de novo assembly of the rhesus macaque genome (MacaM) that incorporates both the original Sanger sequences used to assemble rheMac2 and new Illumina sequences from the same animal. MacaM has a weighted average (N50) contig size of 64 kilobases, more than twice the size of the rheMac2 assembly and almost five times the size of the CR_1.0 assembly. The MacaM chromosome assembly incorporates information from previously unutilized mapping data and preliminary annotation of scaffolds. Independent assessment of the assemblies using Ion Torrent read alignments indicates that MacaM is more complete and accurate than rheMac2 and CR_1.0. We assembled messenger RNA sequences from several rhesus tissues into transcripts which allowed us to identify a total of 11,712 complete proteins representing 9,524 distinct genes. Using a combination of our assembled rhesus macaque transcripts and human transcripts, we annotated 18,757 transcripts and 16,050 genes with complete coding sequences in the MacaM assembly. Further, we demonstrate that the new annotations provide greatly improved accuracy as compared to the current annotations of rheMac2. Finally, we show that the MacaM genome provides an accurate resource for alignment of reads produced by RNA sequence expression studies. Conclusions The MacaM assembly and annotation files provide a substantially more complete and accurate representation of the rhesus macaque genome than rheMac2 or CR_1.0 and will serve as an important resource for investigators conducting next-generation sequencing studies with nonhuman primates. Reviewers This article was reviewed by Dr. Lutz Walter, Dr. Soojin Yi and Dr. Kateryna Makova.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - Robert B Norgren
- Department of Genetics, Cell Biology and Anatomy, University of Nebraska Medical Center, Omaha, Nebraska 68198, USA.
| |
Collapse
|
17
|
Yoder AD, Chan LM, dos Reis M, Larsen PA, Campbell CR, Rasoloarison R, Barrett M, Roos C, Kappeler P, Bielawski J, Yang Z. Molecular evolutionary characterization of a V1R subfamily unique to strepsirrhine primates. Genome Biol Evol 2014; 6:213-27. [PMID: 24398377 PMCID: PMC3914689 DOI: 10.1093/gbe/evu006] [Citation(s) in RCA: 67] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Vomeronasal receptor genes have frequently been invoked as integral to the establishment and maintenance of species boundaries among mammals due to the elaborate one-to-one correspondence between semiochemical signals and neuronal sensory inputs. Here, we report the most extensive sample of vomeronasal receptor class 1 (V1R) sequences ever generated for a diverse yet phylogenetically coherent group of mammals, the tooth-combed primates (suborder Strepsirrhini). Phylogenetic analysis confirms our intensive sampling from a single V1R subfamily, apparently unique to the strepsirrhine primates. We designate this subfamily as V1Rstrep. The subfamily retains extensive repertoires of gene copies that descend from an ancestral gene duplication that appears to have occurred prior to the diversification of all lemuriform primates excluding the basal genus Daubentonia (the aye-aye). We refer to the descendent clades as V1Rstrep-α and V1Rstrep-β. Comparison of the two clades reveals different amino acid compositions corresponding to the predicted ligand-binding site and thus potentially to altered functional profiles between the two. In agreement with previous studies of the mouse lemur (genus, Microcebus), the majority of V1Rstrep gene copies appear to be intact and under strong positive selection, particularly within transmembrane regions. Finally, despite the surprisingly high number of gene copies identified in this study, it is nonetheless probable that V1R diversity remains underestimated in these nonmodel primates and that complete characterization will be limited until high-coverage assembled genomes are available.
Collapse
|
18
|
Gotoh O, Morita M, Nelson DR. Assessment and refinement of eukaryotic gene structure prediction with gene-structure-aware multiple protein sequence alignment. BMC Bioinformatics 2014; 15:189. [PMID: 24927652 PMCID: PMC4065584 DOI: 10.1186/1471-2105-15-189] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2014] [Accepted: 06/09/2014] [Indexed: 03/29/2024] Open
Abstract
Background Accurate computational identification of eukaryotic gene organization is a long-standing problem. Despite the fundamental importance of precise annotation of genes encoded in newly sequenced genomes, the accuracy of predicted gene structures has not been critically evaluated, mostly due to the scarcity of proper assessment methods. Results We present a gene-structure-aware multiple sequence alignment method for gene prediction using amino acid sequences translated from homologous genes from many genomes. The approach provides rich information concerning the reliability of each predicted gene structure. We have also devised an iterative method that attempts to improve the structures of suspiciously predicted genes based on a spliced alignment algorithm using consensus sequences or reliable homologs as templates. Application of our methods to cytochrome P450 and ribosomal proteins from 47 plant genomes indicated that 50 ~ 60 % of the annotated gene structures are likely to contain some defects. Whereas more than half of the defect-containing genes may be intrinsically broken, i.e. they are pseudogenes or gene fragments, located in unfinished sequencing areas, or corresponding to non-productive isoforms, the defects found in a majority of the remaining gene candidates can be remedied by our iterative refinement method. Conclusions Refinement of eukaryotic gene structures mediated by gene-structure-aware multiple protein sequence alignment is a useful strategy to dramatically improve the overall prediction quality of a set of homologous genes. Our method will be applicable to various families of protein-coding genes if their domain structures are evolutionarily stable. It is also feasible to apply our method to gene families from all kingdoms of life, not just plants.
Collapse
Affiliation(s)
- Osamu Gotoh
- Computational Biology Research Center (CBRC), National Institute of Advanced Industrial Science and Technology (AIST), Koto-ku, Tokyo 135-0064, Japan.
| | | | | |
Collapse
|
19
|
-Biao Guo F, Lin Y, -Ling Chen L. Recognition of Protein-coding Genes Based on Z-curve Algorithms. Curr Genomics 2014; 15:95-103. [PMID: 24822027 PMCID: PMC4009845 DOI: 10.2174/1389202915999140328162724] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2013] [Revised: 11/19/2013] [Accepted: 11/20/2013] [Indexed: 01/18/2023] Open
Abstract
Recognition of protein-coding genes, a classical bioinformatics issue, is an absolutely needed step for annotating newly sequenced genomes. The Z-curve algorithm, as one of the most effective methods on this issue, has been successfully applied in annotating or re-annotating many genomes, including those of bacteria, archaea and viruses. Two Z-curve based ab initio gene-finding programs have been developed: ZCURVE (for bacteria and archaea) and ZCURVE_V (for viruses and phages). ZCURVE_C (for 57 bacteria) and Zfisher (for any bacterium) are web servers for re-annotation of bacterial and archaeal genomes. The above four tools can be used for genome annotation or re-annotation, either independently or combined with the other gene-finding programs. In addition to recognizing protein-coding genes and exons, Z-curve algorithms are also effective in recognizing promoters and translation start sites. Here, we summarize the applications of Z-curve algorithms in gene finding and genome annotation.
Collapse
Affiliation(s)
- Feng -Biao Guo
- Center of Bioinformatics and Key Laboratory for NeuroInformation of the Ministry of Education, University of Elec-tronic Science and Technology of China, Chengdu, 610054, China
| | - Yan Lin
- Department of Physics, Tianjin University, Tianjin 300072, China
| | - Ling -Ling Chen
- cCollege of Life Science and Technology, Huazhong Agricultural University, Wuhan, 430070, China
| |
Collapse
|
20
|
Khenoussi W, Vanhoutrève R, Poch O, Thompson JD. SIBIS: a Bayesian model for inconsistent protein sequence estimation. Bioinformatics 2014; 30:2432-9. [PMID: 24825613 DOI: 10.1093/bioinformatics/btu329] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION The prediction of protein coding genes is a major challenge that depends on the quality of genome sequencing, the accuracy of the model used to elucidate the exonic structure of the genes and the complexity of the gene splicing process leading to different protein variants. As a consequence, today's protein databases contain a huge amount of inconsistency, due to both natural variants and sequence prediction errors. RESULTS We have developed a new method, called SIBIS, to detect such inconsistencies based on the evolutionary information in multiple sequence alignments. A Bayesian framework, combined with Dirichlet mixture models, is used to estimate the probability of observing specific amino acids and to detect inconsistent or erroneous sequence segments. We evaluated the performance of SIBIS on a reference set of protein sequences with experimentally validated errors and showed that the sensitivity is significantly higher than previous methods, with only a small loss of specificity. We also assessed a large set of human sequences from the UniProt database and found evidence of inconsistency in 48% of the previously uncharacterized sequences. We conclude that the integration of quality control methods like SIBIS in automatic analysis pipelines will be critical for the robust inference of structural, functional and phylogenetic information from these sequences. AVAILABILITY AND IMPLEMENTATION Source code, implemented in C on a linux system, and the datasets of protein sequences are freely available for download at http://www.lbgi.fr/∼julie/SIBIS.
Collapse
Affiliation(s)
- Walyd Khenoussi
- Department of Computer Science, ICube, UMR 7357, University of Strasbourg, CNRS, Fédération de médecine translationnelle, Strasbourg, F-67085, France
| | - Renaud Vanhoutrève
- Department of Computer Science, ICube, UMR 7357, University of Strasbourg, CNRS, Fédération de médecine translationnelle, Strasbourg, F-67085, France
| | - Olivier Poch
- Department of Computer Science, ICube, UMR 7357, University of Strasbourg, CNRS, Fédération de médecine translationnelle, Strasbourg, F-67085, France
| | - Julie D Thompson
- Department of Computer Science, ICube, UMR 7357, University of Strasbourg, CNRS, Fédération de médecine translationnelle, Strasbourg, F-67085, France
| |
Collapse
|
21
|
Nagy A, Patthy L. FixPred: a resource for correction of erroneous protein sequences. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2014; 2014:bau032. [PMID: 24705206 PMCID: PMC3975993 DOI: 10.1093/database/bau032] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Protein databases are heavily contaminated with erroneous (mispredicted, abnormal and incomplete) sequences and these erroneous data significantly distort the conclusions drawn from genome-scale protein sequence analyses. In our earlier work we described the MisPred resource that serves to identify erroneous sequences; here we present the FixPred computational pipeline that automatically corrects sequences identified by MisPred as erroneous. The current version of the associated FixPred database contains corrected UniProtKB/Swiss-Prot and NCBI/RefSeq sequences from Homo sapiens, Mus musculus, Rattus norvegicus, Monodelphis domestica, Gallus gallus, Xenopus tropicalis, Danio rerio, Fugu rubripes, Ciona intestinalis, Branchostoma floridae, Drosophila melanogaster and Caenorhabditis elegans; future releases of the FixPred database will include corrected sequences of additional Metazoan species. The FixPred computational pipeline and database (http://www.fixpred.com) are easily accessible through a simple web interface coupled to a powerful query engine and a standard web service. The content is completely or partially downloadable in a variety of formats. Database URL:http://www.fixpred.com
Collapse
Affiliation(s)
| | - László Patthy
- *Corresponding author: Tel: +361 279 3100; Fax: +361 466 5465;
| |
Collapse
|
22
|
Nagy A, Patthy L. MisPred: a resource for identification of erroneous protein sequences in public databases. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2013; 2013:bat053. [PMID: 23864220 PMCID: PMC3713709 DOI: 10.1093/database/bat053] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
Correct prediction of the structure of protein-coding genes of higher eukaryotes is still a difficult task; therefore, public databases are heavily contaminated with mispredicted sequences. The high rate of misprediction has serious consequences because it significantly affects the conclusions that may be drawn from genome-scale sequence analyses of eukaryotic genomes. Here we present the MisPred database and computational pipeline that provide efficient means for the identification of erroneous sequences in public databases. The MisPred database contains a collection of abnormal, incomplete and mispredicted protein sequences from 19 metazoan species identified as erroneous by MisPred quality control tools in the UniProtKB/Swiss-Prot, UniProtKB/TrEMBL, NCBI/RefSeq and EnsEMBL databases. Major releases of the database are automatically generated and updated regularly. The database (http://www.mispred.com) is easily accessible through a simple web interface coupled to a powerful query engine and a standard web service. The content is completely or partially downloadable in a variety of formats. DATABASE URL: http://www.mispred.com.
Collapse
Affiliation(s)
- Alinda Nagy
- Institute of Enzymology, Research Centre for Natural Sciences, Hungarian Academy of Sciences, H-1113 Budapest, Hungary
| | | |
Collapse
|
23
|
Light S, Elofsson A. The impact of splicing on protein domain architecture. Curr Opin Struct Biol 2013; 23:451-8. [PMID: 23562110 DOI: 10.1016/j.sbi.2013.02.013] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2013] [Revised: 02/22/2013] [Accepted: 02/28/2013] [Indexed: 10/27/2022]
Abstract
Many proteins are composed of protein domains, functional units of common descent. Multidomain forms are common in all eukaryotes making up more than half of the proteome and the evolution of novel domain architecture has been accelerated in metazoans. It is also becoming increasingly clear that alternative splicing is prevalent among vertebrates. Given that protein domains are defined as structurally, functionally and evolutionarily distinct units, one may speculate that some alternative splicing events may lead to clean excisions of protein domains, thus generating a number of different domain architectures from one gene template. However, recent findings indicate that smaller alternative splicing events, in particular in disordered regions, might be more prominent than domain architectural changes. The problem of identifying protein isoforms is, however, still not resolved. Clearly, many splice forms identified through detection of mRNA sequences appear to produce 'nonfunctional' proteins, such as proteins with missing internal secondary structure elements. Here, we review the state of the art methods for identification of functional isoforms and present a summary of what is known, thus far, about alternative splicing with regard to protein domain architectures.
Collapse
Affiliation(s)
- Sara Light
- Science for Life Laboratory, Stockholm University, Box 1031 SE-171 21 Solna, Sweden
| | | |
Collapse
|
24
|
Abrusán G, Szilágyi A, Zhang Y, Papp B. Turning gold into 'junk': transposable elements utilize central proteins of cellular networks. Nucleic Acids Res 2013; 41:3190-200. [PMID: 23341038 PMCID: PMC3597677 DOI: 10.1093/nar/gkt011] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023] Open
Abstract
The numerous discovered cases of domesticated transposable element (TE) proteins led to the recognition that TEs are a significant source of evolutionary innovation. However, much less is known about the reverse process, whether and to what degree the evolution of TEs is influenced by the genome of their hosts. We addressed this issue by searching for cases of incorporation of host genes into the sequence of TEs and examined the systems-level properties of these genes using the Saccharomyces cerevisiae and Drosophila melanogaster genomes. We identified 51 cases where the evolutionary scenario was the incorporation of a host gene fragment into a TE consensus sequence, and we show that both the yeast and fly homologues of the incorporated protein sequences have central positions in the cellular networks. An analysis of selective pressure (Ka/Ks ratio) detected significant selection in 37% of the cases. Recent research on retrovirus-host interactions shows that virus proteins preferentially target hubs of the host interaction networks enabling them to take over the host cell using only a few proteins. We propose that TEs face a similar evolutionary pressure to evolve proteins with high interacting capacities and take some of the necessary protein domains directly from their hosts.
Collapse
Affiliation(s)
- György Abrusán
- Synthetic and Systems Biology Unit, Institute of Biochemistry, Biological Research Center of the Hungarian Academy of Sciences, Temesváry krt. 62. Szeged H-6701, Hungary.
| | | | | | | |
Collapse
|
25
|
Abstract
The study of nonhuman primates (NHP) is key to understanding human evolution, in addition to being an important model for biomedical research. NHPs are especially important for translational medicine. There are now exciting opportunities to greatly increase the utility of these models by incorporating Next Generation (NextGen) sequencing into study design. Unfortunately, the draft status of nonhuman genomes greatly constrains what can currently be accomplished with available technology. Although all genomes contain errors, draft assemblies and annotations contain so many mistakes that they make currently available nonhuman primate genomes misleading to investigators conducting evolutionary studies; and these genomes are of insufficient quality to serve as references for NextGen studies. Fortunately, NextGen sequencing can be used in the production of greatly improved genomes. Existing Sanger sequences can be supplemented with NextGen whole genome, and exomic genomic sequences to create new, more complete and correct assemblies. Additional physical mapping, and an incorporation of information about gene structure, can be used to improve assignment of scaffolds to chromosomes. In addition, mRNA-sequence data can be used to economically acquire transcriptome information, which can be used for annotation. Some highly polymorphic and complex regions, for example MHC class I and immunoglobulin loci, will require extra effort to properly assemble and annotate. However, for the vast majority of genes, a modest investment in money, and a somewhat greater investment in time, can greatly improve assemblies and annotations sufficient to produce true, reference grade nonhuman primate genomes. Such resources can reasonably be expected to transform nonhuman primate research.
Collapse
Affiliation(s)
- Robert B. Norgren
- Address correspondence and reprint requests to Dr. Robert B. Norgren, Department of Genetics, Cell Biology and Anatomy, University of Nebraska Medical Center, 985805 Nebraska Medical Center, Omaha, NE 68198 or email
| |
Collapse
|
26
|
Doolittle RF, McNamara K, Lin K. Correlating structure and function during the evolution of fibrinogen-related domains. Protein Sci 2012; 21:1808-23. [PMID: 23076991 PMCID: PMC3575912 DOI: 10.1002/pro.2177] [Citation(s) in RCA: 52] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2012] [Revised: 10/04/2012] [Accepted: 10/05/2012] [Indexed: 12/29/2022]
Abstract
Fibrinogen-related domains (FReDs) are found in a variety of animal proteins with widely different functions, ranging from non-self recognition to clot formation. All appear to have a common surface where binding of one sort or other occurs. An examination of 19 completed animal genomes--including a sponge and sea anemone, six protostomes, and 11 deuterostomes--has allowed phylogenies to be constructed that show where various types of FReP (proteins containing FReDs) first made their appearance. Comparisons of sequences and structures also reveal particular features that correlate with function, including the influence of neighbor-domains. A particular set of insertions in the carboxyl-terminal subdomain was involved in the transition from structures known to bind sugars to those known to bind amino-terminal peptides. Perhaps not unexpectedly, FReDs with different functions have changed at different rates, with ficolins by far the fastest changing group. Significantly, the greatest amount of change in ficolin FReDs occurs in the third subdomain ("P domain"), the very opposite of the situation in most other vertebrate FReDs. The unbalanced style of change was also observed in FReDs from non-chordates, many of which have been implicated in innate immunity.
Collapse
Affiliation(s)
- Russell F Doolittle
- Department of Chemistry & Biochemistry, University of California, San Diego, La Jolla, California 92093-0314, USA.
| | | | | |
Collapse
|
27
|
Wang Q, Lei Y, Xu X, Wang G, Chen LL. Theoretical prediction and experimental verification of protein-coding genes in plant pathogen genome Agrobacterium tumefaciens strain C58. PLoS One 2012; 7:e43176. [PMID: 22984411 PMCID: PMC3439454 DOI: 10.1371/journal.pone.0043176] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2012] [Accepted: 07/18/2012] [Indexed: 11/19/2022] Open
Abstract
Agrobacterium tumefaciens strain C58 is a Gram-negative soil bacterium capable of inducing tumors (crown galls) on many dicotyledonous plants. The genome of A. tumefaciens strain C58 was re-annotated based on the Z-curve method. First, all the ‘hypothetical genes’ were re-identified, and 29 originally annotated ‘hypothetical genes’ were recognized to be non-coding open reading frames (ORFs). Theoretical evidence obtained from principal component analysis, clusters of orthologous groups of proteins occupation, and average length distribution showed that these non-coding ORFs were highly unlikely to encode proteins. Results from the reverse transcription-polymerase chain reaction (RT-PCR) experiments on three different growth stages of A. tumefaciens C58 confirmed that 23 (79%) of the identified non-coding ORFs have no transcripts in these growth stages. In addition, using theoretical prediction, 19 potential protein-coding genes were predicted to be new protein-coding genes. Fifteen (79%) of these genes were verified with RT-PCR experiments. The RT-PCR experimental results confirmed the reliability of our theoretical prediction, indicating that false-positive prediction and missing genes always exist in the annotation of A. tumefaciens C58 genome. The improved annotation will serve as a valuable resource for the research of the lifestyle, metabolism, and pathogenicity of A. tumefaciens C58. The re-annotation of A. tumefaciens C58 can be obtained from http://211.69.128.148/Atum/.
Collapse
Affiliation(s)
- Qian Wang
- State Key Laboratory of Agricultural Microbiology, College of Life Science and Technology, Huazhong Agricultural University, Wuhan, People's Republic of China
| | - Yang Lei
- State Key Laboratory of Agricultural Microbiology, College of Life Science and Technology, Huazhong Agricultural University, Wuhan, People's Republic of China
- Center for Bioinformatics, Huazhong Agricultural University, Wuhan, People's Republic of China
| | - Xiwen Xu
- State Key Laboratory of Agricultural Microbiology, College of Life Science and Technology, Huazhong Agricultural University, Wuhan, People's Republic of China
- Center for Bioinformatics, Huazhong Agricultural University, Wuhan, People's Republic of China
| | - Gejiao Wang
- State Key Laboratory of Agricultural Microbiology, College of Life Science and Technology, Huazhong Agricultural University, Wuhan, People's Republic of China
- * E-mail: (WG); (LLC)
| | - Ling-Ling Chen
- State Key Laboratory of Agricultural Microbiology, College of Life Science and Technology, Huazhong Agricultural University, Wuhan, People's Republic of China
- Center for Bioinformatics, Huazhong Agricultural University, Wuhan, People's Republic of China
- * E-mail: (WG); (LLC)
| |
Collapse
|
28
|
Zhang X, Goodsell J, Norgren RB. Limitations of the rhesus macaque draft genome assembly and annotation. BMC Genomics 2012; 13:206. [PMID: 22646658 PMCID: PMC3426473 DOI: 10.1186/1471-2164-13-206] [Citation(s) in RCA: 48] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2011] [Accepted: 05/30/2012] [Indexed: 11/30/2022] Open
Abstract
Finished genome sequences and assemblies are available for only a few vertebrates. Thus, investigators studying many species must rely on draft genomes. Using the rhesus macaque as an example, we document the effects of sequencing errors, gaps in sequence and misassemblies on one automated gene model pipeline, Gnomon. The combination of draft genome with automated gene finding software can result in spurious sequences. We estimate that approximately 50% of the rhesus gene models are missing, incomplete or incorrect. The problems identified in this work likely apply to all draft vertebrate genomes annotated with any automated gene model pipeline and thus represent a pervasive challenge to the analysis of draft genomes.
Collapse
Affiliation(s)
- Xiongfei Zhang
- Department of Genetics, Cell Biology and Anatomy, University of Nebraska Medical Center, Omaha, NE 68198, USA
| | | | | |
Collapse
|
29
|
Guo B, Zou M, Wagner A. Pervasive indels and their evolutionary dynamics after the fish-specific genome duplication. Mol Biol Evol 2012; 29:3005-22. [PMID: 22490820 DOI: 10.1093/molbev/mss108] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023] Open
Abstract
Insertions and deletions (indels) in protein-coding genes are important sources of genetic variation. Their role in creating new proteins may be especially important after gene duplication. However, little is known about how indels affect the divergence of duplicate genes. We here study thousands of duplicate genes in five fish (teleost) species with completely sequenced genomes. The ancestor of these species has been subject to a fish-specific genome duplication (FSGD) event that occurred approximately 350 Ma. We find that duplicate genes contain at least 25% more indels than single-copy genes. These indels accumulated preferentially in the first 40 my after the FSGD. A lack of widespread asymmetric indel accumulation indicates that both members of a duplicate gene pair typically experience relaxed selection. Strikingly, we observe a 30-80% excess of deletions over insertions that is consistent for indels of various lengths and across the five genomes. We also find that indels preferentially accumulate inside loop regions of protein secondary structure and in regions where amino acids are exposed to solvent. We show that duplicate genes with high indel density also show high DNA sequence divergence. Indel density, but not amino acid divergence, can explain a large proportion of the tertiary structure divergence between proteins encoded by duplicate genes. Our observations are consistent across all five fish species. Taken together, they suggest a general pattern of duplicate gene evolution in which indels are important driving forces of evolutionary change.
Collapse
Affiliation(s)
- Baocheng Guo
- Institute of Evolutionary Biology and Environmental Studies, University of Zurich, Zurich, Switzerland
| | | | | |
Collapse
|
30
|
Prosdocimi F, Linard B, Pontarotti P, Poch O, Thompson JD. Controversies in modern evolutionary biology: the imperative for error detection and quality control. BMC Genomics 2012; 13:5. [PMID: 22217008 PMCID: PMC3311146 DOI: 10.1186/1471-2164-13-5] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2011] [Accepted: 01/04/2012] [Indexed: 12/03/2022] Open
Abstract
Background The data from high throughput genomics technologies provide unique opportunities for studies of complex biological systems, but also pose many new challenges. The shift to the genome scale in evolutionary biology, for example, has led to many interesting, but often controversial studies. It has been suggested that part of the conflict may be due to errors in the initial sequences. Most gene sequences are predicted by bioinformatics programs and a number of quality issues have been raised, concerning DNA sequencing errors or badly predicted coding regions, particularly in eukaryotes. Results We investigated the impact of these errors on evolutionary studies and specifically on the identification of important genetic events. We focused on the detection of asymmetric evolution after duplication, which has been the subject of controversy recently. Using the human genome as a reference, we established a reliable set of 688 duplicated genes in 13 complete vertebrate genomes, where significantly different evolutionary rates are observed. We estimated the rates at which protein sequence errors occur and are accumulated in the higher-level analyses. We showed that the majority of the detected events (57%) are in fact artifacts due to the putative erroneous sequences and that these artifacts are sufficient to mask the true functional significance of the events. Conclusions Initial errors are accumulated throughout the evolutionary analysis, generating artificially high rates of event predictions and leading to substantial uncertainty in the conclusions. This study emphasizes the urgent need for error detection and quality control strategies in order to efficiently extract knowledge from the new genome data.
Collapse
Affiliation(s)
- Francisco Prosdocimi
- Department of Integrated Structural Biology, IGBMC (Institut de Génétique et de Biologie Moléculaire et Cellulaire) CNRS/INSERM/Université de Strasbourg, 1 rue Laurent Fries, Illkirch, F-67404, France
| | | | | | | | | |
Collapse
|
31
|
Polymorphisms in Ly6 genes in Msq1 encoding susceptibility to mouse adenovirus type 1. Mamm Genome 2011; 23:250-8. [PMID: 22101863 DOI: 10.1007/s00335-011-9368-9] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2011] [Accepted: 10/20/2011] [Indexed: 12/17/2022]
Abstract
Strain-specific differences in susceptibility to mouse adenovirus type 1 (MAV-1) are linked to the quantitative trait locus Msq1 on mouse chromosome 15. This region contains 14 Ly6 or Ly6-related genes, many of which are known to be expressed on the surface of immune cells, suggesting a possible role in host defense. We analyzed these genes for polymorphisms between MAV-1-susceptible and MAV-1-resistant inbred mouse strains. Sequencing of cDNAs identified 12 coding-region polymorphisms in 2010109I03Rik, Ly6e, Ly6a, Ly6c1, and Ly6c2, six of which were nonsynonymous and five of which were previously unlisted in dbSNP Build 132. We also clarified sequence discrepancies in GenBank for the coding regions of I830127L07Rik and Ly6g. Additionally, Southern blotting revealed size polymorphisms within the DNA regions of Ly6e, Ly6a, and Ly6g. Collectively, these genetic variations have implications for the structure, function, and/or expression of Ly6 and Ly6-related genes that may contribute to the observed strain-specific differences in susceptibility to MAV-1.
Collapse
|
32
|
Yu JF, Xiao K, Jiang DK, Guo J, Wang JH, Sun X. An integrative method for identifying the over-annotated protein-coding genes in microbial genomes. DNA Res 2011; 18:435-49. [PMID: 21903723 PMCID: PMC3223076 DOI: 10.1093/dnares/dsr030] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
Abstract
The falsely annotated protein-coding genes have been deemed one of the major causes accounting for the annotating errors in public databases. Although many filtering approaches have been designed for the over-annotated protein-coding genes, some are questionable due to the resultant increase in false negative. Furthermore, there is no webserver or software specifically devised for the problem of over-annotation. In this study, we propose an integrative algorithm for detecting the over-annotated protein-coding genes in microorganisms. Overall, an average accuracy of 99.94% is achieved over 61 microbial genomes. The extremely high accuracy indicates that the presented algorithm is efficient to differentiate the protein-coding genes from the non-coding open reading frames. Abundant analyses show that the predicting results are reliable and the integrative algorithm is robust and convenient. Our analysis also indicates that the over-annotated protein-coding genes can cause the false positive of horizontal gene transfers detection. The webserver of the proposed algorithm can be freely accessible from www.cbi.seu.edu.cn/RPGM.
Collapse
Affiliation(s)
- Jia-Feng Yu
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing, China.
| | | | | | | | | | | |
Collapse
|
33
|
Doolittle RF. The protochordate Ciona intestinalis has a protein like full-length vertebrate fibrinogen. J Innate Immun 2011; 4:219-22. [PMID: 21860218 DOI: 10.1159/000329823] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2011] [Accepted: 05/31/2011] [Indexed: 11/19/2022] Open
Abstract
In a recent review, a putative fibrinogen-like protein in the protochordate Ciona intestinalis was noted. Unfortunately, computer-directed splicing had omitted several exons, mistakenly generating a single long polypeptide chain. In fact, 3 consecutive genes exist, the translated versions of which are homologous to individual vertebrate fibrinogen chains. The circulating form is likely a 6-chain covalent dimer, just as occurs in vertebrates.
Collapse
Affiliation(s)
- Russell F Doolittle
- Department of Chemistry & Biochemistry, University of California, San Diego, Calif., USA.
| |
Collapse
|
34
|
Reassessing domain architecture evolution of metazoan proteins: major impact of gene prediction errors. Genes (Basel) 2011; 2:449-501. [PMID: 24710207 PMCID: PMC3927609 DOI: 10.3390/genes2030449] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2011] [Revised: 06/14/2011] [Accepted: 06/20/2011] [Indexed: 11/17/2022] Open
Abstract
In view of the fact that appearance of novel protein domain architectures (DA) is closely associated with biological innovations, there is a growing interest in the genome-scale reconstruction of the evolutionary history of the domain architectures of multidomain proteins. In such analyses, however, it is usually ignored that a significant proportion of Metazoan sequences analyzed is mispredicted and that this may seriously affect the validity of the conclusions. To estimate the contribution of errors in gene prediction to differences in DA of predicted proteins, we have used the high quality manually curated UniProtKB/Swiss-Prot database as a reference. For genome-scale analysis of domain architectures of predicted proteins we focused on RefSeq, EnsEMBL and NCBI's GNOMON predicted sequences of Metazoan species with completely sequenced genomes. Comparison of the DA of UniProtKB/Swiss-Prot sequences of worm, fly, zebrafish, frog, chick, mouse, rat and orangutan with those of human Swiss-Prot entries have identified relatively few cases where orthologs had different DA, although the percentage with different DA increased with evolutionary distance. In contrast with this, comparison of the DA of human, orangutan, rat, mouse, chicken, frog, zebrafish, worm and fly RefSeq, EnsEMBL and NCBI's GNOMON predicted protein sequences with those of the corresponding/orthologous human Swiss-Prot entries identified a significantly higher proportion of domain architecture differences than in the case of the comparison of Swiss-Prot entries. Analysis of RefSeq, EnsEMBL and NCBI's GNOMON predicted protein sequences with DAs different from those of their Swiss-Prot orthologs confirmed that the higher rate of domain architecture differences is due to errors in gene prediction, the majority of which could be corrected with our FixPred protocol. We have also demonstrated that contamination of databases with incomplete, abnormal or mispredicted sequences introduces a bias in DA differences in as much as it increases the proportion of terminal over internal DA differences. Here we have shown that in the case of RefSeq, EnsEMBL and NCBI's GNOMON predicted protein sequences of Metazoan species, the contribution of gene prediction errors to domain architecture differences of orthologs is comparable to or greater than those due to true gene rearrangements. We have also demonstrated that domain architecture comparison may serve as a useful tool for the quality control of gene predictions and may thus guide the correction of sequence errors. Our findings caution that earlier genome-scale studies based on comparison of predicted (frequently mispredicted) protein sequences may have led to some erroneous conclusions about the evolution of novel domain architectures of multidomain proteins. A reassessment of the DA evolution of orthologous and paralogous proteins is presented in an accompanying paper [1].
Collapse
|
35
|
D'Angelo S, Velappan N, Mignone F, Santoro C, Sblattero D, Kiss C, Bradbury ARM. Filtering "genic" open reading frames from genomic DNA samples for advanced annotation. BMC Genomics 2011; 12 Suppl 1:S5. [PMID: 21810207 PMCID: PMC3223728 DOI: 10.1186/1471-2164-12-s1-s5] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Background In order to carry out experimental gene annotation, DNA encoding open reading frames (ORFs) derived from real genes (termed "genic") in the correct frame is required. When genes are correctly assigned, isolation of genic DNA for functional annotation can be carried out by PCR. However, not all genes are correctly assigned, and even when correctly assigned, gene products are often incorrectly folded when expressed in heterologous hosts. This is a problem that can sometimes be overcome by the expression of protein fragments encoding domains, rather than full-length proteins. One possible method to isolate DNA encoding such domains would to "filter" complex DNA (cDNA libraries, genomic and metagenomic DNA) for gene fragments that confer a selectable phenotype relying on correct folding, with all such domains present in a complex DNA sample, termed the “domainome”. Results In this paper we discuss the preparation of diverse genic ORF libraries from randomly fragmented genomic DNA using ß-lactamase to filter out the open reading frames. By cloning DNA fragments between leader sequences and the mature ß-lactamase gene, colonies can be selected for resistance to ampicillin, conferred by correct folding of the lactamase gene. Our experiments demonstrate that the majority of surviving colonies contain genic open reading frames, suggesting that ß-lactamase is acting as a selectable folding reporter. Furthermore, different leaders (Sec, TAT and SRP), normally translocating different protein classes, filter different genic fragment subsets, indicating that their use increases the fraction of the “domainone” that is accessible. Conclusions The availability of ORF libraries, obtained with the filtering method described here, combined with screening methods such as phage display and protein-protein interaction studies, or with protein structure determination projects, can lead to the identification and structural determination of functional genic ORFs. ORF libraries represent, moreover, a useful tool to proceed towards high-throughput functional annotation of newly sequenced genomes.
Collapse
Affiliation(s)
- Sara D'Angelo
- Bioscience Division, Los Alamos National Laboratory, Los Alamos, NM, USA
| | | | | | | | | | | | | |
Collapse
|
36
|
Williams GW, Davis PA, Rogers AS, Bieri T, Ozersky P, Spieth J. Methods and strategies for gene structure curation in WormBase. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2011; 2011:baq039. [PMID: 21543339 PMCID: PMC3092607 DOI: 10.1093/database/baq039] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
The Caenorhabditis elegans genome sequence was published over a decade ago; this was the first published genome of a multi-cellular organism and now the WormBase project has had a decade of experience in curating this genome's sequence and gene structures. In one of its roles as a central repository for nematode biology, WormBase continues to refine the gene structure annotations using sequence similarity and other computational methods, as well as information from the literature- and community-submitted annotations. We describe the various methods of gene structure curation that have been tried by WormBase and the problems associated with each of them. We also describe the current strategy for gene structure curation, and introduce the WormBase ‘curation tool’, which integrates different data sources in order to identify new and correct gene structures. Database URL: http://www.wormbase.org/
Collapse
Affiliation(s)
- G W Williams
- WormBase Group, The Wellcome Trust Sanger Institute, Hinxton, Cambs, UK.
| | | | | | | | | | | |
Collapse
|
37
|
Hegyi H, Kalmar L, Horvath T, Tompa P. Verification of alternative splicing variants based on domain integrity, truncation length and intrinsic protein disorder. Nucleic Acids Res 2010; 39:1208-19. [PMID: 20972208 PMCID: PMC3045584 DOI: 10.1093/nar/gkq843] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023] Open
Abstract
According to current estimations ∼95% of multi-exonic human protein-coding genes undergo alternative splicing (AS). However, for 4000 human proteins in PDB, only 14 human proteins have structures of at least two alternative isoforms. Surveying these structural isoforms revealed that the maximum insertion accommodated by an isoform of a fully ordered protein domain was 5 amino acids, other instances of domain changes involved intrinsic structural disorder. After collecting 505 minor isoforms of human proteins with evidence for their existence we analyzed their length, protein disorder and exposed hydrophobic surface. We found that strict rules govern the selection of alternative splice variants aimed to preserve the integrity of globular domains: alternative splice sites (i) tend to avoid globular domains or (ii) affect them only marginally or (iii) tend to coincide with a location where the exposed hydrophobic surface is minimal or (iv) the protein is disordered. We also observed an inverse correlation between the domain fraction lost and the full length of the minor isoform containing the domain, possibly indicating a buffering effect for the isoform protein counteracting the domain truncation effect. These observations provide the basis for a prediction method (currently under development) to predict the viability of splice variants.
Collapse
Affiliation(s)
- Hedi Hegyi
- Institute of Enzymology, Biological Research Center, Hungarian Academy of Sciences, PO Box 7, 1518 Budapest, Hungary.
| | | | | | | |
Collapse
|
38
|
Assigning biological functions to rice genes by genome annotation, expression analysis and mutagenesis. Biotechnol Lett 2010; 32:1753-63. [PMID: 20703802 DOI: 10.1007/s10529-010-0377-7] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2010] [Accepted: 07/28/2010] [Indexed: 12/17/2022]
Abstract
Rice is the first cereal genome to be completely sequenced. Since the completion of its genome sequencing, considerable progress has been made in multiple areas including the whole genome annotation, gene expression profiling, mutant collection, etc. Here, we summarize the current status of rice genome annotation and review the methodology of assigning biological functions to hundreds of thousands of rice genes as well as discuss the major limitations and the future perspective in rice functional genomics. Available data analysis shows that the rice genome encodes around 32,000 protein-coding genes. Expression analysis revealed at least 31,000 genes with expression evidence from full-length cDNA/EST collection or other transcript profiling. In addition, we have summarized various strategies to generate mutant population including natural, physical, chemical, T-DNA, transposon/retrotransposon or gene silencing based mutagenesis. Currently, more than 1 million of mutants have been generated and 27,551 of them have their flanking sequence tags. To assign biological functions to hundreds of thousands of rice genes, global co-operations are required, various genetic resources should be more easily accessible and diverse data from transcriptomics, proteomics, epigenetics, comparative genomics and bioinformatics should be integrated to better understand the functions of these genes and their regulatory mechanisms.
Collapse
|
39
|
Temeyer KB, Pruett JH, Olafson PU. Baculovirus expression, biochemical characterization and organophosphate sensitivity of rBmAChE1, rBmAChE2, and rBmAChE3 of Rhipicephalus (Boophilus) microplus. Vet Parasitol 2010; 172:114-21. [DOI: 10.1016/j.vetpar.2010.04.016] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2010] [Revised: 04/08/2010] [Accepted: 04/09/2010] [Indexed: 01/31/2023]
|
40
|
Poptsova MS, Gogarten JP. Using comparative genome analysis to identify problems in annotated microbial genomes. Microbiology (Reading) 2010; 156:1909-1917. [DOI: 10.1099/mic.0.033811-0] [Citation(s) in RCA: 80] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Genome annotation is a tedious task that is mostly done by automated methods; however, the accuracy of these approaches has been questioned since the beginning of the sequencing era. Genome annotation is a multilevel process, and errors can emerge at different stages: during sequencing, as a result of gene-calling procedures, and in the process of assigning gene functions. Missed or wrongly annotated genes differentially impact different types of analyses. Here we discuss and demonstrate how the methods of comparative genome analysis can refine annotations by locating missing orthologues. We also discuss possible reasons for errors and show that the second-generation annotation systems, which combine multiple gene-calling programs with similarity-based methods, perform much better than the first annotation tools. Since old errors may propagate to the newly sequenced genomes, we emphasize that the problem of continuously updating popular public databases is an urgent and unresolved one. Due to the progress in genome-sequencing technologies, automated annotation techniques will remain the main approach in the future. Researchers need to be aware of the existing errors in the annotation of even well-studied genomes, such as Escherichia coli, and consider additional quality control for their results.
Collapse
Affiliation(s)
- Maria S. Poptsova
- Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT 06269-3125, USA
| | - J. Peter Gogarten
- Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT 06269-3125, USA
| |
Collapse
|
41
|
Rouchka EC. Database of exact tandem repeats in the Zebrafish genome. BMC Genomics 2010; 11:347. [PMID: 20515480 PMCID: PMC2901318 DOI: 10.1186/1471-2164-11-347] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2009] [Accepted: 06/01/2010] [Indexed: 11/23/2022] Open
Abstract
Background Sequencing of the approximately 1.7 billion bases of the zebrafish genome is currently underway. To date, few high resolution genetic maps exist for the zebrafish genome, based mainly on single nucleotide polymorphisms (SNPs) and short microsatellite repeats. The desire to construct a higher resolution genetic map led to the construction of a database of tandemly repeating elements within the zebrafish Zv8 assembly. Description Exact tandem repeats with a repeat length of at least three bases and a copy number of at least 10 were reported. Repeats with a total length of 250 or fewer bases and their flanking regions were masked for known vertebrate repeats. Optimal primer pairs were computationally designed in the regions flanking the detected repeats. This database of exact tandem repeats can then be used as a resource by molecular biologists with interests in experimentally testing VNTRs within a zebrafish population. Conclusions A total of 116,915 repeats with a base length of at least three nucleotides were detected. The longest of these was a 54-base repeat with fourteen tandem copies. A significant number of repeats with a base length of 18, 24, 27 and 30 were detected, many with potentially novel proline-rich coding regions. Detection of exact tandem repeats in the zebrafish genome leads to a wealth of information regarding potential polymorphic sites for VNTRs. The association of many of these repeats with potentially novel yet similar coding regions yields an exciting potential for disease associated genes. A web interface for querying repeats is available at http://bioinformatics.louisville.edu/zebrafish/. This portal allows for users to search for a repeats of a selected base size from any valid specified region within the 25 linkage groups.
Collapse
Affiliation(s)
- Eric C Rouchka
- Department of Computer Engineering and Computer Science, Speed School of Engineering, University of Louisville, Duthie Center, Room 208, Louisville, KY, USA.
| |
Collapse
|
42
|
Bányai L, Sonderegger P, Patthy L. Agrin binds BMP2, BMP4 and TGFbeta1. PLoS One 2010; 5:e10758. [PMID: 20505824 PMCID: PMC2874008 DOI: 10.1371/journal.pone.0010758] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2009] [Accepted: 05/03/2010] [Indexed: 01/13/2023] Open
Abstract
The C-terminal 95 kDa fragment of some isoforms of vertebrate agrins is sufficient to induce clustering of acetylcholine receptors but despite two decades of intense agrin research very little is known about the function of the other isoforms and the function of the larger, N-terminal part of agrins that is common to all isoforms. Since the N-terminal part of agrins contains several follistatin-domains, a domain type that is frequently implicated in binding TGFβs, we have explored the interaction of the N-terminal part of rat agrin (Agrin-Nterm) with members of the TGFβ family using surface plasmon resonance spectroscopy and reporter assays. Here we show that agrin binds BMP2, BMP4 and TGFβ1 with relatively high affinity, the KD values of the interactions calculated from SPR experiments fall in the 10−8 M–10−7 M range. In reporter assays Agrin-Nterm inhibited the activities of BMP2 and BMP4, half maximal inhibition being achieved at ∼5×10−7 M. Paradoxically, in the case of TGFβ1 Agrin N-term caused a slight increase in activity in reporter assays. Our finding that agrin binds members of the TGFβ family may have important implications for the role of these growth factors in the regulation of synaptogenesis as well as for the role of agrin isoforms that are unable to induce clustering of acetylcholine receptors. We suggest that binding of these TGFβ family members to agrin may have a dual function: agrin may serve as a reservoir for these growth factors and may also inhibit their growth promoting activity. Based on analysis of the evolutionary history of agrin we suggest that agrin's growth factor binding function is more ancient than its involvement in acetylcholine receptor clustering.
Collapse
Affiliation(s)
- László Bányai
- Institute of Enzymology, Biological Research Center, Hungarian Academy of Sciences, Budapest, Hungary
| | - Peter Sonderegger
- Department of Biochemistry, University of Zurich, Zurich, Switzerland
| | - László Patthy
- Institute of Enzymology, Biological Research Center, Hungarian Academy of Sciences, Budapest, Hungary
- * E-mail:
| |
Collapse
|
43
|
GenePRIMP: a gene prediction improvement pipeline for prokaryotic genomes. Nat Methods 2010; 7:455-7. [PMID: 20436475 DOI: 10.1038/nmeth.1457] [Citation(s) in RCA: 450] [Impact Index Per Article: 32.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2010] [Accepted: 03/26/2010] [Indexed: 11/09/2022]
Abstract
We present 'gene prediction improvement pipeline' (GenePRIMP; http://geneprimp.jgi-psf.org/), a computational process that performs evidence-based evaluation of gene models in prokaryotic genomes and reports anomalies including inconsistent start sites, missed genes and split genes. We found that manual curation of gene models using the anomaly reports generated by GenePRIMP improved their quality, and demonstrate the applicability of GenePRIMP in improving finishing quality and comparing different genome-sequencing and annotation technologies.
Collapse
|
44
|
Goudenège D, Avner S, Lucchetti-Miganeh C, Barloy-Hubler F. CoBaltDB: Complete bacterial and archaeal orfeomes subcellular localization database and associated resources. BMC Microbiol 2010; 10:88. [PMID: 20331850 PMCID: PMC2850352 DOI: 10.1186/1471-2180-10-88] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2009] [Accepted: 03/23/2010] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The functions of proteins are strongly related to their localization in cell compartments (for example the cytoplasm or membranes) but the experimental determination of the sub-cellular localization of proteomes is laborious and expensive. A fast and low-cost alternative approach is in silico prediction, based on features of the protein primary sequences. However, biologists are confronted with a very large number of computational tools that use different methods that address various localization features with diverse specificities and sensitivities. As a result, exploiting these computer resources to predict protein localization accurately involves querying all tools and comparing every prediction output; this is a painstaking task. Therefore, we developed a comprehensive database, called CoBaltDB, that gathers all prediction outputs concerning complete prokaryotic proteomes. DESCRIPTION The current version of CoBaltDB integrates the results of 43 localization predictors for 784 complete bacterial and archaeal proteomes (2.548.292 proteins in total). CoBaltDB supplies a simple user-friendly interface for retrieving and exploring relevant information about predicted features (such as signal peptide cleavage sites and transmembrane segments). Data are organized into three work-sets ("specialized tools", "meta-tools" and "additional tools"). The database can be queried using the organism name, a locus tag or a list of locus tags and may be browsed using numerous graphical and text displays. CONCLUSIONS With its new functionalities, CoBaltDB is a novel powerful platform that provides easy access to the results of multiple localization tools and support for predicting prokaryotic protein localizations with higher confidence than previously possible. CoBaltDB is available at http://www.umr6026.univ-rennes1.fr/english/home/research/basic/software/cobalten.
Collapse
Affiliation(s)
- David Goudenège
- CNRS UMR 6026, ICM, Equipe B@SIC, Université de Rennes 1, Campus de Beaulieu, 35042 Rennes, France
| | | | | | | |
Collapse
|
45
|
Eisenhaber B, Eisenhaber F. Prediction of posttranslational modification of proteins from their amino acid sequence. Methods Mol Biol 2010; 609:365-84. [PMID: 20221930 DOI: 10.1007/978-1-60327-241-4_21] [Citation(s) in RCA: 56] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
If posttranslational modifications (PTMs) are chemical alterations of the protein primary structure during the protein's life cycle as a result of an enzymatic reaction, then the motif in the substrate protein sequence that is recognized by the enzyme can serve as basis for predictor construction that recognizes PTM sites in database sequences. The recognition motif consists generally of two regions: first, a small, central segment that enters the catalytic cleft of the enzyme and that is specific for this type of PTM and, second, a sequence environment of about 10 or more residues with linker characteristics (a trend for small and polar residues with flexible backbone) on either side of the central part that are needed to provide accessibility of the central segment to the enzyme's catalytic site. In this review, we consider predictors for cleavage of targeting signals, lipid PTMs, phosphorylation, and glycosylation.
Collapse
Affiliation(s)
- Birgit Eisenhaber
- Experimental Therapeutic Centre, Bioinformatics Institute, Agency for science, Technology, and Research, Singapore
| | | |
Collapse
|
46
|
Kim W, Silby MW, Purvine SO, Nicoll JS, Hixson KK, Monroe M, Nicora CD, Lipton MS, Levy SB. Proteomic detection of non-annotated protein-coding genes in Pseudomonas fluorescens Pf0-1. PLoS One 2009; 4:e8455. [PMID: 20041161 PMCID: PMC2794547 DOI: 10.1371/journal.pone.0008455] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2009] [Accepted: 12/02/2009] [Indexed: 11/18/2022] Open
Abstract
Genome sequences are annotated by computational prediction of coding sequences, followed by similarity searches such as BLAST, which provide a layer of possible functional information. While the existence of processes such as alternative splicing complicates matters for eukaryote genomes, the view of bacterial genomes as a linear series of closely spaced genes leads to the assumption that computational annotations that predict such arrangements completely describe the coding capacity of bacterial genomes. We undertook a proteomic study to identify proteins expressed by Pseudomonas fluorescens Pf0-1 from genes that were not predicted during the genome annotation. Mapping peptides to the Pf0-1 genome sequence identified sixteen non-annotated protein-coding regions, of which nine were antisense to predicted genes, six were intergenic, and one read in the same direction as an annotated gene but in a different frame. The expression of all but one of the newly discovered genes was verified by RT-PCR. Few clues as to the function of the new genes were gleaned from informatic analyses, but potential orthologs in other Pseudomonas genomes were identified for eight of the new genes. The 16 newly identified genes improve the quality of the Pf0-1 genome annotation, and the detection of antisense protein-coding genes indicates the under-appreciated complexity of bacterial genome organization.
Collapse
Affiliation(s)
- Wook Kim
- Center for Adaptation Genetics and Drug Resistance and Department of Molecular Biology and Microbiology, Tufts University School of Medicine, Boston, Massachusetts, United States of America
| | - Mark W. Silby
- Center for Adaptation Genetics and Drug Resistance and Department of Molecular Biology and Microbiology, Tufts University School of Medicine, Boston, Massachusetts, United States of America
| | - Sam O. Purvine
- Pacific Northwest National Laboratory, Richland, Washington, United States of America
| | - Julie S. Nicoll
- Center for Adaptation Genetics and Drug Resistance and Department of Molecular Biology and Microbiology, Tufts University School of Medicine, Boston, Massachusetts, United States of America
| | - Kim K. Hixson
- Pacific Northwest National Laboratory, Richland, Washington, United States of America
| | - Matt Monroe
- Pacific Northwest National Laboratory, Richland, Washington, United States of America
| | - Carrie D. Nicora
- Pacific Northwest National Laboratory, Richland, Washington, United States of America
| | - Mary S. Lipton
- Pacific Northwest National Laboratory, Richland, Washington, United States of America
| | - Stuart B. Levy
- Center for Adaptation Genetics and Drug Resistance and Department of Molecular Biology and Microbiology, Tufts University School of Medicine, Boston, Massachusetts, United States of America
- * E-mail:
| |
Collapse
|
47
|
Yang Y, Gilbert D, Kim S. Annotation confidence score for genome annotation: a genome comparison approach. ACTA ACUST UNITED AC 2009; 26:22-9. [PMID: 19855104 DOI: 10.1093/bioinformatics/btp613] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION The massively parallel sequencing technology can be used by small research labs to generate genome sequences of their research interest. However, annotation of genomes still relies on the manual process, which becomes a serious bottleneck to the high-throughput genome projects. Recently, automatic annotation methods are increasingly more accurate, but there are several issues. One important challenge in using automatic annotation methods is to distinguish annotation quality of ORFs or genes. The availability of such annotation quality of genes can reduce the human labor cost dramatically since manual inspection can focus only on genes with low-annotation quality scores. RESULTS In this article, we propose a novel annotation quality or confidence scoring scheme, called Annotation Confidence Score (ACS), using a genome comparison approach. The scoring scheme is computed by combining sequence and textual annotation similarity using a modified version of a logistic curve. The most important feature of the proposed scoring scheme is to generate a score that reflects the excellence in annotation quality of genes by automatically adjusting the number of genomes used to compute the score and their phylogenetic distance. Extensive experiments with bacterial genomes showed that the proposed scoring scheme generated scores for annotation quality according to the quality of annotation regardless of the number of reference genomes and their phylogenetic distance. AVAILABILITY http://microbial.informatics.indiana.edu/acs
Collapse
Affiliation(s)
- Youngik Yang
- School of Informatics and Computing, Indiana University, Bloomington, IN 47408, USA
| | | | | |
Collapse
|
48
|
Vallender EJ. Bioinformatic approaches to identifying orthologs and assessing evolutionary relationships. Methods 2009; 49:50-5. [PMID: 19467333 PMCID: PMC2732758 DOI: 10.1016/j.ymeth.2009.05.010] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2009] [Revised: 04/27/2009] [Accepted: 05/18/2009] [Indexed: 01/26/2023] Open
Abstract
Non-human primate genetic research defines itself through comparisons to humans; few other species require the implicit comparative genomics approaches. Because of this, errors in the identification of non-human primate orthologs can have profound effects. Gene prediction algorithms can and have produced false transcripts that have become incorporated into commonly used databases and genomics portals. These false transcripts can arise from deficiencies in the algorithms themselves as well as through gaps and other problems in the genome assembly. Putative genes generated can not only miss microexons, but improperly incorporate non-coding sequence resulting in pseudogenes or other transcripts without biological relevance. False transcripts then become identified as orthologs to established human genes and are too often taken as gospel by unwary researchers. Here, the processes through which these errors propagate are isolated and methods are described for identifying false orthologs in databases with several representative errors illustrated. Through these steps any researcher seeking to make use of non-human primate genetic information will have the tools at their disposal to ascertain where errors exist and to remedy them once encountered.
Collapse
Affiliation(s)
- Eric J Vallender
- Division of Neurosciences, New England Primate Research Center, Harvard Medical School, Pine Hill Drive, Southborough Campus, Southborough, MA 01772, USA.
| |
Collapse
|
49
|
Alternative splicing of transcription factors' genes: beyond the increase of proteome diversity. Comp Funct Genomics 2009:905894. [PMID: 19609452 PMCID: PMC2709715 DOI: 10.1155/2009/905894] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2008] [Revised: 04/06/2009] [Accepted: 05/18/2009] [Indexed: 11/29/2022] Open
Abstract
Functional modification of transcription regulators may lead to developmental changes and phenotypical differences between species. In this work, we study the influence of alternative splicing on transcription factors in human and mouse. Our results show that the impact of alternative splicing on transcription factors is similar in both species, meaning that the ways to increase variability should also be similar. However, when looking at the expression patterns of transcription factors, we observe that they tend to diverge regardless of the role of alternative splicing. Finally, we hypothesise that transcription regulation of alternatively spliced transcription factors could play an important role in the phenotypical differences between species, without discarding other phenomena or functional families.
Collapse
|
50
|
Tipney HJ, Schuyler RP, Hunter L. Consistent visualizations of changing knowledge. SUMMIT ON TRANSLATIONAL BIOINFORMATICS 2009; 2009:129-32. [PMID: 21347184 PMCID: PMC3041575] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
Networks are increasingly used in biology to represent complex data in uncomplicated symbolic form. However, as biological knowledge is continually evolving, so must those networks representing this knowledge. Capturing and presenting this type of knowledge change over time is particularly challenging due to the intimate manner in which researchers customize those networks they come into contact with. The effective visualization of this knowledge is important as it creates insight into complex systems and stimulates hypothesis generation and biological discovery. Here we highlight how the retention of user customizations, and the collection and visualization of knowledge associated provenance supports effective and productive network exploration. We also present an extension of the Hanalyzer system, ReOrient, which supports network exploration and analysis in the presence of knowledge change.
Collapse
|