1
|
Hu Z, Scott HS, Qin G, Zheng G, Chu X, Xie L, Adelson DL, Oftedal BE, Venugopal P, Babic M, Hahn CN, Zhang B, Wang X, Li N, Wei C. Revealing Missing Human Protein Isoforms Based on Ab Initio Prediction, RNA-seq and Proteomics. Sci Rep 2015; 5:10940. [PMID: 26156868 PMCID: PMC4496727 DOI: 10.1038/srep10940] [Citation(s) in RCA: 47] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2015] [Accepted: 05/05/2015] [Indexed: 01/02/2023] Open
Abstract
Biological and biomedical research relies on comprehensive understanding of protein-coding transcripts. However, the total number of human proteins is still unknown due to the prevalence of alternative splicing. In this paper, we detected 31,566 novel transcripts with coding potential by filtering our ab initio predictions with 50 RNA-seq datasets from diverse tissues/cell lines. PCR followed by MiSeq sequencing showed that at least 84.1% of these predicted novel splice sites could be validated. In contrast to known transcripts, the expression of these novel transcripts were highly tissue-specific. Based on these novel transcripts, at least 36 novel proteins were detected from shotgun proteomics data of 41 breast samples. We also showed L1 retrotransposons have a more significant impact on the origin of new transcripts/genes than previously thought. Furthermore, we found that alternative splicing is extraordinarily widespread for genes involved in specific biological functions like protein binding, nucleoside binding, neuron projection, membrane organization and cell adhesion. In the end, the total number of human transcripts with protein-coding potential was estimated to be at least 204,950.
Collapse
Affiliation(s)
- Zhiqiang Hu
- 1] School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai 200240, China [2] Shanghai Center for Bioinformation Technology, 1278 Keyuan Road, Pudong District, Shanghai 201203, China
| | - Hamish S Scott
- 1] Department of Genetics and Molecular Pathology, Centre for Cancer Biology, Frome Road, Adelaide, SA 5000 Australia [2] School of Biological Sciences, University of Adelaide, SA 5005, Australia [3] School of Medicine, University of Adelaide, North Terrace, Adelaide, SA 5000, Australia [4] School of Pharmacy and Medical Sciences, Division of Health Sciences, University of South Australia, SA, Australia [5] ACRF Cancer Genomics Facility, Centre for Cancer Biology, SA Pathology, Frome Road, Adelaide, SA 5000, Australia
| | - Guangrong Qin
- Shanghai Center for Bioinformation Technology, 1278 Keyuan Road, Pudong District, Shanghai 201203, China
| | - Guangyong Zheng
- 1] Shanghai Center for Bioinformation Technology, 1278 Keyuan Road, Pudong District, Shanghai 201203, China [2] CAS-MPG Partner Institute for Computational Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, 320 Yueyang Road, Shanghai 200031, China
| | - Xixia Chu
- School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai 200240, China
| | - Lu Xie
- Shanghai Center for Bioinformation Technology, 1278 Keyuan Road, Pudong District, Shanghai 201203, China
| | - David L Adelson
- School of Biological Sciences, University of Adelaide, SA 5005, Australia
| | - Bergithe E Oftedal
- 1] Department of Genetics and Molecular Pathology, Centre for Cancer Biology, Frome Road, Adelaide, SA 5000 Australia [2] Department of Biomedical Informatics (DBMI), Vanderbilt University Medical Center (VUMC), 2525 West End Ave, Suite 800, Nashville, TN 37203, USA
| | - Parvathy Venugopal
- 1] Department of Genetics and Molecular Pathology, Centre for Cancer Biology, Frome Road, Adelaide, SA 5000 Australia [2] School of Biological Sciences, University of Adelaide, SA 5005, Australia
| | - Milena Babic
- Department of Genetics and Molecular Pathology, Centre for Cancer Biology, Frome Road, Adelaide, SA 5000 Australia
| | - Christopher N Hahn
- 1] Department of Genetics and Molecular Pathology, Centre for Cancer Biology, Frome Road, Adelaide, SA 5000 Australia [2] School of Biological Sciences, University of Adelaide, SA 5005, Australia [3] School of Medicine, University of Adelaide, North Terrace, Adelaide, SA 5000, Australia
| | - Bing Zhang
- Department of Biomedical Informatics (DBMI), Vanderbilt University Medical Center (VUMC), 2525 West End Ave, Suite 800, Nashville, TN 37203, USA
| | - Xiaojing Wang
- Department of Biomedical Informatics (DBMI), Vanderbilt University Medical Center (VUMC), 2525 West End Ave, Suite 800, Nashville, TN 37203, USA
| | - Nan Li
- Institute of Immunology, Second Military Medical University, 800 Xiangyin Road, Shanghai 200433, China
| | - Chaochun Wei
- 1] School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai 200240, China [2] Shanghai Center for Bioinformation Technology, 1278 Keyuan Road, Pudong District, Shanghai 201203, China
| |
Collapse
|
2
|
Steijger T, Abril JF, Engström PG, Kokocinski F, Hubbard TJ, Guigó R, Harrow J, Bertone P. Assessment of transcript reconstruction methods for RNA-seq. Nat Methods 2013; 10:1177-84. [PMID: 24185837 PMCID: PMC3851240 DOI: 10.1038/nmeth.2714] [Citation(s) in RCA: 462] [Impact Index Per Article: 42.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2013] [Accepted: 09/23/2013] [Indexed: 11/09/2022]
Abstract
We evaluated 25 protocol variants of 14 independent computational methods for exon identification, transcript reconstruction and expression-level quantification from RNA-seq data. Our results show that most algorithms are able to identify discrete transcript components with high success rates but that assembly of complete isoform structures poses a major challenge even when all constituent elements are identified. Expression-level estimates also varied widely across methods, even when based on similar transcript models. Consequently, the complexity of higher eukaryotic genomes imposes severe limitations on transcript recall and splice product discrimination that are likely to remain limiting factors for the analysis of current-generation RNA-seq data.
Collapse
Affiliation(s)
- Tamara Steijger
- European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridge, UK
| | - Josep F Abril
- Departament de Genètica, Facultat de Biologia, Universitat de Barcelona, Barcelona, Spain
| | - Pär G Engström
- European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridge, UK
| | | | | | - Roderic Guigó
- Center for Genomic Regulation, Barcelona, Spain
- Universitat Pompeu Fabra, Barcelona, Spain
| | | | - Paul Bertone
- European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridge, UK
- Genome Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany
- Developmental Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany
- Wellcome Trust - Medical Research Council Cambridge Stem Cell Institute, University of Cambridge, Cambridge, UK
| |
Collapse
|
3
|
Abstract
Protein sequence databases do not contain just the sequence of the protein itself but also annotation that reflects our knowledge of its function and contributing residues. In this chapter, we will discuss various public protein sequence databases, with a focus on those that are generally applicable. Special attention is paid to issues related to the reliability of both sequence and annotation, as those are fundamental to many questions researchers will ask. Using both well-annotated and scarcely annotated human proteins as examples, it will be shown what information about the targets can be collected from freely available Internet resources and how this information can be used. The results are shown to be summarized in a simple graphical model of the protein's sequence architecture highlighting its structural and functional modules.
Collapse
Affiliation(s)
- Michael Rebhan
- Head Bioinformatics Support, Friedrich Miescher Institute for Biomedical Research, Basel, Switzerland
| |
Collapse
|
4
|
Praz V, Bucher P. CleanEx: new data extraction and merging tools based on MeSH term annotation. Nucleic Acids Res 2009; 37:D880-4. [PMID: 19073704 PMCID: PMC2686468 DOI: 10.1093/nar/gkn878] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
The CleanEx expression database (http://www.cleanex.isb-sib.ch) provides access to public gene expression data via unique gene names as well as via experiments biomedical characteristics. To reach this, a dual annotation of both sequences and experiments has been generated. First, the system links official gene symbols to any kind of sequences used for gene expression measurements (cDNA, Affymetrix, oligonucleotide arrays, SAGE or MPSS tags, Expressed Sequence Tags or other mRNA sequences, etc.). For the biomedical annotation, we re-annotate each experiment from the CleanEx database with the MeSH (Medical Subject Headings) terms, primarily used by NLM (National Library of Medicine) for indexing articles for the MEDLINE/PubMED database. This annotation allows a fast and easy retrieval of expression data with common biological or medical features. The numerical data can then be exported as matrix-like tab-delimited text files. Data can be extracted from either one dataset or from heterogeneous datasets.
Collapse
Affiliation(s)
- Viviane Praz
- ISREC, Swiss Institute of Bioinformatics, Boveresses 155, Epalinges, VD 1066, Switzerland.
| | | |
Collapse
|
5
|
Hornshøj H, Bendixen E, Conley LN, Andersen PK, Hedegaard J, Panitz F, Bendixen C. Transcriptomic and proteomic profiling of two porcine tissues using high-throughput technologies. BMC Genomics 2009; 10:30. [PMID: 19152685 PMCID: PMC2633351 DOI: 10.1186/1471-2164-10-30] [Citation(s) in RCA: 53] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2008] [Accepted: 01/19/2009] [Indexed: 02/03/2023] Open
Abstract
Background The recent development within high-throughput technologies for expression profiling has allowed for parallel analysis of transcriptomes and proteomes in biological systems such as comparative analysis of transcript and protein levels of tissue regulated genes. Until now, such studies of have only included microarray or short length sequence tags for transcript profiling. Furthermore, most comparisons of transcript and protein levels have been based on absolute expression values from within the same tissue and not relative expression values based on tissue ratios. Results Presented here is a novel study of two porcine tissues based on integrative analysis of data from expression profiling of identical samples using cDNA microarray, 454-sequencing and iTRAQ-based proteomics. Sequence homology identified 2.541 unique transcripts that are detectable by both microarray hybridizations and 454-sequencing of 1.2 million cDNA tags. Both transcript-based technologies showed high reproducibility between sample replicates of the same tissue, but the correlation across these two technologies was modest. Thousands of genes being differentially expressed were identified with microarray. Out of the 306 differentially expressed genes, identified by 454-sequencing, 198 (65%) were also found by microarray. The relationship between the regulation of transcript and protein levels was analyzed by integrating iTRAQ-based proteomics data. Protein expression ratios were determined for 354 genes, of which 148 could be mapped to both microarray and 454-sequencing data. A comparison of the expression ratios from the three technologies revealed that differences in transcript and protein levels across heart and muscle tissues are positively correlated. Conclusion We show that the reproducibility within cDNA microarray and 454-sequencing is high, but that the agreement across these two technologies is modest. We demonstrate that the regulation of transcript and protein levels across identical tissue samples is positively correlated when the tissue expression ratios are used for comparison. The results presented are of interest in systems biology research in terms of integration and analysis of high-throughput expression data from mammalian tissues.
Collapse
Affiliation(s)
- Henrik Hornshøj
- Department of Genetics and Biotechnology, Faculty of Agricultural Sciences, Aarhus University, Tjele, Denmark.
| | | | | | | | | | | | | |
Collapse
|
6
|
Profiling the HeLa S3 transcriptome using randomly primed cDNA and massively parallel short-read sequencing. Biotechniques 2008; 45:81-94. [PMID: 18611170 DOI: 10.2144/000112900] [Citation(s) in RCA: 259] [Impact Index Per Article: 16.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open
Abstract
Sequence-based methods for transcriptome characterization have typically relied on generation of either serial analysis of gene expression tags or expressed sequence tags. Although such approaches have the potential to enumerate transcripts by counting sequence tags derived from them, they typically do not robustly survey the majority of transcripts along their entire length. Here we show that massively parallel sequencing of randomly primed cDNAs, using a next-generation sequencing-by-synthesis technology, offers the potential to generate relative measures of mRNA and individual exon abundance while simultaneously profiling the prevalence of both annotated and novel exons and exon-splicing events. This technique identifies known single nucleotide polymorphisms (SNPs) as well as novel single-base variants. Analysis of these variants, and previously unannotated splicing events in the HeLa S3 cell line, reveals an overrepresentation of gene categories including those previously implicated in cancer.
Collapse
|
7
|
A general definition and nomenclature for alternative splicing events. PLoS Comput Biol 2008; 4:e1000147. [PMID: 18688268 PMCID: PMC2467475 DOI: 10.1371/journal.pcbi.1000147] [Citation(s) in RCA: 168] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2007] [Accepted: 07/01/2008] [Indexed: 11/19/2022] Open
Abstract
Understanding the molecular mechanisms responsible for the regulation of the transcriptome present in eukaryotic cells is one of the most challenging tasks in the postgenomic era. In this regard, alternative splicing (AS) is a key phenomenon contributing to the production of different mature transcripts from the same primary RNA sequence. As a plethora of different transcript forms is available in databases, a first step to uncover the biology that drives AS is to identify the different types of reflected splicing variation. In this work, we present a general definition of the AS event along with a notation system that involves the relative positions of the splice sites. This nomenclature univocally and dynamically assigns a specific “AS code” to every possible pattern of splicing variation. On the basis of this definition and the corresponding codes, we have developed a computational tool (AStalavista) that automatically characterizes the complete landscape of AS events in a given transcript annotation of a genome, thus providing a platform to investigate the transcriptome diversity across genes, chromosomes, and species. Our analysis reveals that a substantial part—in human more than a quarter—of the observed splicing variations are ignored in common classification pipelines. We have used AStalavista to investigate and to compare the AS landscape of different reference annotation sets in human and in other metazoan species and found that proportions of AS events change substantially depending on the annotation protocol, species-specific attributes, and coding constraints acting on the transcripts. The AStalavista system therefore provides a general framework to conduct specific studies investigating the occurrence, impact, and regulation of AS. The genome sequence is said to be an organism's blueprint, a set of instructions driving the organism's biology. The unfolding of these instructions—the so-called genes—is initiated by the transcription of DNA into RNA molecules, which subsequently are processed before they can take their functional role. During this processing step, initially identical RNA molecules may result in different products through a process known as alternative splicing (AS). AS therefore allows for widening the diversity from the limited repertoire of genes, and it is often postulated as an explanation for the apparent paradox that complex and simple organisms resemble in their number of genes; it characterizes species, individuals, and developmental and cellular conditions. Comparing the differences of AS products between cells may help to reveal the broad molecular basis underlying phenotypic differences—for instance, between a cancer and a normal cell. An obstacle for such comparisons has been that, so far, no paradigm existed to delineate each single quantum of AS, so-called AS events. Here, we describe a possibility of exhaustively decomposing AS complements into qualitatively different groups of events and a nomenclature to unequivocally denote them. This typological catalogue of AS events along with their observed frequencies represent the AS landscape, and we propose a procedure to automatically identify such landscapes. We use it to describe the human AS landscape and to investigate how it has changed throughout evolution.
Collapse
|
8
|
Côté RG, Jones P, Martens L, Kerrien S, Reisinger F, Lin Q, Leinonen R, Apweiler R, Hermjakob H. The Protein Identifier Cross-Referencing (PICR) service: reconciling protein identifiers across multiple source databases. BMC Bioinformatics 2007; 8:401. [PMID: 17945017 PMCID: PMC2151082 DOI: 10.1186/1471-2105-8-401] [Citation(s) in RCA: 103] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2007] [Accepted: 10/18/2007] [Indexed: 11/28/2022] Open
Abstract
Background Each major protein database uses its own conventions when assigning protein identifiers. Resolving the various, potentially unstable, identifiers that refer to identical proteins is a major challenge. This is a common problem when attempting to unify datasets that have been annotated with proteins from multiple data sources or querying data providers with one flavour of protein identifiers when the source database uses another. Partial solutions for protein identifier mapping exist but they are limited to specific species or techniques and to a very small number of databases. As a result, we have not found a solution that is generic enough and broad enough in mapping scope to suit our needs. Results We have created the Protein Identifier Cross-Reference (PICR) service, a web application that provides interactive and programmatic (SOAP and REST) access to a mapping algorithm that uses the UniProt Archive (UniParc) as a data warehouse to offer protein cross-references based on 100% sequence identity to proteins from over 70 distinct source databases loaded into UniParc. Mappings can be limited by source database, taxonomic ID and activity status in the source database. Users can copy/paste or upload files containing protein identifiers or sequences in FASTA format to obtain mappings using the interactive interface. Search results can be viewed in simple or detailed HTML tables or downloaded as comma-separated values (CSV) or Microsoft Excel (XLS) files suitable for use in a local database or a spreadsheet. Alternatively, a SOAP interface is available to integrate PICR functionality in other applications, as is a lightweight REST interface. Conclusion We offer a publicly available service that can interactively map protein identifiers and protein sequences to the majority of commonly used protein databases. Programmatic access is available through a standards-compliant SOAP interface or a lightweight REST interface. The PICR interface, documentation and code examples are available at .
Collapse
Affiliation(s)
- Richard G Côté
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1QY, UK.
| | | | | | | | | | | | | | | | | |
Collapse
|
9
|
Pagni M, Ioannidis V, Cerutti L, Zahn-Zabal M, Jongeneel CV, Hau J, Martin O, Kuznetsov D, Falquet L. MyHits: improvements to an interactive resource for analyzing protein sequences. Nucleic Acids Res 2007; 35:W433-7. [PMID: 17545200 PMCID: PMC1933190 DOI: 10.1093/nar/gkm352] [Citation(s) in RCA: 152] [Impact Index Per Article: 8.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
The MyHits web site (http://myhits.isb-sib.ch) is an integrated service dedicated to the analysis of protein sequences. Since its first description in 2004, both the user interface and the back end of the server were improved. A number of tools (e.g. MAFFT, Jacop, Dotlet, Jalview, ESTScan) were added or updated to improve the usability of the service. The MySQL schema and its associated API were revamped and the database engine (HitKeeper) was separated from the web interface. This paper summarizes the current status of the server, with an emphasis on the new services.
Collapse
Affiliation(s)
- Marco Pagni
- Swiss Institute of Bioinformatics (SIB), Vital-IT Group, UNIL-Génopode, CH-1015 Lausanne, Swiss Institute of Bioinformatics (SIB), EMBnet Group, UNIL-Génopode, CH-1015 Lausanne, Swiss Institute of Bioinformatics (SIB), Swiss-Prot Group, UNIGE-CMU, CH-1211 Genève 4, Ludwig Institute for Cancer Research, UNIL-Génopode, CH-1015 Lausanne and Nestlé Research Center, Department of BioAnalytical Science, PO Box 44, CH-1000 Lausanne 26, Switzerland
- *To whom correspondence should be addressed. +41-21-692-40-38+41-21-692-40-65
| | - Vassilios Ioannidis
- Swiss Institute of Bioinformatics (SIB), Vital-IT Group, UNIL-Génopode, CH-1015 Lausanne, Swiss Institute of Bioinformatics (SIB), EMBnet Group, UNIL-Génopode, CH-1015 Lausanne, Swiss Institute of Bioinformatics (SIB), Swiss-Prot Group, UNIGE-CMU, CH-1211 Genève 4, Ludwig Institute for Cancer Research, UNIL-Génopode, CH-1015 Lausanne and Nestlé Research Center, Department of BioAnalytical Science, PO Box 44, CH-1000 Lausanne 26, Switzerland
| | - Lorenzo Cerutti
- Swiss Institute of Bioinformatics (SIB), Vital-IT Group, UNIL-Génopode, CH-1015 Lausanne, Swiss Institute of Bioinformatics (SIB), EMBnet Group, UNIL-Génopode, CH-1015 Lausanne, Swiss Institute of Bioinformatics (SIB), Swiss-Prot Group, UNIGE-CMU, CH-1211 Genève 4, Ludwig Institute for Cancer Research, UNIL-Génopode, CH-1015 Lausanne and Nestlé Research Center, Department of BioAnalytical Science, PO Box 44, CH-1000 Lausanne 26, Switzerland
| | - Monique Zahn-Zabal
- Swiss Institute of Bioinformatics (SIB), Vital-IT Group, UNIL-Génopode, CH-1015 Lausanne, Swiss Institute of Bioinformatics (SIB), EMBnet Group, UNIL-Génopode, CH-1015 Lausanne, Swiss Institute of Bioinformatics (SIB), Swiss-Prot Group, UNIGE-CMU, CH-1211 Genève 4, Ludwig Institute for Cancer Research, UNIL-Génopode, CH-1015 Lausanne and Nestlé Research Center, Department of BioAnalytical Science, PO Box 44, CH-1000 Lausanne 26, Switzerland
| | - C. Victor Jongeneel
- Swiss Institute of Bioinformatics (SIB), Vital-IT Group, UNIL-Génopode, CH-1015 Lausanne, Swiss Institute of Bioinformatics (SIB), EMBnet Group, UNIL-Génopode, CH-1015 Lausanne, Swiss Institute of Bioinformatics (SIB), Swiss-Prot Group, UNIGE-CMU, CH-1211 Genève 4, Ludwig Institute for Cancer Research, UNIL-Génopode, CH-1015 Lausanne and Nestlé Research Center, Department of BioAnalytical Science, PO Box 44, CH-1000 Lausanne 26, Switzerland
| | - Jörg Hau
- Swiss Institute of Bioinformatics (SIB), Vital-IT Group, UNIL-Génopode, CH-1015 Lausanne, Swiss Institute of Bioinformatics (SIB), EMBnet Group, UNIL-Génopode, CH-1015 Lausanne, Swiss Institute of Bioinformatics (SIB), Swiss-Prot Group, UNIGE-CMU, CH-1211 Genève 4, Ludwig Institute for Cancer Research, UNIL-Génopode, CH-1015 Lausanne and Nestlé Research Center, Department of BioAnalytical Science, PO Box 44, CH-1000 Lausanne 26, Switzerland
| | - Olivier Martin
- Swiss Institute of Bioinformatics (SIB), Vital-IT Group, UNIL-Génopode, CH-1015 Lausanne, Swiss Institute of Bioinformatics (SIB), EMBnet Group, UNIL-Génopode, CH-1015 Lausanne, Swiss Institute of Bioinformatics (SIB), Swiss-Prot Group, UNIGE-CMU, CH-1211 Genève 4, Ludwig Institute for Cancer Research, UNIL-Génopode, CH-1015 Lausanne and Nestlé Research Center, Department of BioAnalytical Science, PO Box 44, CH-1000 Lausanne 26, Switzerland
| | - Dmitri Kuznetsov
- Swiss Institute of Bioinformatics (SIB), Vital-IT Group, UNIL-Génopode, CH-1015 Lausanne, Swiss Institute of Bioinformatics (SIB), EMBnet Group, UNIL-Génopode, CH-1015 Lausanne, Swiss Institute of Bioinformatics (SIB), Swiss-Prot Group, UNIGE-CMU, CH-1211 Genève 4, Ludwig Institute for Cancer Research, UNIL-Génopode, CH-1015 Lausanne and Nestlé Research Center, Department of BioAnalytical Science, PO Box 44, CH-1000 Lausanne 26, Switzerland
| | - Laurent Falquet
- Swiss Institute of Bioinformatics (SIB), Vital-IT Group, UNIL-Génopode, CH-1015 Lausanne, Swiss Institute of Bioinformatics (SIB), EMBnet Group, UNIL-Génopode, CH-1015 Lausanne, Swiss Institute of Bioinformatics (SIB), Swiss-Prot Group, UNIGE-CMU, CH-1211 Genève 4, Ludwig Institute for Cancer Research, UNIL-Génopode, CH-1015 Lausanne and Nestlé Research Center, Department of BioAnalytical Science, PO Box 44, CH-1000 Lausanne 26, Switzerland
| |
Collapse
|
10
|
Kim YC, Jung YC, Xuan Z, Dong H, Zhang MQ, Wang SM. Pan-genome isolation of low abundance transcripts using SAGE tag. FEBS Lett 2006; 580:6721-9. [PMID: 17113583 PMCID: PMC1791009 DOI: 10.1016/j.febslet.2006.11.013] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2006] [Revised: 10/31/2006] [Accepted: 11/03/2006] [Indexed: 11/24/2022]
Abstract
The SAGE (serial analysis of gene expression) method is sensitive at detecting the lower abundance transcripts. More than a third of human SAGE tags identified are novel representing the low abundance unknown transcripts. Using the GLGI method (generation of longer 3' EST from SAGE tag for gene identification), we converted 1009 low-copy, human X chromosome-specific SAGE tags into 10210 3' ESTs. We identified 3418 unique 3' ESTs, 46% of which are novel and originated from the lower abundance transcripts. However, nearly all 3' ESTs were mapped to various regions across the genome but not X chromosome. Detailed analysis indicates that those 3' ESTs were isolated by SAGE tag mis-priming to the non-parent transcripts. Replacing SAGE tags with non-transcribed genomic DNA tags resulted in poor amplification, indicating that the sequence similarity between different transcripts contributed to the amplification. Our study shows the prevalence of novel low abundance transcripts that can be isolated efficiently through SAGE tags mis-priming.
Collapse
Affiliation(s)
- Yeong Cheol Kim
- Center for Functional Genomics, Division of Medical Genetics, Department of Medicine, ENH Research Institute, Northwestern University, Evanston, IL 60201, USA
| | | | | | | | | | | |
Collapse
|
11
|
Viatte S, Alves PM, Romero P. Reverse immunology approach for the identification of CD8 T-cell-defined antigens: advantages and hurdles. Immunol Cell Biol 2006; 84:318-30. [PMID: 16681829 DOI: 10.1111/j.1440-1711.2006.01447.x] [Citation(s) in RCA: 30] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]
Abstract
One of the challenges of tumour immunology remains the identification of strongly immunogenic tumour antigens for vaccination. Reverse immunology, that is, the procedure to predict and identify immunogenic peptides from the sequence of a gene product of interest, has been postulated to be a particularly efficient, high-throughput approach for tumour antigen discovery. Over one decade after this concept was born, we discuss the reverse immunology approach in terms of costs and efficacy: data mining with bioinformatic algorithms, molecular methods to identify tumour-specific transcripts, prediction and determination of proteasomal cleavage sites, peptide-binding prediction to HLA molecules and experimental validation, assessment of the in vitro and in vivo immunogenic potential of selected peptide antigens, isolation of specific cytolytic T lymphocyte clones and final validation in functional assays of tumour cell recognition. We conclude that the overall low sensitivity and yield of every prediction step often requires a compensatory up-scaling of the initial number of candidate sequences to be screened, rendering reverse immunology an unexpectedly complex approach.
Collapse
Affiliation(s)
- Sebastien Viatte
- Division of Clinical Onco-Immunology, Ludwig Institute for Cancer Research, Lausanne branch, University Hospital, CHUV, and National Center for Competence in Research, NCCR, Molecular Oncology, Lausanne, Switzerland
| | | | | |
Collapse
|
12
|
Sperisen P, Schmid CD, Bucher P, Zilian O. Stealth proteins: in silico identification of a novel protein family rendering bacterial pathogens invisible to host immune defense. PLoS Comput Biol 2005; 1:e63. [PMID: 16299590 PMCID: PMC1285062 DOI: 10.1371/journal.pcbi.0010063] [Citation(s) in RCA: 37] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2005] [Accepted: 10/20/2005] [Indexed: 01/24/2023] Open
Abstract
There are a variety of bacterial defense strategies to survive in a hostile environment. Generation of extracellular polysaccharides has proved to be a simple but effective strategy against the host's innate immune system. A comparative genomics approach led us to identify a new protein family termed Stealth, most likely involved in the synthesis of extracellular polysaccharides. This protein family is characterized by a series of domains conserved across phylogeny from bacteria to eukaryotes. In bacteria, Stealth (previously characterized as SacB, XcbA, or WefC) is encoded by subsets of strains mainly colonizing multicellular organisms, with evidence for a protective effect against the host innate immune defense. More specifically, integrating all the available information about Stealth proteins in bacteria, we propose that Stealth is a D-hexose-1-phosphoryl transferase involved in the synthesis of polysaccharides. In the animal kingdom, Stealth is strongly conserved across evolution from social amoebas to simple and complex multicellular organisms, such as Dictyostelium discoideum, hydra, and human. Based on the occurrence of Stealth in most Eukaryotes and a subset of Prokaryotes together with its potential role in extracellular polysaccharide synthesis, we propose that metazoan Stealth functions to regulate the innate immune system. Moreover, there is good reason to speculate that the acquisition and spread of Stealth could be responsible for future epidemic outbreaks of infectious diseases caused by a large variety of eubacterial pathogens. Our in silico identification of a homologous protein in the human host will help to elucidate the causes of Stealth-dependent virulence. At a more basic level, the characterization of the molecular and cellular function of Stealth proteins may shed light on fundamental mechanisms of innate immune defense against microbial invasion.
Collapse
Affiliation(s)
- Peter Sperisen
- Swiss Institute of Bioinformatics, Epalinges, Switzerland
| | | | - Philipp Bucher
- Swiss Institute of Bioinformatics, Epalinges, Switzerland
- Swiss Institute for Experimental Cancer Research, Epalinges, Switzerland
- * To whom correspondence should be addressed. E-mail:
| | - Olav Zilian
- Swiss Institute for Experimental Cancer Research, Epalinges, Switzerland
| |
Collapse
|
13
|
Jongeneel CV, Delorenzi M, Iseli C, Zhou D, Haudenschild CD, Khrebtukova I, Kuznetsov D, Stevenson BJ, Strausberg RL, Simpson AJG, Vasicek TJ. An atlas of human gene expression from massively parallel signature sequencing (MPSS). Genome Res 2005; 15:1007-14. [PMID: 15998913 PMCID: PMC1172045 DOI: 10.1101/gr.4041005] [Citation(s) in RCA: 133] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
We have used massively parallel signature sequencing (MPSS) to sample the transcriptomes of 32 normal human tissues to an unprecedented depth, thus documenting the patterns of expression of almost 20,000 genes with high sensitivity and specificity. The data confirm the widely held belief that differences in gene expression between cell and tissue types are largely determined by transcripts derived from a limited number of tissue-specific genes, rather than by combinations of more promiscuously expressed genes. Expression of a little more than half of all known human genes seems to account for both the common requirements and the specific functions of the tissues sampled. A classification of tissues based on patterns of gene expression largely reproduces classifications based on anatomical and biochemical properties. The unbiased sampling of the human transcriptome achieved by MPSS supports the idea that most human genes have been mapped, if not functionally characterized. This data set should prove useful for the identification of tissue-specific genes, for the study of global changes induced by pathological conditions, and for the definition of a minimal set of genes necessary for basic cell maintenance. The data are available on the Web at http://mpss.licr.org and http://sgb.lynxgen.com.
Collapse
Affiliation(s)
- C Victor Jongeneel
- Office of Information Technology, Ludwig Institute for Cancer Research, and Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland.
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
14
|
Naef F, Huelsken J. Cell-type-specific transcriptomics in chimeric models using transcriptome-based masks. Nucleic Acids Res 2005; 33:e111. [PMID: 16030348 PMCID: PMC1178007 DOI: 10.1093/nar/gni104] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/15/2023] Open
Abstract
Regulatory networks involving different cell types control inflammation, morphogenesis and tissue homeostasis. Cell-type-specific transcriptional profiling offers a powerful tool for analyzing such cross-talk but is often hampered by mingling of cells within a tissue. Here, we present a novel method that performs cell-type-specific expression measurements without prior cell separation. This involves inter-species transplantation or chimeric co-culture models among which the human mouse system is frequently used. Here, we exploit the sufficiently divergent transcriptomes of human and mouse in conjunction with high-density oligonucleotide arrays. This required a masking procedure based on transcriptome databases and exhaustive fuzzy mapping of oligonucleotide probes onto these data. The approach was tested in a human-mouse experiment, demonstrating that we can efficiently measure species-specific transcriptional profiles in chimeric RNA samples without physically separating cells. Our results stress the importance of transcriptome databases with accurate 3' mRNA termination for computational prediction of accurate probe masks. We find that most human and mouse 3'-untranslated region contain unique stretches to allow for an effective control of cross-hybridization between the two species. This approach can be applied to xenograft models studying tumor-host interactions, morphogenesis or immune responses.
Collapse
Affiliation(s)
- Felix Naef
- Swiss Institute for Experimental Cancer Research (ISREC), NCCR Molecular OncologyChemin des Boveresses 155, 1066 Epalinges, Switzerland
- Swiss Institute of BioinformaticsChemin des Boveresses 155, 1066 Epalinges, Switzerland
| | - Joerg Huelsken
- Swiss Institute for Experimental Cancer Research (ISREC), NCCR Molecular OncologyChemin des Boveresses 155, 1066 Epalinges, Switzerland
- To whom correspondence should be addressed. Tel: +41 (0)21 692 58 58; Fax: +41 (0)21 652 69 33;
| |
Collapse
|
15
|
Pagni M, Ioannidis V, Cerutti L, Zahn-Zabal M, Jongeneel CV, Falquet L. MyHits: a new interactive resource for protein annotation and domain identification. Nucleic Acids Res 2004; 32:W332-5. [PMID: 15215405 PMCID: PMC441617 DOI: 10.1093/nar/gkh479] [Citation(s) in RCA: 75] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
The MyHits web server (http://myhits.isb-sib.ch) is a new integrated service dedicated to the annotation of protein sequences and to the analysis of their domains and signatures. Guest users can use the system anonymously, with full access to (i) standard bioinformatics programs (e.g. PSI-BLAST, ClustalW, T-Coffee, Jalview); (ii) a large number of protein sequence databases, including standard (Swiss-Prot, TrEMBL) and locally developed databases (splice variants); (iii) databases of protein motifs (Prosite, Interpro); (iv) a precomputed list of matches ('hits') between the sequence and motif databases. All databases are updated on a weekly basis and the hit list is kept up to date incrementally. The MyHits server also includes a new collection of tools to generate graphical representations of pairwise and multiple sequence alignments including their annotated features. Free registration enables users to upload their own sequences and motifs to private databases. These are then made available through the same web interface and the same set of analytical tools. Registered users can manage their own sequences and annotations using only web tools and freeze their data in their private database for publication purposes.
Collapse
Affiliation(s)
- Marco Pagni
- Swiss Institute of Bioinformatics, CH-1066 Epalinges/Lausanne, Switzerland.
| | | | | | | | | | | |
Collapse
|
16
|
Schmid CD, Praz V, Delorenzi M, Périer R, Bucher P. The Eukaryotic Promoter Database EPD: the impact of in silico primer extension. Nucleic Acids Res 2004; 32:D82-5. [PMID: 14681364 PMCID: PMC308856 DOI: 10.1093/nar/gkh122] [Citation(s) in RCA: 74] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
The Eukaryotic Promoter Database (EPD) is an annotated non-redundant collection of eukaryotic POL II promoters, experimentally defined by a transcription start site (TSS). There may be multiple promoter entries for a single gene. The underlying experimental evidence comes from journal articles and, starting from release 73, from 5' ESTs of full-length cDNA clones used for so-called in silico primer extension. Access to promoter sequences is provided by pointers to TSS positions in nucleotide sequence entries. The annotation part of an EPD entry includes a description of the type and source of the initiation site mapping data, links to other biological databases and bibliographic references. EPD is structured in a way that facilitates dynamic extraction of biologically meaningful promoter subsets for comparative sequence analysis. Web-based interfaces have been developed that enable the user to view EPD entries in different formats, to select and extract promoter sequences according to a variety of criteria and to navigate to related databases exploiting different cross-references. Tools for analysing sequence motifs around TSSs defined in EPD are provided by the signal search analysis server. EPD can be accessed at http://www.epd. isb-sib.ch.
Collapse
Affiliation(s)
- Christoph D Schmid
- Swiss Institute of Bioinformatics, Ch. des Boveresses 155, 1066 Epalinges s/Lausanne, Switzerland
| | | | | | | | | |
Collapse
|