1
|
Jain S, Bakolitsa C, Brenner SE, Radivojac P, Moult J, Repo S, Hoskins RA, Andreoletti G, Barsky D, Chellapan A, Chu H, Dabbiru N, Kollipara NK, Ly M, Neumann AJ, Pal LR, Odell E, Pandey G, Peters-Petrulewicz RC, Srinivasan R, Yee SF, Yeleswarapu SJ, Zuhl M, Adebali O, Patra A, Beer MA, Hosur R, Peng J, Bernard BM, Berry M, Dong S, Boyle AP, Adhikari A, Chen J, Hu Z, Wang R, Wang Y, Miller M, Wang Y, Bromberg Y, Turina P, Capriotti E, Han JJ, Ozturk K, Carter H, Babbi G, Bovo S, Di Lena P, Martelli PL, Savojardo C, Casadio R, Cline MS, De Baets G, Bonache S, Díez O, Gutiérrez-Enríquez S, Fernández A, Montalban G, Ootes L, Özkan S, Padilla N, Riera C, De la Cruz X, Diekhans M, Huwe PJ, Wei Q, Xu Q, Dunbrack RL, Gotea V, Elnitski L, Margolin G, Fariselli P, Kulakovskiy IV, Makeev VJ, Penzar DD, Vorontsov IE, Favorov AV, Forman JR, Hasenahuer M, Fornasari MS, Parisi G, Avsec Z, Çelik MH, Nguyen TYD, Gagneur J, Shi FY, Edwards MD, Guo Y, Tian K, Zeng H, Gifford DK, Göke J, Zaucha J, Gough J, Ritchie GRS, Frankish A, Mudge JM, Harrow J, Young EL, Yu Y, Huff CD, Murakami K, Nagai Y, Imanishi T, Mungall CJ, Jacobsen JOB, Kim D, Jeong CS, Jones DT, Li MJ, Guthrie VB, Bhattacharya R, Chen YC, Douville C, Fan J, Kim D, Masica D, Niknafs N, Sengupta S, Tokheim C, Turner TN, Yeo HTG, Karchin R, Shin S, Welch R, Keles S, Li Y, Kellis M, Corbi-Verge C, Strokach AV, Kim PM, Klein TE, Mohan R, Sinnott-Armstrong NA, Wainberg M, Kundaje A, Gonzaludo N, Mak ACY, Chhibber A, Lam HYK, Dahary D, Fishilevich S, Lancet D, Lee I, Bachman B, Katsonis P, Lua RC, Wilson SJ, Lichtarge O, Bhat RR, Sundaram L, Viswanath V, Bellazzi R, Nicora G, Rizzo E, Limongelli I, Mezlini AM, Chang R, Kim S, Lai C, O’Connor R, Topper S, van den Akker J, Zhou AY, Zimmer AD, Mishne G, Bergquist TR, Breese MR, Guerrero RF, Jiang Y, Kiga N, Li B, Mort M, Pagel KA, Pejaver V, Stamboulian MH, Thusberg J, Mooney SD, Teerakulkittipong N, Cao C, Kundu K, Yin Y, Yu CH, Kleyman M, Lin CF, Stackpole M, Mount SM, Eraslan G, Mueller NS, Naito T, Rao AR, Azaria JR, Brodie A, Ofran Y, Garg A, Pal D, Hawkins-Hooker A, Kenlay H, Reid J, Mucaki EJ, Rogan PK, Schwarz JM, Searls DB, Lee GR, Seok C, Krämer A, Shah S, Huang CV, Kirsch JF, Shatsky M, Cao Y, Chen H, Karimi M, Moronfoye O, Sun Y, Shen Y, Shigeta R, Ford CT, Nodzak C, Uppal A, Shi X, Joseph T, Kotte S, Rana S, Rao A, Saipradeep VG, Sivadasan N, Sunderam U, Stanke M, Su A, Adzhubey I, Jordan DM, Sunyaev S, Rousseau F, Schymkowitz J, Van Durme J, Tavtigian SV, Carraro M, Giollo M, Tosatto SCE, Adato O, Carmel L, Cohen NE, Fenesh T, Holtzer T, Juven-Gershon T, Unger R, Niroula A, Olatubosun A, Väliaho J, Yang Y, Vihinen M, Wahl ME, Chang B, Chong KC, Hu I, Sun R, Wu WKK, Xia X, Zee BC, Wang MH, Wang M, Wu C, Lu Y, Chen K, Yang Y, Yates CM, Kreimer A, Yan Z, Yosef N, Zhao H, Wei Z, Yao Z, Zhou F, Folkman L, Zhou Y, Daneshjou R, Altman RB, Inoue F, Ahituv N, Arkin AP, Lovisa F, Bonvini P, Bowdin S, Gianni S, Mantuano E, Minicozzi V, Novak L, Pasquo A, Pastore A, Petrosino M, Puglisi R, Toto A, Veneziano L, Chiaraluce R, Ball MP, Bobe JR, Church GM, Consalvi V, Cooper DN, Buckley BA, Sheridan MB, Cutting GR, Scaini MC, Cygan KJ, Fredericks AM, Glidden DT, Neil C, Rhine CL, Fairbrother WG, Alontaga AY, Fenton AW, Matreyek KA, Starita LM, Fowler DM, Löscher BS, Franke A, Adamson SI, Graveley BR, Gray JW, Malloy MJ, Kane JP, Kousi M, Katsanis N, Schubach M, Kircher M, Mak ACY, Tang PLF, Kwok PY, Lathrop RH, Clark WT, Yu GK, LeBowitz JH, Benedicenti F, Bettella E, Bigoni S, Cesca F, Mammi I, Marino-Buslje C, Milani D, Peron A, Polli R, Sartori S, Stanzial F, Toldo I, Turolla L, Aspromonte MC, Bellini M, Leonardi E, Liu X, Marshall C, McCombie WR, Elefanti L, Menin C, Meyn MS, Murgia A, Nadeau KCY, Neuhausen SL, Nussbaum RL, Pirooznia M, Potash JB, Dimster-Denk DF, Rine JD, Sanford JR, Snyder M, Cote AG, Sun S, Verby MW, Weile J, Roth FP, Tewhey R, Sabeti PC, Campagna J, Refaat MM, Wojciak J, Grubb S, Schmitt N, Shendure J, Spurdle AB, Stavropoulos DJ, Walton NA, Zandi PP, Ziv E, Burke W, Chen F, Carr LR, Martinez S, Paik J, Harris-Wai J, Yarborough M, Fullerton SM, Koenig BA, McInnes G, Shigaki D, Chandonia JM, Furutsuki M, Kasak L, Yu C, Chen R, Friedberg I, Getz GA, Cong Q, Kinch LN, Zhang J, Grishin NV, Voskanian A, Kann MG, Tran E, Ioannidis NM, Hunter JM, Udani R, Cai B, Morgan AA, Sokolov A, Stuart JM, Minervini G, Monzon AM, Batzoglou S, Butte AJ, Greenblatt MS, Hart RK, Hernandez R, Hubbard TJP, Kahn S, O’Donnell-Luria A, Ng PC, Shon J, Veltman J, Zook JM. CAGI, the Critical Assessment of Genome Interpretation, establishes progress and prospects for computational genetic variant interpretation methods. Genome Biol 2024; 25:53. [PMID: 38389099 PMCID: PMC10882881 DOI: 10.1186/s13059-023-03113-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2023] [Accepted: 11/17/2023] [Indexed: 02/24/2024] Open
Abstract
BACKGROUND The Critical Assessment of Genome Interpretation (CAGI) aims to advance the state-of-the-art for computational prediction of genetic variant impact, particularly where relevant to disease. The five complete editions of the CAGI community experiment comprised 50 challenges, in which participants made blind predictions of phenotypes from genetic data, and these were evaluated by independent assessors. RESULTS Performance was particularly strong for clinical pathogenic variants, including some difficult-to-diagnose cases, and extends to interpretation of cancer-related variants. Missense variant interpretation methods were able to estimate biochemical effects with increasing accuracy. Assessment of methods for regulatory variants and complex trait disease risk was less definitive and indicates performance potentially suitable for auxiliary use in the clinic. CONCLUSIONS Results show that while current methods are imperfect, they have major utility for research and clinical applications. Emerging methods and increasingly large, robust datasets for training and assessment promise further progress ahead.
Collapse
|
2
|
Barker M, Chue Hong NP, Katz DS, Lamprecht AL, Martinez-Ortiz C, Psomopoulos F, Harrow J, Castro LJ, Gruenpeter M, Martinez PA, Honeyman T. Introducing the FAIR Principles for research software. Sci Data 2022; 9:622. [PMID: 36241754 PMCID: PMC9562067 DOI: 10.1038/s41597-022-01710-x] [Citation(s) in RCA: 33] [Impact Index Per Article: 16.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2022] [Accepted: 09/21/2022] [Indexed: 11/09/2022] Open
Abstract
Research software is a fundamental and vital part of research, yet significant challenges to discoverability, productivity, quality, reproducibility, and sustainability exist. Improving the practice of scholarship is a common goal of the open science, open source, and FAIR (Findable, Accessible, Interoperable and Reusable) communities and research software is now being understood as a type of digital object to which FAIR should be applied. This emergence reflects a maturation of the research community to better understand the crucial role of FAIR research software in maximising research value. The FAIR for Research Software (FAIR4RS) Working Group has adapted the FAIR Guiding Principles to create the FAIR Principles for Research Software (FAIR4RS Principles). The contents and context of the FAIR4RS Principles are summarised here to provide the basis for discussion of their adoption. Examples of implementation by organisations are provided to share information on how to maximise the value of research outputs, and to encourage others to amplify the importance and impact of this work.
Collapse
Affiliation(s)
| | - Neil P Chue Hong
- Software Sustainability Institute & EPCC, University of Edinburgh, 47 Potterrow, Edinburgh, EH8 9BT, UK
| | - Daniel S Katz
- NCSA & CS & ECE & iSchool, University of Illinois at Urbana-Champaign, 1205 W Clark St., Urbana, IL, 61801, USA
| | - Anna-Lena Lamprecht
- Institute of Computer Science, University of Potsdam, An der Bahn 2, 14476, Potsdam, Germany
| | | | - Fotis Psomopoulos
- Institute of Applied Biosciences, Centre for Research and Technology Hellas, Thessaloniki, 57001, Greece
| | - Jennifer Harrow
- ELIXIR Hub, South Building, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
| | - Leyla Jael Castro
- Semantic Technologies team, ZB MED Information Centre for Life Sciences, Gleueler Strasse 60, 50931, Cologne, Germany
| | | | - Paula Andrea Martinez
- Research Software Alliance/Australian Research Data Commons, Level 6, Duhig Tower, The University of Queensland, Brisbane, QLD 4072, Australia
| | - Tom Honeyman
- Australian Research Data Commons, University of Technology Sydney Library, Ultimo, NSW, 2007, Australia
| |
Collapse
|
3
|
Martinez-Ortiz C, Goble C, Katz D, Honeyman T, Martinez P, Barker M, Castro LJ, Chue Hong N, Gruenpeter M, Harrow J, Lamprecht AL, Psomopoulos F. How does software fit into the FDO landscape? RIO 2022. [DOI: 10.3897/rio.8.e95724] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
In academic research virtually every field has increased its use of digital and computational technology, leading to new scientific discoveries, and this trend is likely to continue. Reliable and efficient scholarly research requires researchers to be able to validate and extend previously generated research results. In the digital era, this implies that digital objectsKahn and Wilensky 2006 used in research should be Findable, Accessible, Interoperable and Reusable (FAIR). These objects include (but are not limited to) data, software, models (for example, machine learning), representations of physical objects, virtual research environments, workflows, etc. Leaving any of these digital objects out of the FAIR process may result in a loss of academic rigor and may have severe consequences in the long term for the field, such as a reproducibility crisis. In this extended abstract, we focus on research software as a FAIR digital object (FDO).
The FDO framework De Smedt et al. 2020 describes FDOs as being actionable units of knowledge, which can be aggregated, analyzed, and processed by different types of algorithms. Such algorithms must be implemented by software in one form or another. The framework also describes large software stacks supporting FDOs enabling responsible data science and increasing reproducibility. This implies that software is a key ingredient of the FDO framework, and should adhere to the FAIR principles. Software plays multiple roles: it is a DO itself, it is responsible for creating new FDOs (e.g., data) and it helps to make them available to the public (e.g., via repositories and registries). However there is a need to specify in more detail how non-data DOs, in particular software, fit in this framework.
Different classes of digital objects have different intrinsic properties and ways to relate to other DOs. This means that while they, in principle, are subject to the high-level FAIR principles, there are also differences depending on their type and properties, requiring an adaptation so FAIR implementations are more aligned to the digital object itself. This holds true in particular to software. Software has intrinsic properties (executability, composite nature, development practices, continuous evolution and versioning, and packaging and distribution) and specific needs that must be considered by the FDO framework. For example, open source software is typically developed in the open on social coding platforms, where releases are distributed through package management systems, unlike data that is typically published in archival repositories. These social coding platforms do not provide long term archiving, permanent identifiers, or metadata, and package management systems, while somewhat better, similarly do not make a commitment to long term archiving, do not use identifiers that fit the scholarly publication system well, and provide metadata that may be missing key elements. The FAIR for research software (FAIR4RS, Chue Hong et al. 2021) working group has dedicated significant effort in building a community consensus around developing FAIR principles that are customized for research software, providing methods for researchers to understand and address these gaps.
In this presentation we will highlight the importance of software for the FAIR landscape and why different (but related) FAIR principles are needed for software (vs those originally developed for data). Our goal here is to contribute to building an FDO landscape together, where we consider all different types of digital objects that are essential in today's research, and we are enthusiastic about contributing our expertise on research software in helping shape this landscape.
Collapse
|
4
|
Harrow J, Drysdale R, Smith A, Repo S, Lanfear J, Blomberg N. ELIXIR: Providing a Sustainable Infrastructure for Life Science Data at European Scale. Bioinformatics 2021; 37:2506-2511. [PMID: 34175941 PMCID: PMC8388016 DOI: 10.1093/bioinformatics/btab481] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2020] [Revised: 02/19/2021] [Accepted: 06/25/2021] [Indexed: 11/12/2022] Open
Affiliation(s)
- Jennifer Harrow
- ELIXIR Hub, South Building, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Rachel Drysdale
- ELIXIR Hub, South Building, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Andrew Smith
- ELIXIR Hub, South Building, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Susanna Repo
- ELIXIR Hub, South Building, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Jerry Lanfear
- ELIXIR Hub, South Building, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Niklas Blomberg
- ELIXIR Hub, South Building, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| |
Collapse
|
5
|
Abstract
The large diversity of experimental methods in proteomics as well as their increasing usage across biological and clinical research has led to the development of hundreds if not thousands of software tools to aid in the analysis and interpretation of the resulting data. Detailed information about these tools needs to be collected, categorized, and validated to guarantee their optimal utilization. A tools registry like bio.tools enables users and developers to identify new tools with more powerful algorithms or to find tools with similar functions for comparison. Here we present the content of the registry, which now comprises more than 1000 proteomics tool entries. Furthermore, we discuss future applications and engagement with other community efforts resulting in a high impact on the bioinformatics landscape.
Collapse
Affiliation(s)
- Veit Schwämmle
- Department of Biochemistry and Molecular Biology, University of Southern Denmark, Campusvej 55, 5230 Odense, Denmark
| | - Jennifer Harrow
- ELIXIR-Hub, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom
| | - Hans Ienasescu
- National Life Science Supercomputing Center, Technical University of Denmark, Building 208, DK-2800 Kongens Lyngby, Denmark
| |
Collapse
|
6
|
Harrow J, Hancock J, Blomberg N. ELIXIR-EXCELERATE: establishing Europe's data infrastructure for the life science research of the future. EMBO J 2021; 40:e107409. [PMID: 33565128 PMCID: PMC7957415 DOI: 10.15252/embj.2020107409] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2020] [Revised: 12/14/2020] [Accepted: 12/16/2020] [Indexed: 02/06/2023] Open
Abstract
A new inter-governmental research infrastructure, ELIXIR, aims to unify bioinformatics resources and life science data across Europe, thereby facilitating their mining and (re-)use.
Collapse
Affiliation(s)
| | - John Hancock
- ELIXIR Hub, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | | | | |
Collapse
|
7
|
Lamprecht AL, Garcia L, Kuzak M, Martinez C, Arcila R, Martin Del Pico E, Dominguez Del Angel V, van de Sandt S, Ison J, Martinez PA, McQuilton P, Valencia A, Harrow J, Psomopoulos F, Gelpi JL, Chue Hong N, Goble C, Capella-Gutierrez S. Towards FAIR principles for research software. ACTA ACUST UNITED AC 2020. [DOI: 10.3233/ds-190026] [Citation(s) in RCA: 84] [Impact Index Per Article: 21.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Affiliation(s)
| | - Leyla Garcia
- ZBMED Information Centre for Life Sciences, Germany. E-mail:
| | - Mateusz Kuzak
- Netherlands eScience Center, The Netherlands
- Dutch Techcentre for Life Sciences, The Netherlands. E-mail:
| | | | | | | | | | | | - Jon Ison
- National Life Science Supercomputing Center, Technical University of Denmark, Denmark. E-mail:
| | | | | | - Alfonso Valencia
- Barcelona Supercomputing Center (BSC), Spain
- Institució Catalana de Recerca i Estudis Avançats (ICREA), Spain. E-mail:
| | | | | | - Josep Ll. Gelpi
- Barcelona Supercomputing Center (BSC), Spain
- University of Barcelona, Spain. E-mail:
| | - Neil Chue Hong
- Software Sustainability Institute, UK
- EPCC, University of Edinburgh, UK. E-mail:
| | | | | |
Collapse
|
8
|
Barnes IHA, Ibarra-Soria X, Fitzgerald S, Gonzalez JM, Davidson C, Hardy MP, Manthravadi D, Van Gerven L, Jorissen M, Zeng Z, Khan M, Mombaerts P, Harrow J, Logan DW, Frankish A. Expert curation of the human and mouse olfactory receptor gene repertoires identifies conserved coding regions split across two exons. BMC Genomics 2020; 21:196. [PMID: 32126975 PMCID: PMC7055050 DOI: 10.1186/s12864-020-6583-3] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2019] [Accepted: 02/17/2020] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND Olfactory receptor (OR) genes are the largest multi-gene family in the mammalian genome, with 874 in human and 1483 loci in mouse (including pseudogenes). The expansion of the OR gene repertoire has occurred through numerous duplication events followed by diversification, resulting in a large number of highly similar paralogous genes. These characteristics have made the annotation of the complete OR gene repertoire a complex task. Most OR genes have been predicted in silico and are typically annotated as intronless coding sequences. RESULTS Here we have developed an expert curation pipeline to analyse and annotate every OR gene in the human and mouse reference genomes. By combining evidence from structural features, evolutionary conservation and experimental data, we have unified the annotation of these gene families, and have systematically determined the protein-coding potential of each locus. We have defined the non-coding regions of many OR genes, enabling us to generate full-length transcript models. We found that 13 human and 41 mouse OR loci have coding sequences that are split across two exons. These split OR genes are conserved across mammals, and are expressed at the same level as protein-coding OR genes with an intronless coding region. Our findings challenge the long-standing and widespread notion that the coding region of a vertebrate OR gene is contained within a single exon. CONCLUSIONS This work provides the most comprehensive curation effort of the human and mouse OR gene repertoires to date. The complete annotation has been integrated into the GENCODE reference gene set, for immediate availability to the research community.
Collapse
Affiliation(s)
- If H A Barnes
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.
| | - Ximena Ibarra-Soria
- Cancer Research UK Cambridge Institute, University of Cambridge, Li Ka Shing Centre, Robinson Way, Cambridge, CB2 0RE, UK.
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SA, UK.
| | - Stephen Fitzgerald
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Jose M Gonzalez
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Claire Davidson
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Matthew P Hardy
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | | | - Laura Van Gerven
- Department of ENT-HNS, UZ Leuven, Herestraat 49, 3000, Leuven, Belgium
| | - Mark Jorissen
- Department of ENT-HNS, UZ Leuven, Herestraat 49, 3000, Leuven, Belgium
| | - Zhen Zeng
- Max Planck Research Unit for Neurogenetics, Max von-Laue-Strasse 4, 60438, Frankfurt, Germany
| | - Mona Khan
- Max Planck Research Unit for Neurogenetics, Max von-Laue-Strasse 4, 60438, Frankfurt, Germany
| | - Peter Mombaerts
- Max Planck Research Unit for Neurogenetics, Max von-Laue-Strasse 4, 60438, Frankfurt, Germany
| | - Jennifer Harrow
- ELIXIR, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Darren W Logan
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
- Monell Chemical Senses Center, Philadelphia, PA, 19104, USA
- Waltham Petcare Science Institute, Leicestershire, LE14 4RT, UK
| | - Adam Frankish
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.
| |
Collapse
|
9
|
Lilue J, Doran AG, Fiddes IT, Abrudan M, Armstrong J, Bennett R, Chow W, Collins J, Collins S, Czechanski A, Danecek P, Diekhans M, Dolle DD, Dunn M, Durbin R, Earl D, Ferguson-Smith A, Flicek P, Flint J, Frankish A, Fu B, Gerstein M, Gilbert J, Goodstadt L, Harrow J, Howe K, Ibarra-Soria X, Kolmogorov M, Lelliott C, Logan DW, Loveland J, Mathews CE, Mott R, Muir P, Nachtweide S, Navarro FC, Odom DT, Park N, Pelan S, Pham SK, Quail M, Reinholdt L, Romoth L, Shirley L, Sisu C, Sjoberg-Herrera M, Stanke M, Steward C, Thomas M, Threadgold G, Thybert D, Torrance J, Wong K, Wood J, Yalcin B, Yang F, Adams DJ, Paten B, Keane TM. Sixteen diverse laboratory mouse reference genomes define strain-specific haplotypes and novel functional loci. Nat Genet 2018; 50:1574-1583. [PMID: 30275530 PMCID: PMC6205630 DOI: 10.1038/s41588-018-0223-8] [Citation(s) in RCA: 119] [Impact Index Per Article: 19.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2018] [Accepted: 08/02/2018] [Indexed: 12/11/2022]
Abstract
We report full-length draft de novo genome assemblies for 16 widely used inbred mouse strains and find extensive strain-specific haplotype variation. We identify and characterize 2,567 regions on the current mouse reference genome exhibiting the greatest sequence diversity. These regions are enriched for genes involved in pathogen defence and immunity and exhibit enrichment of transposable elements and signatures of recent retrotransposition events. Combinations of alleles and genes unique to an individual strain are commonly observed at these loci, reflecting distinct strain phenotypes. We used these genomes to improve the mouse reference genome, resulting in the completion of 10 new gene structures. Also, 62 new coding loci were added to the reference genome annotation. These genomes identified a large, previously unannotated, gene (Efcab3-like) encoding 5,874 amino acids. Mutant Efcab3-like mice display anomalies in multiple brain regions, suggesting a possible role for this gene in the regulation of brain development.
Collapse
MESH Headings
- Animals
- Animals, Laboratory
- Chromosome Mapping/veterinary
- Genetic Loci
- Genome
- Haplotypes/genetics
- Mice
- Mice, Inbred BALB C/genetics
- Mice, Inbred C3H/genetics
- Mice, Inbred C57BL/genetics
- Mice, Inbred CBA/genetics
- Mice, Inbred DBA/genetics
- Mice, Inbred NOD/genetics
- Mice, Inbred Strains/classification
- Mice, Inbred Strains/genetics
- Molecular Sequence Annotation
- Phylogeny
- Polymorphism, Single Nucleotide
- Species Specificity
Collapse
Affiliation(s)
- Jingtao Lilue
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, United Kingdom
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA, UK
| | - Anthony G. Doran
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, United Kingdom
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA, UK
| | - Ian T. Fiddes
- Center for Biomolecular Science and Engineering, University of California Santa Cruz, Santa Cruz, CA 95064, USA
| | - Monica Abrudan
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA, UK
| | - Joel Armstrong
- Center for Biomolecular Science and Engineering, University of California Santa Cruz, Santa Cruz, CA 95064, USA
| | - Ruth Bennett
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, United Kingdom
| | - William Chow
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA, UK
| | - Joanna Collins
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA, UK
| | - Stephan Collins
- Institut de Génétique et de Biologie Moléculaire et Cellulaire, Centre National de la Recherche Scientifique UMR7104, Institut National de la Santé et de la Recherche Médicale U964, Université de Strasbourg, 67404 Illkirch, France
- Centre des Sciences du Goût et de l’Alimentation, University of Bourgogne Franche-Comté, 21000 Dijon, France
| | - Anne Czechanski
- The Jackson Laboratory, 600 Main Street, Bar Harbor, ME 04609, USA
| | - Petr Danecek
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA, UK
| | - Mark Diekhans
- Center for Biomolecular Science and Engineering, University of California Santa Cruz, Santa Cruz, CA 95064, USA
| | - Dirk-Dominik Dolle
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA, UK
| | - Matt Dunn
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA, UK
| | - Richard Durbin
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA, UK
- Department of Genetics, University of Cambridge, Downing Site, Cambridge CB2 3EH, UK
| | - Dent Earl
- Center for Biomolecular Science and Engineering, University of California Santa Cruz, Santa Cruz, CA 95064, USA
| | - Anne Ferguson-Smith
- Department of Genetics, University of Cambridge, Downing Site, Cambridge CB2 3EH, UK
| | - Paul Flicek
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, United Kingdom
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA, UK
| | - Jonathan Flint
- Brain Research Institute, University of California, 695 Charles E Young Dr S, Los Angeles, CA 90095, USA
| | - Adam Frankish
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, United Kingdom
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA, UK
| | - Beiyuan Fu
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA, UK
| | - Mark Gerstein
- Yale Computational Biology and Bioinformatics, Yale University, New Haven, CT 06520, USA
| | - James Gilbert
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA, UK
| | - Leo Goodstadt
- OxFORD Asset Management, OxAM House, 6 George Street, Oxford OX1 2BW
| | - Jennifer Harrow
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA, UK
| | - Kerstin Howe
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA, UK
| | | | - Mikhail Kolmogorov
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA 92093, USA
| | - Chris Lelliott
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA, UK
| | - Darren W. Logan
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA, UK
| | - Jane Loveland
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, United Kingdom
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA, UK
| | - Clayton E. Mathews
- Department of Pathology, Immunology, and Laboratory Medicine, University of Florida, Gainesville, FL, USA
| | - Richard Mott
- Genetics Institute, University College London, Gower Street, London WC1E 6BT, UK
| | - Paul Muir
- Yale Computational Biology and Bioinformatics, Yale University, New Haven, CT 06520, USA
| | - Stefanie Nachtweide
- Institute of Mathematics and Computer Science, University of Greifswald, Domstraße 11, 17489 Greifswald, Germany
| | - Fabio C.P. Navarro
- Yale Computational Biology and Bioinformatics, Yale University, New Haven, CT 06520, USA
| | - Duncan T. Odom
- Cancer Research UK Cambridge Institute, University of Cambridge, Robinson Way, Cambridge, CB2 0RE, UK
- German Cancer Research Center (DKFZ), Division Signaling and Functional Genomics, 69120 Heidelberg, Germany
| | - Naomi Park
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA, UK
| | - Sarah Pelan
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA, UK
| | - Son K Pham
- BioTuring Inc., San Diego, California, CA92121
| | - Mike Quail
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA, UK
| | - Laura Reinholdt
- The Jackson Laboratory, 600 Main Street, Bar Harbor, ME 04609, USA
| | - Lars Romoth
- Institute of Mathematics and Computer Science, University of Greifswald, Domstraße 11, 17489 Greifswald, Germany
| | - Lesley Shirley
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA, UK
| | - Cristina Sisu
- Yale Computational Biology and Bioinformatics, Yale University, New Haven, CT 06520, USA
- Department of Bioscience, Brunel University London, Uxbridge UB8 3PH, UK
| | - Marcela Sjoberg-Herrera
- Departamento de Biología Celular y Molecular, Facultad de Ciencias Biológicas, Pontificia Universidad Católica de Chile, Santiago 8331150, Chile
| | - Mario Stanke
- Institute of Mathematics and Computer Science, University of Greifswald, Domstraße 11, 17489 Greifswald, Germany
| | - Charles Steward
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA, UK
| | - Mark Thomas
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA, UK
| | - Glen Threadgold
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA, UK
| | - David Thybert
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA, UK
- Earlham Institute, Norwich Research Park, Norwich NR4 7UZ, UK
| | - James Torrance
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA, UK
| | - Kim Wong
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA, UK
| | - Jonathan Wood
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA, UK
| | - Binnaz Yalcin
- Institut de Génétique et de Biologie Moléculaire et Cellulaire, Centre National de la Recherche Scientifique UMR7104, Institut National de la Santé et de la Recherche Médicale U964, Université de Strasbourg, 67404 Illkirch, France
| | - Fengtang Yang
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA, UK
| | - David J. Adams
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA, UK
| | - Benedict Paten
- Center for Biomolecular Science and Engineering, University of California Santa Cruz, Santa Cruz, CA 95064, USA
| | - Thomas M. Keane
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, United Kingdom
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA, UK
- School of Life Sciences, University of Nottingham, Nottingham, UK
| |
Collapse
|
10
|
Lagarde J, Uszczynska-Ratajczak B, Carbonell S, Pérez-Lluch S, Abad A, Davis C, Gingeras TR, Frankish A, Harrow J, Guigo R, Johnson R. High-throughput annotation of full-length long noncoding RNAs with capture long-read sequencing. Nat Genet 2017; 49:1731-1740. [PMID: 29106417 PMCID: PMC5709232 DOI: 10.1038/ng.3988] [Citation(s) in RCA: 166] [Impact Index Per Article: 23.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2017] [Accepted: 10/11/2017] [Indexed: 12/20/2022]
Abstract
Accurate annotation of genes and their transcripts is a foundation of genomics, but currently no annotation technique combines throughput and accuracy. As a result, reference gene collections remain incomplete-many gene models are fragmentary, and thousands more remain uncataloged, particularly for long noncoding RNAs (lncRNAs). To accelerate lncRNA annotation, the GENCODE consortium has developed RNA Capture Long Seq (CLS), which combines targeted RNA capture with third-generation long-read sequencing. Here we present an experimental reannotation of the GENCODE intergenic lncRNA populations in matched human and mouse tissues that resulted in novel transcript models for 3,574 and 561 gene loci, respectively. CLS approximately doubled the annotated complexity of targeted loci, outperforming existing short-read techniques. Full-length transcript models produced by CLS enabled us to definitively characterize the genomic features of lncRNAs, including promoter and gene structure, and protein-coding potential. Thus, CLS removes a long-standing bottleneck in transcriptome annotation and generates manual-quality full-length transcript models at high-throughput scales.
Collapse
Affiliation(s)
- Julien Lagarde
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Dr. Aiguader 88, 08003 Barcelona, Spain
- Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - Barbara Uszczynska-Ratajczak
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Dr. Aiguader 88, 08003 Barcelona, Spain
- Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - Silvia Carbonell
- R&D Department, Quantitative Genomic Medicine Laboratories (qGenomics), Barcelona, Spain
| | - Sílvia Pérez-Lluch
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Dr. Aiguader 88, 08003 Barcelona, Spain
- Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - Amaya Abad
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Dr. Aiguader 88, 08003 Barcelona, Spain
- Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - Carrie Davis
- Functional Genomics Group, Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, New York 11724, USA
| | - Thomas R. Gingeras
- Functional Genomics Group, Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, New York 11724, USA
| | - Adam Frankish
- Wellcome Trust Sanger Institute, Hinxton, Cambridgeshire, UK CB10 1HH
| | - Jennifer Harrow
- Wellcome Trust Sanger Institute, Hinxton, Cambridgeshire, UK CB10 1HH
| | - Roderic Guigo
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Dr. Aiguader 88, 08003 Barcelona, Spain
- Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - Rory Johnson
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Dr. Aiguader 88, 08003 Barcelona, Spain
- Universitat Pompeu Fabra (UPF), Barcelona, Spain
| |
Collapse
|
11
|
Abstract
The Human Genome Project and advances in DNA sequencing technologies have revolutionized the identification of genetic disorders through the use of clinical exome sequencing. However, in a considerable number of patients, the genetic basis remains unclear. As clinicians begin to consider whole-genome sequencing, an understanding of the processes and tools involved and the factors to consider in the annotation of the structure and function of genomic elements that might influence variant identification is crucial. Here, we discuss and illustrate the strengths and weaknesses of approaches for the annotation and classification of important elements of protein-coding genes, other genomic elements such as pseudogenes and the non-coding genome, comparative-genomic approaches for inferring gene function, and new technologies for aiding genome annotation, as a practical guide for clinicians when considering pathogenic sequence variation. Complete and accurate annotation of structure and function of genome features has the potential to reduce both false-negative (from missing annotation) and false-positive (from incorrect annotation) errors in causal variant identification in exome and genome sequences. Re-analysis of unsolved cases will be necessary as newer technology improves genome annotation, potentially improving the rate of diagnosis.
Collapse
Affiliation(s)
- Charles A Steward
- Congenica Ltd, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1DR, UK. .,The Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SA, UK.
| | | | - Berge A Minassian
- Department of Pediatrics (Neurology), University of Texas Southwestern, Dallas, TX, USA.,Program in Genetics and Genome Biology and Department of Paediatrics (Neurology), The Hospital for Sick Children and University of Toronto, Toronto, Canada
| | - Sanjay M Sisodiya
- Department of Clinical and Experimental Epilepsy, UCL Institute of Neurology, London, WC1N 3BG, UK.,Chalfont Centre for Epilepsy, Chesham Lane, Chalfont St Peter, Buckinghamshire, SL9 0RJ, UK
| | - Adam Frankish
- The Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SA, UK.,European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Jennifer Harrow
- The Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SA, UK.,Illumina Inc, Great Chesterford, Essex, CB10 1XL, UK
| |
Collapse
|
12
|
Abstract
A genome sequence is worthless if it cannot be deciphered; therefore, efforts to describe - or 'annotate' - genes began as soon as DNA sequences became available. Whereas early work focused on individual protein-coding genes, the modern genomic ocean is a complex maelstrom of alternative splicing, non-coding transcription and pseudogenes. Scientists - from clinicians to evolutionary biologists - need to navigate these waters, and this has led to the design of high-throughput, computationally driven annotation projects. The catalogues that are being produced are key resources for genome exploration, especially as they become integrated with expression, epigenomic and variation data sets. Their creation, however, remains challenging.
Collapse
Affiliation(s)
- Jonathan M Mudge
- Department of Computational Genomics, Wellcome Trust Sanger Institute, Hinxton CB10 1SA, UK
| | - Jennifer Harrow
- Department of Computational Genomics, Wellcome Trust Sanger Institute, Hinxton CB10 1SA, UK.,Illumina Cambridge Ltd, Chesterford Research Park, Little Chesterford, Saffron Walden CB10 1 XL, UK
| |
Collapse
|
13
|
Lagarde J, Uszczynska-Ratajczak B, Santoyo-Lopez J, Gonzalez JM, Tapanari E, Mudge JM, Steward CA, Wilming L, Tanzer A, Howald C, Chrast J, Vela-Boza A, Rueda A, Lopez-Domingo FJ, Dopazo J, Reymond A, Guigó R, Harrow J. Extension of human lncRNA transcripts by RACE coupled with long-read high-throughput sequencing (RACE-Seq). Nat Commun 2016; 7:12339. [PMID: 27531712 PMCID: PMC4992054 DOI: 10.1038/ncomms12339] [Citation(s) in RCA: 53] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2016] [Accepted: 06/23/2016] [Indexed: 12/22/2022] Open
Abstract
Long non-coding RNAs (lncRNAs) constitute a large, yet mostly uncharacterized fraction of the mammalian transcriptome. Such characterization requires a comprehensive, high-quality annotation of their gene structure and boundaries, which is currently lacking. Here we describe RACE-Seq, an experimental workflow designed to address this based on RACE (rapid amplification of cDNA ends) and long-read RNA sequencing. We apply RACE-Seq to 398 human lncRNA genes in seven tissues, leading to the discovery of 2,556 on-target, novel transcripts. About 60% of the targeted loci are extended in either 5′ or 3′, often reaching genomic hallmarks of gene boundaries. Analysis of the novel transcripts suggests that lncRNAs are as long, have as many exons and undergo as much alternative splicing as protein-coding genes, contrary to current assumptions. Overall, we show that RACE-Seq is an effective tool to annotate an organism's deep transcriptome, and compares favourably to other targeted sequencing techniques. Long non-coding RNAs are increasingly recognised to be important factors in regulating cellular processes and comprise a large faction of the transcriptome, however most are uncharacterised. Here the authors present RACE-Seq, a tool to improve and extend the annotation of low-expression transcripts.
Collapse
Affiliation(s)
- Julien Lagarde
- Centre for Genomic Regulation (CRG), Barcelona Institute of Science and Technology (BIST), Dr Aiguader 88, 08003 Barcelona, Spain.,Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - Barbara Uszczynska-Ratajczak
- Centre for Genomic Regulation (CRG), Barcelona Institute of Science and Technology (BIST), Dr Aiguader 88, 08003 Barcelona, Spain.,Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | | | | | - Electra Tapanari
- Wellcome Trust Sanger Institute, Hinxton, Cambridgeshire CB10 1HH, UK
| | - Jonathan M Mudge
- Wellcome Trust Sanger Institute, Hinxton, Cambridgeshire CB10 1HH, UK
| | - Charles A Steward
- Wellcome Trust Sanger Institute, Hinxton, Cambridgeshire CB10 1HH, UK
| | - Laurens Wilming
- Wellcome Trust Sanger Institute, Hinxton, Cambridgeshire CB10 1HH, UK
| | - Andrea Tanzer
- Centre for Genomic Regulation (CRG), Barcelona Institute of Science and Technology (BIST), Dr Aiguader 88, 08003 Barcelona, Spain.,Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - Cédric Howald
- Center for Integrative Genomics, University of Lausanne, Lausanne, Switzerland
| | - Jacqueline Chrast
- Center for Integrative Genomics, University of Lausanne, Lausanne, Switzerland
| | - Alicia Vela-Boza
- Genomics and Bioinformatics Platform of Andalusia (GBPA), 41092 Seville, Spain.,Roche Diagnostics, 08174 Sant Cugat Del Vallès, Barcelona, Spain
| | - Antonio Rueda
- Genomics and Bioinformatics Platform of Andalusia (GBPA), 41092 Seville, Spain
| | | | - Joaquin Dopazo
- Genomics and Bioinformatics Platform of Andalusia (GBPA), 41092 Seville, Spain.,Computational Genomics Department, Centro de Investigación Príncipe Felipe, 46012 Valencia, Spain.,Functional Genomics Node (INB), Centro de Investigación Príncipe Felipe, 46012 Valencia, Spain
| | - Alexandre Reymond
- Center for Integrative Genomics, University of Lausanne, Lausanne, Switzerland
| | - Roderic Guigó
- Centre for Genomic Regulation (CRG), Barcelona Institute of Science and Technology (BIST), Dr Aiguader 88, 08003 Barcelona, Spain.,Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - Jennifer Harrow
- Wellcome Trust Sanger Institute, Hinxton, Cambridgeshire CB10 1HH, UK
| |
Collapse
|
14
|
Bruce I, Harrow J, Obolenskaya P. Blind and partially sighted people’s perceptions of their inclusion by family and friends. British Journal of Visual Impairment 2016. [DOI: 10.1177/0264619607071778] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Blind and partially sighted people’s perceptions of inclusion by family and friends are examined in a major survey of over 900 adults with low vision in the UK. Findings demonstrate a complex picture, reporting high levels of severe lack of social support in comparison to the general population especially among men, and lack of social support expressed extensively by those who were rarely or never visited by family or neighbours. Levels of reported social support were not related to the degree of severity of sight loss or age; and economically inactive respondents of working age reported lower levels of social support than those who were working. Correlation between respondents’ having hobbies and going shopping and rising levels of social support was shown. With 40% of respondents living alone, having someone visiting as little as at least once a month meant that respondents were less likely to express severe lack of social support. The concept of ‘inclusion’ is recognized as more associated with formal ideas of citizenship and participation in community life than with informal support. It is suggested that increased focus should be given in public policy development and service provision to enabling greater levels of informal inclusion for people with visual impairments. Implications for general services development are noted.
Collapse
|
15
|
Yates A, Akanni W, Amode MR, Barrell D, Billis K, Carvalho-Silva D, Cummins C, Clapham P, Fitzgerald S, Gil L, Girón CG, Gordon L, Hourlier T, Hunt SE, Janacek SH, Johnson N, Juettemann T, Keenan S, Lavidas I, Martin FJ, Maurel T, McLaren W, Murphy DN, Nag R, Nuhn M, Parker A, Patricio M, Pignatelli M, Rahtz M, Riat HS, Sheppard D, Taylor K, Thormann A, Vullo A, Wilder SP, Zadissa A, Birney E, Harrow J, Muffato M, Perry E, Ruffier M, Spudich G, Trevanion SJ, Cunningham F, Aken BL, Zerbino DR, Flicek P. Ensembl 2016. Nucleic Acids Res 2016; 44:D710-6. [PMID: 26687719 PMCID: PMC4702834 DOI: 10.1093/nar/gkv1157] [Citation(s) in RCA: 1066] [Impact Index Per Article: 133.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2015] [Revised: 10/19/2015] [Accepted: 10/19/2015] [Indexed: 01/17/2023] Open
Abstract
The Ensembl project (http://www.ensembl.org) is a system for genome annotation, analysis, storage and dissemination designed to facilitate the access of genomic annotation from chordates and key model organisms. It provides access to data from 87 species across our main and early access Pre! websites. This year we introduced three newly annotated species and released numerous updates across our supported species with a concentration on data for the latest genome assemblies of human, mouse, zebrafish and rat. We also provided two data updates for the previous human assembly, GRCh37, through a dedicated website (http://grch37.ensembl.org). Our tools, in particular the VEP, have been improved significantly through integration of additional third party data. REST is now capable of larger-scale analysis and our regulatory data BioMart can deliver faster results. The website is now capable of displaying long-range interactions such as those found in cis-regulated datasets. Finally we have launched a website optimized for mobile devices providing views of genes, variants and phenotypes. Our data is made available without restriction and all code is available from our GitHub organization site (http://github.com/Ensembl) under an Apache 2.0 license.
Collapse
Affiliation(s)
- Andrew Yates
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Wasiu Akanni
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - M Ridwan Amode
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Daniel Barrell
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Konstantinos Billis
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Denise Carvalho-Silva
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Carla Cummins
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Peter Clapham
- Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Stephen Fitzgerald
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Laurent Gil
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Carlos García Girón
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Leo Gordon
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Thibaut Hourlier
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Sarah E Hunt
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Sophie H Janacek
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Nathan Johnson
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Thomas Juettemann
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Stephen Keenan
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Ilias Lavidas
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Fergal J Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Thomas Maurel
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - William McLaren
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Daniel N Murphy
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Rishi Nag
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Michael Nuhn
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Anne Parker
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Mateus Patricio
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Miguel Pignatelli
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Matthew Rahtz
- Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Harpreet Singh Riat
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Daniel Sheppard
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Kieron Taylor
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Anja Thormann
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Alessandro Vullo
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Steven P Wilder
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Amonida Zadissa
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Ewan Birney
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Jennifer Harrow
- Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Matthieu Muffato
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Emily Perry
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Magali Ruffier
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Giulietta Spudich
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Stephen J Trevanion
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Fiona Cunningham
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Bronwen L Aken
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Daniel R Zerbino
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Paul Flicek
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| |
Collapse
|
16
|
Abstract
A report on the Wellcome Trust retreat on devising a consensus framework for the validation of novel human protein coding loci, held in Hinxton, U.K., May 11-13, 2015.
Collapse
Affiliation(s)
- Elspeth A Bruford
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI) , Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, United Kingdom
| | - Lydie Lane
- SIB Swiss Institute of Bioinformatics and University of Geneva, Faculty of Medicine, CMU, Michel Servet 1, 1211 Geneva 4, Switzerland
| | - Jennifer Harrow
- Wellcome Trust Sanger Institute , Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SA, United Kingdom
| |
Collapse
|
17
|
Abstract
Annotation on the reference genome of the C57BL6/J mouse has been an ongoing project ever since the draft genome was first published. Initially, the principle focus was on the identification of all protein-coding genes, although today the importance of describing long non-coding RNAs, small RNAs, and pseudogenes is recognized. Here, we describe the progress of the GENCODE mouse annotation project, which combines manual annotation from the HAVANA group with Ensembl computational annotation, alongside experimental and in silico validation pipelines from other members of the consortium. We discuss the more recent incorporation of next-generation sequencing datasets into this workflow, including the usage of mass-spectrometry data to potentially identify novel protein-coding genes. Finally, we will outline how the C57BL6/J genebuild can be used to gain insights into the variant sites that distinguish different mouse strains and species.
Collapse
|
18
|
Frankish A, Uszczynska B, Ritchie GRS, Gonzalez JM, Pervouchine D, Petryszak R, Mudge JM, Fonseca N, Brazma A, Guigo R, Harrow J. Comparison of GENCODE and RefSeq gene annotation and the impact of reference geneset on variant effect prediction. BMC Genomics 2015; 16 Suppl 8:S2. [PMID: 26110515 PMCID: PMC4502323 DOI: 10.1186/1471-2164-16-s8-s2] [Citation(s) in RCA: 58] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/02/2023] Open
Abstract
Background A vast amount of DNA variation is being identified by increasingly large-scale exome and genome sequencing projects. To be useful, variants require accurate functional annotation and a wide range of tools are available to this end. McCarthy et al recently demonstrated the large differences in prediction of loss-of-function (LoF) variation when RefSeq and Ensembl transcripts are used for annotation, highlighting the importance of the reference transcripts on which variant functional annotation is based. Results We describe a detailed analysis of the similarities and differences between the gene and transcript annotation in the GENCODE and RefSeq genesets. We demonstrate that the GENCODE Comprehensive set is richer in alternative splicing, novel CDSs, novel exons and has higher genomic coverage than RefSeq, while the GENCODE Basic set is very similar to RefSeq. Using RNAseq data we show that exons and introns unique to one geneset are expressed at a similar level to those common to both. We present evidence that the differences in gene annotation lead to large differences in variant annotation where GENCODE and RefSeq are used as reference transcripts, although this is predominantly confined to non-coding transcripts and UTR sequence, with at most ~30% of LoF variants annotated discordantly. We also describe an investigation of dominant transcript expression, showing that it both supports the utility of the GENCODE Basic set in providing a smaller set of more highly expressed transcripts and provides a useful, biologically-relevant filter for further reducing the complexity of the transcriptome. Conclusions The reference transcripts selected for variant functional annotation do have a large effect on the outcome. The GENCODE Comprehensive transcripts contain more exons, have greater genomic coverage and capture many more variants than RefSeq in both genome and exome datasets, while the GENCODE Basic set shows a higher degree of concordance with RefSeq and has fewer unique features. We propose that the GENCODE Comprehensive set has great utility for the discovery of new variants with functional potential, while the GENCODE Basic set is more suitable for applications demanding less complex interpretation of functional variants.
Collapse
|
19
|
Cunningham F, Amode MR, Barrell D, Beal K, Billis K, Brent S, Carvalho-Silva D, Clapham P, Coates G, Fitzgerald S, Gil L, Girón CG, Gordon L, Hourlier T, Hunt SE, Janacek SH, Johnson N, Juettemann T, Kähäri AK, Keenan S, Martin FJ, Maurel T, McLaren W, Murphy DN, Nag R, Overduin B, Parker A, Patricio M, Perry E, Pignatelli M, Riat HS, Sheppard D, Taylor K, Thormann A, Vullo A, Wilder SP, Zadissa A, Aken BL, Birney E, Harrow J, Kinsella R, Muffato M, Ruffier M, Searle SMJ, Spudich G, Trevanion SJ, Yates A, Zerbino DR, Flicek P. Ensembl 2015. Nucleic Acids Res 2014; 43:D662-9. [PMID: 25352552 PMCID: PMC4383879 DOI: 10.1093/nar/gku1010] [Citation(s) in RCA: 961] [Impact Index Per Article: 96.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023] Open
Abstract
Ensembl (http://www.ensembl.org) is a genomic interpretation system providing the most up-to-date annotations, querying tools and access methods for chordates and key model organisms. This year we released updated annotation (gene models, comparative genomics, regulatory regions and variation) on the new human assembly, GRCh38, although we continue to support researchers using the GRCh37.p13 assembly through a dedicated site (http://grch37.ensembl.org). Our Regulatory Build has been revamped to identify regulatory regions of interest and to efficiently highlight their activity across disparate epigenetic data sets. A number of new interfaces allow users to perform large-scale comparisons of their data against our annotations. The REST server (http://rest.ensembl.org), which allows programs written in any language to query our databases, has moved to a full service alongside our upgraded website tools. Our online Variant Effect Predictor tool has been updated to process more variants and calculate summary statistics. Lastly, the WiggleTools package enables users to summarize large collections of data sets and view them as single tracks in Ensembl. The Ensembl code base itself is more accessible: it is now hosted on our GitHub organization page (https://github.com/Ensembl) under an Apache 2.0 open source license.
Collapse
Affiliation(s)
- Fiona Cunningham
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - M Ridwan Amode
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Daniel Barrell
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Kathryn Beal
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Konstantinos Billis
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Simon Brent
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Denise Carvalho-Silva
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Peter Clapham
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Guy Coates
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Stephen Fitzgerald
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Laurent Gil
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Carlos García Girón
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Leo Gordon
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Thibaut Hourlier
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Sarah E Hunt
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Sophie H Janacek
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Nathan Johnson
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Thomas Juettemann
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Andreas K Kähäri
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Stephen Keenan
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Fergal J Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Thomas Maurel
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - William McLaren
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Daniel N Murphy
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Rishi Nag
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Bert Overduin
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Anne Parker
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Mateus Patricio
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Emily Perry
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Miguel Pignatelli
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Harpreet Singh Riat
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Daniel Sheppard
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Kieron Taylor
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Anja Thormann
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Alessandro Vullo
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Steven P Wilder
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Amonida Zadissa
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Bronwen L Aken
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Ewan Birney
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Jennifer Harrow
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Rhoda Kinsella
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Matthieu Muffato
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Magali Ruffier
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Stephen M J Searle
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Giulietta Spudich
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Stephen J Trevanion
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Andy Yates
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Daniel R Zerbino
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Paul Flicek
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| |
Collapse
|
20
|
Petrov AI, Kay SJE, Gibson R, Kulesha E, Staines D, Bruford EA, Wright MW, Burge S, Finn RD, Kersey PJ, Cochrane G, Bateman A, Griffiths-Jones S, Harrow J, Chan PP, Lowe TM, Zwieb CW, Wower J, Williams KP, Hudson CM, Gutell R, Clark MB, Dinger M, Quek XC, Bujnicki JM, Chua NH, Liu J, Wang H, Skogerbø G, Zhao Y, Chen R, Zhu W, Cole JR, Chai B, Huang HD, Huang HY, Cherry JM, Hatzigeorgiou A, Pruitt KD. RNAcentral: an international database of ncRNA sequences. Nucleic Acids Res 2014; 43:D123-9. [PMID: 25352543 PMCID: PMC4384043 DOI: 10.1093/nar/gku991] [Citation(s) in RCA: 86] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023] Open
Abstract
The field of non-coding RNA biology has been hampered by the lack of availability of a
comprehensive, up-to-date collection of accessioned RNA sequences. Here we present the
first release of RNAcentral, a database that collates and integrates information from an
international consortium of established RNA sequence databases. The initial release
contains over 8.1 million sequences, including representatives of all major functional
classes. A web portal (http://rnacentral.org) provides free access to data, search functionality,
cross-references, source code and an integrated genome browser for selected species.
Collapse
|
21
|
Ezkurdia I, Juan D, Rodriguez JM, Frankish A, Diekhans M, Harrow J, Vazquez J, Valencia A, Tress ML. Multiple evidence strands suggest that there may be as few as 19,000 human protein-coding genes. Hum Mol Genet 2014; 23:5866-78. [PMID: 24939910 PMCID: PMC4204768 DOI: 10.1093/hmg/ddu309] [Citation(s) in RCA: 320] [Impact Index Per Article: 32.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022] Open
Abstract
Determining the full complement of protein-coding genes is a key goal of genome annotation. The most powerful approach for confirming protein-coding potential is the detection of cellular protein expression through peptide mass spectrometry (MS) experiments. Here, we mapped peptides detected in seven large-scale proteomics studies to almost 60% of the protein-coding genes in the GENCODE annotation of the human genome. We found a strong relationship between detection in proteomics experiments and both gene family age and cross-species conservation. Most of the genes for which we detected peptides were highly conserved. We found peptides for >96% of genes that evolved before bilateria. At the opposite end of the scale, we identified almost no peptides for genes that have appeared since primates, for genes that did not have any protein-like features or for genes with poor cross-species conservation. These results motivated us to describe a set of 2001 potential non-coding genes based on features such as weak conservation, a lack of protein features, or ambiguous annotations from major databases, all of which correlated with low peptide detection across the seven experiments. We identified peptides for just 3% of these genes. We show that many of these genes behave more like non-coding genes than protein-coding genes and suggest that most are unlikely to code for proteins under normal circumstances. We believe that their inclusion in the human protein-coding gene catalogue should be revised as part of the ongoing human genome annotation effort.
Collapse
Affiliation(s)
| | - David Juan
- Structural Biology and Bioinformatics Programme and
| | - Jose Manuel Rodriguez
- National Bioinformatics Institute (INB), Spanish National Cancer Research Centre (CNIO), Melchor Fernández Almagro, 3, 28029, Madrid, Spain
| | - Adam Frankish
- Wellcome Trust Sanger Institute, Wellcome Trust Campus, Hinxton, Cambridge CB10 1SA, UK and
| | - Mark Diekhans
- Center for Biomolecular Science and Engineering, School of Engineering, University of California Santa Cruz (UCSC), 1156 High Street, Santa Cruz, CA 95064, USA
| | - Jennifer Harrow
- Wellcome Trust Sanger Institute, Wellcome Trust Campus, Hinxton, Cambridge CB10 1SA, UK and
| | - Jesus Vazquez
- Laboratorio de Proteómica Cardiovascular, Centro Nacional de Investigaciones Cardiovasculares, CNIC, Melchor Fernández Almagro, 3, 28029, Madrid, Spain
| | - Alfonso Valencia
- Structural Biology and Bioinformatics Programme and, National Bioinformatics Institute (INB), Spanish National Cancer Research Centre (CNIO), Melchor Fernández Almagro, 3, 28029, Madrid, Spain,
| | | |
Collapse
|
22
|
Deelen J, Beekman M, Uh HW, Broer L, Ayers KL, Tan Q, Kamatani Y, Bennet AM, Tamm R, Trompet S, Guðbjartsson DF, Flachsbart F, Rose G, Viktorin A, Fischer K, Nygaard M, Cordell HJ, Crocco P, van den Akker EB, Böhringer S, Helmer Q, Nelson CP, Saunders GI, Alver M, Andersen-Ranberg K, Breen ME, van der Breggen R, Caliebe A, Capri M, Cevenini E, Collerton JC, Dato S, Davies K, Ford I, Gampe J, Garagnani P, de Geus EJC, Harrow J, van Heemst D, Heijmans BT, Heinsen FA, Hottenga JJ, Hofman A, Jeune B, Jonsson PV, Lathrop M, Lechner D, Martin-Ruiz C, Mcnerlan SE, Mihailov E, Montesanto A, Mooijaart SP, Murphy A, Nohr EA, Paternoster L, Postmus I, Rivadeneira F, Ross OA, Salvioli S, Sattar N, Schreiber S, Stefánsson H, Stott DJ, Tiemeier H, Uitterlinden AG, Westendorp RGJ, Willemsen G, Samani NJ, Galan P, Sørensen TIA, Boomsma DI, Jukema JW, Rea IM, Passarino G, de Craen AJM, Christensen K, Nebel A, Stefánsson K, Metspalu A, Magnusson P, Blanché H, Christiansen L, Kirkwood TBL, van Duijn CM, Franceschi C, Houwing-Duistermaat JJ, Slagboom PE. Genome-wide association meta-analysis of human longevity identifies a novel locus conferring survival beyond 90 years of age. Hum Mol Genet 2014; 23:4420-32. [PMID: 24688116 PMCID: PMC4103672 DOI: 10.1093/hmg/ddu139] [Citation(s) in RCA: 173] [Impact Index Per Article: 17.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
The genetic contribution to the variation in human lifespan is ∼25%. Despite the large number of identified disease-susceptibility loci, it is not known which loci influence population mortality. We performed a genome-wide association meta-analysis of 7729 long-lived individuals of European descent (≥85 years) and 16 121 younger controls (<65 years) followed by replication in an additional set of 13 060 long-lived individuals and 61 156 controls. In addition, we performed a subset analysis in cases aged ≥90 years. We observed genome-wide significant association with longevity, as reflected by survival to ages beyond 90 years, at a novel locus, rs2149954, on chromosome 5q33.3 (OR = 1.10, P = 1.74 × 10−8). We also confirmed association of rs4420638 on chromosome 19q13.32 (OR = 0.72, P = 3.40 × 10−36), representing the TOMM40/APOE/APOC1 locus. In a prospective meta-analysis (n = 34 103), the minor allele of rs2149954 (T) on chromosome 5q33.3 associates with increased survival (HR = 0.95, P = 0.003). This allele has previously been reported to associate with low blood pressure in middle age. Interestingly, the minor allele (T) associates with decreased cardiovascular mortality risk, independent of blood pressure. We report on the first GWAS-identified longevity locus on chromosome 5q33.3 influencing survival in the general European population. The minor allele of this locus associates with low blood pressure in middle age, although the contribution of this allele to survival may be less dependent on blood pressure. Hence, the pleiotropic mechanisms by which this intragenic variation contributes to lifespan regulation have to be elucidated.
Collapse
Affiliation(s)
- Joris Deelen
- Department of Molecular Epidemiology, Netherlands Consortium for Healthy Ageing
| | - Marian Beekman
- Department of Molecular Epidemiology, Netherlands Consortium for Healthy Ageing
| | - Hae-Won Uh
- Department of Medical Statistics and Bioinformatics
| | - Linda Broer
- Netherlands Consortium for Healthy Ageing, Department of Epidemiology and
| | - Kristin L Ayers
- Institute of Genetic Medicine, International Centre for Life, Newcastle University, Newcastle upon Tyne NE1 3BZ, UK
| | - Qihua Tan
- Epidemiology, Institute of Public Health and Department of Clinical Genetics and
| | | | - Anna M Bennet
- Department of Medical Epidemiology and Biostatistics, Karolinska Institute, Stockholm SE-171 77, Sweden
| | - Riin Tamm
- Estonian Genome Center and Institute of Molecular and Cell Biology, University of Tartu, Tartu 51010, Estonia
| | - Stella Trompet
- Department of Cardiology and Department of Gerontology and Geriatrics, Leiden University Medical Center, Leiden 2300 RC, The Netherlands
| | | | | | - Giuseppina Rose
- Department of Biology, Ecology and Earth Science, University of Calabria, Rende 87036, Italy
| | - Alexander Viktorin
- Department of Medical Epidemiology and Biostatistics, Karolinska Institute, Stockholm SE-171 77, Sweden
| | | | - Marianne Nygaard
- Epidemiology, Institute of Public Health and Department of Clinical Genetics and
| | - Heather J Cordell
- Institute of Genetic Medicine, International Centre for Life, Newcastle University, Newcastle upon Tyne NE1 3BZ, UK
| | - Paolina Crocco
- Department of Biology, Ecology and Earth Science, University of Calabria, Rende 87036, Italy
| | - Erik B van den Akker
- Department of Molecular Epidemiology, Delft Bioinformatics Lab, Delft University of Technology, Delft 2600 GA, The Netherlands
| | | | | | - Christopher P Nelson
- Department of Cardiovascular Sciences, University of Leicester, Leicester LE3 9QP, UK National Institute for Health Research Leicester Cardiovascular Biomedical Research Unit, Glenfield Hospital, Leicester LE3 9QP, UK
| | - Gary I Saunders
- Human and Vertebrate Analysis and Annotation, The Wellcome Trust Sanger Institute, The Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK
| | - Maris Alver
- Estonian Genome Center and Institute of Molecular and Cell Biology, University of Tartu, Tartu 51010, Estonia
| | | | - Marie E Breen
- School of Medicine, Dentistry and Biomedical Science, Queens University Belfast, Belfast BT9 7BL, UK Department of Psychiatry, University of Iowa, Iowa City, IA 52242, USA
| | | | - Amke Caliebe
- Institute of Medical Informatics and Statistics, Christian-Albrechts-University, Kiel 24105, Germany
| | - Miriam Capri
- Department of Experimental, Diagnostic and Specialty Medicine and
| | - Elisa Cevenini
- Department of Experimental, Diagnostic and Specialty Medicine and
| | - Joanna C Collerton
- Institute for Ageing and Health, Newcastle University, Campus for Ageing and Vitality, Newcastle upon Tyne NE4 5PL, UK
| | - Serena Dato
- Department of Biology, Ecology and Earth Science, University of Calabria, Rende 87036, Italy
| | - Karen Davies
- Institute for Ageing and Health, Newcastle University, Campus for Ageing and Vitality, Newcastle upon Tyne NE4 5PL, UK
| | - Ian Ford
- Robertson Center for Biostatistics and
| | - Jutta Gampe
- Laboratory of Statistical Demography, Max Planck Institute for Demographic Research, Rostock 18057, Germany
| | - Paolo Garagnani
- Department of Experimental, Diagnostic and Specialty Medicine and
| | - Eco J C de Geus
- Department of Biological Psychology, VU University Amsterdam, Amsterdam 1081 BT, The Netherlands EMGO Institute for Health and Care Research, VU University Medical Center, Amsterdam 1081 BT, The Netherlands
| | - Jennifer Harrow
- Human and Vertebrate Analysis and Annotation, The Wellcome Trust Sanger Institute, The Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK
| | - Diana van Heemst
- Department of Gerontology and Geriatrics, Leiden University Medical Center, Leiden 2300 RC, The Netherlands
| | - Bastiaan T Heijmans
- Department of Molecular Epidemiology, Netherlands Consortium for Healthy Ageing
| | | | - Jouke-Jan Hottenga
- Department of Biological Psychology, VU University Amsterdam, Amsterdam 1081 BT, The Netherlands
| | - Albert Hofman
- Netherlands Consortium for Healthy Ageing, Department of Epidemiology and
| | | | - Palmi V Jonsson
- Geriatrics, Landspitali University Hospital, Reykjavik 101, Iceland Faculty of Medicine, University of Iceland, Reykjavik 101, Iceland
| | - Mark Lathrop
- Fondation Jean Dausset-CEPH, Paris 75010, France EMGO Institute for Health and Care Research, VU University Medical Center, Amsterdam 1081 BT, The Netherlands McGill University and Génome Québec Innovation Centre, Montréal, Québec, Canada H3A 1A4
| | | | - Carmen Martin-Ruiz
- Institute for Ageing and Health, Newcastle University, Campus for Ageing and Vitality, Newcastle upon Tyne NE4 5PL, UK
| | - Susan E Mcnerlan
- School of Medicine, Dentistry and Biomedical Science, Queens University Belfast, Belfast BT9 7BL, UK Cytogenetics Laboratory, Belfast Health and Social Care Trust, Belfast BT8 8BH, UK
| | - Evelin Mihailov
- Estonian Genome Center and Estonian Biocentre, Tartu 51010, Estonia
| | - Alberto Montesanto
- Department of Biology, Ecology and Earth Science, University of Calabria, Rende 87036, Italy
| | - Simon P Mooijaart
- Netherlands Consortium for Healthy Ageing, Department of Gerontology and Geriatrics, Leiden University Medical Center, Leiden 2300 RC, The Netherlands
| | - Anne Murphy
- School of Medicine, Dentistry and Biomedical Science, Queens University Belfast, Belfast BT9 7BL, UK
| | - Ellen A Nohr
- Section for Epidemiology, Department of Public Health, Aarhus University, Aarhus C DK-8000, Denmark Department of Gynecology and Obstetrics, Institute of Clinical Research, University of Southern Denmark, Odense C DK-5000, Denmark
| | - Lavinia Paternoster
- MRC Centre for Causal Analyses in Translational Epidemiology, School of Social and Community Medicine, University of Bristol, Bristol BS8 2BN, UK
| | - Iris Postmus
- Netherlands Consortium for Healthy Ageing, Department of Gerontology and Geriatrics, Leiden University Medical Center, Leiden 2300 RC, The Netherlands
| | - Fernando Rivadeneira
- Netherlands Consortium for Healthy Ageing, Department of Epidemiology and Department of Internal Medicine, Erasmus Medical Center, Rotterdam 3000 CA, The Netherlands
| | - Owen A Ross
- School of Medicine, Dentistry and Biomedical Science, Queens University Belfast, Belfast BT9 7BL, UK Department of Neuroscience, Mayo Clinic, Jacksonville, FL 32224, USA
| | - Stefano Salvioli
- Department of Experimental, Diagnostic and Specialty Medicine and
| | - Naveed Sattar
- BHF Glasgow Cardiovascular Research Centre, Faculty of Medicine, University of Glasgow, Glasgow G12 8TA, UK
| | - Stefan Schreiber
- Institute of Clinical Molecular Biology and PopGen Biobank, Christian-Albrechts-University and University Hospital Schleswig-Holstein, Kiel 24105, Germany
| | | | - David J Stott
- Institute of Cardiovascular and Medical Sciences, University of Glasgow, Glasgow G12 8QQ, UK
| | - Henning Tiemeier
- Netherlands Consortium for Healthy Ageing, Department of Epidemiology and Department of Child and Adolescent Psychiatry, Erasmus Medical Center-Sophia Children's Hospital, Rotterdam 3000 CA, The Netherlands
| | - André G Uitterlinden
- Netherlands Consortium for Healthy Ageing, Department of Epidemiology and Department of Internal Medicine, Erasmus Medical Center, Rotterdam 3000 CA, The Netherlands
| | - Rudi G J Westendorp
- Netherlands Consortium for Healthy Ageing, Department of Gerontology and Geriatrics, Leiden University Medical Center, Leiden 2300 RC, The Netherlands
| | - Gonneke Willemsen
- Department of Biological Psychology, VU University Amsterdam, Amsterdam 1081 BT, The Netherlands
| | - Nilesh J Samani
- Department of Cardiovascular Sciences, University of Leicester, Leicester LE3 9QP, UK National Institute for Health Research Leicester Cardiovascular Biomedical Research Unit, Glenfield Hospital, Leicester LE3 9QP, UK
| | - Pilar Galan
- Université Sorbonne Paris Cité-UREN (Unité de Recherche en Epidémiologie Nutritionnelle), U557 Inserm; U1125 Inra; Cnam; Université Paris 13, CRNH IdF, Bobigny 93017, France
| | - Thorkild I A Sørensen
- Novo Nordisk Foundation Center for Basic Metabolic Research, Section on Metabolic Genetics, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen N DK-2200, Denmark Institute of Preventive Medicine, Bispebjerg and Frederiksberg University Hospitals, Frederiksberg DK-2000, Denmark
| | - Dorret I Boomsma
- Department of Biological Psychology, VU University Amsterdam, Amsterdam 1081 BT, The Netherlands
| | - J Wouter Jukema
- Department of Cardiology and Interuniversity Cardiology Institute of the Netherlands, Utrecht 3501 DG, The Netherlands
| | - Irene Maeve Rea
- School of Medicine, Dentistry and Biomedical Science, Queens University Belfast, Belfast BT9 7BL, UK
| | - Giuseppe Passarino
- Department of Biology, Ecology and Earth Science, University of Calabria, Rende 87036, Italy
| | - Anton J M de Craen
- Department of Gerontology and Geriatrics, Leiden University Medical Center, Leiden 2300 RC, The Netherlands
| | - Kaare Christensen
- Epidemiology, Institute of Public Health and Department of Clinical Genetics and Clinical Biochemistry and Pharmacology, Odense University Hospital, Odense C DK-5000, Denmark
| | | | - Kári Stefánsson
- Population Genomics, deCODE Genetics, Reykjavík 101, Iceland
| | - Andres Metspalu
- Estonian Genome Center and Institute of Molecular and Cell Biology, University of Tartu, Tartu 51010, Estonia Estonian Biocentre, Tartu 51010, Estonia
| | - Patrik Magnusson
- Department of Medical Epidemiology and Biostatistics, Karolinska Institute, Stockholm SE-171 77, Sweden
| | | | - Lene Christiansen
- Epidemiology, Institute of Public Health and Department of Clinical Genetics and
| | - Thomas B L Kirkwood
- Institute for Ageing and Health, Newcastle University, Campus for Ageing and Vitality, Newcastle upon Tyne NE4 5PL, UK
| | | | - Claudio Franceschi
- Department of Experimental, Diagnostic and Specialty Medicine and Interdepartmental Centre 'L. Galvani', University of Bologna, Bologna 40126, Italy IRCCS Institute of Neurological Science, Bellaria Hospital, Bologna 40139, Italy CNR-ISOF, Bologna 40129, Italy
| | | | - P Eline Slagboom
- Department of Molecular Epidemiology, Netherlands Consortium for Healthy Ageing,
| |
Collapse
|
23
|
Howe K, Clark MD, Torroja CF, Torrance J, Berthelot C, Muffato M, Collins JE, Humphray S, McLaren K, Matthews L, McLaren S, Sealy I, Caccamo M, Churcher C, Scott C, Barrett JC, Koch R, Rauch GJ, White S, Chow W, Kilian B, Quintais LT, Guerra-Assunção JA, Zhou Y, Gu Y, Yen J, Vogel JH, Eyre T, Banerjee R, Chi J, Fu B, Langley E, Maguire SF, Laird G, Lloyd D, Kenyon E, Donaldson S, Sehra H, Almeida-King J, Loveland J, Trevanion S, Jones M, Quail M, Willey D, Hunt A, Burton J, Sims S, McLay K, Plumb B, Davis J, Clee C, Oliver K, Clark R, Riddle C, Elliott D, Threadgold G, Harden G, Ware D, Begum S, Mortimore B, Kerry G, Heath P, Phillimore B, Tracey A, Corby N, Dunn M, Johnson C, Wood J, Clark S, Pelan S, Griffiths G, Smith M, Glithero R, Howden P, Barker N, Lloyd C, Stevens C, Harley J, Holt K, Panagiotidis G, Lovell J, Beasley H, Henderson C, Gordon D, Auger K, Wright D, Collins J, Raisen C, Dyer L, Leung K, Robertson L, Ambridge K, Leongamornlert D, McGuire S, Gilderthorp R, Griffiths C, Manthravadi D, Nichol S, Barker G, Whitehead S, Kay M, Brown J, Murnane C, Gray E, Humphries M, Sycamore N, Barker D, Saunders D, Wallis J, Babbage A, Hammond S, Mashreghi-Mohammadi M, Barr L, Martin S, Wray P, Ellington A, Matthews N, Ellwood M, Woodmansey R, Clark G, Cooper JD, Tromans A, Grafham D, Skuce C, Pandian R, Andrews R, Harrison E, Kimberley A, Garnett J, Fosker N, Hall R, Garner P, Kelly D, Bird C, Palmer S, Gehring I, Berger A, Dooley C, Ersan-Ürün Z, Eser C, Geiger H, Geisler M, Karotki L, Kirn A, Konantz J, Konantz M, Oberländer M, Rudolph-Geiger S, Teucke M, Lanz C, Raddatz G, Osoegawa K, Zhu B, Rapp A, Widaa S, Langford C, Yang F, Schuster SC, Carter NP, Harrow J, Ning Z, Herrero J, Searle SMJ, Enright A, Geisler R, Plasterk RHA, Lee C, Westerfield M, de Jong PJ, Zon LI, Postlethwait JH, Volhard CN, Hubbard TJP, Crollius HR, Rogers J, Stemple DL. Erratum: Corrigendum: The zebrafish reference genome sequence and its relationship to the human genome. Nature 2013. [DOI: 10.1038/nature12813] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
24
|
Flicek P, Amode MR, Barrell D, Beal K, Billis K, Brent S, Carvalho-Silva D, Clapham P, Coates G, Fitzgerald S, Gil L, Girón CG, Gordon L, Hourlier T, Hunt S, Johnson N, Juettemann T, Kähäri AK, Keenan S, Kulesha E, Martin FJ, Maurel T, McLaren WM, Murphy DN, Nag R, Overduin B, Pignatelli M, Pritchard B, Pritchard E, Riat HS, Ruffier M, Sheppard D, Taylor K, Thormann A, Trevanion SJ, Vullo A, Wilder SP, Wilson M, Zadissa A, Aken BL, Birney E, Cunningham F, Harrow J, Herrero J, Hubbard TJ, Kinsella R, Muffato M, Parker A, Spudich G, Yates A, Zerbino DR, Searle SM. Ensembl 2014. Nucleic Acids Res 2013; 42:D749-55. [PMID: 24316576 PMCID: PMC3964975 DOI: 10.1093/nar/gkt1196] [Citation(s) in RCA: 1056] [Impact Index Per Article: 96.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
Ensembl (http://www.ensembl.org) creates tools and data resources to facilitate genomic analysis in chordate species with an emphasis on human, major vertebrate model organisms and farm animals. Over the past year we have increased the number of species that we support to 77 and expanded our genome browser with a new scrollable overview and improved variation and phenotype views. We also report updates to our core datasets and improvements to our gene homology relationships from the addition of new species. Our REST service has been extended with additional support for comparative genomics and ontology information. Finally, we provide updated information about our methods for data access and resources for user training.
Collapse
Affiliation(s)
- Paul Flicek
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
- *To whom correspondence should be addressed. Tel: +44 1223 492 581; Fax: +44 1223 494 494;
| | - M. Ridwan Amode
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Daniel Barrell
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Kathryn Beal
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Konstantinos Billis
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Simon Brent
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Denise Carvalho-Silva
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Peter Clapham
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Guy Coates
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Stephen Fitzgerald
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Laurent Gil
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Carlos García Girón
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Leo Gordon
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Thibaut Hourlier
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Sarah Hunt
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Nathan Johnson
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Thomas Juettemann
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Andreas K. Kähäri
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Stephen Keenan
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Eugene Kulesha
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Fergal J. Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Thomas Maurel
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - William M. McLaren
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Daniel N. Murphy
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Rishi Nag
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Bert Overduin
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Miguel Pignatelli
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Bethan Pritchard
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Emily Pritchard
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Harpreet S. Riat
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Magali Ruffier
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Daniel Sheppard
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Kieron Taylor
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Anja Thormann
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Stephen J. Trevanion
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Alessandro Vullo
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Steven P. Wilder
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Mark Wilson
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Amonida Zadissa
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Bronwen L. Aken
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Ewan Birney
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Fiona Cunningham
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Jennifer Harrow
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Javier Herrero
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Tim J.P. Hubbard
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Rhoda Kinsella
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Matthieu Muffato
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Anne Parker
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Giulietta Spudich
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Andy Yates
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Daniel R. Zerbino
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Stephen M.J. Searle
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD and Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| |
Collapse
|
25
|
Steijger T, Abril JF, Engström PG, Kokocinski F, Hubbard TJ, Guigó R, Harrow J, Bertone P. Assessment of transcript reconstruction methods for RNA-seq. Nat Methods 2013; 10:1177-84. [PMID: 24185837 PMCID: PMC3851240 DOI: 10.1038/nmeth.2714] [Citation(s) in RCA: 447] [Impact Index Per Article: 40.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2013] [Accepted: 09/23/2013] [Indexed: 11/09/2022]
Abstract
We evaluated 25 protocol variants of 14 independent computational methods for exon identification, transcript reconstruction and expression-level quantification from RNA-seq data. Our results show that most algorithms are able to identify discrete transcript components with high success rates but that assembly of complete isoform structures poses a major challenge even when all constituent elements are identified. Expression-level estimates also varied widely across methods, even when based on similar transcript models. Consequently, the complexity of higher eukaryotic genomes imposes severe limitations on transcript recall and splice product discrimination that are likely to remain limiting factors for the analysis of current-generation RNA-seq data.
Collapse
Affiliation(s)
- Tamara Steijger
- European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridge, UK
| | - Josep F Abril
- Departament de Genètica, Facultat de Biologia, Universitat de Barcelona, Barcelona, Spain
| | - Pär G Engström
- European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridge, UK
| | | | | | - Roderic Guigó
- Center for Genomic Regulation, Barcelona, Spain
- Universitat Pompeu Fabra, Barcelona, Spain
| | | | - Paul Bertone
- European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridge, UK
- Genome Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany
- Developmental Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany
- Wellcome Trust - Medical Research Council Cambridge Stem Cell Institute, University of Cambridge, Cambridge, UK
| |
Collapse
|
26
|
Abstract
The last decade has seen tremendous effort committed to the annotation of the human genome sequence, most notably perhaps in the form of the ENCODE project. One of the major findings of ENCODE, and other genome analysis projects, is that the human transcriptome is far larger and more complex than previously thought. This complexity manifests, for example, as alternative splicing within protein-coding genes, as well as in the discovery of thousands of long noncoding RNAs. It is also possible that significant numbers of human transcripts have not yet been described by annotation projects, while existing transcript models are frequently incomplete. The question as to what proportion of this complexity is truly functional remains open, however, and this ambiguity presents a serious challenge to genome scientists. In this article, we will discuss the current state of human transcriptome annotation, drawing on our experience gained in generating the GENCODE gene annotation set. We highlight the gaps in our knowledge of transcript functionality that remain, and consider the potential computational and experimental strategies that can be used to help close them. We propose that an understanding of the true overlap between transcriptional complexity and functionality will not be gained in the short term. However, significant steps toward obtaining this knowledge can now be taken by using an integrated strategy, combining all of the experimental resources at our disposal.
Collapse
Affiliation(s)
- Jonathan M Mudge
- Department of Informatics, Wellcome Trust Sanger Institute, Hinxton CB10 1SA, United Kingdom
| | | | | |
Collapse
|
27
|
Gonzàlez-Porta M, Frankish A, Rung J, Harrow J, Brazma A. Transcriptome analysis of human tissues and cell lines reveals one dominant transcript per gene. Genome Biol 2013; 14:R70. [PMID: 23815980 PMCID: PMC4053754 DOI: 10.1186/gb-2013-14-7-r70] [Citation(s) in RCA: 183] [Impact Index Per Article: 16.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2013] [Accepted: 07/01/2013] [Indexed: 12/15/2022] Open
Abstract
BACKGROUND RNA sequencing has opened new avenues for the study of transcriptome composition. Significant evidence has accumulated showing that the human transcriptome contains in excess of a hundred thousand different transcripts. However, it is still not clear to what extent this diversity prevails when considering the relative abundances of different transcripts from the same gene. RESULTS Here we show that, in a given condition, most protein coding genes have one major transcript expressed at significantly higher level than others, that in human tissues the major transcripts contribute almost 85 percent to the total mRNA from protein coding loci, and that often the same major transcript is expressed in many tissues. We detect a high degree of overlap between the set of major transcripts and a recently published set of alternatively spliced transcripts that are predicted to be translated utilizing proteomic data. Thus, we hypothesize that although some minor transcripts may play a functional role, the major ones are likely to be the main contributors to the proteome. However, we still detect a non-negligible fraction of protein coding genes for which the major transcript does not code a protein. CONCLUSIONS Overall, our findings suggest that the transcriptome from protein coding loci is dominated by one transcript per gene and that not all the transcripts that contribute to transcriptome diversity are equally likely to contribute to protein diversity. This observation can help to prioritize candidate targets in proteomics research and to predict the functional impact of the detected changes in variation studies.
Collapse
|
28
|
Derrien T, Johnson R, Bussotti G, Tanzer A, Djebali S, Tilgner H, Guernec G, Martin D, Merkel A, Knowles DG, Lagarde J, Veeravalli L, Ruan X, Ruan Y, Lassmann T, Carninci P, Brown JB, Lipovich L, Gonzalez JM, Thomas M, Davis CA, Shiekhattar R, Gingeras TR, Hubbard TJ, Notredame C, Harrow J, Guigó R. The GENCODE v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and expression. Genome Res 2013; 22:1775-89. [PMID: 22955988 PMCID: PMC3431493 DOI: 10.1101/gr.132159.111] [Citation(s) in RCA: 3733] [Impact Index Per Article: 339.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
The human genome contains many thousands of long noncoding RNAs (lncRNAs). While several studies have demonstrated compelling biological and disease roles for individual examples, analytical and experimental approaches to investigate these genes have been hampered by the lack of comprehensive lncRNA annotation. Here, we present and analyze the most complete human lncRNA annotation to date, produced by the GENCODE consortium within the framework of the ENCODE project and comprising 9277 manually annotated genes producing 14,880 transcripts. Our analyses indicate that lncRNAs are generated through pathways similar to that of protein-coding genes, with similar histone-modification profiles, splicing signals, and exon/intron lengths. In contrast to protein-coding genes, however, lncRNAs display a striking bias toward two-exon transcripts, they are predominantly localized in the chromatin and nucleus, and a fraction appear to be preferentially processed into small RNAs. They are under stronger selective pressure than neutrally evolving sequences—particularly in their promoter regions, which display levels of selection comparable to protein-coding genes. Importantly, about one-third seem to have arisen within the primate lineage. Comprehensive analysis of their expression in multiple human organs and brain regions shows that lncRNAs are generally lower expressed than protein-coding genes, and display more tissue-specific expression patterns, with a large fraction of tissue-specific lncRNAs expressed in the brain. Expression correlation analysis indicates that lncRNAs show particularly striking positive correlation with the expression of antisense coding genes. This GENCODE annotation represents a valuable resource for future studies of lncRNAs.
Collapse
Affiliation(s)
- Thomas Derrien
- Bioinformatics and Genomics, Centre for Genomic Regulation and UPF, 08003 Barcelona, Catalonia, Spain
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
29
|
Harrow J, Frankish A, Gonzalez JM, Tapanari E, Diekhans M, Kokocinski F, Aken BL, Barrell D, Zadissa A, Searle S, Barnes I, Bignell A, Boychenko V, Hunt T, Kay M, Mukherjee G, Rajan J, Despacio-Reyes G, Saunders G, Steward C, Harte R, Lin M, Howald C, Tanzer A, Derrien T, Chrast J, Walters N, Balasubramanian S, Pei B, Tress M, Rodriguez JM, Ezkurdia I, van Baren J, Brent M, Haussler D, Kellis M, Valencia A, Reymond A, Gerstein M, Guigó R, Hubbard TJ. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res 2013; 22:1760-74. [PMID: 22955987 PMCID: PMC3431492 DOI: 10.1101/gr.135350.111] [Citation(s) in RCA: 3491] [Impact Index Per Article: 317.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
The GENCODE Consortium aims to identify all gene features in the human genome using a combination of computational analysis, manual annotation, and experimental validation. Since the first public release of this annotation data set, few new protein-coding loci have been added, yet the number of alternative splicing transcripts annotated has steadily increased. The GENCODE 7 release contains 20,687 protein-coding and 9640 long noncoding RNA loci and has 33,977 coding transcripts not represented in UCSC genes and RefSeq. It also has the most comprehensive annotation of long noncoding RNA (lncRNA) loci publicly available with the predominant transcript form consisting of two exons. We have examined the completeness of the transcript annotation and found that 35% of transcriptional start sites are supported by CAGE clusters and 62% of protein-coding genes have annotated polyA sites. Over one-third of GENCODE protein-coding genes are supported by peptide hits derived from mass spectrometry spectra submitted to Peptide Atlas. New models derived from the Illumina Body Map 2.0 RNA-seq data identify 3689 new loci not currently in GENCODE, of which 3127 consist of two exon models indicating that they are possibly unannotated long noncoding loci. GENCODE 7 is publicly available from gencodegenes.org and via the Ensembl and UCSC Genome Browsers.
Collapse
Affiliation(s)
- Jennifer Harrow
- Wellcome Trust Sanger Institute, Wellcome Trust Campus, Hinxton, Cambridge CB10 1SA, United Kingdom.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
30
|
Flicek P, Ahmed I, Amode MR, Barrell D, Beal K, Brent S, Carvalho-Silva D, Clapham P, Coates G, Fairley S, Fitzgerald S, Gil L, García-Girón C, Gordon L, Hourlier T, Hunt S, Juettemann T, Kähäri AK, Keenan S, Komorowska M, Kulesha E, Longden I, Maurel T, McLaren WM, Muffato M, Nag R, Overduin B, Pignatelli M, Pritchard B, Pritchard E, Riat HS, Ritchie GRS, Ruffier M, Schuster M, Sheppard D, Sobral D, Taylor K, Thormann A, Trevanion S, White S, Wilder SP, Aken BL, Birney E, Cunningham F, Dunham I, Harrow J, Herrero J, Hubbard TJP, Johnson N, Kinsella R, Parker A, Spudich G, Yates A, Zadissa A, Searle SMJ. Ensembl 2013. Nucleic Acids Res 2012. [PMID: 23203987 PMCID: PMC3531136 DOI: 10.1093/nar/gks1236] [Citation(s) in RCA: 787] [Impact Index Per Article: 65.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
The Ensembl project (http://www.ensembl.org) provides genome information for sequenced chordate genomes with a particular focus on human, mouse, zebrafish and rat. Our resources include evidenced-based gene sets for all supported species; large-scale whole genome multiple species alignments across vertebrates and clade-specific alignments for eutherian mammals, primates, birds and fish; variation data resources for 17 species and regulation annotations based on ENCODE and other data sets. Ensembl data are accessible through the genome browser at http://www.ensembl.org and through other tools and programmatic interfaces.
Collapse
Affiliation(s)
- Paul Flicek
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton Cambridge CB10 1SD, UK.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
31
|
Pei B, Sisu C, Frankish A, Howald C, Habegger L, Mu XJ, Harte R, Balasubramanian S, Tanzer A, Diekhans M, Reymond A, Hubbard TJ, Harrow J, Gerstein MB. The GENCODE pseudogene resource. Genome Biol 2012; 13:R51. [PMID: 22951037 PMCID: PMC3491395 DOI: 10.1186/gb-2012-13-9-r51] [Citation(s) in RCA: 253] [Impact Index Per Article: 21.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2012] [Revised: 05/30/2012] [Accepted: 06/25/2012] [Indexed: 12/11/2022] Open
Abstract
Background Pseudogenes have long been considered as nonfunctional genomic sequences. However, recent evidence suggests that many of them might have some form of biological activity, and the possibility of functionality has increased interest in their accurate annotation and integration with functional genomics data. Results As part of the GENCODE annotation of the human genome, we present the first genome-wide pseudogene assignment for protein-coding genes, based on both large-scale manual annotation and in silico pipelines. A key aspect of this coupled approach is that it allows us to identify pseudogenes in an unbiased fashion as well as untangle complex events through manual evaluation. We integrate the pseudogene annotations with the extensive ENCODE functional genomics information. In particular, we determine the expression level, transcription-factor and RNA polymerase II binding, and chromatin marks associated with each pseudogene. Based on their distribution, we develop simple statistical models for each type of activity, which we validate with large-scale RT-PCR-Seq experiments. Finally, we compare our pseudogenes with conservation and variation data from primate alignments and the 1000 Genomes project, producing lists of pseudogenes potentially under selection. Conclusions At one extreme, some pseudogenes possess conventional characteristics of functionality; these may represent genes that have recently died. On the other hand, we find interesting patterns of partial activity, which may suggest that dead genes are being resurrected as functioning non-coding RNAs. The activity data of each pseudogene are stored in an associated resource, psiDR, which will be useful for the initial identification of potentially functional pseudogenes.
Collapse
Affiliation(s)
- Baikang Pei
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06520, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
32
|
Djebali S, Davis CA, Merkel A, Dobin A, Lassmann T, Mortazavi AM, Tanzer A, Lagarde J, Lin W, Schlesinger F, Xue C, Marinov GK, Khatun J, Williams BA, Zaleski C, Rozowsky J, Röder M, Kokocinski F, Abdelhamid RF, Alioto T, Antoshechkin I, Baer MT, Bar NS, Batut P, Bell K, Bell I, Chakrabortty S, Chen X, Chrast J, Curado J, Derrien T, Drenkow J, Dumais E, Dumais J, Duttagupta R, Falconnet E, Fastuca M, Fejes-Toth K, Ferreira P, Foissac S, Fullwood MJ, Gao H, Gonzalez D, Gordon A, Gunawardena H, Howald C, Jha S, Johnson R, Kapranov P, King B, Kingswood C, Luo OJ, Park E, Persaud K, Preall JB, Ribeca P, Risk B, Robyr D, Sammeth M, Schaffer L, See LH, Shahab A, Skancke J, Suzuki AM, Takahashi H, Tilgner H, Trout D, Walters N, Wang H, Wrobel J, Yu Y, Ruan X, Hayashizaki Y, Harrow J, Gerstein M, Hubbard T, Reymond A, Antonarakis SE, Hannon G, Giddings MC, Ruan Y, Wold B, Carninci P, Guigó R, Gingeras TR. Landscape of transcription in human cells. Nature 2012; 489:101-8. [PMID: 22955620 PMCID: PMC3684276 DOI: 10.1038/nature11233] [Citation(s) in RCA: 3716] [Impact Index Per Article: 309.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2011] [Accepted: 05/15/2012] [Indexed: 02/07/2023]
Abstract
Eukaryotic cells make many types of primary and processed RNAs that are found either in specific subcellular compartments or throughout the cells. A complete catalogue of these RNAs is not yet available and their characteristic subcellular localizations are also poorly understood. Because RNA represents the direct output of the genetic information encoded by genomes and a significant proportion of a cell's regulatory capabilities are focused on its synthesis, processing, transport, modification and translation, the generation of such a catalogue is crucial for understanding genome function. Here we report evidence that three-quarters of the human genome is capable of being transcribed, as well as observations about the range and levels of expression, localization, processing fates, regulatory regions and modifications of almost all currently annotated and thousands of previously unannotated RNAs. These observations, taken together, prompt a redefinition of the concept of a gene.
Collapse
Affiliation(s)
- Sarah Djebali
- Centre for Genomic Regulation (CRG) and UPF, Doctor Aiguader, 88 . Barcelona, Catalunya, Spain 08003
| | - Carrie A. Davis
- Cold Spring Harbor Laboratory, Functional Genomics, 1 Bungtown Rd. Cold Spring Harbor, NY, USA 11742
| | - Angelika Merkel
- Centre for Genomic Regulation (CRG) and UPF, Doctor Aiguader, 88 . Barcelona, Catalunya, Spain 08003
| | - Alex Dobin
- Cold Spring Harbor Laboratory, Functional Genomics, 1 Bungtown Rd. Cold Spring Harbor, NY, USA 11742
| | - Timo Lassmann
- RIKEN Yokohama Institute, RIKEN Omics Science Center, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa Japan 230-0045
| | - Ali M. Mortazavi
- California Institute of Technology, Division of Biology, 91125. 2 Beckman Institute, Pasadena, CA USA 91125
- University of California Irvine, Dept of. Developmental and Cell Biology, 2300 Biological Sciences III, Irving, CA USA 92697
| | - Andrea Tanzer
- Centre for Genomic Regulation (CRG) and UPF, Doctor Aiguader, 88 . Barcelona, Catalunya, Spain 08003
| | - Julien Lagarde
- Centre for Genomic Regulation (CRG) and UPF, Doctor Aiguader, 88 . Barcelona, Catalunya, Spain 08003
| | - Wei Lin
- Cold Spring Harbor Laboratory, Functional Genomics, 1 Bungtown Rd. Cold Spring Harbor, NY, USA 11742
| | - Felix Schlesinger
- Cold Spring Harbor Laboratory, Functional Genomics, 1 Bungtown Rd. Cold Spring Harbor, NY, USA 11742
| | - Chenghai Xue
- Cold Spring Harbor Laboratory, Functional Genomics, 1 Bungtown Rd. Cold Spring Harbor, NY, USA 11742
| | - Georgi K. Marinov
- California Institute of Technology, Division of Biology, 91125. 2 Beckman Institute, Pasadena, CA USA 91125
| | - Jainab Khatun
- Boise State University, College of Arts & Sciences, 1910 University Dr. Boise, ID USA 83725
| | - Brian A. Williams
- California Institute of Technology, Division of Biology, 91125. 2 Beckman Institute, Pasadena, CA USA 91125
| | - Chris Zaleski
- Cold Spring Harbor Laboratory, Functional Genomics, 1 Bungtown Rd. Cold Spring Harbor, NY, USA 11742
| | - Joel Rozowsky
- Program in Computational Biology and Bioinformatics, Yale University, Bass 432, 266 Whitney Avenue, New Haven, CT 06520
- Department of Molecular Biophysics and Biochemistry, Yale University, Bass 432, 266 Whitney Avenue, New Haven, CT 06520
| | - Maik Röder
- Centre for Genomic Regulation (CRG) and UPF, Doctor Aiguader, 88 . Barcelona, Catalunya, Spain 08003
| | - Felix Kokocinski
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire United Kingdom CB10 1SA
| | - Rehab F. Abdelhamid
- RIKEN Yokohama Institute, RIKEN Omics Science Center, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa Japan 230-0045
| | - Tyler Alioto
- Centre for Genomic Regulation (CRG) and UPF, Doctor Aiguader, 88 . Barcelona, Catalunya, Spain 08003
| | - Igor Antoshechkin
- California Institute of Technology, Division of Biology, 91125. 2 Beckman Institute, Pasadena, CA USA 91125
| | - Michael T. Baer
- Cold Spring Harbor Laboratory, Functional Genomics, 1 Bungtown Rd. Cold Spring Harbor, NY, USA 11742
| | - Nadav S. Bar
- Department of Chemical Engineering, Norwegian University of Science and Technology (NTNU), Trondheim, Norway
| | - Philippe Batut
- Cold Spring Harbor Laboratory, Functional Genomics, 1 Bungtown Rd. Cold Spring Harbor, NY, USA 11742
| | - Kimberly Bell
- Cold Spring Harbor Laboratory, Functional Genomics, 1 Bungtown Rd. Cold Spring Harbor, NY, USA 11742
| | - Ian Bell
- Affymetrix, Inc, 3380 Central Expressway, Santa Clara, CA. USA 95051
| | - Sudipto Chakrabortty
- Cold Spring Harbor Laboratory, Functional Genomics, 1 Bungtown Rd. Cold Spring Harbor, NY, USA 11742
| | - Xian Chen
- University of North Carolina at Chapel Hill, Department of Biochemistry & Biophysics, 120 Mason Farm Rd., Chapel Hill, NC USA 27599
| | - Jacqueline Chrast
- University of Lausanne, Center for Integrative Genomics, Genopode building, Lausanne, Switzerland 1015
| | - Joao Curado
- Centre for Genomic Regulation (CRG) and UPF, Doctor Aiguader, 88 . Barcelona, Catalunya, Spain 08003
| | - Thomas Derrien
- Centre for Genomic Regulation (CRG) and UPF, Doctor Aiguader, 88 . Barcelona, Catalunya, Spain 08003
| | - Jorg Drenkow
- Cold Spring Harbor Laboratory, Functional Genomics, 1 Bungtown Rd. Cold Spring Harbor, NY, USA 11742
| | - Erica Dumais
- Affymetrix, Inc, 3380 Central Expressway, Santa Clara, CA. USA 95051
| | - Jacqueline Dumais
- Affymetrix, Inc, 3380 Central Expressway, Santa Clara, CA. USA 95051
| | - Radha Duttagupta
- Affymetrix, Inc, 3380 Central Expressway, Santa Clara, CA. USA 95051
| | - Emilie Falconnet
- University of Geneva Medical School, Department of Genetic Medicine and Development and iGE3 Institute of Genetics and Genomics of Geneva, 1 rue Michel-Servet, Geneva, Switzerland 1015
| | - Meagan Fastuca
- Cold Spring Harbor Laboratory, Functional Genomics, 1 Bungtown Rd. Cold Spring Harbor, NY, USA 11742
| | - Kata Fejes-Toth
- Cold Spring Harbor Laboratory, Functional Genomics, 1 Bungtown Rd. Cold Spring Harbor, NY, USA 11742
| | - Pedro Ferreira
- Centre for Genomic Regulation (CRG) and UPF, Doctor Aiguader, 88 . Barcelona, Catalunya, Spain 08003
| | - Sylvain Foissac
- Affymetrix, Inc, 3380 Central Expressway, Santa Clara, CA. USA 95051
| | - Melissa J. Fullwood
- Genome Institute of Singapore, Genome Technology and Biology, 60 Biopolis Street, #02-01, Genome, Singapore, Singapore 138672
| | - Hui Gao
- Affymetrix, Inc, 3380 Central Expressway, Santa Clara, CA. USA 95051
| | - David Gonzalez
- Centre for Genomic Regulation (CRG) and UPF, Doctor Aiguader, 88 . Barcelona, Catalunya, Spain 08003
| | - Assaf Gordon
- Cold Spring Harbor Laboratory, Functional Genomics, 1 Bungtown Rd. Cold Spring Harbor, NY, USA 11742
| | - Harsha Gunawardena
- University of North Carolina at Chapel Hill, Department of Biochemistry & Biophysics, 120 Mason Farm Rd., Chapel Hill, NC USA 27599
| | - Cedric Howald
- University of Lausanne, Center for Integrative Genomics, Genopode building, Lausanne, Switzerland 1015
| | - Sonali Jha
- Cold Spring Harbor Laboratory, Functional Genomics, 1 Bungtown Rd. Cold Spring Harbor, NY, USA 11742
| | - Rory Johnson
- Centre for Genomic Regulation (CRG) and UPF, Doctor Aiguader, 88 . Barcelona, Catalunya, Spain 08003
| | - Philipp Kapranov
- Affymetrix, Inc, 3380 Central Expressway, Santa Clara, CA. USA 95051
- St. Laurent Institute, One Kendall Square, Cambridge, MA
| | - Brandon King
- California Institute of Technology, Division of Biology, 91125. 2 Beckman Institute, Pasadena, CA USA 91125
| | - Colin Kingswood
- Centre for Genomic Regulation (CRG) and UPF, Doctor Aiguader, 88 . Barcelona, Catalunya, Spain 08003
| | - Oscar J. Luo
- Genome Institute of Singapore, Genome Technology and Biology, 60 Biopolis Street, #02-01, Genome, Singapore, Singapore 138672
| | - Eddie Park
- University of California Irvine, Dept of. Developmental and Cell Biology, 2300 Biological Sciences III, Irving, CA USA 92697
| | - Kimberly Persaud
- Cold Spring Harbor Laboratory, Functional Genomics, 1 Bungtown Rd. Cold Spring Harbor, NY, USA 11742
| | - Jonathan B. Preall
- Cold Spring Harbor Laboratory, Functional Genomics, 1 Bungtown Rd. Cold Spring Harbor, NY, USA 11742
| | - Paolo Ribeca
- Centre for Genomic Regulation (CRG) and UPF, Doctor Aiguader, 88 . Barcelona, Catalunya, Spain 08003
| | - Brian Risk
- Boise State University, College of Arts & Sciences, 1910 University Dr. Boise, ID USA 83725
| | - Daniel Robyr
- University of Geneva Medical School, Department of Genetic Medicine and Development and iGE3 Institute of Genetics and Genomics of Geneva, 1 rue Michel-Servet, Geneva, Switzerland 1015
| | - Michael Sammeth
- Centre for Genomic Regulation (CRG) and UPF, Doctor Aiguader, 88 . Barcelona, Catalunya, Spain 08003
| | - Lorian Schaffer
- California Institute of Technology, Division of Biology, 91125. 2 Beckman Institute, Pasadena, CA USA 91125
| | - Lei-Hoon See
- Cold Spring Harbor Laboratory, Functional Genomics, 1 Bungtown Rd. Cold Spring Harbor, NY, USA 11742
| | - Atif Shahab
- Genome Institute of Singapore, Genome Technology and Biology, 60 Biopolis Street, #02-01, Genome, Singapore, Singapore 138672
| | - Jorgen Skancke
- Centre for Genomic Regulation (CRG) and UPF, Doctor Aiguader, 88 . Barcelona, Catalunya, Spain 08003
- Department of Chemical Engineering, Norwegian University of Science and Technology (NTNU), Trondheim, Norway
| | - Ana Maria Suzuki
- RIKEN Yokohama Institute, RIKEN Omics Science Center, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa Japan 230-0045
| | - Hazuki Takahashi
- RIKEN Yokohama Institute, RIKEN Omics Science Center, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa Japan 230-0045
| | - Hagen Tilgner
- Centre for Genomic Regulation (CRG) and UPF, Doctor Aiguader, 88 . Barcelona, Catalunya, Spain 08003
| | - Diane Trout
- California Institute of Technology, Division of Biology, 91125. 2 Beckman Institute, Pasadena, CA USA 91125
| | - Nathalie Walters
- University of Lausanne, Center for Integrative Genomics, Genopode building, Lausanne, Switzerland 1015
| | - Huaien Wang
- Cold Spring Harbor Laboratory, Functional Genomics, 1 Bungtown Rd. Cold Spring Harbor, NY, USA 11742
| | - John Wrobel
- Boise State University, College of Arts & Sciences, 1910 University Dr. Boise, ID USA 83725
| | - Yanbao Yu
- University of North Carolina at Chapel Hill, Department of Biochemistry & Biophysics, 120 Mason Farm Rd., Chapel Hill, NC USA 27599
| | - Xiaoan Ruan
- Genome Institute of Singapore, Genome Technology and Biology, 60 Biopolis Street, #02-01, Genome, Singapore, Singapore 138672
| | - Yoshihide Hayashizaki
- RIKEN Yokohama Institute, RIKEN Omics Science Center, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa Japan 230-0045
| | - Jennifer Harrow
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire United Kingdom CB10 1SA
| | - Mark Gerstein
- Program in Computational Biology and Bioinformatics, Yale University, Bass 432, 266 Whitney Avenue, New Haven, CT 06520
- Department of Molecular Biophysics and Biochemistry, Yale University, Bass 432, 266 Whitney Avenue, New Haven, CT 06520
- Department of Computer Science, Yale University, Bass 432, 266 Whitney Avenue, New Haven, CT 06520
| | - Tim Hubbard
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire United Kingdom CB10 1SA
| | - Alexandre Reymond
- University of Lausanne, Center for Integrative Genomics, Genopode building, Lausanne, Switzerland 1015
| | - Stylianos E. Antonarakis
- University of Geneva Medical School, Department of Genetic Medicine and Development and iGE3 Institute of Genetics and Genomics of Geneva, 1 rue Michel-Servet, Geneva, Switzerland 1015
| | - Gregory Hannon
- Cold Spring Harbor Laboratory, Functional Genomics, 1 Bungtown Rd. Cold Spring Harbor, NY, USA 11742
| | - Morgan C. Giddings
- Boise State University, College of Arts & Sciences, 1910 University Dr. Boise, ID USA 83725
- University of North Carolina at Chapel Hill, Department of Biochemistry & Biophysics, 120 Mason Farm Rd., Chapel Hill, NC USA 27599
| | - Yijun Ruan
- Genome Institute of Singapore, Genome Technology and Biology, 60 Biopolis Street, #02-01, Genome, Singapore, Singapore 138672
| | - Barbara Wold
- California Institute of Technology, Division of Biology, 91125. 2 Beckman Institute, Pasadena, CA USA 91125
| | - Piero Carninci
- RIKEN Yokohama Institute, RIKEN Omics Science Center, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa Japan 230-0045
| | - Roderic Guigó
- Centre for Genomic Regulation (CRG) and UPF, Doctor Aiguader, 88 . Barcelona, Catalunya, Spain 08003
| | - Thomas R. Gingeras
- Cold Spring Harbor Laboratory, Functional Genomics, 1 Bungtown Rd. Cold Spring Harbor, NY, USA 11742
- Affymetrix, Inc, 3380 Central Expressway, Santa Clara, CA. USA 95051
| |
Collapse
|
33
|
Howald C, Tanzer A, Chrast J, Kokocinski F, Derrien T, Walters N, Gonzalez JM, Frankish A, Aken BL, Hourlier T, Vogel JH, White S, Searle S, Harrow J, Hubbard TJ, Guigó R, Reymond A. Combining RT-PCR-seq and RNA-seq to catalog all genic elements encoded in the human genome. Genome Res 2012; 22:1698-710. [PMID: 22955982 PMCID: PMC3431487 DOI: 10.1101/gr.134478.111] [Citation(s) in RCA: 46] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2011] [Accepted: 05/01/2012] [Indexed: 12/21/2022]
Abstract
Within the ENCODE Consortium, GENCODE aimed to accurately annotate all protein-coding genes, pseudogenes, and noncoding transcribed loci in the human genome through manual curation and computational methods. Annotated transcript structures were assessed, and less well-supported loci were systematically, experimentally validated. Predicted exon-exon junctions were evaluated by RT-PCR amplification followed by highly multiplexed sequencing readout, a method we called RT-PCR-seq. Seventy-nine percent of all assessed junctions are confirmed by this evaluation procedure, demonstrating the high quality of the GENCODE gene set. RT-PCR-seq was also efficient to screen gene models predicted using the Human Body Map (HBM) RNA-seq data. We validated 73% of these predictions, thus confirming 1168 novel genes, mostly noncoding, which will further complement the GENCODE annotation. Our novel experimental validation pipeline is extremely sensitive, far more than unbiased transcriptome profiling through RNA sequencing, which is becoming the norm. For example, exon-exon junctions unique to GENCODE annotated transcripts are five times more likely to be corroborated with our targeted approach than with extensive large human transcriptome profiling. Data sets such as the HBM and ENCODE RNA-seq data fail sampling of low-expressed transcripts. Our RT-PCR-seq targeted approach also has the advantage of identifying novel exons of known genes, as we discovered unannotated exons in ~11% of assessed introns. We thus estimate that at least 18% of known loci have yet-unannotated exons. Our work demonstrates that the cataloging of all of the genic elements encoded in the human genome will necessitate a coordinated effort between unbiased and targeted approaches, like RNA-seq and RT-PCR-seq.
Collapse
Affiliation(s)
- Cédric Howald
- Center for Integrative Genomics, University of Lausanne, 1015 Lausanne, Switzerland
- Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Andrea Tanzer
- Centre de Regulacio Genomica, Grup de Recerca en Informatica Biomedica, E-08003 Barcelona, Spain
| | - Jacqueline Chrast
- Center for Integrative Genomics, University of Lausanne, 1015 Lausanne, Switzerland
| | - Felix Kokocinski
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, United Kingdom
| | - Thomas Derrien
- Centre de Regulacio Genomica, Grup de Recerca en Informatica Biomedica, E-08003 Barcelona, Spain
| | - Nathalie Walters
- Center for Integrative Genomics, University of Lausanne, 1015 Lausanne, Switzerland
| | - Jose M. Gonzalez
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, United Kingdom
| | - Adam Frankish
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, United Kingdom
| | - Bronwen L. Aken
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, United Kingdom
| | - Thibaut Hourlier
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, United Kingdom
| | - Jan-Hinnerk Vogel
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, United Kingdom
| | - Simon White
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, United Kingdom
| | - Stephen Searle
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, United Kingdom
| | - Jennifer Harrow
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, United Kingdom
| | - Tim J. Hubbard
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, United Kingdom
| | - Roderic Guigó
- Centre de Regulacio Genomica, Grup de Recerca en Informatica Biomedica, E-08003 Barcelona, Spain
| | - Alexandre Reymond
- Center for Integrative Genomics, University of Lausanne, 1015 Lausanne, Switzerland
| |
Collapse
|
34
|
Ezkurdia I, del Pozo A, Frankish A, Rodriguez JM, Harrow J, Ashman K, Valencia A, Tress ML. Comparative proteomics reveals a significant bias toward alternative protein isoforms with conserved structure and function. Mol Biol Evol 2012; 29:2265-83. [PMID: 22446687 PMCID: PMC3424414 DOI: 10.1093/molbev/mss100] [Citation(s) in RCA: 66] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023] Open
Abstract
Advances in high-throughput mass spectrometry are making proteomics an increasingly important tool in genome annotation projects. Peptides detected in mass spectrometry experiments can be used to validate gene models and verify the translation of putative coding sequences (CDSs). Here, we have identified peptides that cover 35% of the genes annotated by the GENCODE consortium for the human genome as part of a comprehensive analysis of experimental spectra from two large publicly available mass spectrometry databases. We detected the translation to protein of “novel” and “putative” protein-coding transcripts as well as transcripts annotated as pseudogenes and nonsense-mediated decay targets. We provide a detailed overview of the population of alternatively spliced protein isoforms that are detectable by peptide identification methods. We found that 150 genes expressed multiple alternative protein isoforms. This constitutes the largest set of reliably confirmed alternatively spliced proteins yet discovered. Three groups of genes were highly overrepresented. We detected alternative isoforms for 10 of the 25 possible heterogeneous nuclear ribonucleoproteins, proteins with a key role in the splicing process. Alternative isoforms generated from interchangeable homologous exons and from short indels were also significantly enriched, both in human experiments and in parallel analyses of mouse and Drosophila proteomics experiments. Our results show that a surprisingly high proportion (almost 25%) of the detected alternative isoforms are only subtly different from their constitutive counterparts. Many of the alternative splicing events that give rise to these alternative isoforms are conserved in mouse. It was striking that very few of these conserved splicing events broke Pfam functional domains or would damage globular protein structures. This evidence of a strong bias toward subtle differences in CDS and likely conserved cellular function and structure is remarkable and strongly suggests that the translation of alternative transcripts may be subject to selective constraints.
Collapse
Affiliation(s)
- Iakes Ezkurdia
- Structural Biology and Biocomputing Programme, Spanish National Cancer Research Centre, Madrid, Spain
| | | | | | | | | | | | | | | |
Collapse
|
35
|
Abstract
While alternative splicing (AS) can potentially expand the functional repertoire of vertebrate genomes, relatively few AS transcripts have been experimentally characterized. We describe our detailed manual annotation of vertebrate genomes, which is generating a publicly available geneset rich in AS. In order to achieve this we have adopted a highly sensitive approach to annotating gene models supported by correctly mapped, canonically spliced transcriptional evidence combined with a highly cautious approach to adding unsupported extensions to models and making decisions on their functional potential. We use information about the predicted functional potential and structural properties of every AS transcript annotated at a protein-coding or non-coding locus to place them into one of eleven subclasses. We describe the incorporation of new sequencing and proteomics technologies into our annotation pipelines, which are used to identify and validate AS. Combining all data sources has led to the production of a rich geneset containing an average of 6.3 AS transcripts for every human multi-exon protein-coding gene. The datasets produced have proved very useful in providing context to studies investigating the functional potential of genes and the effect of variation may have on gene structure and function. Database URL:http://www.ensembl.org/index.html, http://vega.sanger.ac.uk/index.html
Collapse
Affiliation(s)
- Adam Frankish
- Human and Vertebrate Analysis and Annotation Team, Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK.
| | | | | | | |
Collapse
|
36
|
Harte RA, Farrell CM, Loveland JE, Suner MM, Wilming L, Aken B, Barrell D, Frankish A, Wallin C, Searle S, Diekhans M, Harrow J, Pruitt KD. Tracking and coordinating an international curation effort for the CCDS Project. Database (Oxford) 2012; 2012:bas008. [PMID: 22434842 PMCID: PMC3308164 DOI: 10.1093/database/bas008] [Citation(s) in RCA: 45] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/27/2022]
Abstract
The Consensus Coding Sequence (CCDS) collaboration involves curators at multiple centers with a goal of producing a conservative set of high quality, protein-coding region annotations for the human and mouse reference genome assemblies. The CCDS data set reflects a ‘gold standard’ definition of best supported protein annotations, and corresponding genes, which pass a standard series of quality assurance checks and are supported by manual curation. This data set supports use of genome annotation information by human and mouse researchers for effective experimental design, analysis and interpretation. The CCDS project consists of analysis of automated whole-genome annotation builds to identify identical CDS annotations, quality assurance testing and manual curation support. Identical CDS annotations are tracked with a CCDS identifier (ID) and any future change to the annotated CDS structure must be agreed upon by the collaborating members. CCDS curation guidelines were developed to address some aspects of curation in order to improve initial annotation consistency and to reduce time spent in discussing proposed annotation updates. Here, we present the current status of the CCDS database and details on our procedures to track and coordinate our efforts. We also present the relevant background and reasoning behind the curation standards that we have developed for CCDS database treatment of transcripts that are nonsense-mediated decay (NMD) candidates, for transcripts containing upstream open reading frames, for identifying the most likely translation start codons and for the annotation of readthrough transcripts. Examples are provided to illustrate the application of these guidelines. Database URL: http://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi
Collapse
Affiliation(s)
- Rachel A Harte
- Center for Biomolecular Science and Engineering, University of California Santa Cruz (UCSC), Santa Cruz, CA 95064, USA
| | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
37
|
MacArthur DG, Balasubramanian S, Frankish A, Huang N, Morris J, Walter K, Jostins L, Habegger L, Pickrell JK, Montgomery SB, Albers CA, Zhang ZD, Conrad DF, Lunter G, Zheng H, Ayub Q, DePristo MA, Banks E, Hu M, Handsaker RE, Rosenfeld JA, Fromer M, Jin M, Mu XJ, Khurana E, Ye K, Kay M, Saunders GI, Suner MM, Hunt T, Barnes IHA, Amid C, Carvalho-Silva DR, Bignell AH, Snow C, Yngvadottir B, Bumpstead S, Cooper DN, Xue Y, Romero IG, Wang J, Li Y, Gibbs RA, McCarroll SA, Dermitzakis ET, Pritchard JK, Barrett JC, Harrow J, Hurles ME, Gerstein MB, Tyler-Smith C. A systematic survey of loss-of-function variants in human protein-coding genes. Science 2012; 335:823-8. [PMID: 22344438 PMCID: PMC3299548 DOI: 10.1126/science.1215040] [Citation(s) in RCA: 869] [Impact Index Per Article: 72.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023]
Abstract
Genome-sequencing studies indicate that all humans carry many genetic variants predicted to cause loss of function (LoF) of protein-coding genes, suggesting unexpected redundancy in the human genome. Here we apply stringent filters to 2951 putative LoF variants obtained from 185 human genomes to determine their true prevalence and properties. We estimate that human genomes typically contain ~100 genuine LoF variants with ~20 genes completely inactivated. We identify rare and likely deleterious LoF alleles, including 26 known and 21 predicted severe disease-causing variants, as well as common LoF variants in nonessential genes. We describe functional and evolutionary differences between LoF-tolerant and recessive disease genes and a method for using these differences to prioritize candidate genes found in clinical sequencing studies.
Collapse
|
38
|
Djebali S, Lagarde J, Kapranov P, Lacroix V, Borel C, Mudge JM, Howald C, Foissac S, Ucla C, Chrast J, Ribeca P, Martin D, Murray RR, Yang X, Ghamsari L, Lin C, Bell I, Dumais E, Drenkow J, Tress ML, Gelpí JL, Orozco M, Valencia A, van Berkum NL, Lajoie BR, Vidal M, Stamatoyannopoulos J, Batut P, Dobin A, Harrow J, Hubbard T, Dekker J, Frankish A, Salehi-Ashtiani K, Reymond A, Antonarakis SE, Guigó R, Gingeras TR. Evidence for transcript networks composed of chimeric RNAs in human cells. PLoS One 2012; 7:e28213. [PMID: 22238572 PMCID: PMC3251577 DOI: 10.1371/journal.pone.0028213] [Citation(s) in RCA: 53] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2011] [Accepted: 11/03/2011] [Indexed: 12/03/2022] Open
Abstract
The classic organization of a gene structure has followed the Jacob and Monod bacterial gene model proposed more than 50 years ago. Since then, empirical determinations of the complexity of the transcriptomes found in yeast to human has blurred the definition and physical boundaries of genes. Using multiple analysis approaches we have characterized individual gene boundaries mapping on human chromosomes 21 and 22. Analyses of the locations of the 5′ and 3′ transcriptional termini of 492 protein coding genes revealed that for 85% of these genes the boundaries extend beyond the current annotated termini, most often connecting with exons of transcripts from other well annotated genes. The biological and evolutionary importance of these chimeric transcripts is underscored by (1) the non-random interconnections of genes involved, (2) the greater phylogenetic depth of the genes involved in many chimeric interactions, (3) the coordination of the expression of connected genes and (4) the close in vivo and three dimensional proximity of the genomic regions being transcribed and contributing to parts of the chimeric RNAs. The non-random nature of the connection of the genes involved suggest that chimeric transcripts should not be studied in isolation, but together, as an RNA network.
Collapse
Affiliation(s)
- Sarah Djebali
- Bioinformatics and Genomics, Centre for Genomic Regulation and Universitat Pompeu Fabra, Barcelona, Catalonia, Spain
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
39
|
Flicek P, Amode MR, Barrell D, Beal K, Brent S, Carvalho-Silva D, Clapham P, Coates G, Fairley S, Fitzgerald S, Gil L, Gordon L, Hendrix M, Hourlier T, Johnson N, Kähäri AK, Keefe D, Keenan S, Kinsella R, Komorowska M, Koscielny G, Kulesha E, Larsson P, Longden I, McLaren W, Muffato M, Overduin B, Pignatelli M, Pritchard B, Riat HS, Ritchie GRS, Ruffier M, Schuster M, Sobral D, Tang YA, Taylor K, Trevanion S, Vandrovcova J, White S, Wilson M, Wilder SP, Aken BL, Birney E, Cunningham F, Dunham I, Durbin R, Fernández-Suarez XM, Harrow J, Herrero J, Hubbard TJP, Parker A, Proctor G, Spudich G, Vogel J, Yates A, Zadissa A, Searle SMJ. Ensembl 2012. Nucleic Acids Res 2011; 40:D84-90. [PMID: 22086963 PMCID: PMC3245178 DOI: 10.1093/nar/gkr991] [Citation(s) in RCA: 806] [Impact Index Per Article: 62.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022] Open
Abstract
The Ensembl project (http://www.ensembl.org) provides genome resources for chordate genomes with a particular focus on human genome data as well as data for key model organisms such as mouse, rat and zebrafish. Five additional species were added in the last year including gibbon (Nomascus leucogenys) and Tasmanian devil (Sarcophilus harrisii) bringing the total number of supported species to 61 as of Ensembl release 64 (September 2011). Of these, 55 species appear on the main Ensembl website and six species are provided on the Ensembl preview site (Pre!Ensembl; http://pre.ensembl.org) with preliminary support. The past year has also seen improvements across the project.
Collapse
Affiliation(s)
- Paul Flicek
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton Cambridge CB10 1SD, UK.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
40
|
Mudge JM, Frankish A, Fernandez-Banet J, Alioto T, Derrien T, Howald C, Reymond A, Guigó R, Hubbard T, Harrow J. The origins, evolution, and functional potential of alternative splicing in vertebrates. Mol Biol Evol 2011; 28:2949-59. [PMID: 21551269 PMCID: PMC3176834 DOI: 10.1093/molbev/msr127] [Citation(s) in RCA: 62] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022] Open
Abstract
Alternative splicing (AS) has the potential to greatly expand the functional repertoire of mammalian transcriptomes. However, few variant transcripts have been characterized functionally, making it difficult to assess the contribution of AS to the generation of phenotypic complexity and to study the evolution of splicing patterns. We have compared the AS of 309 protein-coding genes in the human ENCODE pilot regions against their mouse orthologs in unprecedented detail, utilizing traditional transcriptomic and RNAseq data. The conservation status of every transcript has been investigated, and each functionally categorized as coding (separated into coding sequence [CDS] or nonsense-mediated decay [NMD] linked) or noncoding. In total, 36.7% of human and 19.3% of mouse coding transcripts are species specific, and we observe a 3.6 times excess of human NMD transcripts compared with mouse; in contrast to previous studies, the majority of species-specific AS is unlinked to transposable elements. We observe one conserved CDS variant and one conserved NMD variant per 2.3 and 11.4 genes, respectively. Subsequently, we identify and characterize equivalent AS patterns for 22.9% of these CDS or NMD-linked events in nonmammalian vertebrate genomes, and our data indicate that functional NMD-linked AS is more widespread and ancient than previously thought. Furthermore, although we observe an association between conserved AS and elevated sequence conservation, as previously reported, we emphasize that 30% of conserved AS exons display sequence conservation below the average score for constitutive exons. In conclusion, we demonstrate the value of detailed comparative annotation in generating a comprehensive set of AS transcripts, increasing our understanding of AS evolution in vertebrates. Our data supports a model whereby the acquisition of functional AS has occurred throughout vertebrate evolution and is considered alongside amino acid change as a key mechanism in gene evolution.
Collapse
Affiliation(s)
- Jonathan M Mudge
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK.
| | | | | | | | | | | | | | | | | | | |
Collapse
|
41
|
Siddle HV, Deakin JE, Coggill P, Whilming LG, Harrow J, Kaufman J, Beck S, Belov K. The tammar wallaby major histocompatibility complex shows evidence of past genomic instability. BMC Genomics 2011; 12:421. [PMID: 21854592 PMCID: PMC3179965 DOI: 10.1186/1471-2164-12-421] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2010] [Accepted: 08/19/2011] [Indexed: 01/10/2023] Open
Abstract
BACKGROUND The major histocompatibility complex (MHC) is a group of genes with a variety of roles in the innate and adaptive immune responses. MHC genes form a genetically linked cluster in eutherian mammals, an organization that is thought to confer functional and evolutionary advantages to the immune system. The tammar wallaby (Macropus eugenii), an Australian marsupial, provides a unique model for understanding MHC gene evolution, as many of its antigen presenting genes are not linked to the MHC, but are scattered around the genome. RESULTS Here we describe the 'core' tammar wallaby MHC region on chromosome 2q by ordering and sequencing 33 BAC clones, covering over 4.5 MB and containing 129 genes. When compared to the MHC region of the South American opossum, eutherian mammals and non-mammals, the wallaby MHC has a novel gene organization. The wallaby has undergone an expansion of MHC class II genes, which are separated into two clusters by the class III genes. The antigen processing genes have undergone duplication, resulting in two copies of TAP1 and three copies of TAP2. Notably, Kangaroo Endogenous Retroviral Elements are present within the region and may have contributed to the genomic instability. CONCLUSIONS The wallaby MHC has been extensively remodeled since the American and Australian marsupials last shared a common ancestor. The instability is characterized by the movement of antigen presenting genes away from the core MHC, most likely via the presence and activity of retroviral elements. We propose that the movement of class II genes away from the ancestral class II region has allowed this gene family to expand and diversify in the wallaby. The duplication of TAP genes in the wallaby MHC makes this species a unique model organism for studying the relationship between MHC gene organization and function.
Collapse
Affiliation(s)
- Hannah V Siddle
- Faculty of Veterinary Science, University of Sydney, NSW 2006, AUSTRALIA
- University of Cambridge, Department of Pathology, Cambridge CB2 1QP, UK
| | - Janine E Deakin
- ARC Centre of Excellence for Kangaroo Genomics, Research School of Biological Sciences, Australian National University, Canberra, ACT 0200, Australia
| | - Penny Coggill
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton Hall, Hinxton, Cambridgeshire, CB10 1SA, UK
| | - Laurens G Whilming
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton Hall, Hinxton, Cambridgeshire, CB10 1SA, UK
| | - Jennifer Harrow
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton Hall, Hinxton, Cambridgeshire, CB10 1SA, UK
| | - Jim Kaufman
- University of Cambridge, Department of Pathology, Cambridge CB2 1QP, UK
| | - Stephan Beck
- UCL Cancer Institute, University College London, London WC1E 6BT, UK
| | - Katherine Belov
- Faculty of Veterinary Science, University of Sydney, NSW 2006, AUSTRALIA
| |
Collapse
|
42
|
Brosch M, Saunders GI, Frankish A, Collins MO, Yu L, Wright J, Verstraten R, Adams DJ, Harrow J, Choudhary JS, Hubbard T. Shotgun proteomics aids discovery of novel protein-coding genes, alternative splicing, and "resurrected" pseudogenes in the mouse genome. Genome Res 2011; 21:756-67. [PMID: 21460061 DOI: 10.1101/gr.114272.110] [Citation(s) in RCA: 87] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
Recent advances in proteomic mass spectrometry (MS) offer the chance to marry high-throughput peptide sequencing to transcript models, allowing the validation, refinement, and identification of new protein-coding loci. We present a novel pipeline that integrates highly sensitive and statistically robust peptide spectrum matching with genome-wide protein-coding predictions to perform large-scale gene validation and discovery in the mouse genome for the first time. In searching an excess of 10 million spectra, we have been able to validate 32%, 17%, and 7% of all protein-coding genes, exons, and splice boundaries, respectively. Moreover, we present strong evidence for the identification of multiple alternatively spliced translations from 53 genes and have uncovered 10 entirely novel protein-coding genes, which are not covered in any mouse annotation data sources. One such novel protein-coding gene is a fusion protein that spans the Ins2 and Igf2 loci to produce a transcript encoding the insulin II and the insulin-like growth factor 2-derived peptides. We also report nine processed pseudogenes that have unique peptide hits, demonstrating, for the first time, that they are not just transcribed but are translated and are therefore resurrected into new coding loci. This work not only highlights an important utility for MS data in genome annotation but also provides unique insights into the gene structure and propagation in the mouse genome. All these data have been subsequently used to improve the publicly available mouse annotation available in both the Vega and Ensembl genome browsers (http://vega.sanger.ac.uk).
Collapse
Affiliation(s)
- Markus Brosch
- The Wellcome Trust Sanger Institute, The Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, United Kingdom
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
43
|
Balasubramanian S, Habegger L, Frankish A, MacArthur D, Harte R, Tyler-Smith C, Harrow J, Gerstein M. Defining the human reference protein-coding gene set. Genome Biol 2010. [PMCID: PMC3026232 DOI: 10.1186/gb-2010-11-s1-o5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/04/2022] Open
|
44
|
Amid C, Frankish A, Aken B, Ezkurdia I, Kokocinsk F, Gilbert J, White S, Carninci P, Gingeras T, Guigo R, Searle S, Tress ML, Harrow J, Hubbard T. From identification to validation to gene count. Genome Biol 2010. [PMCID: PMC3026224 DOI: 10.1186/gb-2010-11-s1-o1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
|
45
|
Abstract
Background As genome sequences are determined for increasing numbers of model organisms, demand has grown for better tools to facilitate unified genome annotation efforts by communities of biologists. Typically this process involves numerous experts from the field and the use of data from dispersed sources as evidence. This kind of collaborative annotation project requires specialized software solutions for efficient data tracking and processing. Results As part of the scale-up phase of the ENCODE project (Encyclopedia of DNA Elements), the aim of the GENCODE project is to produce a highly accurate evidence-based reference gene annotation for the human genome. The AnnoTrack software system was developed to aid this effort. It integrates data from multiple distributed sources, highlights conflicts and facilitates the quick identification, prioritisation and resolution of problems during the process of genome annotation. Conclusions AnnoTrack has been in use for the last year and has proven a very valuable tool for large-scale genome annotation. Designed to interface with standard bioinformatics components, such as DAS servers and Ensembl databases, it is easy to setup and configure for different genome projects. The source code is available at http://annotrack.sanger.ac.uk.
Collapse
Affiliation(s)
- Felix Kokocinski
- Vertebrate Genome Analysis, Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB101HH, UK.
| | | | | |
Collapse
|
46
|
Madupu R, Brinkac LM, Harrow J, Wilming LG, Böhme U, Lamesch P, Hannick LI. Meeting report: a workshop on Best Practices in Genome Annotation. Database (Oxford) 2010; 2010:baq001. [PMID: 20428316 PMCID: PMC2860899 DOI: 10.1093/database/baq001] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/12/2009] [Revised: 01/08/2010] [Accepted: 01/11/2010] [Indexed: 01/28/2023]
Abstract
Efforts to annotate the genomes of a wide variety of model organisms are currently carried out by sequencing centers, model organism databases and academic/institutional laboratories around the world. Different annotation methods and tools have been developed over time to meet the needs of biologists faced with the task of annotating biological data. While standardized methods are essential for consistent curation within each annotation group, methods and tools can differ between groups, especially when the groups are curating different organisms. Biocurators from several institutes met at the Third International Biocuration Conference in Berlin, Germany, April 2009 and hosted the ‘Best Practices in Genome Annotation: Inference from Evidence’ workshop to share their strategies, pipelines, standards and tools. This article documents the material presented in the workshop.
Collapse
Affiliation(s)
- Ramana Madupu
- Informatics, J. Craig Venter Institute, Rockville, MD 20850 USA, Wellcome Trust Sanger Institute, Genome Campus, Hinxton, Cambridgeshire, CB10 1SA, UK and The Arabidopsis Information Resource, Carnegie Institution of Washington, Stanford, CA 94305 USA
| | | | | | | | | | | | | |
Collapse
|
47
|
Searle S, Frankish A, Bignell A, Aken B, Derrien T, Diekhans M, Harte R, Howald C, Kokocinski F, Lin M, Tress M, Van Baren M, Barnes I, Hunt T, Carvalho-Silva D, Davidson C, Donaldson S, Gilbert J, Kay M, Lloyd D, Loveland J, Mudge J, Snow C, Vamathevan J, Wilming L, Brent M, Gerstein M, Guigó R, Kellis M, Reymond A, Zadissa A, Valencia A, Harrow J, Hubbard T. The GENCODE human gene set. Genome Biol 2010. [PMCID: PMC3026266 DOI: 10.1186/gb-2010-11-s1-p36] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
|
48
|
Boles MK, Wilkinson BM, Wilming LG, Liu B, Probst FJ, Harrow J, Grafham D, Hentges KE, Woodward LP, Maxwell A, Mitchell K, Risley MD, Johnson R, Hirschi K, Lupski JR, Funato Y, Miki H, Marin-Garcia P, Matthews L, Coffey AJ, Parker A, Hubbard TJ, Rogers J, Bradley A, Adams DJ, Justice MJ. Discovery of candidate disease genes in ENU-induced mouse mutants by large-scale sequencing, including a splice-site mutation in nucleoredoxin. PLoS Genet 2009; 5:e1000759. [PMID: 20011118 PMCID: PMC2782131 DOI: 10.1371/journal.pgen.1000759] [Citation(s) in RCA: 39] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2009] [Accepted: 11/09/2009] [Indexed: 12/13/2022] Open
Abstract
An accurate and precisely annotated genome assembly is a fundamental requirement for functional genomic analysis. Here, the complete DNA sequence and gene annotation of mouse Chromosome 11 was used to test the efficacy of large-scale sequencing for mutation identification. We re-sequenced the 14,000 annotated exons and boundaries from over 900 genes in 41 recessive mutant mouse lines that were isolated in an N-ethyl-N-nitrosourea (ENU) mutation screen targeted to mouse Chromosome 11. Fifty-nine sequence variants were identified in 55 genes from 31 mutant lines. 39% of the lesions lie in coding sequences and create primarily missense mutations. The other 61% lie in noncoding regions, many of them in highly conserved sequences. A lesion in the perinatal lethal line l11Jus13 alters a consensus splice site of nucleoredoxin (Nxn), inserting 10 amino acids into the resulting protein. We conclude that point mutations can be accurately and sensitively recovered by large-scale sequencing, and that conserved noncoding regions should be included for disease mutation identification. Only seven of the candidate genes we report have been previously targeted by mutation in mice or rats, showing that despite ongoing efforts to functionally annotate genes in the mammalian genome, an enormous gap remains between phenotype and function. Our data show that the classical positional mapping approach of disease mutation identification can be extended to large target regions using high-throughput sequencing.
Collapse
Affiliation(s)
- Melissa K. Boles
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, United States of America
| | - Bonney M. Wilkinson
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, United States of America
| | - Laurens G. Wilming
- The Wellcome Trust Sanger Institute, Hinxton, Cambridgeshire, United Kingdom
| | - Bin Liu
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, United States of America
| | - Frank J. Probst
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, United States of America
| | - Jennifer Harrow
- The Wellcome Trust Sanger Institute, Hinxton, Cambridgeshire, United Kingdom
| | - Darren Grafham
- The Wellcome Trust Sanger Institute, Hinxton, Cambridgeshire, United Kingdom
| | - Kathryn E. Hentges
- Faculty of Life Sciences, University of Manchester, Manchester, United Kingdom
| | - Lanette P. Woodward
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, United States of America
| | - Andrea Maxwell
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, United States of America
| | - Karen Mitchell
- Faculty of Life Sciences, University of Manchester, Manchester, United Kingdom
| | - Michael D. Risley
- Faculty of Life Sciences, University of Manchester, Manchester, United Kingdom
| | - Randy Johnson
- The University of Texas MD Anderson Cancer Center, Houston, Texas, United States of America
| | - Karen Hirschi
- Department of Pediatrics, Baylor College of Medicine, Houston, Texas, United States of America
| | - James R. Lupski
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, United States of America
- Department of Pediatrics, Baylor College of Medicine, Houston, Texas, United States of America
- Texas Children's Hospital, Houston, Texas, United States of America
| | - Yosuke Funato
- Laboratory of Intracellular Signaling, Institute for Protein Research, Osaka University, Osaka, Japan
| | - Hiroaki Miki
- Laboratory of Intracellular Signaling, Institute for Protein Research, Osaka University, Osaka, Japan
| | - Pablo Marin-Garcia
- The Wellcome Trust Sanger Institute, Hinxton, Cambridgeshire, United Kingdom
| | - Lucy Matthews
- The Wellcome Trust Sanger Institute, Hinxton, Cambridgeshire, United Kingdom
| | - Alison J. Coffey
- The Wellcome Trust Sanger Institute, Hinxton, Cambridgeshire, United Kingdom
| | - Anne Parker
- The Wellcome Trust Sanger Institute, Hinxton, Cambridgeshire, United Kingdom
| | - Tim J. Hubbard
- The Wellcome Trust Sanger Institute, Hinxton, Cambridgeshire, United Kingdom
| | - Jane Rogers
- The Wellcome Trust Sanger Institute, Hinxton, Cambridgeshire, United Kingdom
| | - Allan Bradley
- The Wellcome Trust Sanger Institute, Hinxton, Cambridgeshire, United Kingdom
| | - David J. Adams
- The Wellcome Trust Sanger Institute, Hinxton, Cambridgeshire, United Kingdom
- * E-mail: (MJJ); (DJA)
| | - Monica J. Justice
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, United States of America
- * E-mail: (MJJ); (DJA)
| |
Collapse
|
49
|
Siddle HV, Deakin JE, Coggill P, Hart E, Cheng Y, Wong ES, Harrow J, Beck S, Belov K. MHC-linked and un-linked class I genes in the wallaby. BMC Genomics 2009; 10:310. [PMID: 19602235 PMCID: PMC2719672 DOI: 10.1186/1471-2164-10-310] [Citation(s) in RCA: 44] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2008] [Accepted: 07/14/2009] [Indexed: 01/15/2023] Open
Abstract
BACKGROUND MHC class I antigens are encoded by a rapidly evolving gene family comprising classical and non-classical genes that are found in all vertebrates and involved in diverse immune functions. However, there is a fundamental difference between the organization of class I genes in mammals and non-mammals. Non-mammals have a single classical gene responsible for antigen presentation, which is linked to the antigen processing genes, including TAP. This organization allows co-evolution of advantageous class Ia/TAP haplotypes. In contrast, mammals have multiple classical genes within the MHC, which are separated from the antigen processing genes by class III genes. It has been hypothesized that separation of classical class I genes from antigen processing genes in mammals allowed them to duplicate. We investigated this hypothesis by characterizing the class I genes of the tammar wallaby, a model marsupial that has a novel MHC organization, with class I genes located within the MHC and 10 other chromosomal locations. RESULTS Sequence analysis of 14 BACs containing 15 class I genes revealed that nine class I genes, including one to three classical class I, are not linked to the MHC but are scattered throughout the genome. Kangaroo Endogenous Retroviruses (KERVs) were identified flanking the MHC un-linked class I. The wallaby MHC contains four non-classical class I, interspersed with antigen processing genes. Clear orthologs of non-classical class I are conserved in distant marsupial lineages. CONCLUSION We demonstrate that classical class I genes are not linked to antigen processing genes in the wallaby and provide evidence that retroviral elements were involved in their movement. The presence of retroviral elements most likely facilitated the formation of recombination hotspots and subsequent diversification of class I genes. The classical class I have moved away from antigen processing genes in eutherian mammals and the wallaby independently, but both lineages appear to have benefited from this loss of linkage by increasing the number of classical genes, perhaps enabling response to a wider range of pathogens. The discovery of non-classical orthologs between distantly related marsupial species is unusual for the rapidly evolving class I genes and may indicate an important marsupial specific function.
Collapse
Affiliation(s)
- Hannah V Siddle
- Faculty of Veterinary Science, University of Sydney, NSW 2006, Australia.
| | | | | | | | | | | | | | | | | |
Collapse
|
50
|
Pruitt KD, Harrow J, Harte RA, Wallin C, Diekhans M, Maglott DR, Searle S, Farrell CM, Loveland JE, Ruef BJ, Hart E, Suner MM, Landrum MJ, Aken B, Ayling S, Baertsch R, Fernandez-Banet J, Cherry JL, Curwen V, Dicuccio M, Kellis M, Lee J, Lin MF, Schuster M, Shkeda A, Amid C, Brown G, Dukhanina O, Frankish A, Hart J, Maidak BL, Mudge J, Murphy MR, Murphy T, Rajan J, Rajput B, Riddick LD, Snow C, Steward C, Webb D, Weber JA, Wilming L, Wu W, Birney E, Haussler D, Hubbard T, Ostell J, Durbin R, Lipman D. The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes. Genes Dev 2009; 19:1316-23. [PMID: 19498102 PMCID: PMC2704439 DOI: 10.1101/gr.080531.108] [Citation(s) in RCA: 401] [Impact Index Per Article: 26.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2008] [Accepted: 04/20/2009] [Indexed: 11/25/2022]
Abstract
Effective use of the human and mouse genomes requires reliable identification of genes and their products. Although multiple public resources provide annotation, different methods are used that can result in similar but not identical representation of genes, transcripts, and proteins. The collaborative consensus coding sequence (CCDS) project tracks identical protein annotations on the reference mouse and human genomes with a stable identifier (CCDS ID), and ensures that they are consistently represented on the NCBI, Ensembl, and UCSC Genome Browsers. Importantly, the project coordinates on manually reviewing inconsistent protein annotations between sites, as well as annotations for which new evidence suggests a revision is needed, to progressively converge on a complete protein-coding set for the human and mouse reference genomes, while maintaining a high standard of reliability and biological accuracy. To date, the project has identified 20,159 human and 17,707 mouse consensus coding regions from 17,052 human and 16,893 mouse genes. Three evaluation methods indicate that the entries in the CCDS set are highly likely to represent real proteins, more so than annotations from contributing groups not included in CCDS. The CCDS database thus centralizes the function of identifying well-supported, identically-annotated, protein-coding regions.
Collapse
Affiliation(s)
- Kim D Pruitt
- National Center for Biotechnology Information, National Library of Medicine, Bethesda, Maryland 20894, USA.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|