1
|
Tao S, Hou Y, Diao L, Hu Y, Xu W, Xie S, Xiao Z. Long noncoding RNA study: Genome-wide approaches. Genes Dis 2023; 10:2491-2510. [PMID: 37554208 PMCID: PMC10404890 DOI: 10.1016/j.gendis.2022.10.024] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2022] [Revised: 10/09/2022] [Accepted: 10/23/2022] [Indexed: 11/30/2022] Open
Abstract
Long noncoding RNAs (lncRNAs) have been confirmed to play a crucial role in various biological processes across several species. Though many efforts have been devoted to the expansion of the lncRNAs landscape, much about lncRNAs is still unknown due to their great complexity. The development of high-throughput technologies and the constantly improved bioinformatic methods have resulted in a rapid expansion of lncRNA research and relevant databases. In this review, we introduced genome-wide research of lncRNAs in three parts: (i) novel lncRNA identification by high-throughput sequencing and computational pipelines; (ii) functional characterization of lncRNAs by expression atlas profiling, genome-scale screening, and the research of cancer-related lncRNAs; (iii) mechanism research by large-scale experimental technologies and computational analysis. Besides, primary experimental methods and bioinformatic pipelines related to these three parts are summarized. This review aimed to provide a comprehensive and systemic overview of lncRNA genome-wide research strategies and indicate a genome-wide lncRNA research system.
Collapse
Affiliation(s)
- Shuang Tao
- The Biotherapy Center, The Third Affiliated Hospital of Sun Yat-sen University, Guangzhou, Guangdong 510630, China
| | - Yarui Hou
- The Biotherapy Center, The Third Affiliated Hospital of Sun Yat-sen University, Guangzhou, Guangdong 510630, China
| | - Liting Diao
- The Biotherapy Center, The Third Affiliated Hospital of Sun Yat-sen University, Guangzhou, Guangdong 510630, China
| | - Yanxia Hu
- The Biotherapy Center, The Third Affiliated Hospital of Sun Yat-sen University, Guangzhou, Guangdong 510630, China
| | - Wanyi Xu
- The Biotherapy Center, The Third Affiliated Hospital of Sun Yat-sen University, Guangzhou, Guangdong 510630, China
| | - Shujuan Xie
- The Biotherapy Center, The Third Affiliated Hospital of Sun Yat-sen University, Guangzhou, Guangdong 510630, China
- Institute of Vaccine, The Third Affiliated Hospital of Sun Yat-sen University, Guangzhou, Guangdong 510630, China
| | - Zhendong Xiao
- The Biotherapy Center, The Third Affiliated Hospital of Sun Yat-sen University, Guangzhou, Guangdong 510630, China
| |
Collapse
|
2
|
Johnson GE, Parker DJ, Lalanne JB, Parker ML, Li GW. BaM-seq and TBaM-seq, highly multiplexed and targeted RNA-seq protocols for rapid, low-cost library generation from bacterial samples. NAR Genom Bioinform 2023; 5:lqad017. [PMID: 36879903 PMCID: PMC9985320 DOI: 10.1093/nargab/lqad017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2022] [Revised: 01/13/2023] [Accepted: 02/27/2023] [Indexed: 03/07/2023] Open
Abstract
The ability to profile transcriptomes and characterize global gene expression changes has been greatly enabled by the development of RNA sequencing technologies (RNA-seq). However, the process of generating sequencing-compatible cDNA libraries from RNA samples can be time-consuming and expensive, especially for bacterial mRNAs which lack poly(A)-tails that are often used to streamline this process for eukaryotic samples. Compared to the increasing throughput and decreasing cost of sequencing, library preparation has had limited advances. Here, we describe bacterial-multiplexed-seq (BaM-seq), an approach that enables simple barcoding of many bacterial RNA samples that decreases the time and cost of library preparation. We also present targeted-bacterial-multiplexed-seq (TBaM-seq) that allows for differential expression analysis of specific gene panels with over 100-fold enrichment in read coverage. In addition, we introduce the concept of transcriptome redistribution based on TBaM-seq that dramatically reduces the required sequencing depth while still allowing for quantification of both highly and lowly abundant transcripts. These methods accurately measure gene expression changes with high technical reproducibility and agreement with gold standard, lower throughput approaches. Together, use of these library preparation protocols allows for fast, affordable generation of sequencing libraries.
Collapse
Affiliation(s)
- Grace E Johnson
- Department of Biology, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Darren J Parker
- Department of Biology, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Jean-Benoit Lalanne
- Department of Biology, Massachusetts Institute of Technology, Cambridge, MA, USA
- Department of Physics, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Mirae L Parker
- Department of Biology, Massachusetts Institute of Technology, Cambridge, MA, USA
- Computational & Systems Biology Graduate Program, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Gene-Wei Li
- Department of Biology, Massachusetts Institute of Technology, Cambridge, MA, USA
| |
Collapse
|
3
|
Li M, Liang C. LncDC: a machine learning-based tool for long non-coding RNA detection from RNA-Seq data. Sci Rep 2022; 12:19083. [PMID: 36351980 PMCID: PMC9646749 DOI: 10.1038/s41598-022-22082-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2022] [Accepted: 10/10/2022] [Indexed: 11/11/2022] Open
Abstract
Long non-coding RNAs (lncRNAs) play an essential role in diverse biological processes and disease development. Accurate classification of lncRNAs and mRNAs is important for the identification of tissue- or disease-specific lncRNAs. Here, we present our tool LncDC (Long non-coding RNA detection) that is able to accurately predict lncRNAs with an XGBoost model using features extracted from RNA sequences, secondary structures, and translated proteins. Benchmarking experiments showed that LncDC consistently outperformed six state-of-the-art tools in distinguishing lncRNAs from mRNAs. Notably, the use of sequence and secondary structure (SASS) k-mer score features and flexible ORF features improved the classification capability of LncDC. We anticipate that LncDC will definitely promote the discovery of more and novel disease-specific lncRNAs. LncDC is implemented in Python and freely available at https://github.com/lim74/LncDC .
Collapse
Affiliation(s)
- Minghua Li
- grid.259956.40000 0001 2195 6763Department of Biology, Miami University, Oxford, OH 45056 USA
| | - Chun Liang
- grid.259956.40000 0001 2195 6763Department of Biology, Miami University, Oxford, OH 45056 USA
| |
Collapse
|
4
|
Santos F, Capela AM, Mateus F, Nóbrega-Pereira S, Bernardes de Jesus B. Non-coding antisense transcripts: fine regulation of gene expression in cancer. Comput Struct Biotechnol J 2022; 20:5652-5660. [PMID: 36284703 PMCID: PMC9579725 DOI: 10.1016/j.csbj.2022.10.009] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2022] [Revised: 10/03/2022] [Accepted: 10/04/2022] [Indexed: 11/14/2022] Open
Abstract
Natural antisense transcripts (NATs) are coding or non-coding RNA sequences transcribed on the opposite direction from the same genomic locus. NATs are widely distributed throughout the human genome and seem to play crucial roles in physiological and pathological processes, through newly described and targeted mechanisms. NATs represent the intricate complexity of the genome organization and constitute another layer of potential targets in disease. Here, we focus on the interesting and unique role of non-coding NATs in cancer, paying particular attention to those acting as miRNA sponges.
Collapse
Affiliation(s)
| | | | | | | | - Bruno Bernardes de Jesus
- Corresponding author at: Department of Medical Sciences and Institute of Biomedicine – iBiMED, University of Aveiro, 3810-193 Aveiro, Portugal.
| |
Collapse
|
5
|
HER2-PI9 and HER2-I12: two novel and functionally active splice variants of the oncogene HER2 in breast cancer. J Cancer Res Clin Oncol 2021; 147:2893-2912. [PMID: 34136934 PMCID: PMC8397700 DOI: 10.1007/s00432-021-03689-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2021] [Accepted: 06/05/2021] [Indexed: 11/03/2022]
Abstract
In this study, two novel alternative splice variants of HER2, named HER2-PI9 and HER2-I12, were identified in breast cancer cell lines and breast tumour tissues. Whilst HER2-P19 arises from the inclusion of an 117 bp cassette-exon of intron 9 of HER2, HER2-I12 results from intron 12 inclusion. In silico analyses were performed to predict the amino acid sequences of these two HER2 novel variants. To confirm their protein expression, plasmid vectors were generated and transfected into the HER2 negative breast cancer cell line, MCF-7. Additionally, their functional properties in oncogenic signalling were confirmed. Expression of HER2-PI9 and HER2-I12 was successful and matched the in silico predictions. Importantly, these splice variants can modulate the phosphorylation levels of extracellular signal-related kinase 1/2 (ERK1/2) and Akt/protein kinase B (Akt) signalling in MCF-7 breast cancer cells. Enhanced cellular proliferation, migration and invasion were observed in the case of the HER2-I12 expressing model. In human tissues and breast carcinoma tumours both variants were present. This study reveals two novel splice variants of HER2. Additionally, the potential biological activity for HER2-PI9 and HER2-I12 in breast cancer cells is also reported..
Collapse
|
6
|
Frankish A, Diekhans M, Jungreis I, Lagarde J, Loveland JE, Mudge JM, Sisu C, Wright JC, Armstrong J, Barnes I, Berry A, Bignell A, Boix C, Carbonell Sala S, Cunningham F, Di Domenico T, Donaldson S, Fiddes IT, García Girón C, Gonzalez JM, Grego T, Hardy M, Hourlier T, Howe KL, Hunt T, Izuogu OG, Johnson R, Martin FJ, Martínez L, Mohanan S, Muir P, Navarro FCP, Parker A, Pei B, Pozo F, Riera FC, Ruffier M, Schmitt BM, Stapleton E, Suner MM, Sycheva I, Uszczynska-Ratajczak B, Wolf MY, Xu J, Yang YT, Yates A, Zerbino D, Zhang Y, Choudhary JS, Gerstein M, Guigó R, Hubbard TJP, Kellis M, Paten B, Tress ML, Flicek P. GENCODE 2021. Nucleic Acids Res 2021; 49:D916-D923. [PMID: 33270111 PMCID: PMC7778937 DOI: 10.1093/nar/gkaa1087] [Citation(s) in RCA: 576] [Impact Index Per Article: 192.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2020] [Revised: 10/21/2020] [Accepted: 10/24/2020] [Indexed: 12/14/2022] Open
Abstract
The GENCODE project annotates human and mouse genes and transcripts supported by experimental data with high accuracy, providing a foundational resource that supports genome biology and clinical genomics. GENCODE annotation processes make use of primary data and bioinformatic tools and analysis generated both within the consortium and externally to support the creation of transcript structures and the determination of their function. Here, we present improvements to our annotation infrastructure, bioinformatics tools, and analysis, and the advances they support in the annotation of the human and mouse genomes including: the completion of first pass manual annotation for the mouse reference genome; targeted improvements to the annotation of genes associated with SARS-CoV-2 infection; collaborative projects to achieve convergence across reference annotation databases for the annotation of human and mouse protein-coding genes; and the first GENCODE manually supervised automated annotation of lncRNAs. Our annotation is accessible via Ensembl, the UCSC Genome Browser and https://www.gencodegenes.org.
Collapse
Affiliation(s)
- Adam Frankish
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Mark Diekhans
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA 95064, USA
| | - Irwin Jungreis
- MIT Computer Science and Artificial Intelligence Laboratory, 32 Vassar St, Cambridge, MA 02139, USA.,Broad Institute of MIT and Harvard, 415 Main Street, Cambridge, MA 02142, USA
| | - Julien Lagarde
- Centre for Genomic Regulation (CRG), The Barcelona Institute for Science and Technology, Dr. Aiguader 88, Barcelona, E-08003 Catalonia, Spain
| | - Jane E Loveland
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Jonathan M Mudge
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Cristina Sisu
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520, USA.,Department of Bioscience, Brunel University London, Uxbridge UB8 3PH, UK
| | - James C Wright
- Functional Proteomics, Division of Cancer Biology, Institute of Cancer Research, 237 Fulham Road, London SW3 6JB, UK
| | - Joel Armstrong
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA 95064, USA
| | - If Barnes
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Andrew Berry
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Alexandra Bignell
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Carles Boix
- MIT Computer Science and Artificial Intelligence Laboratory, 32 Vassar St, Cambridge, MA 02139, USA.,Broad Institute of MIT and Harvard, 415 Main Street, Cambridge, MA 02142, USA.,Computational and Systems Biology Program, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Silvia Carbonell Sala
- Centre for Genomic Regulation (CRG), The Barcelona Institute for Science and Technology, Dr. Aiguader 88, Barcelona, E-08003 Catalonia, Spain
| | - Fiona Cunningham
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Tomás Di Domenico
- Bioinformatics Unit, Spanish National Cancer Research Centre (CNIO), Madrid, Spain
| | - Sarah Donaldson
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Ian T Fiddes
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA 95064, USA
| | - Carlos García Girón
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Jose Manuel Gonzalez
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Tiago Grego
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Matthew Hardy
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Thibaut Hourlier
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Kevin L Howe
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Toby Hunt
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Osagie G Izuogu
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Rory Johnson
- Department of Medical Oncology, Inselspital, University Hospital, University of Bern, Bern, Switzerland.,Department of Biomedical Research (DBMR), University of Bern, Bern, Switzerland
| | - Fergal J Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Laura Martínez
- Bioinformatics Unit, Spanish National Cancer Research Centre (CNIO), Madrid, Spain
| | - Shamika Mohanan
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Paul Muir
- Department of Molecular, Cellular & Developmental Biology, Yale University, New Haven, CT 06520, USA.,Systems Biology Institute, Yale University, West Haven, CT 06516, USA
| | - Fabio C P Navarro
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520, USA
| | - Anne Parker
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Baikang Pei
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520, USA
| | - Fernando Pozo
- Bioinformatics Unit, Spanish National Cancer Research Centre (CNIO), Madrid, Spain
| | - Ferriol Calvet Riera
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Magali Ruffier
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Bianca M Schmitt
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Eloise Stapleton
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Marie-Marthe Suner
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Irina Sycheva
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | | | - Maxim Y Wolf
- Department of Biomedical Informatics at Harvard Medical School, 10 Shattuck Street, Suite 514, Boston, MA 02115, USA
| | - Jinuri Xu
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520, USA
| | - Yucheng T Yang
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520, USA.,Program in Computational Biology & Bioinformatics, Yale University, Bass 432, 266 Whitney Avenue, New Haven, CT 06520, USA
| | - Andrew Yates
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Daniel Zerbino
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Yan Zhang
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520, USA.,Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH 43210, USA
| | - Jyoti S Choudhary
- Functional Proteomics, Division of Cancer Biology, Institute of Cancer Research, 237 Fulham Road, London SW3 6JB, UK
| | - Mark Gerstein
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520, USA.,Program in Computational Biology & Bioinformatics, Yale University, Bass 432, 266 Whitney Avenue, New Haven, CT 06520, USA.,Department of Computer Science, Yale University, Bass 432, 266 Whitney Avenue, New Haven, CT 06520, USA
| | - Roderic Guigó
- Centre for Genomic Regulation (CRG), The Barcelona Institute for Science and Technology, Dr. Aiguader 88, Barcelona, E-08003 Catalonia, Spain.,Universitat Pompeu Fabra (UPF), Barcelona, E-08003 Catalonia, Spain
| | - Tim J P Hubbard
- Department of Medical and Molecular Genetics, King's College London, Guys Hospital, Great Maze Pond, London SE1 9RT, UK
| | - Manolis Kellis
- MIT Computer Science and Artificial Intelligence Laboratory, 32 Vassar St, Cambridge, MA 02139, USA.,Broad Institute of MIT and Harvard, 415 Main Street, Cambridge, MA 02142, USA
| | - Benedict Paten
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA 95064, USA
| | - Michael L Tress
- Bioinformatics Unit, Spanish National Cancer Research Centre (CNIO), Madrid, Spain
| | - Paul Flicek
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| |
Collapse
|
7
|
|
8
|
Cheng Q, Xiao H, Xiong Q. Conserved exitrons of FLAGELLIN-SENSING 2 (FLS2) across dicot plants and their functions. PLANT SCIENCE : AN INTERNATIONAL JOURNAL OF EXPERIMENTAL PLANT BIOLOGY 2020; 296:110507. [PMID: 32540022 DOI: 10.1016/j.plantsci.2020.110507] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/30/2019] [Revised: 03/20/2020] [Accepted: 04/22/2020] [Indexed: 06/11/2023]
Abstract
The alternative splicing of pattern recognition receptor genes regulates immune signalling in mammals, but in plants its role is still unknown. Here, we detected alternatively spliced introns (exitrons) in the first annotated exons of FLAGELLIN-SENSING 2 (FLS2) genes in all the examined dicot plants across nine families. The 5' splice site (SS) regions were conserved and with rare synonymous substitutions. Point mutations and gene swaps indicated that the position and efficiency of exitron splicing primarily depended on the nucleotide sequences of FLS2 genes. Single-nucleotide mutations in the invariable codon carrying 5' SS dramatically altered the accumulation of poplar and tomato FLS2 transcripts, indicating the 5'-proximal exitrons of FLS2 function as stimulatory introns on gene expression. The 3' SSs of exitrons are diverse and can be changed by 1-2 nucleotide mutations in Salicaceae FLS2. The alternative transcripts (ATs) of poplar and tobacco FLS2, which encode small secreted proteins, were specifically induced by flg22, and one such AT from tobacco FLS2 suppressed flg22-induced response. Our results indicated that the exitrons of FLS2 genes regulate the accumulation of transcripts by an intron mediated enhancement (IME) mechanism and some ATs have the potential to encode suppressors for FLS2 pathway.
Collapse
Affiliation(s)
- Qiang Cheng
- Key Laboratory of Forest Genetics & Biotechnology of Ministry of Education, Co-Innovation Center for Sustainable Forestry in Southern China, Nanjing Forestry University, Nanjing, 210037, China.
| | - Hongju Xiao
- Key Laboratory of Forest Genetics & Biotechnology of Ministry of Education, Co-Innovation Center for Sustainable Forestry in Southern China, Nanjing Forestry University, Nanjing, 210037, China
| | - Qin Xiong
- Key Laboratory of Forest Genetics & Biotechnology of Ministry of Education, Co-Innovation Center for Sustainable Forestry in Southern China, Nanjing Forestry University, Nanjing, 210037, China
| |
Collapse
|
9
|
Characterization of splice-altering mutations in inherited predisposition to cancer. Proc Natl Acad Sci U S A 2019; 116:26798-26807. [PMID: 31843900 DOI: 10.1073/pnas.1915608116] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022] Open
Abstract
Mutations responsible for inherited disease may act by disrupting normal transcriptional splicing. Such mutations can be difficult to detect, and their effects difficult to characterize, because many lie deep within exons or introns where they may alter splice enhancers or silencers or introduce new splice acceptors or donors. Multiple mutation-specific and genome-wide approaches have been developed to evaluate these classes of mutations. We introduce a complementary experimental approach, cBROCA, which yields qualitative and quantitative assessments of the effects of genomic mutations on transcriptional splicing of tumor suppressor genes. cBROCA analysis is undertaken by deriving complementary DNA (cDNA) from puromycin-treated patient lymphoblasts, hybridizing the cDNA to the BROCA panel of tumor suppressor genes, and then multiplex sequencing to very high coverage. At each splice junction suggested by split sequencing reads, read depths of test and control samples are compared. Significant Z scores indicate altered transcripts, over and above naturally occurring minor transcripts, and comparisons of read depths indicate relative abundances of mutant and normal transcripts. BROCA analysis of genomic DNA suggested 120 rare mutations from 150 families with cancers of the breast, ovary, uterus, or colon, in >600 informative genotyped relatives. cBROCA analysis of their transcripts revealed a wide variety of consequences of abnormal splicing in tumor suppressor genes, including whole or partial exon skipping, exonification of intronic sequence, loss or gain of exonic and intronic splicing enhancers and silencers, complete intron retention, hypomorphic alleles, and combinations of these alterations. Combined with pedigree analysis, cBROCA sequencing contributes to understanding the clinical consequences of rare inherited mutations.
Collapse
|
10
|
Frankish A, Diekhans M, Ferreira AM, Johnson R, Jungreis I, Loveland J, Mudge JM, Sisu C, Wright J, Armstrong J, Barnes I, Berry A, Bignell A, Carbonell Sala S, Chrast J, Cunningham F, Di Domenico T, Donaldson S, Fiddes IT, García Girón C, Gonzalez JM, Grego T, Hardy M, Hourlier T, Hunt T, Izuogu OG, Lagarde J, Martin FJ, Martínez L, Mohanan S, Muir P, Navarro FC, Parker A, Pei B, Pozo F, Ruffier M, Schmitt BM, Stapleton E, Suner MM, Sycheva I, Uszczynska-Ratajczak B, Xu J, Yates A, Zerbino D, Zhang Y, Aken B, Choudhary JS, Gerstein M, Guigó R, Hubbard TJ, Kellis M, Paten B, Reymond A, Tress ML, Flicek P. GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Res 2019; 47:D766-D773. [PMID: 30357393 PMCID: PMC6323946 DOI: 10.1093/nar/gky955] [Citation(s) in RCA: 1778] [Impact Index Per Article: 355.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2018] [Revised: 09/20/2018] [Accepted: 10/08/2018] [Indexed: 02/06/2023] Open
Abstract
The accurate identification and description of the genes in the human and mouse genomes is a fundamental requirement for high quality analysis of data informing both genome biology and clinical genomics. Over the last 15 years, the GENCODE consortium has been producing reference quality gene annotations to provide this foundational resource. The GENCODE consortium includes both experimental and computational biology groups who work together to improve and extend the GENCODE gene annotation. Specifically, we generate primary data, create bioinformatics tools and provide analysis to support the work of expert manual gene annotators and automated gene annotation pipelines. In addition, manual and computational annotation workflows use any and all publicly available data and analysis, along with the research literature to identify and characterise gene loci to the highest standard. GENCODE gene annotations are accessible via the Ensembl and UCSC Genome Browsers, the Ensembl FTP site, Ensembl Biomart, Ensembl Perl and REST APIs as well as https://www.gencodegenes.org.
Collapse
Affiliation(s)
- Adam Frankish
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Mark Diekhans
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA 95064, USA
| | - Anne-Maud Ferreira
- Center for Integrative Genomics, University of Lausanne, 1015 Lausanne, Switzerland
| | - Rory Johnson
- Department of Medical Oncology, Inselspital, University Hospital, University of Bern, Bern, Switzerland
- Department of Biomedical Research (DBMR), University of Bern, Bern, Switzerland
| | - Irwin Jungreis
- MIT Computer Science and Artificial Intelligence Laboratory, 32 Vasser St, Cambridge, MA 02139, USA
- Broad Institute of MIT and Harvard, 415 Main Street, Cambridge, MA 02142, USA
| | - Jane Loveland
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Jonathan M Mudge
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Cristina Sisu
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520, USA
- Department of Bioscience, Brunel University London, Uxbridge UB8 3PH, UK
| | - James Wright
- Functional Proteomics, Division of Cancer Biology, Institute of Cancer Research, 123 Old Brompton Road, London SW7 3RP, UK
| | - Joel Armstrong
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA 95064, USA
| | - If Barnes
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Andrew Berry
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Alexandra Bignell
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Silvia Carbonell Sala
- Centre for Genomic Regulation (CRG), The Barcelona Institute for Science and Technology, Dr. Aiguader 88, Barcelona, E-08003 Catalonia, Spain
| | - Jacqueline Chrast
- Center for Integrative Genomics, University of Lausanne, 1015 Lausanne, Switzerland
| | - Fiona Cunningham
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Tomás Di Domenico
- Bioinformatics Unit, Spanish National Cancer Research Centre (CNIO), Madrid, Spain
| | - Sarah Donaldson
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Ian T Fiddes
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA 95064, USA
| | - Carlos García Girón
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Jose Manuel Gonzalez
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Tiago Grego
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Matthew Hardy
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Thibaut Hourlier
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Toby Hunt
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Osagie G Izuogu
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Julien Lagarde
- Centre for Genomic Regulation (CRG), The Barcelona Institute for Science and Technology, Dr. Aiguader 88, Barcelona, E-08003 Catalonia, Spain
| | - Fergal J Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Laura Martínez
- Bioinformatics Unit, Spanish National Cancer Research Centre (CNIO), Madrid, Spain
| | - Shamika Mohanan
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Paul Muir
- Department of Molecular, Cellular & Developmental Biology, Yale University, New Haven, CT 06520, USA
- Systems Biology Institute, Yale University, West Haven, CT 06516, USA
| | - Fabio C P Navarro
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520, USA
| | - Anne Parker
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Baikang Pei
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520, USA
| | - Fernando Pozo
- Bioinformatics Unit, Spanish National Cancer Research Centre (CNIO), Madrid, Spain
| | - Magali Ruffier
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Bianca M Schmitt
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Eloise Stapleton
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Marie-Marthe Suner
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Irina Sycheva
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | | | - Jinuri Xu
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520, USA
| | - Andrew Yates
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Daniel Zerbino
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Yan Zhang
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520, USA
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH 43210, USA
| | - Bronwen Aken
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Jyoti S Choudhary
- Functional Proteomics, Division of Cancer Biology, Institute of Cancer Research, 123 Old Brompton Road, London SW7 3RP, UK
| | - Mark Gerstein
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520, USA
- Program in Computational Biology & Bioinformatics, Yale University, Bass 432, 266 Whitney Avenue, New Haven, CT 06520, USA
- Department of Computer Science, Yale University, Bass 432, 266 Whitney Avenue, New Haven, CT 06520, USA
| | - Roderic Guigó
- Centre for Genomic Regulation (CRG), The Barcelona Institute for Science and Technology, Dr. Aiguader 88, Barcelona, E-08003 Catalonia, Spain
- Universitat Pompeu Fabra (UPF), Barcelona, E-08003 Catalonia, Spain
| | - Tim J P Hubbard
- Department of Medical and Molecular Genetics, King's College London, Guys Hospital, Great Maze Pond, London SE1 9RT, UK
| | - Manolis Kellis
- MIT Computer Science and Artificial Intelligence Laboratory, 32 Vasser St, Cambridge, MA 02139, USA
- Broad Institute of MIT and Harvard, 415 Main Street, Cambridge, MA 02142, USA
| | - Benedict Paten
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA 95064, USA
| | - Alexandre Reymond
- Center for Integrative Genomics, University of Lausanne, 1015 Lausanne, Switzerland
| | - Michael L Tress
- Bioinformatics Unit, Spanish National Cancer Research Centre (CNIO), Madrid, Spain
| | - Paul Flicek
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| |
Collapse
|
11
|
Capturing the Alternative Cleavage and Polyadenylation Sites of 14 NAC Genes in Populus Using a Combination of 3'-RACE and High-Throughput Sequencing. Molecules 2018. [PMID: 29518015 PMCID: PMC6017670 DOI: 10.3390/molecules23030608] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
Abstract
Detection of complex splice sites (SSs) and polyadenylation sites (PASs) of eukaryotic genes is essential for the elucidation of gene regulatory mechanisms. Transcriptome-wide studies using high-throughput sequencing (HTS) have revealed prevalent alternative splicing (AS) and alternative polyadenylation (APA) in plants. However, small-scale and high-depth HTS aimed at detecting genes or gene families are very few and limited. We explored a convenient and flexible method for profiling SSs and PASs, which combines rapid amplification of 3′-cDNA ends (3′-RACE) and HTS. Fourteen NAC (NAM, ATAF1/2, CUC2) transcription factor genes of Populus trichocarpa were analyzed by 3′-RACE-seq. Based on experimental reproducibility, boundary sequence analysis and reverse transcription PCR (RT-PCR) verification, only canonical SSs were considered to be authentic. Based on stringent criteria, candidate PASs without any internal priming features were chosen as authentic PASs and assumed to be PAS-rich markers. Thirty-four novel canonical SSs, six intronic/internal exons and thirty 3′-UTR PAS-rich markers were revealed by 3′-RACE-seq. Using 3′-RACE and real-time PCR, we confirmed that three APA transcripts ending in/around PAS-rich markers were differentially regulated in response to plant hormones. Our results indicate that 3′-RACE-seq is a robust and cost-effective method to discover SSs and label active regions subjected to APA for genes or gene families. The method is suitable for small-scale AS and APA research in the initial stage.
Collapse
|
12
|
Merrick BA, Chang JS, Phadke DP, Bostrom MA, Shah RR, Wang X, Gordon O, Wright GM. HAfTs are novel lncRNA transcripts from aflatoxin exposure. PLoS One 2018; 13:e0190992. [PMID: 29351317 PMCID: PMC5774710 DOI: 10.1371/journal.pone.0190992] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2017] [Accepted: 12/22/2017] [Indexed: 12/28/2022] Open
Abstract
The transcriptome can reveal insights into precancer biology. We recently conducted RNA-Seq analysis on liver RNA from male rats exposed to the carcinogen, aflatoxin B1 (AFB1), for 90 days prior to liver tumor onset. Among >1,000 differentially expressed transcripts, several novel, unannotated Cufflinks-assembled transcripts, or HAfTs (Hepatic Aflatoxin Transcripts) were found. We hypothesized PCR-cloning and RACE (rapid amplification of cDNA ends) could further HAfT identification. Sanger data was obtained for 6 transcripts by PCR and 16 transcripts by 5’- and 3’-RACE. BLAST alignments showed, with two exceptions, HAfT transcripts were lncRNAs, >200nt without apparent long open reading frames. Six rat HAfT transcripts were classified as ‘novel’ without RefSeq annotation. Sequence alignment and genomic synteny showed each rat lncRNA had a homologous locus in the mouse genome and over half had homologous loci in the human genome, including at least two loci (and possibly three others) that were previously unannotated. While HAfT functions are not yet clear, coregulatory roles may be possible from their adjacent orientation to known coding genes with altered expression that include 8 HAfT-gene pairs. For example, a unique rat HAfT, homologous to Pvt1, was adjacent to known genes controlling cell proliferation. Additionally, PCR and RACE Sanger sequencing showed many alternative splice variants and refinements of exon sequences compared to Cufflinks assembled transcripts and gene prediction algorithms. Presence of multiple splice variants and short tandem repeats found in some HAfTs may be consequential for secondary structure, transcriptional regulation, and function. In summary, we report novel, differentially expressed lncRNAs after exposure to the genotoxicant, AFB1, prior to neoplastic lesions. Complete cloning and sequencing of such transcripts could pave the way for a new set of sensitive and early prediction markers for chemical hepatocarcinogens.
Collapse
Affiliation(s)
- B. Alex Merrick
- Biomolecular Screening Branch, Division National Toxicology Program, National Institute of Environmental Health Sciences, Research Triangle Park, North Carolina, United States of America
- * E-mail:
| | - Justin S. Chang
- Biomolecular Screening Branch, Division National Toxicology Program, National Institute of Environmental Health Sciences, Research Triangle Park, North Carolina, United States of America
| | - Dhiral P. Phadke
- Sciome, LLC, Research Triangle Park, North Carolina, United States of America
| | - Meredith A. Bostrom
- Genomics Laboratory, David H. Murdock Research Institute, Kannapolis, North Carolina, United State of America
| | - Ruchir R. Shah
- Sciome, LLC, Research Triangle Park, North Carolina, United States of America
| | - Xinguo Wang
- Genomics Laboratory, David H. Murdock Research Institute, Kannapolis, North Carolina, United State of America
| | - Oksana Gordon
- Genomics Laboratory, David H. Murdock Research Institute, Kannapolis, North Carolina, United State of America
| | - Garron M. Wright
- Genomics Laboratory, David H. Murdock Research Institute, Kannapolis, North Carolina, United State of America
| |
Collapse
|
13
|
Serba DD, Uppalapati SR, Krom N, Mukherjee S, Tang Y, Mysore KS, Saha MC. Transcriptome analysis in switchgrass discloses ecotype difference in photosynthetic efficiency. BMC Genomics 2016; 17:1040. [PMID: 27986076 PMCID: PMC5162099 DOI: 10.1186/s12864-016-3377-8] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2016] [Accepted: 12/05/2016] [Indexed: 11/28/2022] Open
Abstract
BACKGROUND Switchgrass, a warm-season perennial grass studied as a potential dedicated biofuel feedstock, is classified into two main taxa - lowland and upland ecotypes - that differ in morphology and habitat of adaptation. But there is limited information on their inherent molecular variations. RESULTS Transcriptome analysis by RNA-sequencing (RNA-Seq) was conducted for lowland and upland ecotypes to document their gene expression variations. Mapping of transcriptome to the reference genome (Panicum virgatum v1.1) revealed that the lowland and upland ecotypes differ substantially in sets of genes transcribed as well as levels of expression. Differential gene expression analysis exhibited that transcripts related to photosynthesis efficiency and development and photosystem reaction center subunits were upregulated in lowlands compared to upland genotype. On the other hand, catalase isozymes, helix-loop-helix, late embryogenesis abundant group I, photosulfokinases, and S-adenosyl methionine synthase gene transcripts were upregulated in the upland compared to the lowlands. At ≥100x coverage and ≥5% minor allele frequency, a total of 25,894 and 16,979 single nucleotide polymorphism (SNP) markers were discovered for VS16 (upland ecotype) and K5 (lowland ecotype) against the reference genome. The allele combination of the SNPs revealed that the transition mutations are more prevalent than the transversion mutations. CONCLUSIONS The gene ontology (GO) analysis of the transcriptome indicated lowland ecotype had significantly higher representation for cellular components associated with photosynthesis machinery controlling carbon fixation. In addition, using the transcriptome data, SNP markers were detected, which were distributed throughout the genome. The differentially expressed genes and SNP markers detected in this study would be useful resources for traits mapping and gene transfer across ecotypes in switchgrass breeding for increased biomass yield for biofuel conversion.
Collapse
Affiliation(s)
- Desalegn D. Serba
- Forage Improvement Division, The Samuel Roberts Noble Foundation, 2510 Sam Noble Parkway, Ardmore, OK 73401 USA
- Present Address: Agricultural Research Center-Hays, Kansas State University, 1232 240th Avenue, Hays, KS 67601 USA
| | - Srinivasa Rao Uppalapati
- Plant Biology Division, The Samuel Roberts Noble Foundation, Ardmore, OK 73401 USA
- Present Address: DuPont Crop Protection, Stine-Haskell Research Center, Newark, DE 19711 USA
| | - Nick Krom
- Computing Services, The Samuel Roberts Noble Foundation, Ardmore, OK 73401 USA
| | - Shreyartha Mukherjee
- Computing Services, The Samuel Roberts Noble Foundation, Ardmore, OK 73401 USA
- Present Address: Syngenta, Stanton, MN 55018 USA
| | - Yuhong Tang
- Plant Biology Division, The Samuel Roberts Noble Foundation, Ardmore, OK 73401 USA
| | - Kirankumar S. Mysore
- Plant Biology Division, The Samuel Roberts Noble Foundation, Ardmore, OK 73401 USA
| | - Malay C. Saha
- Forage Improvement Division, The Samuel Roberts Noble Foundation, 2510 Sam Noble Parkway, Ardmore, OK 73401 USA
| |
Collapse
|
14
|
TACO produces robust multisample transcriptome assemblies from RNA-seq. Nat Methods 2016; 14:68-70. [PMID: 27869815 PMCID: PMC5199618 DOI: 10.1038/nmeth.4078] [Citation(s) in RCA: 115] [Impact Index Per Article: 14.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2016] [Accepted: 10/17/2016] [Indexed: 01/01/2023]
Abstract
Accurate transcript structure and abundance inference from RNA-Seq data is foundational for molecular discovery. Here we present TACO, a computational method to reconstruct a consensus transcriptome from multiple RNA-Seq datasets. TACO employs novel change-point detection to demarcate transcript start and end sites, leading to dramatically improved reconstruction accuracy compared to other tools in its class. The tool is available at http://tacorna.github.io and can be readily incorporated into RNA-Seq analysis workflows.
Collapse
|
15
|
Zhao J, Song X, Wang K. lncScore: alignment-free identification of long noncoding RNA from assembled novel transcripts. Sci Rep 2016; 6:34838. [PMID: 27708423 PMCID: PMC5052565 DOI: 10.1038/srep34838] [Citation(s) in RCA: 35] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2016] [Accepted: 09/21/2016] [Indexed: 12/21/2022] Open
Abstract
RNA-Seq based transcriptome assembly has been widely used to identify novel lncRNAs. However, the best-performing transcript reconstruction methods merely identified 21% of full-length protein-coding transcripts from H. sapiens. Those partial-length protein-coding transcripts are more likely to be classified as lncRNAs due to their incomplete CDS, leading to higher false positive rate for lncRNA identification. Furthermore, potential sequencing or assembly error that gain or abolish stop codons also complicates ORF-based prediction of lncRNAs. Therefore, it remains a challenge to identify lncRNAs from the assembled transcripts, particularly the partial-length ones. Here, we present a novel alignment-free tool, lncScore, which uses a logistic regression model with 11 carefully selected features. Compared to other state-of-the-art alignment-free tools (e.g. CPAT, CNCI, and PLEK), lncScore outperforms them on accurately distinguishing lncRNAs from mRNAs, especially partial-length mRNAs in the human and mouse datasets. In addition, lncScore also performed well on transcripts from five other species (Zebrafish, Fly, C. elegans, Rat, and Sheep). To speed up the prediction, multithreading is implemented within lncScore, and it only took 2 minute to classify 64,756 transcripts and 54 seconds to train a new model with 21,000 transcripts with 12 threads, which is much faster than other tools. lncScore is available at https://github.com/WGLab/lncScore.
Collapse
Affiliation(s)
- Jian Zhao
- Department of Biomedical Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China
- Zilkha Neurogenetic Institute, Keck School of Medicine, University of Southern California, Los Angeles, CA 90089, USA
| | - Xiaofeng Song
- Department of Biomedical Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China
| | - Kai Wang
- Zilkha Neurogenetic Institute, Keck School of Medicine, University of Southern California, Los Angeles, CA 90089, USA
- Division of Bioinformatics, Department of Preventive Medicine, Keck School of Medicine, University of Southern California, Los Angeles, CA 90089, USA
- Institute for Genomic Medicine, Columbia University Medical Center, New York, NY 10032, USA
- Department of Biomedical Informatics, Columbia University Medical Center, New York, NY 10032, USA
| |
Collapse
|
16
|
Plant N. Can a systems approach produce a better understanding of mood disorders? Biochim Biophys Acta Gen Subj 2016; 1861:3335-3344. [PMID: 27565355 DOI: 10.1016/j.bbagen.2016.08.016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2016] [Revised: 07/29/2016] [Accepted: 08/22/2016] [Indexed: 10/21/2022]
Abstract
BACKGROUND One in twenty-five people suffer from a mood disorder. Current treatments are sub-optimal with poor patient response and uncertain modes-of-action. There is thus a need to better understand underlying mechanisms that determine mood, and how these go wrong in affective disorders. Systems biology approaches have yielded important biological discoveries for other complex diseases such as cancer, and their potential in affective disorders will be reviewed. SCOPE OF REVIEW This review will provide a general background to affective disorders, plus an outline of experimental and computational systems biology. The current application of these approaches in understanding affective disorders will be considered, and future recommendations made. MAJOR CONCLUSIONS Experimental systems biology has been applied to the study of affective disorders, especially at the genome and transcriptomic levels. However, data generation has been slowed by a lack of human tissue or suitable animal models. At present, computational systems biology has only be applied to understanding affective disorders on a few occasions. These studies provide sufficient novel biological insight to motivate further use of computational biology in this field. GENERAL SIGNIFICANCE In common with many complex diseases much time and money has been spent on the generation of large-scale experimental datasets. The next step is to use the emerging computational approaches, predominantly developed in the field of oncology, to leverage the most biological insight from these datasets. This will lead to the critical breakthroughs required for more effective diagnosis, stratification and treatment of affective disorders.
Collapse
Affiliation(s)
- Nick Plant
- School of Bioscience and Medicine, Faculty of Health and Medical Science, University of Surrey, Guildford GU2 7XH, UK.
| |
Collapse
|
17
|
Lagarde J, Uszczynska-Ratajczak B, Santoyo-Lopez J, Gonzalez JM, Tapanari E, Mudge JM, Steward CA, Wilming L, Tanzer A, Howald C, Chrast J, Vela-Boza A, Rueda A, Lopez-Domingo FJ, Dopazo J, Reymond A, Guigó R, Harrow J. Extension of human lncRNA transcripts by RACE coupled with long-read high-throughput sequencing (RACE-Seq). Nat Commun 2016; 7:12339. [PMID: 27531712 PMCID: PMC4992054 DOI: 10.1038/ncomms12339] [Citation(s) in RCA: 53] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2016] [Accepted: 06/23/2016] [Indexed: 12/22/2022] Open
Abstract
Long non-coding RNAs (lncRNAs) constitute a large, yet mostly uncharacterized fraction of the mammalian transcriptome. Such characterization requires a comprehensive, high-quality annotation of their gene structure and boundaries, which is currently lacking. Here we describe RACE-Seq, an experimental workflow designed to address this based on RACE (rapid amplification of cDNA ends) and long-read RNA sequencing. We apply RACE-Seq to 398 human lncRNA genes in seven tissues, leading to the discovery of 2,556 on-target, novel transcripts. About 60% of the targeted loci are extended in either 5′ or 3′, often reaching genomic hallmarks of gene boundaries. Analysis of the novel transcripts suggests that lncRNAs are as long, have as many exons and undergo as much alternative splicing as protein-coding genes, contrary to current assumptions. Overall, we show that RACE-Seq is an effective tool to annotate an organism's deep transcriptome, and compares favourably to other targeted sequencing techniques. Long non-coding RNAs are increasingly recognised to be important factors in regulating cellular processes and comprise a large faction of the transcriptome, however most are uncharacterised. Here the authors present RACE-Seq, a tool to improve and extend the annotation of low-expression transcripts.
Collapse
Affiliation(s)
- Julien Lagarde
- Centre for Genomic Regulation (CRG), Barcelona Institute of Science and Technology (BIST), Dr Aiguader 88, 08003 Barcelona, Spain.,Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - Barbara Uszczynska-Ratajczak
- Centre for Genomic Regulation (CRG), Barcelona Institute of Science and Technology (BIST), Dr Aiguader 88, 08003 Barcelona, Spain.,Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | | | | | - Electra Tapanari
- Wellcome Trust Sanger Institute, Hinxton, Cambridgeshire CB10 1HH, UK
| | - Jonathan M Mudge
- Wellcome Trust Sanger Institute, Hinxton, Cambridgeshire CB10 1HH, UK
| | - Charles A Steward
- Wellcome Trust Sanger Institute, Hinxton, Cambridgeshire CB10 1HH, UK
| | - Laurens Wilming
- Wellcome Trust Sanger Institute, Hinxton, Cambridgeshire CB10 1HH, UK
| | - Andrea Tanzer
- Centre for Genomic Regulation (CRG), Barcelona Institute of Science and Technology (BIST), Dr Aiguader 88, 08003 Barcelona, Spain.,Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - Cédric Howald
- Center for Integrative Genomics, University of Lausanne, Lausanne, Switzerland
| | - Jacqueline Chrast
- Center for Integrative Genomics, University of Lausanne, Lausanne, Switzerland
| | - Alicia Vela-Boza
- Genomics and Bioinformatics Platform of Andalusia (GBPA), 41092 Seville, Spain.,Roche Diagnostics, 08174 Sant Cugat Del Vallès, Barcelona, Spain
| | - Antonio Rueda
- Genomics and Bioinformatics Platform of Andalusia (GBPA), 41092 Seville, Spain
| | | | - Joaquin Dopazo
- Genomics and Bioinformatics Platform of Andalusia (GBPA), 41092 Seville, Spain.,Computational Genomics Department, Centro de Investigación Príncipe Felipe, 46012 Valencia, Spain.,Functional Genomics Node (INB), Centro de Investigación Príncipe Felipe, 46012 Valencia, Spain
| | - Alexandre Reymond
- Center for Integrative Genomics, University of Lausanne, Lausanne, Switzerland
| | - Roderic Guigó
- Centre for Genomic Regulation (CRG), Barcelona Institute of Science and Technology (BIST), Dr Aiguader 88, 08003 Barcelona, Spain.,Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - Jennifer Harrow
- Wellcome Trust Sanger Institute, Hinxton, Cambridgeshire CB10 1HH, UK
| |
Collapse
|
18
|
Abstract
Mammals have at least 210 histologically diverse cell types (Alberts, Molecular biology of the cell. Garland Science, New York, 2008) and the number would be even higher if functional differences are taken into account. The genome in each of these cell types is differentially programmed to express the specific set of genes needed to fulfill the phenotypical requirements of the cell. Furthermore, in each of these cell types, the gene program can be differentially modulated by exposure to external signals such as hormones or nutrients. The basis for the distinct gene programs relies on cell type-selective activation of transcriptional enhancers, which in turn are particularly sensitive to modulation. Until recently we had only fragmented insight into the regulation of a few of these enhancers; however, the recent advances in high-throughput sequencing technologies have enabled the development of a large number of technologies that can be used to obtain genome-wide insight into how genomes are reprogrammed during development and in response to specific external signals. By applying such technologies, we have begun to reveal the cross-talk between metabolism and the genome, i.e., how genomes are reprogrammed in response to metabolites, and how the regulation of metabolic networks is coordinated at the genomic level.
Collapse
Affiliation(s)
- Alexander Rauch
- Department of Biochemistry and Molecular Biology, University of Southern Denmark, 5230, Odense M, Denmark
| | - Susanne Mandrup
- Department of Biochemistry and Molecular Biology, University of Southern Denmark, 5230, Odense M, Denmark.
| |
Collapse
|
19
|
Hong SE, Nho KJ, Song HK, Kim DH. Deep sequencing-generated modules demonstrate coherent expression patterns for various cardiac diseases. Gene 2015; 574:53-60. [PMID: 26232333 DOI: 10.1016/j.gene.2015.07.080] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2015] [Revised: 06/19/2015] [Accepted: 07/24/2015] [Indexed: 11/30/2022]
Abstract
As sequencing technology rapidly develops, gene annotations have also become increasingly sophisticated with incorporation of information regarding the temporal-spatial context of alternative splicing patterns, developmental stages, and tissue specificity. The present study aimed to identify the heart-enriched genes based on next-generation sequencing data and to investigate the gene modules demonstrating coherent expression patterns for various cardiac disease-related perturbations. Seven gene modules, including 382 heart-enriched genes, were identified. At least two modules containing differentially expressed genes were experimentally confirmed to be highly sensitive to various cardiac diseases. Transcription factors regulating the gene modules were then analyzed based on knowledgebase information; the expression of eight transcription factors changed significantly during pressure-overload cardiac hypertrophy, suggesting possible regulation of the modules by the identified transcription factors. Collectively, our results contribute to the classification of heart-enriched genes and their modules and would aid in identification of the transcription factors involved in cardiac pathogenesis in the future.
Collapse
Affiliation(s)
- Seong-Eui Hong
- College of Life Sciences and Systems Biology Research Center, Gwangju Institute of Science and Technology (GIST), 123 Cheomdangwagi-ro, Buk-gu, Gwangju 500-712, Republic of Korea.
| | - Kyoung Jin Nho
- College of Life Sciences and Systems Biology Research Center, Gwangju Institute of Science and Technology (GIST), 123 Cheomdangwagi-ro, Buk-gu, Gwangju 500-712, Republic of Korea.
| | - Hong Ki Song
- College of Life Sciences and Systems Biology Research Center, Gwangju Institute of Science and Technology (GIST), 123 Cheomdangwagi-ro, Buk-gu, Gwangju 500-712, Republic of Korea.
| | - Do Han Kim
- College of Life Sciences and Systems Biology Research Center, Gwangju Institute of Science and Technology (GIST), 123 Cheomdangwagi-ro, Buk-gu, Gwangju 500-712, Republic of Korea.
| |
Collapse
|
20
|
Mudge JM, Harrow J. Creating reference gene annotation for the mouse C57BL6/J genome assembly. Mamm Genome 2015; 26:366-78. [PMID: 26187010 PMCID: PMC4602055 DOI: 10.1007/s00335-015-9583-x] [Citation(s) in RCA: 168] [Impact Index Per Article: 18.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2015] [Accepted: 06/18/2015] [Indexed: 12/14/2022]
Abstract
Annotation on the reference genome of the C57BL6/J mouse has been an ongoing project ever since the draft genome was first published. Initially, the principle focus was on the identification of all protein-coding genes, although today the importance of describing long non-coding RNAs, small RNAs, and pseudogenes is recognized. Here, we describe the progress of the GENCODE mouse annotation project, which combines manual annotation from the HAVANA group with Ensembl computational annotation, alongside experimental and in silico validation pipelines from other members of the consortium. We discuss the more recent incorporation of next-generation sequencing datasets into this workflow, including the usage of mass-spectrometry data to potentially identify novel protein-coding genes. Finally, we will outline how the C57BL6/J genebuild can be used to gain insights into the variant sites that distinguish different mouse strains and species.
Collapse
|
21
|
Ziats MN, Grosvenor LP, Rennert OM. Functional genomics of human brain development and implications for autism spectrum disorders. Transl Psychiatry 2015; 5:e665. [PMID: 26506051 PMCID: PMC4930130 DOI: 10.1038/tp.2015.153] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/27/2014] [Revised: 09/03/2015] [Accepted: 09/06/2015] [Indexed: 12/13/2022] Open
Abstract
Transcription of the inherited DNA sequence into copies of messenger RNA is the most fundamental process by which the genome functions to guide development. Encoded sequence information, inherited epigenetic marks and environmental influences all converge at the level of mRNA gene expression to allow for cell-type-specific, tissue-specific, spatial and temporal patterns of expression. Thus, the transcriptome represents a complex interplay between inherited genomic structure, dynamic experiential demands and external signals. This property makes transcriptome studies uniquely positioned to provide insight into complex genetic-epigenetic-environmental processes such as human brain development, and disorders with non-Mendelian genetic etiologies such as autism spectrum disorders. In this review, we describe recent studies exploring the unique functional genomics profile of the human brain during neurodevelopment. We then highlight two emerging areas of research with great potential to increase our understanding of functional neurogenomics-non-coding RNA expression and gene interaction networks. Finally, we review previous functional genomics studies of autism spectrum disorder in this context, and discuss how investigations at the level of functional genomics are beginning to identify convergent molecular mechanisms underlying this genetically heterogeneous disorder.
Collapse
Affiliation(s)
- M N Ziats
- Laboratory of Clinical and Developmental Genomics, National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, MD, USA,University of Cambridge, Robinson College, Cambridgeshire, UK,Baylor College of Medicine MSTP, One Baylor Plaza, Houston, TX, USA,Laboratory of Clinical and Developmental Genomics, National Institute of Child Health and Human Development, National Institutes of Health, 49 Convent Drive, Building 49, Room 2C08, Bethesda, MD 20814, USA. E-mail:
| | - L P Grosvenor
- Pediatrics and Developmental Neuroscience Branch, National Institute of Mental Health, National Institutes of Health, Bethesda, MD, USA
| | - O M Rennert
- Laboratory of Clinical and Developmental Genomics, National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, MD, USA
| |
Collapse
|
22
|
Frankish A, Uszczynska B, Ritchie GRS, Gonzalez JM, Pervouchine D, Petryszak R, Mudge JM, Fonseca N, Brazma A, Guigo R, Harrow J. Comparison of GENCODE and RefSeq gene annotation and the impact of reference geneset on variant effect prediction. BMC Genomics 2015; 16 Suppl 8:S2. [PMID: 26110515 PMCID: PMC4502323 DOI: 10.1186/1471-2164-16-s8-s2] [Citation(s) in RCA: 59] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/02/2023] Open
Abstract
Background A vast amount of DNA variation is being identified by increasingly large-scale exome and genome sequencing projects. To be useful, variants require accurate functional annotation and a wide range of tools are available to this end. McCarthy et al recently demonstrated the large differences in prediction of loss-of-function (LoF) variation when RefSeq and Ensembl transcripts are used for annotation, highlighting the importance of the reference transcripts on which variant functional annotation is based. Results We describe a detailed analysis of the similarities and differences between the gene and transcript annotation in the GENCODE and RefSeq genesets. We demonstrate that the GENCODE Comprehensive set is richer in alternative splicing, novel CDSs, novel exons and has higher genomic coverage than RefSeq, while the GENCODE Basic set is very similar to RefSeq. Using RNAseq data we show that exons and introns unique to one geneset are expressed at a similar level to those common to both. We present evidence that the differences in gene annotation lead to large differences in variant annotation where GENCODE and RefSeq are used as reference transcripts, although this is predominantly confined to non-coding transcripts and UTR sequence, with at most ~30% of LoF variants annotated discordantly. We also describe an investigation of dominant transcript expression, showing that it both supports the utility of the GENCODE Basic set in providing a smaller set of more highly expressed transcripts and provides a useful, biologically-relevant filter for further reducing the complexity of the transcriptome. Conclusions The reference transcripts selected for variant functional annotation do have a large effect on the outcome. The GENCODE Comprehensive transcripts contain more exons, have greater genomic coverage and capture many more variants than RefSeq in both genome and exome datasets, while the GENCODE Basic set shows a higher degree of concordance with RefSeq and has fewer unique features. We propose that the GENCODE Comprehensive set has great utility for the discovery of new variants with functional potential, while the GENCODE Basic set is more suitable for applications demanding less complex interpretation of functional variants.
Collapse
|
23
|
Systematic transcriptome analysis reveals tumor-specific isoforms for ovarian cancer diagnosis and therapy. Proc Natl Acad Sci U S A 2015; 112:E3050-7. [PMID: 26015570 PMCID: PMC4466751 DOI: 10.1073/pnas.1508057112] [Citation(s) in RCA: 54] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023] Open
Abstract
Tumor-specific molecules are needed across diverse areas of oncology for use in early detection, diagnosis, prognosis and therapy. Large and growing public databases of transcriptome sequencing data (RNA-seq) derived from tumors and normal tissues hold the potential of yielding tumor-specific molecules, but because the data are new they have not been fully explored for this purpose. We have developed custom bioinformatic algorithms and used them with 296 high-grade serous ovarian (HGS-OvCa) tumor and 1,839 normal RNA-seq datasets to identify mRNA isoforms with tumor-specific expression. We rank prioritized isoforms by likelihood of being expressed in HGS-OvCa tumors and not in normal tissues and analyzed 671 top-ranked isoforms by high-throughput RT-qPCR. Six of these isoforms were expressed in a majority of the 12 tumors examined but not in 18 normal tissues. An additional 11 were expressed in most tumors and only one normal tissue, which in most cases was fallopian or colon. Of the 671 isoforms, the topmost 5% (n = 33) ranked based on having tumor-specific or highly restricted normal tissue expression by RT-qPCR analysis are enriched for oncogenic, stem cell/cancer stem cell, and early development loci--including ETV4, FOXM1, LSR, CD9, RAB11FIP4, and FGFRL1. Many of the 33 isoforms are predicted to encode proteins with unique amino acid sequences, which would allow them to be specifically targeted for one or more therapeutic strategies--including monoclonal antibodies and T-cell-based vaccines. The systematic process described herein is readily and rapidly applicable to the more than 30 additional tumor types for which sufficient amounts of RNA-seq already exist.
Collapse
|
24
|
Li P, Spolski R, Liao W, Leonard WJ. Complex interactions of transcription factors in mediating cytokine biology in T cells. Immunol Rev 2015; 261:141-56. [PMID: 25123282 DOI: 10.1111/imr.12199] [Citation(s) in RCA: 89] [Impact Index Per Article: 9.9] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
T-helper (Th) cells play critical roles within the mammalian immune system, and the differentiation of naive CD4(+) T cells into distinct T-helper subsets is critical for normal immunoregulation and host defense. These carefully regulated differentiation processes are controlled by networks of cytokines, transcription factors, and epigenetic modifications, resulting in the generation of multiple CD4(+) T-cell subsets, including Th1, Th2, Th9, Th17, Treg, and Tfh cells. In this review, we discuss the roles of transcription factors in determining the specific type of differentiation and in particular the role of interleukin-2 (IL-2) in promoting or inhibiting Th differentiation. In addition to discussing master regulators and subset-specific transcription factors for distinct T-helper cell populations, we focus on signal transducer and activator of transcription (STAT) proteins and on the cooperative action of interferon regulatory factor 4 (IRF4) with activator protein 1 (AP-1) family proteins and STAT3 in the assembly of complexes that broadly influence T-cell differentiation.
Collapse
Affiliation(s)
- Peng Li
- Laboratory of Molecular Immunology and Immunology Center, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, MD, USA
| | | | | | | |
Collapse
|
25
|
Cui H, Dhroso A, Johnson N, Korkin D. The variation game: Cracking complex genetic disorders with NGS and omics data. Methods 2015; 79-80:18-31. [PMID: 25944472 DOI: 10.1016/j.ymeth.2015.04.018] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2014] [Revised: 03/27/2015] [Accepted: 04/17/2015] [Indexed: 12/14/2022] Open
Abstract
Tremendous advances in Next Generation Sequencing (NGS) and high-throughput omics methods have brought us one step closer towards mechanistic understanding of the complex disease at the molecular level. In this review, we discuss four basic regulatory mechanisms implicated in complex genetic diseases, such as cancer, neurological disorders, heart disease, diabetes, and many others. The mechanisms, including genetic variations, copy-number variations, posttranscriptional variations, and epigenetic variations, can be detected using a variety of NGS methods. We propose that malfunctions detected in these mechanisms are not necessarily independent, since these malfunctions are often found associated with the same disease and targeting the same gene, group of genes, or functional pathway. As an example, we discuss possible rewiring effects of the cancer-associated genetic, structural, and posttranscriptional variations on the protein-protein interaction (PPI) network centered around P53 protein. The review highlights multi-layered complexity of common genetic disorders and suggests that integration of NGS and omics data is a critical step in developing new computational methods capable of deciphering this complexity.
Collapse
Affiliation(s)
- Hongzhu Cui
- Department of Computer Science, Worcester Polytechnic Institute, 100 Institute Road, Worcester, MA 01609, United States
| | - Andi Dhroso
- Department of Computer Science, Worcester Polytechnic Institute, 100 Institute Road, Worcester, MA 01609, United States
| | - Nathan Johnson
- Department of Computer Science, Worcester Polytechnic Institute, 100 Institute Road, Worcester, MA 01609, United States
| | - Dmitry Korkin
- Department of Computer Science, Worcester Polytechnic Institute, 100 Institute Road, Worcester, MA 01609, United States; Bioinformatics and Computational Biology Program, Worcester Polytechnic Institute, 100 Institute Road, Worcester, MA 01609, United States
| |
Collapse
|
26
|
Cheng Q, Wang H, Xu B, Zhu S, Hu L, Huang M. Discovery of a novel small secreted protein family with conserved N-terminal IGY motif in Dikarya fungi. BMC Genomics 2014; 15:1151. [PMID: 25526808 PMCID: PMC4367982 DOI: 10.1186/1471-2164-15-1151] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2014] [Accepted: 12/12/2014] [Indexed: 12/23/2022] Open
Abstract
BACKGROUND Small secreted proteins (SSPs) are employed by plant pathogenic fungi as essential strategic tools for their successful colonization. SSPs are often species-specific and so far only a few widely phylogenetically distributed SSPs have been identified. RESULTS A novel fungal SSP family consisting of 107 members was identified in the poplar tree fungal pathogen Marssonina brunnea, which accounts for over 17% of its secretome. We named these proteins IGY proteins (IGYPs) based on the conserved three amino acids at the N-terminus. In spite of overall low sequence similarity among IGYPs; they showed conserved N- and C-terminal motifs and a unified gene structure. By RT-PCR-seq, we analyzed the IGYP gene models and validated their expressions as active genes during infection. IGYP homologues were also found in 25 other Dikarya fungal species, all of which shared conserved motifs and the same gene structure. Furthermore, 18 IGYPs from 11 fungi also shared similar genomic contexts. Real-time RT-PCR showed that 8 MbIGYPs were highly expressed in the biotrophic stage. Interestingly, transient assay of 12 MbIGYPs showed that the MbIGYP13 protein induced cell death in resistant poplar clones. CONCLUSIONS In total, 154 IGYPs in 26 fungi of the Dikarya subkingdom were discovered. Gene structure and genomic context analyses indicated that IGYPs originated from a common ancestor. In M. brunnea, the expansion of highly divergent MbIGYPs possibly is associated with plant-pathogen arms race.
Collapse
Affiliation(s)
- Qiang Cheng
- />Jiangsu Key Laboratory for Poplar Germplasm Enhancement and Variety Improvement, Nanjing Forestry University, Nanjing, 210037 China
| | - Haoran Wang
- />Jiangsu Key Laboratory for Poplar Germplasm Enhancement and Variety Improvement, Nanjing Forestry University, Nanjing, 210037 China
| | - Bin Xu
- />College of prataculture science, Nanjing Agricultural University, Nanjing, China
| | - Sheng Zhu
- />Jiangsu Key Laboratory for Poplar Germplasm Enhancement and Variety Improvement, Nanjing Forestry University, Nanjing, 210037 China
| | - Lanxi Hu
- />Jiangsu Key Laboratory for Poplar Germplasm Enhancement and Variety Improvement, Nanjing Forestry University, Nanjing, 210037 China
| | - Minren Huang
- />Jiangsu Key Laboratory for Poplar Germplasm Enhancement and Variety Improvement, Nanjing Forestry University, Nanjing, 210037 China
| |
Collapse
|
27
|
Hong SE, Song HK, Kim DH. Identification of tissue-enriched novel transcripts and novel exons in mice. BMC Genomics 2014; 15:592. [PMID: 25017872 PMCID: PMC4111849 DOI: 10.1186/1471-2164-15-592] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2014] [Accepted: 07/03/2014] [Indexed: 01/27/2023] Open
Abstract
BACKGROUND RNA sequencing (RNA-seq) has revolutionized the detection of transcriptomic signatures due to its high-throughput sequencing ability. Therefore, genomic annotations on different animal species have been rapidly updated using information from tissue-enriched novel transcripts and novel exons. RESULTS 34 putative novel transcripts and 236 putative tissue-enriched exons were identified using RNA-Seq datasets representing six tissues available in mouse databases. RT-PCR results indicated that expression of 21 and 2 novel transcripts were enriched in testes and liver, respectively, while 31 of the 39 selected novel exons were detected in the testes or heart. The novel isoforms containing the identified novel exons exhibited more dominant expression than the known isoforms in heart and testes. We also identified an example of pathology-associated exclusion of heart-enriched novel exons such as Sorbs1 and Cluh during pressure-overload cardiac hypertrophy. CONCLUSION The present study depicted tissue-enriched novel transcripts, a tissue-specific isoform switch, and pathology-associated alternative splicing in a mouse model, suggesting tissue-specific genomic diversity and plasticity.
Collapse
Affiliation(s)
| | | | - Do Han Kim
- School of Life Sciences and Systems Biology Research Center, Gwangju Institute of Science and Technology (GIST), 123 Cheomdangwagi-ro (Oryong-dong), Buk-gu, Gwangju 500-712, Korea.
| |
Collapse
|
28
|
Abstract
RNA sequencing (RNAseq) samples the majority of expressed genes infrequently, owing to the large size, complex splicing and wide dynamic range of eukaryotic transcriptomes. This results in sparse sequencing coverage that can hinder robust isoform assembly and quantification. RNA capture sequencing (CaptureSeq) addresses this challenge by using oligonucleotide probes to capture selected genes or regions of interest for targeted sequencing. Targeted RNAseq provides enhanced coverage for sensitive gene discovery, robust transcript assembly and accurate gene quantification. Here we describe a detailed protocol for all stages of RNA CaptureSeq, from initial probe design considerations and capture of targeted genes to final assembly and quantification of captured transcripts. Initial probe design and final analysis can take less than 1 d, whereas the central experimental capture stage requires ∼7 d.
Collapse
|
29
|
Soreq L, Guffanti A, Salomonis N, Simchovitz A, Israel Z, Bergman H, Soreq H. Long non-coding RNA and alternative splicing modulations in Parkinson's leukocytes identified by RNA sequencing. PLoS Comput Biol 2014; 10:e1003517. [PMID: 24651478 PMCID: PMC3961179 DOI: 10.1371/journal.pcbi.1003517] [Citation(s) in RCA: 136] [Impact Index Per Article: 13.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2013] [Accepted: 01/31/2014] [Indexed: 12/22/2022] Open
Abstract
The continuously prolonged human lifespan is accompanied by increase in neurodegenerative diseases incidence, calling for the development of inexpensive blood-based diagnostics. Analyzing blood cell transcripts by RNA-Seq is a robust means to identify novel biomarkers that rapidly becomes a commonplace. However, there is lack of tools to discover novel exons, junctions and splicing events and to precisely and sensitively assess differential splicing through RNA-Seq data analysis and across RNA-Seq platforms. Here, we present a new and comprehensive computational workflow for whole-transcriptome RNA-Seq analysis, using an updated version of the software AltAnalyze, to identify both known and novel high-confidence alternative splicing events, and to integrate them with both protein-domains and microRNA binding annotations. We applied the novel workflow on RNA-Seq data from Parkinson's disease (PD) patients' leukocytes pre- and post- Deep Brain Stimulation (DBS) treatment and compared to healthy controls. Disease-mediated changes included decreased usage of alternative promoters and N-termini, 5′-end variations and mutually-exclusive exons. The PD regulated FUS and HNRNP A/B included prion-like domains regulated regions. We also present here a workflow to identify and analyze long non-coding RNAs (lncRNAs) via RNA-Seq data. We identified reduced lncRNA expression and selective PD-induced changes in 13 of over 6,000 detected leukocyte lncRNAs, four of which were inversely altered post-DBS. These included the U1 spliceosomal lncRNA and RP11-462G22.1, each entailing sequence complementarity to numerous microRNAs. Analysis of RNA-Seq from PD and unaffected controls brains revealed over 7,000 brain-expressed lncRNAs, of which 3,495 were co-expressed in the leukocytes including U1, which showed both leukocyte and brain increases. Furthermore, qRT-PCR validations confirmed these co-increases in PD leukocytes and two brain regions, the amygdala and substantia-nigra, compared to controls. This novel workflow allows deep multi-level inspection of RNA-Seq datasets and provides a comprehensive new resource for understanding disease transcriptome modifications in PD and other neurodegenerative diseases. Long non-coding RNAs (lncRNAs) comprise a novel, fascinating class of RNAs with largely unknown biological functions. Parkinson's-disease (PD) is the most frequent motor disorder, and Deep-brain-stimulation (DBS) treatment alleviates the symptoms, but early disease biomarkers are still unknown and new future genetic interference targets are urgently needed. Using RNA-sequencing technology and a novel computational workflow for in-depth exploration of whole-transcriptome RNA-seq datasets, we detected and analyzed lncRNAs in sequenced libraries from PD patients' leukocytes pre and post-treatment and the brain, adding this full profile resource of over 7,000 lncRNAs to the few human tissues-derived lncRNA datasets that are currently available. Our study includes sample-specific database construction, detecting disease-derived changes in known and novel lncRNAs, exons and junctions and predicting corresponding changes in Polyadenylation choices, protein domains and miRNA binding sites. We report widespread transcript structure variations at the splice junction and exons levels, including novel exons and junctions and alteration of lncRNAs followed by experimental validation in PD leukocytes and two PD brain regions compared with controls. Our results suggest lncRNAs involvement in neurodegenerative diseases, and specifically PD. This comprehensive workflow will be of use to the increasing number of laboratories producing RNA-Seq data in a wide range of biomedical studies.
Collapse
Affiliation(s)
- Lilach Soreq
- Department of Medical Neurobiology, IMRIC, The Hebrew University-Hadassah Medical School, Jerusalem, Israel
| | - Alessandro Guffanti
- Department of Biological Chemistry, The Life Sciences Institute, The Hebrew University of Jerusalem, Jerusalem, Israel
- Genomnia srl, Lainate, Milan, Italy
| | - Nathan Salomonis
- Department of Pediatrics, Division of Biomedical Informatics, Cincinnati Children's Hospital Medical Center, Cincinnati, Ohio, United States of America
| | | | - Zvi Israel
- The Center for Functional and Restorative Neurosurgery, Department of Neurosurgery, Hadassah University Hospital, Jerusalem, Israel
| | - Hagai Bergman
- Department of Medical Neurobiology, IMRIC, The Hebrew University-Hadassah Medical School, Jerusalem, Israel
- The Edmond and Lily Safra Center for Brain Sciences (ELSC), The Hebrew University of Jerusalem, Jerusalem, Israel
| | - Hermona Soreq
- Department of Biological Chemistry, The Life Sciences Institute, The Hebrew University of Jerusalem, Jerusalem, Israel
- The Edmond and Lily Safra Center for Brain Sciences (ELSC), The Hebrew University of Jerusalem, Jerusalem, Israel
- * E-mail:
| |
Collapse
|
30
|
Eteleeb AM, Flight RM, Harrison BJ, Petruska JC, Rouchka EC. An Island-Based Approach for Differential Expression Analysis. 2013 ACM CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY AND BIOMEDICAL INFORMATICS : ACM - BCB 2013 : WASHINGTON, D.C., U.S.A., SEPTEMBER 22 - 25, 2013. ACM CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY AND BIOMEDICAL INFORMA... 2013; 2013:419-429. [PMID: 25632406 DOI: 10.1145/2506583.2506589] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
Abstract
High-throughput mRNA sequencing (also known as RNA-Seq) promises to be the technique of choice for studying transcriptome profiles. This technique provides the ability to develop precise methodologies for transcript and gene expression quantification, novel transcript and exon discovery, and splice variant detection. One of the limitations of current RNA-Seq methods is the dependency on annotated biological features (e.g. exons, transcripts, genes) to detect expression differences across samples. This forces the identification of expression levels and the detection of significant changes to known genomic regions. Any significant changes that occur in unannotated regions will not be captured. To overcome this limitation, we developed a novel segmentation approach, Island-Based (IB), for analyzing differential expression in RNA-Seq and targeted sequencing (exome capture) data without specific knowledge of an isoform. The IB segmentation determines individual islands of expression based on windowed read counts that can be compared across experimental conditions to determine differential island expression. In order to detect differentially expressed genes, the significance of islands (p-values) are combined using Fisher's method. We tested and evaluated the performance of our approach by comparing it to the existing differentially expressed gene (DEG) methods: CuffDiff, DESeq, and edgeR using two benchmark MAQC RNA-Seq datasets. The IB algorithm outperforms all three methods in both datasets as illustrated by an increased auROC.
Collapse
Affiliation(s)
- Abdallah M Eteleeb
- Department of Computer, Engineering and Computer, Science, University of Louisville, Louisville, KY, USA,
| | - Robert M Flight
- Department of Chemistry, University of Louisville, Louisville, KY, USA,
| | - Benjamin J Harrison
- Department of Anatomical, Sciences and Neurobiology, University of Louisville, Louisville, KY, USA,
| | - Jeffrey C Petruska
- Department of Anatomical, Sciences and Neurobiology, University of Louisville, Louisville, KY, USA,
| | - Eric C Rouchka
- Department of Computer, Engineering and Computer, Science, University of Louisville, Louisville, KY, USA,
| |
Collapse
|
31
|
Barry G. Lamarckian evolution explains human brain evolution and psychiatric disorders. Front Neurosci 2013; 7:224. [PMID: 24324395 PMCID: PMC3840504 DOI: 10.3389/fnins.2013.00224] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2013] [Accepted: 11/05/2013] [Indexed: 01/05/2023] Open
Affiliation(s)
- Guy Barry
- Neuroscience Division, Garvan Institute of Medical Research Darlinghurst, NSW, Australia
| |
Collapse
|
32
|
Pang CNI, Tay AP, Aya C, Twine NA, Harkness L, Hart-Smith G, Chia SZ, Chen Z, Deshpande NP, Kaakoush NO, Mitchell HM, Kassem M, Wilkins MR. Tools to covisualize and coanalyze proteomic data with genomes and transcriptomes: validation of genes and alternative mRNA splicing. J Proteome Res 2013; 13:84-98. [PMID: 24152167 DOI: 10.1021/pr400820p] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Abstract
Direct links between proteomic and genomic/transcriptomic data are not frequently made, partly because of lack of appropriate bioinformatics tools. To help address this, we have developed the PG Nexus pipeline. The PG Nexus allows users to covisualize peptides in the context of genomes or genomic contigs, along with RNA-seq reads. This is done in the Integrated Genome Viewer (IGV). A Results Analyzer reports the precise base position where LC-MS/MS-derived peptides cover genes or gene isoforms, on the chromosomes or contigs where this occurs. In prokaryotes, the PG Nexus pipeline facilitates the validation of genes, where annotation or gene prediction is available, or the discovery of genes using a "virtual protein"-based unbiased approach. We illustrate this with a comprehensive proteogenomics analysis of two strains of Campylobacter concisus . For higher eukaryotes, the PG Nexus facilitates gene validation and supports the identification of mRNA splice junction boundaries and splice variants that are protein-coding. This is illustrated with an analysis of splice junctions covered by human phosphopeptides, and other examples of relevance to the Chromosome-Centric Human Proteome Project. The PG Nexus is open-source and available from https://github.com/IntersectAustralia/ap11_Samifier. It has been integrated into Galaxy and made available in the Galaxy tool shed.
Collapse
Affiliation(s)
- Chi Nam Ignatius Pang
- Systems Biology Initiative, The University of New South Wales , Sydney, New South Wales 2052, Australia
| | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
33
|
Abstract
The last decade has seen tremendous effort committed to the annotation of the human genome sequence, most notably perhaps in the form of the ENCODE project. One of the major findings of ENCODE, and other genome analysis projects, is that the human transcriptome is far larger and more complex than previously thought. This complexity manifests, for example, as alternative splicing within protein-coding genes, as well as in the discovery of thousands of long noncoding RNAs. It is also possible that significant numbers of human transcripts have not yet been described by annotation projects, while existing transcript models are frequently incomplete. The question as to what proportion of this complexity is truly functional remains open, however, and this ambiguity presents a serious challenge to genome scientists. In this article, we will discuss the current state of human transcriptome annotation, drawing on our experience gained in generating the GENCODE gene annotation set. We highlight the gaps in our knowledge of transcript functionality that remain, and consider the potential computational and experimental strategies that can be used to help close them. We propose that an understanding of the true overlap between transcriptional complexity and functionality will not be gained in the short term. However, significant steps toward obtaining this knowledge can now be taken by using an integrated strategy, combining all of the experimental resources at our disposal.
Collapse
Affiliation(s)
- Jonathan M Mudge
- Department of Informatics, Wellcome Trust Sanger Institute, Hinxton CB10 1SA, United Kingdom
| | | | | |
Collapse
|
34
|
|
35
|
Spicuglia S, Maqbool MA, Puthier D, Andrau JC. An update on recent methods applied for deciphering the diversity of the noncoding RNA genome structure and function. Methods 2013; 63:3-17. [DOI: 10.1016/j.ymeth.2013.04.003] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2013] [Revised: 04/02/2013] [Accepted: 04/04/2013] [Indexed: 12/17/2022] Open
|
36
|
Zubarev RA. The challenge of the proteome dynamic range and its implications for in-depth proteomics. Proteomics 2013; 13:723-6. [PMID: 23307342 DOI: 10.1002/pmic.201200451] [Citation(s) in RCA: 127] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2012] [Revised: 11/22/2012] [Accepted: 12/04/2012] [Indexed: 12/25/2022]
Abstract
The dynamic range of the cellular proteome approaches seven orders of magnitude-from one copy per cell to ten million copies per cell. Since a proteome's abundance distribution represents a nearly symmetric bell-shape curve on the logarithmic copy number scale, detection of half of the expressed cellular proteome, i.e. approximately 5000 proteins, should be a relatively straightforward task with modern mass spectrometric instrumentation that exhibits four orders of magnitude of the dynamic range, while deeper proteome analysis should be progressively more difficult. Indeed, metaanalysis of 15 recent papers that claim detection of >5000 protein groups reveals that the half-proteome analyses currently requires ≈5 h of chromatographic separation, while deeper analyses yield on average ≤20 new proteins per hour of chromatographic gradient. Therefore, a typical proteomics experiment consists of a "high-content" part, with the detection rate of approximately 1000 proteins/h, and a "low-content" tail with much lower rate of discovery and respectively, lower cost efficiency. This result calls for disruptive innovation in deep proteomics analysis.
Collapse
Affiliation(s)
- Roman A Zubarev
- Division of Physiological Chemistry I, Department of Medical Biochemistry and Biophysics, Karolinska Institutet, Stockholm, Sweden.
| |
Collapse
|
37
|
Luo H, Sun S, Li P, Bu D, Cao H, Zhao Y. Comprehensive characterization of 10,571 mouse large intergenic noncoding RNAs from whole transcriptome sequencing. PLoS One 2013; 8:e70835. [PMID: 23951020 PMCID: PMC3741367 DOI: 10.1371/journal.pone.0070835] [Citation(s) in RCA: 48] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2013] [Accepted: 06/22/2013] [Indexed: 11/18/2022] Open
Abstract
Large intergenic noncoding RNAs (lincRNAs) have been recognized in recent years to constitute a significant portion of the mammalian transcriptome, yet their biological functions remain largely elusive. This is partly due to an incomplete annotation of tissue-specific lincRNAs in essential model organisms, particularly in mice, which has hindered the genetic annotation and functional characterization of these novel transcripts. In this report, we performed ab initio assembly of 1.9 billion tissue-specific RNA-sequencing reads across six tissue types, and identified 3,965 novel expressed lincRNAs in mice. Combining these with 6,606 documented lincRNAs, we established a comprehensive catalog of 10,571 transcribed lincRNAs. We then systemically analyzed all mouse lincRNAs to reveal that some of them are evolutionally conserved and that they exhibit striking tissue-specific expression patterns. We also discovered that mouse lincRNAs carry unique genomic signatures, and that their expression level is correlated with that of neighboring protein-coding transcripts. Finally, we predicted that a large portion of tissue-specific lincRNAs are functionally associated with essential biological processes including the cell cycle and cell development, and that they could play a key role in regulating tissue development and functionality. Our analyses provide a framework for continued discovery and annotation of tissue-specific lincRNAs in model organisms, and our transcribed mouse lincRNA catalog will serve as a roadmap for functional analyses of lincRNAs in genetic mouse models.
Collapse
Affiliation(s)
- Haitao Luo
- Bioinformatics Research Group, Advanced Computing Research Laboratory, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
| | - Silong Sun
- Bioinformatics Research Group, Advanced Computing Research Laboratory, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
| | - Ping Li
- National Heart, Lung and Blood Institute, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Dechao Bu
- Bioinformatics Research Group, Advanced Computing Research Laboratory, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
| | - Haiming Cao
- National Heart, Lung and Blood Institute, National Institutes of Health, Bethesda, Maryland, United States of America
- * E-mail: (YZ); (HC)
| | - Yi Zhao
- Bioinformatics Research Group, Advanced Computing Research Laboratory, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
- * E-mail: (YZ); (HC)
| |
Collapse
|
38
|
Kim T, Reitmair A. Non-Coding RNAs: Functional Aspects and Diagnostic Utility in Oncology. Int J Mol Sci 2013; 14:4934-68. [PMID: 23455466 PMCID: PMC3634484 DOI: 10.3390/ijms14034934] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2012] [Revised: 02/09/2013] [Accepted: 02/18/2013] [Indexed: 02/06/2023] Open
Abstract
Noncoding RNAs (ncRNAs) have been found to have roles in a large variety of biological processes. Recent studies indicate that ncRNAs are far more abundant and important than initially imagined, holding great promise for use in diagnostic, prognostic, and therapeutic applications. Within ncRNAs, microRNAs (miRNAs) are the most widely studied and characterized. They have been implicated in initiation and progression of a variety of human malignancies, including major pathologies such as cancers, arthritis, neurodegenerative disorders, and cardiovascular diseases. Their surprising stability in serum and other bodily fluids led to their rapid ascent as a novel class of biomarkers. For example, several properties of stable miRNAs, and perhaps other classes of ncRNAs, make them good candidate biomarkers for early cancer detection and for determining which preneoplastic lesions are likely to progress to cancer. Of particular interest is the identification of biomarker signatures, which may include traditional protein-based biomarkers, to improve risk assessment, detection, and prognosis. Here, we offer a comprehensive review of the ncRNA biomarker literature and discuss state-of-the-art technologies for their detection. Furthermore, we address the challenges present in miRNA detection and quantification, and outline future perspectives for development of next-generation biodetection assays employing multicolor alternating-laser excitation (ALEX) fluorescence spectroscopy.
Collapse
Affiliation(s)
- Taiho Kim
- Nesher Technologies, Inc., 2100 W. 3rd St. Los Angeles, CA 90057, USA.
| | | |
Collapse
|
39
|
Bidirectional promoters as important drivers for the emergence of species-specific transcripts. PLoS One 2013; 8:e57323. [PMID: 23460838 PMCID: PMC3583895 DOI: 10.1371/journal.pone.0057323] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2012] [Accepted: 01/21/2013] [Indexed: 11/23/2022] Open
Abstract
The diversification of gene functions has been largely attributed to the process of gene duplication. Novel examples of genes originating from previously untranscribed regions have been recently described without regard to a unifying functional mechanism for their emergence. Here we propose a model mechanism that could generate a large number of lineage-specific novel transcripts in vertebrates through the activation of bidirectional transcription from unidirectional promoters. We examined this model in silico using human transcriptomic and genomic data and identified evidence consistent with the emergence of more than 1,000 primate-specific transcripts. These are transcripts with low coding potential and virtually no functional annotation. They initiate at less than 1 kb upstream of an oppositely transcribed conserved protein coding gene, in agreement with the generally accepted definition of bidirectional promoters. We found that the genomic regions upstream of ancestral promoters, where the novel transcripts in our dataset reside, are characterized by preferential accumulation of transposable elements. This enhances the sequence diversity of regions located upstream of ancestral promoters, further highlighting their evolutionary importance for the emergence of transcriptional novelties. By applying a newly developed test for positive selection to transposable element-derived fragments in our set of novel transcripts, we found evidence of adaptive evolution in the human lineage in nearly 3% of the novel transcripts in our dataset. These findings indicate that at least some novel transcripts could become functionally relevant, and thus highlight the evolutionary importance of promoters, through their capacity for bidirectional transcription, for the emergence of novel genes.
Collapse
|
40
|
Abstract
This issue of Genome Research presents new results, methods, and tools from The ENCODE Project (ENCyclopedia of DNA Elements), which collectively represents an important step in moving beyond a parts list of the genome and promises to shape the future of genomic research. This collection sheds light on basic biological questions and frames the current debate over the optimization of tools and methodological challenges necessary to compare and interpret large complex data sets focused on how the genome is organized and regulated. In a number of instances, the authors have highlighted the strengths and limitations of current computational and technical approaches, providing the community with useful standards, which should stimulate development of new tools. In many ways, these papers will ripple through the scientific community, as those in pursuit of understanding the “regulatory genome” will heavily traverse the maps and tools. Similarly, the work should have a substantive impact on how genetic variation contributes to specific diseases and traits by providing a compendium of functional elements for follow-up study. The success of these papers should not only be measured by the scope of the scientific insights and tools but also by their ability to attract new talent to mine existing and future data.
Collapse
Affiliation(s)
- Stephen Chanock
- Laboratory of Translational Genomics, Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Advanced Technology Center, Bethesda, Maryland 20892-4605, USA.
| |
Collapse
|
41
|
Harrow J, Frankish A, Gonzalez JM, Tapanari E, Diekhans M, Kokocinski F, Aken BL, Barrell D, Zadissa A, Searle S, Barnes I, Bignell A, Boychenko V, Hunt T, Kay M, Mukherjee G, Rajan J, Despacio-Reyes G, Saunders G, Steward C, Harte R, Lin M, Howald C, Tanzer A, Derrien T, Chrast J, Walters N, Balasubramanian S, Pei B, Tress M, Rodriguez JM, Ezkurdia I, van Baren J, Brent M, Haussler D, Kellis M, Valencia A, Reymond A, Gerstein M, Guigó R, Hubbard TJ. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res 2013; 22:1760-74. [PMID: 22955987 PMCID: PMC3431492 DOI: 10.1101/gr.135350.111] [Citation(s) in RCA: 3249] [Impact Index Per Article: 295.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
The GENCODE Consortium aims to identify all gene features in the human genome using a combination of computational analysis, manual annotation, and experimental validation. Since the first public release of this annotation data set, few new protein-coding loci have been added, yet the number of alternative splicing transcripts annotated has steadily increased. The GENCODE 7 release contains 20,687 protein-coding and 9640 long noncoding RNA loci and has 33,977 coding transcripts not represented in UCSC genes and RefSeq. It also has the most comprehensive annotation of long noncoding RNA (lncRNA) loci publicly available with the predominant transcript form consisting of two exons. We have examined the completeness of the transcript annotation and found that 35% of transcriptional start sites are supported by CAGE clusters and 62% of protein-coding genes have annotated polyA sites. Over one-third of GENCODE protein-coding genes are supported by peptide hits derived from mass spectrometry spectra submitted to Peptide Atlas. New models derived from the Illumina Body Map 2.0 RNA-seq data identify 3689 new loci not currently in GENCODE, of which 3127 consist of two exon models indicating that they are possibly unannotated long noncoding loci. GENCODE 7 is publicly available from gencodegenes.org and via the Ensembl and UCSC Genome Browsers.
Collapse
Affiliation(s)
- Jennifer Harrow
- Wellcome Trust Sanger Institute, Wellcome Trust Campus, Hinxton, Cambridge CB10 1SA, United Kingdom.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
42
|
Abstract
In its first production phase, The ENCODE Project Consortium (ENCODE) has generated thousands of genome-scale data sets, resulting in a genomic “parts list” that encompasses transcripts, sites of transcription factor binding, and other functional features that now number in the millions of distinct elements. These data are reshaping many long-held beliefs concerning the information content of the human and other complex genomes, including the very definition of the gene. Here I discuss and place in context many of the leading findings of ENCODE, as well as trends that are shaping the generation and interpretation of ENCODE data. Finally, I consider prospects for the future, including maximizing the accuracy, completeness, and utility of ENCODE data for the community.
Collapse
Affiliation(s)
- John A Stamatoyannopoulos
- Departments of Genome Sciences and Medicine, University of Washington School of Medicine, Seattle, Washington 98195, USA.
| |
Collapse
|
43
|
Affiliation(s)
- Kelly A Frazer
- Moores UCSD Cancer Center, Department of Pediatrics and Rady Children's Hospital, University of California at San Diego, La Jolla, California 92093, USA.
| |
Collapse
|
44
|
Abstract
The human genome encodes the blueprint of life, but the function of the vast majority of its nearly three billion bases is unknown. The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure and histone modification. These data enabled us to assign biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions. Many discovered candidate regulatory elements are physically associated with one another and with expressed genes, providing new insights into the mechanisms of gene regulation. The newly identified elements also show a statistical correspondence to sequence variants linked to human disease, and can thereby guide interpretation of this variation. Overall, the project provides new insights into the organization and regulation of our genes and genome, and is an expansive resource of functional annotations for biomedical research.
Collapse
|