1
|
Jousheghani ZZ, Patro R. Oarfish: Enhanced probabilistic modeling leads to improved accuracy in long read transcriptome quantification. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.02.28.582591. [PMID: 38464200 PMCID: PMC10925290 DOI: 10.1101/2024.02.28.582591] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/12/2024]
Abstract
Motivation Long read sequencing technology is becoming an increasingly indispensable tool in genomic and transcriptomic analysis. In transcriptomics in particular, long reads offer the possibility of sequencing full-length isoforms, which can vastly simplify the identification of novel transcripts and transcript quantification. However, despite this promise, the focus of much long read method development to date has been on transcript identification, with comparatively little attention paid to quantification. Yet, due to differences in the underlying protocols and technologies, lower throughput (i.e. fewer reads sequenced per sample compared to short read technologies), as well as technical artifacts, long read quantification remains a challenge, motivating the continued development and assessment of quantification methods tailored to this increasingly prevalent type of data. Results We introduce a new method and software tool for long read transcript quantification called oarfish. Our model incorporates a novel and innovative coverage score, which affects the conditional probability of fragment assignment in the underlying probabilistic model. We demonstrate that by accounting for this coverage information, oarfish is able to produce more accurate quantification estimates than existing long read quantification methods, particularly when one considers the primary isoforms present in a particular cell line or tissue type. Availability and Implementation Oarfish is implemented in the Rust programming language, and is made available as free and open-source software under the BSD 3-clause license. The source code is available at https://www.github.com/COMBINE-lab/oarfish.
Collapse
Affiliation(s)
- Zahra Zare Jousheghani
- Department of Electrical and Computer Engineering, University of Maryland, College Park, 20742, Maryland, USA
| | - Rob Patro
- Department of Computer Science, University of Maryland, College Park, 20742, Maryland, USA
| |
Collapse
|
2
|
Deshpande D, Chhugani K, Chang Y, Karlsberg A, Loeffler C, Zhang J, Muszyńska A, Munteanu V, Yang H, Rotman J, Tao L, Balliu B, Tseng E, Eskin E, Zhao F, Mohammadi P, P. Łabaj P, Mangul S. RNA-seq data science: From raw data to effective interpretation. Front Genet 2023; 14:997383. [PMID: 36999049 PMCID: PMC10043755 DOI: 10.3389/fgene.2023.997383] [Citation(s) in RCA: 18] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2022] [Accepted: 02/24/2023] [Indexed: 03/14/2023] Open
Abstract
RNA sequencing (RNA-seq) has become an exemplary technology in modern biology and clinical science. Its immense popularity is due in large part to the continuous efforts of the bioinformatics community to develop accurate and scalable computational tools to analyze the enormous amounts of transcriptomic data that it produces. RNA-seq analysis enables genes and their corresponding transcripts to be probed for a variety of purposes, such as detecting novel exons or whole transcripts, assessing expression of genes and alternative transcripts, and studying alternative splicing structure. It can be a challenge, however, to obtain meaningful biological signals from raw RNA-seq data because of the enormous scale of the data as well as the inherent limitations of different sequencing technologies, such as amplification bias or biases of library preparation. The need to overcome these technical challenges has pushed the rapid development of novel computational tools, which have evolved and diversified in accordance with technological advancements, leading to the current myriad of RNA-seq tools. These tools, combined with the diverse computational skill sets of biomedical researchers, help to unlock the full potential of RNA-seq. The purpose of this review is to explain basic concepts in the computational analysis of RNA-seq data and define discipline-specific jargon.
Collapse
Affiliation(s)
- Dhrithi Deshpande
- Department of Pharmacology and Pharmaceutical Sciences, USC Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences, Los Angeles, CA, United States
| | - Karishma Chhugani
- Department of Pharmacology and Pharmaceutical Sciences, USC Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences, Los Angeles, CA, United States
| | - Yutong Chang
- Department of Pharmacology and Pharmaceutical Sciences, USC Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences, Los Angeles, CA, United States
| | - Aaron Karlsberg
- Department of Clinical Pharmacy, USC Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences, Los Angeles, CA, United States
| | - Caitlin Loeffler
- Department of Computer Science, University of California, Los Angeles, CA, United States
| | - Jinyang Zhang
- Beijing Institutes of Life Science, Chinese Academy of Sciences, Beijing, China
| | - Agata Muszyńska
- Małopolska Centre of Biotechnology, Jagiellonian University, Krakow, Poland
- Institute of Automatic Control, Electronics and Computer Science, Silesian University of Technology, Gliwice, Poland
| | - Viorel Munteanu
- Department of Computers, Informatics and Microelectronics, Technical University of Moldova, Chisinau, Moldova
| | - Harry Yang
- Department of Microbiology, Immunology and Molecular Genetics, University of California Los Angeles, Los Angeles, CA, United States
| | - Jeremy Rotman
- Department of Clinical Pharmacy, USC Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences, Los Angeles, CA, United States
| | - Laura Tao
- Department of Computational Medicine, David Geffen School of Medicine at UCLA, CHS, Los Angeles, CA, United States
| | - Brunilda Balliu
- Department of Computational Medicine, David Geffen School of Medicine at UCLA, CHS, Los Angeles, CA, United States
| | | | - Eleazar Eskin
- Department of Computer Science, University of California, Los Angeles, CA, United States
- Department of Computational Medicine, David Geffen School of Medicine at UCLA, CHS, Los Angeles, CA, United States
- Department of Human Genetics, David Geffen School of Medicine at UCLA, Los Angeles, CA, United States
| | - Fangqing Zhao
- Beijing Institutes of Life Science, Chinese Academy of Sciences, Beijing, China
- Key Laboratory of Systems Biology, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Hangzhou, China
| | - Pejman Mohammadi
- Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, United States
| | - Paweł P. Łabaj
- Małopolska Centre of Biotechnology, Jagiellonian University, Krakow, Poland
- Department of Biotechnology, Boku University Vienna, Vienna, Austria
| | - Serghei Mangul
- Department of Clinical Pharmacy, USC Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences, Los Angeles, CA, United States
- Department of Quantitative and Computational Biology, USC Dornsife College of Letters, Arts and Sciences, Los Angeles, CA, United States
- *Correspondence: Serghei Mangul,
| |
Collapse
|
3
|
Sibbesen JA, Eizenga JM, Novak AM, Sirén J, Chang X, Garrison E, Paten B. Haplotype-aware pantranscriptome analyses using spliced pangenome graphs. Nat Methods 2023; 20:239-247. [PMID: 36646895 DOI: 10.1101/2021.03.26.437240] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2021] [Accepted: 11/28/2022] [Indexed: 05/24/2023]
Abstract
Pangenomics is emerging as a powerful computational paradigm in bioinformatics. This field uses population-level genome reference structures, typically consisting of a sequence graph, to mitigate reference bias and facilitate analyses that were challenging with previous reference-based methods. In this work, we extend these methods into transcriptomics to analyze sequencing data using the pantranscriptome: a population-level transcriptomic reference. Our toolchain, which consists of additions to the VG toolkit and a standalone tool, RPVG, can construct spliced pangenome graphs, map RNA sequencing data to these graphs, and perform haplotype-aware expression quantification of transcripts in a pantranscriptome. We show that this workflow improves accuracy over state-of-the-art RNA sequencing mapping methods, and that it can efficiently quantify haplotype-specific transcript expression without needing to characterize the haplotypes of a sample beforehand.
Collapse
Affiliation(s)
| | | | - Adam M Novak
- UC Santa Cruz Genomics Institute, Santa Cruz, CA, USA
| | - Jouni Sirén
- UC Santa Cruz Genomics Institute, Santa Cruz, CA, USA
| | - Xian Chang
- UC Santa Cruz Genomics Institute, Santa Cruz, CA, USA
| | - Erik Garrison
- University of Tennessee Health Science Center, Memphis, TN, USA
| | | |
Collapse
|
4
|
Terrón-Camero LC, Gordillo-González F, Salas-Espejo E, Andrés-León E. Comparison of Metagenomics and Metatranscriptomics Tools: A Guide to Making the Right Choice. Genes (Basel) 2022; 13:2280. [PMID: 36553546 PMCID: PMC9777648 DOI: 10.3390/genes13122280] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2022] [Revised: 11/28/2022] [Accepted: 12/01/2022] [Indexed: 12/09/2022] Open
Abstract
The study of microorganisms is a field of great interest due to their environmental (e.g., soil contamination) and biomedical (e.g., parasitic diseases, autism) importance. The advent of revolutionary next-generation sequencing techniques, and their application to the hypervariable regions of the 16S, 18S or 23S ribosomal subunits, have allowed the research of a large variety of organisms more in-depth, including bacteria, archaea, eukaryotes and fungi. Additionally, together with the development of analysis software, the creation of specific databases (e.g., SILVA or RDP) has boosted the enormous growth of these studies. As the cost of sequencing per sample has continuously decreased, new protocols have also emerged, such as shotgun sequencing, which allows the profiling of all taxonomic domains in a sample. The sequencing of hypervariable regions and shotgun sequencing are technologies that enable the taxonomic classification of microorganisms from the DNA present in microbial communities. However, they are not capable of measuring what is actively expressed. Conversely, we advocate that metatranscriptomics is a "new" technology that makes the identification of the mRNAs of a microbial community possible, quantifying gene expression levels and active biological pathways. Furthermore, it can be also used to characterise symbiotic interactions between the host and its microbiome. In this manuscript, we examine the three technologies above, and discuss the implementation of different software and databases, which greatly impact the obtaining of reliable results. Finally, we have developed two easy-to-use pipelines leveraging Nextflow technology. These aim to provide everything required for an average user to perform a metagenomic analysis of marker genes with QIMME2 and a metatranscriptomic study using Kraken2/Bracken.
Collapse
Affiliation(s)
- Laura C. Terrón-Camero
- Bioinformatics Unit, Institute of Parasitology and Biomedicine “López-Neyra”, CSIC (IPBLN-CSIC), 18016 Granada, Spain
| | - Fernando Gordillo-González
- Bioinformatics Unit, Institute of Parasitology and Biomedicine “López-Neyra”, CSIC (IPBLN-CSIC), 18016 Granada, Spain
| | - Eduardo Salas-Espejo
- Department of Biochemistry and Molecular Biology, Faculty of Sciences, University of Granada, 18071 Granada, Spain
| | - Eduardo Andrés-León
- Bioinformatics Unit, Institute of Parasitology and Biomedicine “López-Neyra”, CSIC (IPBLN-CSIC), 18016 Granada, Spain
| |
Collapse
|
5
|
Baaijens JA, Zulli A, Ott IM, Nika I, van der Lugt MJ, Petrone ME, Alpert T, Fauver JR, Kalinich CC, Vogels CBF, Breban MI, Duvallet C, McElroy KA, Ghaeli N, Imakaev M, Mckenzie-Bennett MF, Robison K, Plocik A, Schilling R, Pierson M, Littlefield R, Spencer ML, Simen BB, Hanage WP, Grubaugh ND, Peccia J, Baym M. Lineage abundance estimation for SARS-CoV-2 in wastewater using transcriptome quantification techniques. Genome Biol 2022; 23:236. [PMID: 36348471 PMCID: PMC9643916 DOI: 10.1186/s13059-022-02805-9] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2021] [Accepted: 10/25/2022] [Indexed: 11/09/2022] Open
Abstract
Effectively monitoring the spread of SARS-CoV-2 mutants is essential to efforts to counter the ongoing pandemic. Predicting lineage abundance from wastewater, however, is technically challenging. We show that by sequencing SARS-CoV-2 RNA in wastewater and applying algorithms initially used for transcriptome quantification, we can estimate lineage abundance in wastewater samples. We find high variability in signal among individual samples, but the overall trends match those observed from sequencing clinical samples. Thus, while clinical sequencing remains a more sensitive technique for population surveillance, wastewater sequencing can be used to monitor trends in mutant prevalence in situations where clinical sequencing is unavailable.
Collapse
Affiliation(s)
- Jasmijn A Baaijens
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
- Department of Intelligent Systems, Delft University of Technology, Delft, Netherlands.
| | - Alessandro Zulli
- Department of Chemical and Environmental Engineering, Yale University, New Haven, CT, USA
| | - Isabel M Ott
- Department of Epidemiology of Microbial Diseases, Yale School of Public Health, New Haven, CT, USA
| | - Ioanna Nika
- Department of Intelligent Systems, Delft University of Technology, Delft, Netherlands
| | - Mart J van der Lugt
- Department of Intelligent Systems, Delft University of Technology, Delft, Netherlands
| | - Mary E Petrone
- Department of Epidemiology of Microbial Diseases, Yale School of Public Health, New Haven, CT, USA
| | - Tara Alpert
- Department of Epidemiology of Microbial Diseases, Yale School of Public Health, New Haven, CT, USA
| | - Joseph R Fauver
- Department of Epidemiology of Microbial Diseases, Yale School of Public Health, New Haven, CT, USA
- Department of Epidemiology, University of Nebraska Medical Center, Omaha, NE, USA
| | - Chaney C Kalinich
- Department of Epidemiology of Microbial Diseases, Yale School of Public Health, New Haven, CT, USA
| | - Chantal B F Vogels
- Department of Epidemiology of Microbial Diseases, Yale School of Public Health, New Haven, CT, USA
| | - Mallery I Breban
- Department of Epidemiology of Microbial Diseases, Yale School of Public Health, New Haven, CT, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | - William P Hanage
- Center for Communicable Disease Dynamics and Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Nathan D Grubaugh
- Department of Epidemiology of Microbial Diseases, Yale School of Public Health, New Haven, CT, USA
- Department of Ecology and Evolutionary Biology, Yale University, New Haven, CT, USA
| | - Jordan Peccia
- Department of Chemical and Environmental Engineering, Yale University, New Haven, CT, USA
| | - Michael Baym
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| |
Collapse
|
6
|
Qi W, Fu H, Luo X, Ren Y, Liu X, Dai H, Zheng Q, Liang F. Electroacupuncture at PC6 (Neiguan) Attenuates Angina Pectoris in Rats with Myocardial Ischemia-Reperfusion Injury Through Regulating the Alternative Splicing of the Major Inhibitory Neurotransmitter Receptor GABRG2. J Cardiovasc Transl Res 2022; 15:1176-1191. [PMID: 35377129 DOI: 10.1007/s12265-022-10245-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/26/2021] [Accepted: 03/25/2022] [Indexed: 11/27/2022]
Abstract
Angina pectoris is the most common manifestation of coronary heart disease, causing suffering in patients. Electroacupuncture at PC6 can effectively alleviate angina by regulating the expression of genes, whether the alternative splicing (AS) of genes is affected by acupuncture is still unknown. We established a rat model of myocardial ischemia-reperfusion by coronary artery ligation and confirmed electroacupuncture alleviated the abnormal discharge caused by angina pectoris measured in EMG electromyograms. Analysis of the GSE61840 dataset established that AS events were altered after I/R and regulated by electroacupuncture. I/R decreased the expression of splicing factor Nova1 while electroacupuncture rescued it. Further experiments in dorsal root ganglion cells showed Nova1 regulated the AS of the GABRG2, specifically on its exon 9 where an important phosphorylation site is present. In vivo, results also showed that electroacupuncture can restore AS of GABRG2. Our results proved that electroacupuncture alleviates angina results by regulating alternative splicing.
Collapse
Affiliation(s)
- Wenchuan Qi
- Chengdu University of Traditional Chinese Medicine, Chengdu, 610075, Sichuan, China
| | - Hongjuan Fu
- Chengdu University of Traditional Chinese Medicine, Chengdu, 610075, Sichuan, China
| | - Xinye Luo
- Chengdu University of Traditional Chinese Medicine, Chengdu, 610075, Sichuan, China
| | - Yanrong Ren
- Chengdu University of Traditional Chinese Medicine, Chengdu, 610075, Sichuan, China.,Shanxi University of Traditional Chinese Medicine, Jinzhong, 030002, Shanxi, China
| | - Xueying Liu
- Chengdu University of Traditional Chinese Medicine, Chengdu, 610075, Sichuan, China.,Shanxi University of Traditional Chinese Medicine, Jinzhong, 030002, Shanxi, China
| | - Hongyuan Dai
- College of Life Sciences, Sichuan University, Chengdu, 610065, Sichuan, China
| | - Qianhua Zheng
- Chengdu University of Traditional Chinese Medicine, Chengdu, 610075, Sichuan, China
| | - Fanrong Liang
- Chengdu University of Traditional Chinese Medicine, Chengdu, 610075, Sichuan, China.
| |
Collapse
|
7
|
Schaap-Johansen AL, Vujović M, Borch A, Hadrup SR, Marcatili P. T Cell Epitope Prediction and Its Application to Immunotherapy. Front Immunol 2021; 12:712488. [PMID: 34603286 PMCID: PMC8479193 DOI: 10.3389/fimmu.2021.712488] [Citation(s) in RCA: 33] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2021] [Accepted: 07/12/2021] [Indexed: 12/13/2022] Open
Abstract
T cells play a crucial role in controlling and driving the immune response with their ability to discriminate peptides derived from healthy as well as pathogenic proteins. In this review, we focus on the currently available computational tools for epitope prediction, with a particular focus on tools aimed at identifying neoepitopes, i.e. cancer-specific peptides and their potential for use in immunotherapy for cancer treatment. This review will cover how these tools work, what kind of data they use, as well as pros and cons in their respective applications.
Collapse
Affiliation(s)
| | - Milena Vujović
- Department of Health Technology, Technical University of Denmark, Lyngby, Denmark
| | - Annie Borch
- Department of Health Technology, Technical University of Denmark, Lyngby, Denmark
| | - Sine Reker Hadrup
- Department of Health Technology, Technical University of Denmark, Lyngby, Denmark
| | - Paolo Marcatili
- Department of Health Technology, Technical University of Denmark, Lyngby, Denmark
| |
Collapse
|
8
|
Alser M, Rotman J, Deshpande D, Taraszka K, Shi H, Baykal PI, Yang HT, Xue V, Knyazev S, Singer BD, Balliu B, Koslicki D, Skums P, Zelikovsky A, Alkan C, Mutlu O, Mangul S. Technology dictates algorithms: recent developments in read alignment. Genome Biol 2021; 22:249. [PMID: 34446078 PMCID: PMC8390189 DOI: 10.1186/s13059-021-02443-7] [Citation(s) in RCA: 37] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2020] [Accepted: 07/28/2021] [Indexed: 01/08/2023] Open
Abstract
Aligning sequencing reads onto a reference is an essential step of the majority of genomic analysis pipelines. Computational algorithms for read alignment have evolved in accordance with technological advances, leading to today's diverse array of alignment methods. We provide a systematic survey of algorithmic foundations and methodologies across 107 alignment methods, for both short and long reads. We provide a rigorous experimental evaluation of 11 read aligners to demonstrate the effect of these underlying algorithms on speed and efficiency of read alignment. We discuss how general alignment algorithms have been tailored to the specific needs of various domains in biology.
Collapse
Affiliation(s)
- Mohammed Alser
- Computer Science Department, ETH Zürich, 8092, Zürich, Switzerland
- Computer Engineering Department, Bilkent University, 06800 Bilkent, Ankara, Turkey
- Information Technology and Electrical Engineering Department, ETH Zürich, Zürich, 8092, Switzerland
| | - Jeremy Rotman
- Department of Computer Science, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - Dhrithi Deshpande
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, Los Angeles, CA, 90089, USA
| | - Kodi Taraszka
- Department of Computer Science, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - Huwenbo Shi
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, 02115, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA
| | - Pelin Icer Baykal
- Department of Computer Science, Georgia State University, Atlanta, GA, 30302, USA
| | - Harry Taegyun Yang
- Department of Computer Science, University of California Los Angeles, Los Angeles, CA, 90095, USA
- Bioinformatics Interdepartmental Ph.D. Program, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - Victor Xue
- Department of Computer Science, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - Sergey Knyazev
- Department of Computer Science, Georgia State University, Atlanta, GA, 30302, USA
| | - Benjamin D Singer
- Division of Pulmonary and Critical Care Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL, 60611, USA
- Department of Biochemistry & Molecular Genetics, Northwestern University Feinberg School of Medicine, Chicago, USA
- Simpson Querrey Institute for Epigenetics, Northwestern University Feinberg School of Medicine, Chicago, IL, 60611, USA
| | - Brunilda Balliu
- Department of Computational Medicine, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - David Koslicki
- Computer Science and Engineering, Pennsylvania State University, University Park, PA, 16801, USA
- Biology Department, Pennsylvania State University, University Park, PA, 16801, USA
- The Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, PA, 16801, USA
| | - Pavel Skums
- Department of Computer Science, Georgia State University, Atlanta, GA, 30302, USA
| | - Alex Zelikovsky
- Department of Computer Science, Georgia State University, Atlanta, GA, 30302, USA
- The Laboratory of Bioinformatics, I.M. Sechenov First Moscow State Medical University, Moscow, 119991, Russia
| | - Can Alkan
- Computer Engineering Department, Bilkent University, 06800 Bilkent, Ankara, Turkey
- Bilkent-Hacettepe Health Sciences and Technologies Program, Ankara, Turkey
| | - Onur Mutlu
- Computer Science Department, ETH Zürich, 8092, Zürich, Switzerland
- Computer Engineering Department, Bilkent University, 06800 Bilkent, Ankara, Turkey
- Information Technology and Electrical Engineering Department, ETH Zürich, Zürich, 8092, Switzerland
| | - Serghei Mangul
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, Los Angeles, CA, 90089, USA.
| |
Collapse
|
9
|
Knyazev S, Tsyvina V, Shankar A, Melnyk A, Artyomenko A, Malygina T, Porozov YB, Campbell EM, Switzer WM, Skums P, Mangul S, Zelikovsky A. Accurate assembly of minority viral haplotypes from next-generation sequencing through efficient noise reduction. Nucleic Acids Res 2021; 49:e102. [PMID: 34214168 PMCID: PMC8464054 DOI: 10.1093/nar/gkab576] [Citation(s) in RCA: 29] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2020] [Revised: 05/25/2021] [Accepted: 06/18/2021] [Indexed: 12/21/2022] Open
Abstract
Rapidly evolving RNA viruses continuously produce minority haplotypes that can become dominant if they are drug-resistant or can better evade the immune system. Therefore, early detection and identification of minority viral haplotypes may help to promptly adjust the patient’s treatment plan preventing potential disease complications. Minority haplotypes can be identified using next-generation sequencing, but sequencing noise hinders accurate identification. The elimination of sequencing noise is a non-trivial task that still remains open. Here we propose CliqueSNV based on extracting pairs of statistically linked mutations from noisy reads. This effectively reduces sequencing noise and enables identifying minority haplotypes with the frequency below the sequencing error rate. We comparatively assess the performance of CliqueSNV using an in vitro mixture of nine haplotypes that were derived from the mutation profile of an existing HIV patient. We show that CliqueSNV can accurately assemble viral haplotypes with frequencies as low as 0.1% and maintains consistent performance across short and long bases sequencing platforms.
Collapse
Affiliation(s)
- Sergey Knyazev
- Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA.,Division of HIV Prevention, Centers for Disease Control and Prevention, Atlanta, GA 30333, USA.,Oak Ridge Institute for Science and Education, Oak Ridge, TN 37830, USA
| | - Viachaslau Tsyvina
- Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA
| | - Anupama Shankar
- Division of HIV Prevention, Centers for Disease Control and Prevention, Atlanta, GA 30333, USA
| | - Andrew Melnyk
- Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA
| | | | - Tatiana Malygina
- International Scientific and Research Institute of Bioengineering, ITMO University, St. Petersburg 197101, Russia
| | - Yuri B Porozov
- World-Class Research Center "Digital biodesign and personalized healthcare", I.M. Sechenov First Moscow State Medical University, Moscow 119991, Russia.,Department of Computational Biology, Sirius University of Science and Technology, Sochi 354340, Russia
| | - Ellsworth M Campbell
- Division of HIV Prevention, Centers for Disease Control and Prevention, Atlanta, GA 30333, USA
| | - William M Switzer
- Division of HIV Prevention, Centers for Disease Control and Prevention, Atlanta, GA 30333, USA
| | - Pavel Skums
- Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA
| | - Serghei Mangul
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, Los Angeles, CA 90089, USA
| | - Alex Zelikovsky
- Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA.,World-Class Research Center "Digital biodesign and personalized healthcare", I.M. Sechenov First Moscow State Medical University, Moscow 119991, Russia
| |
Collapse
|
10
|
Hu Y, Fang L, Chen X, Zhong JF, Li M, Wang K. LIQA: long-read isoform quantification and analysis. Genome Biol 2021; 22:182. [PMID: 34140043 PMCID: PMC8212471 DOI: 10.1186/s13059-021-02399-8] [Citation(s) in RCA: 45] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2020] [Accepted: 06/04/2021] [Indexed: 11/10/2022] Open
Abstract
Long-read RNA sequencing (RNA-seq) technologies can sequence full-length transcripts, facilitating the exploration of isoform-specific gene expression over short-read RNA-seq. We present LIQA to quantify isoform expression and detect differential alternative splicing (DAS) events using long-read direct mRNA sequencing or cDNA sequencing data. LIQA incorporates base pair quality score and isoform-specific read length information in a survival model to assign different weights across reads, and uses an expectation-maximization algorithm for parameter estimation. We apply LIQA to long-read RNA-seq data from the Universal Human Reference, acute myeloid leukemia, and esophageal squamous epithelial cells and demonstrate its high accuracy in profiling alternative splicing events.
Collapse
Affiliation(s)
- Yu Hu
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA
| | - Li Fang
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA
| | - Xuelian Chen
- Department of Otolaryngology, Keck School of Medicine, University of Southern California, Los Angeles, CA, 90033, USA
| | - Jiang F Zhong
- Department of Otolaryngology, Keck School of Medicine, University of Southern California, Los Angeles, CA, 90033, USA
| | - Mingyao Li
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, 19104, USA
| | - Kai Wang
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA.
- Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, 19104, USA.
| |
Collapse
|
11
|
Simoneau J, Gosselin R, Scott MS. Factorial study of the RNA-seq computational workflow identifies biases as technical gene signatures. NAR Genom Bioinform 2021; 2:lqaa043. [PMID: 33575596 PMCID: PMC7671328 DOI: 10.1093/nargab/lqaa043] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2020] [Revised: 05/15/2020] [Accepted: 06/05/2020] [Indexed: 12/12/2022] Open
Abstract
RNA-seq is a modular experimental and computational approach aiming in identifying and quantifying RNA molecules. The modularity of the RNA-seq technology enables adaptation of the protocol to develop new ways to explore RNA biology, but this modularity also brings forth the importance of methodological thoroughness. Liberty of approach comes with the responsibility of choices, and such choices must be informed. Here, we present an approach that identifies gene group-specific quantification biases in current RNA-seq software and references by processing datasets using diverse RNA-seq computational pipelines, and by decomposing these expression datasets with an independent component analysis matrix factorization method. By exploring the RNA-seq pipeline using this systemic approach, we identify genome annotations as a design choice that affects to the same extent quantification results as does the choice of aligners and quantifiers. We also show that the different choices in RNA-seq methodology are not independent, identifying interactions between genome annotations and quantification software. Genes were mainly affected by differences in their sequence, by overlapping genes and genes with similar sequence. Our approach offers an explanation for the observed biases by identifying the common features used differently by the software and references, therefore providing leads for the betterment of RNA-seq methodology.
Collapse
Affiliation(s)
- Joël Simoneau
- Department of Biochemistry and Functional Genomics, Faculty of Medicine and Health Sciences, Université de Sherbrooke, Sherbrooke, Québec, J1K 2R1, Canada
| | - Ryan Gosselin
- Department of Chemical & Biotechnological Engineering, Faculty of Engineering, Université de Sherbrooke, Sherbrooke, Québec, J1K 2R1, Canada
| | - Michelle S Scott
- Department of Biochemistry and Functional Genomics, Faculty of Medicine and Health Sciences, Université de Sherbrooke, Sherbrooke, Québec, J1K 2R1, Canada
| |
Collapse
|
12
|
Hounkpe BW, Chenou F, de Lima F, De Paula E. HRT Atlas v1.0 database: redefining human and mouse housekeeping genes and candidate reference transcripts by mining massive RNA-seq datasets. Nucleic Acids Res 2021; 49:D947-D955. [PMID: 32663312 PMCID: PMC7778946 DOI: 10.1093/nar/gkaa609] [Citation(s) in RCA: 111] [Impact Index Per Article: 37.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2020] [Accepted: 07/08/2020] [Indexed: 12/18/2022] Open
Abstract
Housekeeping (HK) genes are constitutively expressed genes that are required for the maintenance of basic cellular functions. Despite their importance in the calibration of gene expression, as well as the understanding of many genomic and evolutionary features, important discrepancies have been observed in studies that previously identified these genes. Here, we present Housekeeping and Reference Transcript Atlas (HRT Atlas v1.0, www.housekeeping.unicamp.br) a web-based database which addresses some of the previously observed limitations in the identification of these genes, and offers a more accurate database of human and mouse HK genes and transcripts. The database was generated by mining massive human and mouse RNA-seq data sets, including 11 281 and 507 high-quality RNA-seq samples from 52 human non-disease tissues/cells and 14 healthy tissues/cells of C57BL/6 wild type mouse, respectively. User can visualize the expression and download lists of 2158 human HK transcripts from 2176 HK genes and 3024 mouse HK transcripts from 3277 mouse HK genes. HRT Atlas also offers the most stable and suitable tissue selective candidate reference transcripts for normalization of qPCR experiments. Specific primers and predicted modifiers of gene expression for some of these HK transcripts are also proposed. HRT Atlas has also been integrated with a regulatory elements resource from Epiregio server.
Collapse
Affiliation(s)
| | - Francine Chenou
- School of Medical Sciences, University of Campinas, Campinas, SP, Brazil
| | - Franciele de Lima
- School of Medical Sciences, University of Campinas, Campinas, SP, Brazil
| | - Erich Vinicius De Paula
- School of Medical Sciences, University of Campinas, Campinas, SP, Brazil
- Hematology and Hemotherapy Center, University of Campinas, Campinas, SP, Brazil
| |
Collapse
|
13
|
Melsted P, Ntranos V, Pachter L. The barcode, UMI, set format and BUStools. Bioinformatics 2020; 35:4472-4473. [PMID: 31073610 DOI: 10.1093/bioinformatics/btz279] [Citation(s) in RCA: 76] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2018] [Revised: 02/15/2019] [Accepted: 04/13/2019] [Indexed: 11/13/2022] Open
Abstract
SUMMARY We introduce the Barcode-UMI-Set format (BUS) for representing pseudoalignments of reads from single-cell RNA-seq experiments. The format can be used with all single-cell RNA-seq technologies, and we show that BUS files can be efficiently generated. BUStools is a suite of tools for working with BUS files and facilitates rapid quantification and analysis of single-cell RNA-seq data. The BUS format therefore makes possible the development of modular, technology-specific and robust workflows for single-cell RNA-seq analysis. AVAILABILITY AND IMPLEMENTATION http://BUStools.github.io/ and http://pachterlab.github.io/kallisto/singlecell.html. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Páll Melsted
- Faculty of Industrial Engineering, Mechanical Engineering and Computer Science, University of Iceland, Reykjavik, Iceland
| | - Vasilis Ntranos
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA
| | - Lior Pachter
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA.,Department of Computing & Mathematical Sciences, California Institute of Technology, Pasadena, CA, USA
| |
Collapse
|
14
|
Deschamps-Francoeur G, Simoneau J, Scott MS. Handling multi-mapped reads in RNA-seq. Comput Struct Biotechnol J 2020; 18:1569-1576. [PMID: 32637053 PMCID: PMC7330433 DOI: 10.1016/j.csbj.2020.06.014] [Citation(s) in RCA: 27] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/29/2020] [Revised: 06/06/2020] [Accepted: 06/07/2020] [Indexed: 11/07/2022] Open
Abstract
Many eukaryotic genomes harbour large numbers of duplicated sequences, of diverse biotypes, resulting from several mechanisms including recombination, whole genome duplication and retro-transposition. Such repeated sequences complicate gene/transcript quantification during RNA-seq analysis due to reads mapping to more than one locus, sometimes involving genes embedded in other genes. Genes of different biotypes have dissimilar levels of sequence duplication, with long-noncoding RNAs and messenger RNAs sharing less sequence similarity to other genes than biotypes encoding shorter RNAs. Many strategies have been elaborated to handle these multi-mapped reads, resulting in increased accuracy in gene/transcript quantification, although separate tools are typically used to estimate the abundance of short and long genes due to their dissimilar characteristics. This review discusses the mechanisms leading to sequence duplication, the biotypes affected, the computational strategies employed to deal with multi-mapped reads and the challenges that still remain to be overcome.
Collapse
Affiliation(s)
- Gabrielle Deschamps-Francoeur
- Département de Biochimie et Génomique Fonctionnelle, Faculté de médecine et des sciences de la santé, Université de Sherbrooke, Sherbrooke, QC J1E 4K8, Canada
| | - Joël Simoneau
- Département de Biochimie et Génomique Fonctionnelle, Faculté de médecine et des sciences de la santé, Université de Sherbrooke, Sherbrooke, QC J1E 4K8, Canada
| | - Michelle S. Scott
- Département de Biochimie et Génomique Fonctionnelle, Faculté de médecine et des sciences de la santé, Université de Sherbrooke, Sherbrooke, QC J1E 4K8, Canada
| |
Collapse
|
15
|
Lachmann A, Clarke DJB, Torre D, Xie Z, Ma'ayan A. Interoperable RNA-Seq analysis in the cloud. BIOCHIMICA ET BIOPHYSICA ACTA-GENE REGULATORY MECHANISMS 2020; 1863:194521. [PMID: 32156561 DOI: 10.1016/j.bbagrm.2020.194521] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/02/2019] [Revised: 03/01/2020] [Accepted: 03/01/2020] [Indexed: 12/25/2022]
Abstract
RNA-Sequencing (RNA-Seq) is currently the leading technology for genome-wide transcript quantification. Mapping the raw reads to transcript and gene level counts can be achieved by different aligners. Here we report an in-depth comparison of transcript quantification methods. Our goal is the specific use of cost-efficient RNA-Seq analysis for deployment in a cloud infrastructure composed of interacting microservices. The individual modules cover file transfer into the cloud and APIs to handle the cloud alignment jobs. We next demonstrate how newly generated RNA-Seq data can be placed in the context of thousands of previously published datasets in near real time. With in-depth benchmarks, we identify suitable gene count quantification methods to facilitate cost-effective, accurate, and cloud-based RNA-Seq analysis service. Pseudo-alignment algorithms such as kallisto and Salmon combine high read quality estimation with cost efficient runtime performance. HISAT2 is the fastest of the classical aligners with good alignment quality. This article is part of a Special Issue entitled: Transcriptional Profiles and Regulatory Gene Networks edited by Dr. Federico Manuel Giorgi and Dr. Shaun Mahony.
Collapse
Affiliation(s)
- Alexander Lachmann
- Department of Pharmacological Sciences, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, Box 1603, New York, NY 10029, USA; Library of Integrated Network-based Cellular Signatures, Data Coordination and Integration Center (BD2K-LINCS DCIC), USA; Knowledge Management Center for Illuminating the Druggable Genome (KMC-IDG), USA.
| | - Daniel J B Clarke
- Department of Pharmacological Sciences, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, Box 1603, New York, NY 10029, USA; Library of Integrated Network-based Cellular Signatures, Data Coordination and Integration Center (BD2K-LINCS DCIC), USA; Knowledge Management Center for Illuminating the Druggable Genome (KMC-IDG), USA
| | - Denis Torre
- Department of Pharmacological Sciences, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, Box 1603, New York, NY 10029, USA; Library of Integrated Network-based Cellular Signatures, Data Coordination and Integration Center (BD2K-LINCS DCIC), USA; Knowledge Management Center for Illuminating the Druggable Genome (KMC-IDG), USA
| | - Zhuorui Xie
- Department of Pharmacological Sciences, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, Box 1603, New York, NY 10029, USA; Library of Integrated Network-based Cellular Signatures, Data Coordination and Integration Center (BD2K-LINCS DCIC), USA
| | - Avi Ma'ayan
- Department of Pharmacological Sciences, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, Box 1603, New York, NY 10029, USA; Library of Integrated Network-based Cellular Signatures, Data Coordination and Integration Center (BD2K-LINCS DCIC), USA; Knowledge Management Center for Illuminating the Druggable Genome (KMC-IDG), USA
| |
Collapse
|
16
|
Zheng H, Brennan K, Hernaez M, Gevaert O. Benchmark of long non-coding RNA quantification for RNA sequencing of cancer samples. Gigascience 2019; 8:giz145. [PMID: 31808800 PMCID: PMC6897288 DOI: 10.1093/gigascience/giz145] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2019] [Revised: 09/30/2019] [Accepted: 11/15/2019] [Indexed: 12/14/2022] Open
Abstract
BACKGROUND Long non-coding RNAs (lncRNAs) are emerging as important regulators of various biological processes. While many studies have exploited public resources such as RNA sequencing (RNA-Seq) data in The Cancer Genome Atlas to study lncRNAs in cancer, it is crucial to choose the optimal method for accurate expression quantification. RESULTS In this study, we compared the performance of pseudoalignment methods Kallisto and Salmon, alignment-based transcript quantification method RSEM, and alignment-based gene quantification methods HTSeq and featureCounts, in combination with read aligners STAR, Subread, and HISAT2, in lncRNA quantification, by applying them to both un-stranded and stranded RNA-Seq datasets. Full transcriptome annotation, including protein-coding and non-coding RNAs, greatly improves the specificity of lncRNA expression quantification. Pseudoalignment methods and RSEM outperform HTSeq and featureCounts for lncRNA quantification at both sample- and gene-level comparison, regardless of RNA-Seq protocol type, choice of aligners, and transcriptome annotation. Pseudoalignment methods and RSEM detect more lncRNAs and correlate highly with simulated ground truth. On the contrary, HTSeq and featureCounts often underestimate lncRNA expression. Antisense lncRNAs are poorly quantified by alignment-based gene quantification methods, which can be improved using stranded protocols and pseudoalignment methods. CONCLUSIONS Considering the consistency with ground truth and computational resources, pseudoalignment methods Kallisto or Salmon in combination with full transcriptome annotation is our recommended strategy for RNA-Seq analysis for lncRNAs.
Collapse
Affiliation(s)
- Hong Zheng
- Stanford Center for Biomedical Informatics Research, Department of Medicine, Stanford University, 1265 Welch Road, Stanford, 94305, CA, USA
| | - Kevin Brennan
- Stanford Center for Biomedical Informatics Research, Department of Medicine, Stanford University, 1265 Welch Road, Stanford, 94305, CA, USA
| | - Mikel Hernaez
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, 1206 W. Gregory Dr, Urbana, 61805, IL, USA
| | - Olivier Gevaert
- Stanford Center for Biomedical Informatics Research, Department of Medicine, Stanford University, 1265 Welch Road, Stanford, 94305, CA, USA
- Department of Biomedical Data Science, Stanford University, 1265 Welch Road, Stanford, 94305, CA, USA
| |
Collapse
|
17
|
Malik L, Almodaresi F, Patro R. Grouper: graph-based clustering and annotation for improved de novo transcriptome analysis. Bioinformatics 2019; 34:3265-3272. [PMID: 29746620 DOI: 10.1093/bioinformatics/bty378] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2017] [Accepted: 05/03/2018] [Indexed: 11/14/2022] Open
Abstract
Motivation De novo transcriptome analysis using RNA-seq offers a promising means to study gene expression in non-model organisms. Yet, the difficulty of transcriptome assembly means that the contigs provided by the assembler often represent a fractured and incomplete view of the transcriptome, complicating downstream analysis. We introduce Grouper, a new method for clustering contigs from de novo assemblies that are likely to belong to the same transcripts and genes; these groups can subsequently be analyzed more robustly. When provided with access to the genome of a related organism, Grouper can transfer annotations to the de novo assembly, further improving the clustering. Results On de novo assemblies from four different species, we show that Grouper is able to accurately cluster a larger number of contigs than the existing state-of-the-art method. The Grouper pipeline is able to map greater than 10% more reads against the contigs, leading to accurate downstream differential expression analyses. The labeling module, in the presence of a closely related annotated genome, can efficiently transfer annotations to the contigs and use this information to further improve clustering. Overall, Grouper provides a complete and efficient pipeline for processing de novo transcriptomic assemblies. Availability and implementation The Grouper software is freely available at https://github.com/COMBINE-lab/grouper under the 2-clause BSD license. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Laraib Malik
- Department of Computer Science, Stony Brook University, Stony Brook, NY, USA
| | - Fatemeh Almodaresi
- Department of Computer Science, Stony Brook University, Stony Brook, NY, USA
| | - Rob Patro
- Department of Computer Science, Stony Brook University, Stony Brook, NY, USA
| |
Collapse
|
18
|
Nomoto Y, Kubota Y, Ohnishi Y, Kasahara K, Tomita A, Oshime T, Yamashita H, Fahmi M, Ito M. Gene Cascade Finder: A tool for identification of gene cascades and its application in Caenorhabditis elegans. PLoS One 2019; 14:e0215187. [PMID: 31504044 PMCID: PMC6736238 DOI: 10.1371/journal.pone.0215187] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2019] [Accepted: 08/06/2019] [Indexed: 11/24/2022] Open
Abstract
Obtaining a comprehensive understanding of the gene regulatory networks, or gene cascades, involved in cell fate determination and cell lineage segregation in Caenorhabditis elegans is a long-standing challenge. Although RNA-sequencing (RNA-Seq) is a promising technique to resolve these questions, the bioinformatics tools to identify associated gene cascades from RNA-Seq data remain inadequate. To overcome these limitations, we developed Gene Cascade Finder (GCF) as a novel tool for building gene cascades by comparison of mutant and wild-type RNA-Seq data along with integrated information of protein-protein interactions, expression timing, and domains. Application of GCF to RNA-Seq data confirmed that SPN-4 and MEX-3 regulate the canonical Wnt pathway during embryonic development. Moreover, lin-35, hsp-3, and gpa-12 were found to be involved in MEX-1-dependent neurogenesis, and MEX-3 was found to control the gene cascade promoting neurogenesis through lin-35 and apl-1. Thus, GCF could be a useful tool for building gene cascades from RNA-Seq data.
Collapse
Affiliation(s)
- Yusuke Nomoto
- Advanced Life Sciences Program, Graduate School of Life Sciences, Ritsumeikan University, Kusatsu, Shiga, Japan
| | - Yukihiro Kubota
- Department of Bioinformatics, College of Life Sciences, Ritsumeikan University, Kusatsu, Shiga, Japan
| | - Yuto Ohnishi
- Advanced Life Sciences Program, Graduate School of Life Sciences, Ritsumeikan University, Kusatsu, Shiga, Japan
| | - Kota Kasahara
- Department of Bioinformatics, College of Life Sciences, Ritsumeikan University, Kusatsu, Shiga, Japan
| | - Aimi Tomita
- Advanced Life Sciences Program, Graduate School of Life Sciences, Ritsumeikan University, Kusatsu, Shiga, Japan
| | - Takehiro Oshime
- Advanced Life Sciences Program, Graduate School of Life Sciences, Ritsumeikan University, Kusatsu, Shiga, Japan
| | - Hiroki Yamashita
- Advanced Life Sciences Program, Graduate School of Life Sciences, Ritsumeikan University, Kusatsu, Shiga, Japan
| | - Muhamad Fahmi
- Department of Bioinformatics, College of Life Sciences, Ritsumeikan University, Kusatsu, Shiga, Japan
| | - Masahiro Ito
- Advanced Life Sciences Program, Graduate School of Life Sciences, Ritsumeikan University, Kusatsu, Shiga, Japan
- Department of Bioinformatics, College of Life Sciences, Ritsumeikan University, Kusatsu, Shiga, Japan
- * E-mail:
| |
Collapse
|
19
|
Arefeen A, Liu J, Xiao X, Jiang T. TAPAS: tool for alternative polyadenylation site analysis. Bioinformatics 2019; 34:2521-2529. [PMID: 30052912 DOI: 10.1093/bioinformatics/bty110] [Citation(s) in RCA: 44] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2017] [Accepted: 02/22/2018] [Indexed: 01/08/2023] Open
Abstract
Motivation The length of the 3' untranslated region (3' UTR) of an mRNA is essential for many biological activities such as mRNA stability, sub-cellular localization, protein translation, protein binding and translation efficiency. Moreover, correlation between diseases and the shortening (or lengthening) of 3' UTRs has been reported in the literature. This length is largely determined by the polyadenylation cleavage site in the mRNA. As alternative polyadenylation (APA) sites are common in mammalian genes, several tools have been published recently for detecting APA sites from RNA-Seq data or performing shortening/lengthening analysis. These tools consider either up to only two APA sites in a gene or only APA sites that occur in the last exon of a gene, although a gene may generally have more than two APA sites and an APA site may sometimes occur before the last exon. Furthermore, the tools are unable to integrate the analysis of shortening/lengthening events with APA site detection. Results We propose a new tool, called TAPAS, for detecting novel APA sites from RNA-Seq data. It can deal with more than two APA sites in a gene as well as APA sites that occur before the last exon. The tool is based on an existing method for finding change points in time series data, but some filtration techniques are also adopted to remove change points that are likely false APA sites. It is then extended to identify APA sites that are expressed differently between two biological samples and genes that contain 3' UTRs with shortening/lengthening events. Our extensive experiments on simulated and real RNA-Seq data demonstrate that TAPAS outperforms the existing tools for APA site detection or shortening/lengthening analysis significantly. Availability and implementation https://github.com/arefeen/TAPAS. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ashraful Arefeen
- Department of Computer Science and Engineering, University of California, Riverside, CA, USA
| | - Juntao Liu
- School of Mathematics, Shandong University, Jinan, Shandong, China
| | - Xinshu Xiao
- Department of Integrative Biology and Physiology, University of California, Los Angeles, CA, USA
| | - Tao Jiang
- Department of Computer Science and Engineering, University of California, Riverside, CA, USA.,Institute of Integrative Genome Biology, University of California, Riverside, CA, USA.,MOE Key Lab of Bioinformatics and Bioinformatics Division, TNLIST/Department of Computer Science and Technology, Tsinghua University, Beijing, China
| |
Collapse
|
20
|
Raghupathy N, Choi K, Vincent MJ, Beane GL, Sheppard KS, Munger SC, Korstanje R, Pardo-Manual de Villena F, Churchill GA. Hierarchical analysis of RNA-seq reads improves the accuracy of allele-specific expression. Bioinformatics 2019; 34:2177-2184. [PMID: 29444201 DOI: 10.1093/bioinformatics/bty078] [Citation(s) in RCA: 47] [Impact Index Per Article: 9.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2017] [Accepted: 02/09/2018] [Indexed: 02/06/2023] Open
Abstract
Motivation Allele-specific expression (ASE) refers to the differential abundance of the allelic copies of a transcript. RNA sequencing (RNA-seq) can provide quantitative estimates of ASE for genes with transcribed polymorphisms. When short-read sequences are aligned to a diploid transcriptome, read-mapping ambiguities confound our ability to directly count reads. Multi-mapping reads aligning equally well to multiple genomic locations, isoforms or alleles can comprise the majority (>85%) of reads. Discarding them can result in biases and substantial loss of information. Methods have been developed that use weighted allocation of read counts but these methods treat the different types of multi-reads equivalently. We propose a hierarchical approach to allocation of read counts that first resolves ambiguities among genes, then among isoforms, and lastly between alleles. We have implemented our model in EMASE software (Expectation-Maximization for Allele Specific Expression) to estimate total gene expression, isoform usage and ASE based on this hierarchical allocation. Results Methods that align RNA-seq reads to a diploid transcriptome incorporating known genetic variants improve estimates of ASE and total gene expression compared to methods that use reference genome alignments. Weighted allocation methods outperform methods that discard multi-reads. Hierarchical allocation of reads improves estimation of ASE even when data are simulated from a non-hierarchical model. Analysis of RNA-seq data from F1 hybrid mice using EMASE reveals widespread ASE associated with cis-acting polymorphisms and a small number of parent-of-origin effects. Availability and implementation EMASE software is available at https://github.com/churchill-lab/emase. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
|
21
|
Van den Berge K, Hembach KM, Soneson C, Tiberi S, Clement L, Love MI, Patro R, Robinson MD. RNA Sequencing Data: Hitchhiker's Guide to Expression Analysis. Annu Rev Biomed Data Sci 2019. [DOI: 10.1146/annurev-biodatasci-072018-021255] [Citation(s) in RCA: 71] [Impact Index Per Article: 14.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Gene expression is the fundamental level at which the results of various genetic and regulatory programs are observable. The measurement of transcriptome-wide gene expression has convincingly switched from microarrays to sequencing in a matter of years. RNA sequencing (RNA-seq) provides a quantitative and open system for profiling transcriptional outcomes on a large scale and therefore facilitates a large diversity of applications, including basic science studies, but also agricultural or clinical situations. In the past 10 years or so, much has been learned about the characteristics of the RNA-seq data sets, as well as the performance of the myriad of methods developed. In this review, we give an overview of the developments in RNA-seq data analysis, including experimental design, with an explicit focus on the quantification of gene expression and statistical approachesfor differential expression. We also highlight emerging data types, such as single-cell RNA-seq and gene expression profiling using long-read technologies.
Collapse
Affiliation(s)
- Koen Van den Berge
- Bioinformatics Institute Ghent and Department of Applied Mathematics, Computer Science and Statistics, Ghent University, 9000 Ghent, Belgium
| | - Katharina M. Hembach
- Institute of Molecular Life Sciences and SIB Swiss Institute of Bioinformatics, University of Zurich, 8057 Zurich, Switzerland
| | - Charlotte Soneson
- Institute of Molecular Life Sciences and SIB Swiss Institute of Bioinformatics, University of Zurich, 8057 Zurich, Switzerland
| | - Simone Tiberi
- Institute of Molecular Life Sciences and SIB Swiss Institute of Bioinformatics, University of Zurich, 8057 Zurich, Switzerland
| | - Lieven Clement
- Bioinformatics Institute Ghent and Department of Applied Mathematics, Computer Science and Statistics, Ghent University, 9000 Ghent, Belgium
| | - Michael I. Love
- Department of Biostatistics and Department of Genetics, University of North Carolina, Chapel Hill, North Carolina 27514, USA
| | - Rob Patro
- Department of Computer Science, Stony Brook University, Stony Brook, New York 11794, USA
| | - Mark D. Robinson
- Institute of Molecular Life Sciences and SIB Swiss Institute of Bioinformatics, University of Zurich, 8057 Zurich, Switzerland
| |
Collapse
|
22
|
Karimzadeh M, Ernst C, Kundaje A, Hoffman MM. Umap and Bismap: quantifying genome and methylome mappability. Nucleic Acids Res 2019; 46:e120. [PMID: 30169659 PMCID: PMC6237805 DOI: 10.1093/nar/gky677] [Citation(s) in RCA: 60] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2017] [Accepted: 07/22/2018] [Indexed: 11/14/2022] Open
Abstract
Short-read sequencing enables assessment of genetic and biochemical traits of individual genomic regions, such as the location of genetic variation, protein binding and chemical modifications. Every region in a genome assembly has a property called 'mappability', which measures the extent to which it can be uniquely mapped by sequence reads. In regions of lower mappability, estimates of genomic and epigenomic characteristics from sequencing assays are less reliable. These regions have increased susceptibility to spurious mapping from reads from other regions of the genome with sequencing errors or unexpected genetic variation. Bisulfite sequencing approaches used to identify DNA methylation exacerbate these problems by introducing large numbers of reads that map to multiple regions. Both to correct assumptions of uniformity in downstream analysis and to identify regions where the analysis is less reliable, it is necessary to know the mappability of both ordinary and bisulfite-converted genomes. We introduce the Umap software for identifying uniquely mappable regions of any genome. Its Bismap extension identifies mappability of the bisulfite-converted genome. A Umap and Bismap track hub for human genome assemblies GRCh37/hg19 and GRCh38/hg38, and mouse assemblies GRCm37/mm9 and GRCm38/mm10 is available at https://bismap.hoffmanlab.org for use with genome browsers.
Collapse
Affiliation(s)
- Mehran Karimzadeh
- Princess Margaret Cancer Centre, M5G 1L7, Toronto, ON, Canada.,Department of Medical Biophysics, M5G 1L7, University of Toronto, Toronto, ON, Canada.,Vector Institute, M5G 1M1, Toronto, ON, Canada
| | - Carl Ernst
- Department of Human Genetics, McGill University, H3A 0C7, Montreal, QC, Canada
| | - Anshul Kundaje
- Department of Genetics, Stanford University, 94305-9025, Stanford, CA, USA.,Department of Computer Science, Stanford University, 94305-5120, Stanford, CA, USA
| | - Michael M Hoffman
- Princess Margaret Cancer Centre, M5G 1L7, Toronto, ON, Canada.,Department of Medical Biophysics, M5G 1L7, University of Toronto, Toronto, ON, Canada.,Vector Institute, M5G 1M1, Toronto, ON, Canada.,Department of Computer Science, University of Toronto, M5S 2E4, Toronto, ON, Canada
| |
Collapse
|
23
|
Duan JE, Jiang ZC, Alqahtani F, Mandoiu I, Dong H, Zheng X, Marjani SL, Chen J, Tian XC. Methylome Dynamics of Bovine Gametes and in vivo Early Embryos. Front Genet 2019; 10:512. [PMID: 31191619 PMCID: PMC6546829 DOI: 10.3389/fgene.2019.00512] [Citation(s) in RCA: 38] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2019] [Accepted: 05/10/2019] [Indexed: 01/12/2023] Open
Abstract
DNA methylation undergoes drastic fluctuation during early mammalian embryogenesis. The dynamics of global DNA methylation in bovine embryos, however, have mostly been studied by immunostaining. We adopted the whole genome bisulfite sequencing (WGBS) method to characterize stage-specific genome-wide DNA methylation in bovine sperm, immature oocytes, oocytes matured in vivo and in vitro, as well as in vivo developed single embryos at the 2-, 4-, 8-, and 16-cell stages. We found that the major wave of genome-wide DNA demethylation was complete by the 8-cell stage when de novo methylation became prominent. Sperm and oocytes were differentially methylated in numerous regions (DMRs), which were primarily intergenic, suggesting that these non-coding regions may play important roles in gamete specification. DMRs were also identified between in vivo and in vitro matured oocytes, suggesting environmental effects on epigenetic modifications. In addition, virtually no (less than 1.5%) DNA methylation was found in mitochondrial DNA. Finally, by using RNA-seq data generated from embryos at the same developmental stages, we revealed a weak inverse correlation between gene expression and promoter methylation. This comprehensive analysis provides insight into the critical features of the bovine embryo methylome, and serves as an important reference for embryos produced in vitro, such as by in vitro fertilization and cloning. Lastly, these data can also provide a model for the epigenetic dynamics in human early embryos.
Collapse
Affiliation(s)
- Jingyue Ellie Duan
- Department of Animal Science, University of Connecticut, Storrs, CT, United States
| | - Zongliang Carl Jiang
- School of Animal Science, AgCenter, Louisiana State University, Baton Rouge, LA, United States
| | - Fahad Alqahtani
- Department of Computer Science and Engineering, University of Connecticut, Storrs, CT, United States
| | - Ion Mandoiu
- Department of Computer Science and Engineering, University of Connecticut, Storrs, CT, United States
| | - Hong Dong
- Institute of Animal Science, Xinjiang Academy of Animal Sciences, Ürümqi, China
| | - Xinbao Zheng
- Institute of Animal Science, Xinjiang Academy of Animal Sciences, Ürümqi, China
| | - Sadie L Marjani
- Department of Biology, Central Connecticut State University, New Britain, CT, United States
| | - Jingbo Chen
- Institute of Animal Science, Xinjiang Academy of Animal Sciences, Ürümqi, China
| | - Xiuchun Cindy Tian
- Department of Animal Science, University of Connecticut, Storrs, CT, United States
| |
Collapse
|
24
|
Mangul S, Martin LS, Hill BL, Lam AKM, Distler MG, Zelikovsky A, Eskin E, Flint J. Systematic benchmarking of omics computational tools. Nat Commun 2019; 10:1393. [PMID: 30918265 PMCID: PMC6437167 DOI: 10.1038/s41467-019-09406-4] [Citation(s) in RCA: 86] [Impact Index Per Article: 17.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2018] [Accepted: 03/06/2019] [Indexed: 01/11/2023] Open
Abstract
Computational omics methods packaged as software have become essential to modern biological research. The increasing dependence of scientists on these powerful software tools creates a need for systematic assessment of these methods, known as benchmarking. Adopting a standardized benchmarking practice could help researchers who use omics data to better leverage recent technological innovations. Our review summarizes benchmarking practices from 25 recent studies and discusses the challenges, advantages, and limitations of benchmarking across various domains of biology. We also propose principles that can make computational biology benchmarking studies more sustainable and reproducible, ultimately increasing the transparency of biomedical data and results. Benchmarking studies are important for comprehensively understanding and evaluating different computational omics methods. Here, the authors review practices from 25 recent studies and propose principles to improve the quality of benchmarking studies.
Collapse
Affiliation(s)
- Serghei Mangul
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA, 90095, USA. .,Institute for Quantitative and Computational Biosciences, University of California Los Angeles, 611 Charles E Young Drive East, Los Angeles, CA, 90095, USA.
| | - Lana S Martin
- Institute for Quantitative and Computational Biosciences, University of California Los Angeles, 611 Charles E Young Drive East, Los Angeles, CA, 90095, USA
| | - Brian L Hill
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA, 90095, USA
| | - Angela Ka-Mei Lam
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA, 90095, USA
| | - Margaret G Distler
- Department of Psychiatry and Biobehavioral Sciences, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - Alex Zelikovsky
- Department of Computer Science, Georgia State University, Atlanta, GA, 30303, USA.,The Laboratory of Bioinformatics, I.M. Sechenov First Moscow State Medical University, Moscow, 119991, Russia
| | - Eleazar Eskin
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA, 90095, USA.,Department of Human Genetics, University of California Los Angeles, 695 Charles E. Young, Los Angeles, CA, USA
| | - Jonathan Flint
- Department of Psychiatry and Biobehavioral Sciences, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, CA, 90095, USA
| |
Collapse
|
25
|
Abstract
One of the most notable challenges in single cell RNA-Seq data analysis is the so called drop-out effect, where only a fraction of the transcriptome of each cell is captured. The random nature of dropouts, however, makes it possible to consider imputation methods as means of correcting for dropouts. In this article, we study some existing single cell RNA sequencing (scRNA-Seq) imputation methods and propose a novel iterative imputation approach based on efficiently computing highly similar cells. We then present the results of a comprehensive assessment of existing and proposed methods on real scRNA-Seq data sets with varying per cell sequencing depth.
Collapse
Affiliation(s)
- Marmar Moussa
- Computer Science and Engineering Department, University of Connecticut, Storrs, Connecticut
| | - Ion I Măndoiu
- Computer Science and Engineering Department, University of Connecticut, Storrs, Connecticut
| |
Collapse
|
26
|
McCurdy SR, Ntranos V, Pachter L. Deterministic column subset selection for single-cell RNA-Seq. PLoS One 2019; 14:e0210571. [PMID: 30682053 PMCID: PMC6347249 DOI: 10.1371/journal.pone.0210571] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2018] [Accepted: 12/26/2018] [Indexed: 12/02/2022] Open
Abstract
Analysis of single-cell RNA sequencing (scRNA-Seq) data often involves filtering out uninteresting or poorly measured genes and dimensionality reduction to reduce noise and simplify data visualization. However, techniques such as principal components analysis (PCA) fail to preserve non-negativity and sparsity structures present in the original matrices, and the coordinates of projected cells are not easily interpretable. Commonly used thresholding methods to filter genes avoid those pitfalls, but ignore collinearity and covariance in the original matrix. We show that a deterministic column subset selection (DCSS) method possesses many of the favorable properties of common thresholding methods and PCA, while avoiding pitfalls from both. We derive new spectral bounds for DCSS. We apply DCSS to two measures of gene expression from two scRNA-Seq experiments with different clustering workflows, and compare to three thresholding methods. In each case study, the clusters based on the small subset of the complete gene expression profile selected by DCSS are similar to clusters produced from the full set. The resulting clusters are informative for cell type.
Collapse
Affiliation(s)
- Shannon R. McCurdy
- California Institute for Quantitative Biosciences, University of California Berkeley, Berkeley, California, United States of America
- * E-mail:
| | - Vasilis Ntranos
- Department of Electrical Engineering and Computer Sciences, University of California Berkeley, Berkeley, California, United States of America
| | - Lior Pachter
- Division of Biology and Biological Engineering, Department of Computing and Mathematical Sciences, California Institute of Technology, Pasadena, California, United States of America
| |
Collapse
|
27
|
A discriminative learning approach to differential expression analysis for single-cell RNA-seq. Nat Methods 2019; 16:163-166. [PMID: 30664774 DOI: 10.1038/s41592-018-0303-9] [Citation(s) in RCA: 89] [Impact Index Per Article: 17.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2018] [Accepted: 12/13/2018] [Indexed: 12/16/2022]
Abstract
Single-cell RNA-seq makes it possible to characterize the transcriptomes of cell types across different conditions and to identify their transcriptional signatures via differential analysis. Our method detects changes in transcript dynamics and in overall gene abundance in large numbers of cells to determine differential expression. When applied to transcript compatibility counts obtained via pseudoalignment, our approach provides a quantification-free analysis of 3' single-cell RNA-seq that can identify previously undetectable marker genes.
Collapse
|
28
|
Duan JE, Flock K, Jue N, Zhang M, Jones A, Seesi SA, Mandoiu I, Pillai S, Hoffman M, O'Neill R, Zinn S, Govoni K, Reed S, Jiang H, Jiang ZC, Tian XC. Dosage Compensation and Gene Expression of the X Chromosome in Sheep. G3 (BETHESDA, MD.) 2019; 9:305-314. [PMID: 30482800 PMCID: PMC6325915 DOI: 10.1534/g3.118.200815] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/17/2018] [Accepted: 11/26/2018] [Indexed: 12/20/2022]
Abstract
Ohno's hypothesis predicts that the expression of the single X chromosome in males needs compensatory upregulation to balance its dosage with that of the diploid autosomes. Additionally, X chromosome inactivation ensures that quadruple expression of the two X chromosomes is avoided in females. These mechanisms have been actively studied in mice and humans but lag behind in domestic species. Using RNA sequencing data, we analyzed the X chromosome upregulation in sheep fetal tissues from day 135 of gestation under control, over or restricted maternal diets (100%, 140% and 60% of National Research Council Total Digestible Nutrients), and in conceptuses, juvenile, and adult somatic tissues. By computing the mean expression ratio of all X-linked genes to all autosomal genes (X:A), we found that all samples displayed some levels of X chromosome upregulation. The degrees of X upregulation were not significant (P-value = 0.74) between ovine females and males in the same somatic tissues. Brain, however, displayed complete X upregulation. Interestingly, the male and female reproduction-related tissues exhibited divergent X dosage upregulation. Moreover, expression upregulation of the X chromosome in fetal tissues was not affected by maternal diets. Maternal nutrition, however, did change expression levels of several X-linked genes, such as sex determination genes SOX3 and NR0B1 In summary, our results showed that X chromosome upregulation occurred in nearly all sheep somatic tissues analyzed, thus support Ohno's hypothesis in a new species. However, the levels of upregulation differed by different subgroups of genes such as those that are house-keeping and "dosage-sensitive".
Collapse
Affiliation(s)
| | | | - Nathanial Jue
- School of Natural Sciences, California State University, Monterey Bay, Seaside, CA 93955
| | - Mingyuan Zhang
- Department of Animal Science
- Laboratory Animal Center, Guangxi Medical University, Nanning 530021, China
| | | | - Sahar Al Seesi
- Smith College Department of Computer Science, Northampton, MA 01063
- Department of Computer Science
| | | | | | | | - Rachel O'Neill
- Department of Molecular and Cell Biology, and University of Connecticut, Storrs, CT, 06269
| | | | | | | | - Hesheng Jiang
- College of Animal Science and Technology, Guangxi University, Nanning 530004, China, and
| | - Zongliang Carl Jiang
- Department of Animal Science
- School of Animal Science, Louisiana State University, Baton Rouge, LA 70803
| | | |
Collapse
|
29
|
Duan JE, Shi W, Jue NK, Jiang Z, Kuo L, O'Neill R, Wolf E, Dong H, Zheng X, Chen J, Tian XC. Dosage Compensation of the X Chromosomes in Bovine Germline, Early Embryos, and Somatic Tissues. Genome Biol Evol 2019; 11:242-252. [PMID: 30566637 PMCID: PMC6354180 DOI: 10.1093/gbe/evy270] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/12/2018] [Indexed: 12/15/2022] Open
Abstract
Dosage compensation of the mammalian X chromosome (X) was proposed by Susumu Ohno as a mechanism wherein the inactivation of one X in females would lead to doubling the expression of the other. This would resolve the dosage imbalance between eutherian females (XX) versus male (XY) and between a single active X versus autosome pairs (A). Expression ratio of X- and A-linked genes has been relatively well studied in humans and mice, despite controversial results over the existence of upregulation of X-linked genes. Here we report the first comprehensive test of Ohno’s hypothesis in bovine preattachment embryos, germline, and somatic tissues. Overall an incomplete dosage compensation (0.5 < X:A < 1) of expressed genes and an excess X dosage compensation (X:A > 1) of ubiquitously expressed “dosage-sensitive” genes were seen. No significant differences in X:A ratios were observed between bovine female and male somatic tissues, further supporting Ohno’s hypothesis. Interestingly, preimplantation embryos manifested a unique pattern of X dosage compensation dynamics. Specifically, X dosage decreased after fertilization, indicating that the sperm brings in an inactive X to the matured oocyte. Subsequently, the activation of the bovine embryonic genome enhanced expression of X-linked genes and increased the X dosage. As a result, an excess compensation was exhibited from the 8-cell stage to the compact morula stage. The X dosage peaked at the 16-cell stage and stabilized after the blastocyst stage. Together, our findings confirm Ohno’s hypothesis of X dosage compensation in the bovine and extend it by showing incomplete and over-compensation for expressed and “dosage-sensitive” genes, respectively.
Collapse
Affiliation(s)
| | - Wei Shi
- Department of Statistics, University of Connecticut, Storrs, CT
| | - Nathaniel K Jue
- School of Natural Sciences, California State University, Monterey Bay, CA
| | - Zongliang Jiang
- School of Animal Science, Louisiana State University, Agricultural Center, Baton Rouge, LA
| | - Lynn Kuo
- Department of Statistics, University of Connecticut, Storrs, CT
| | - Rachel O'Neill
- Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT
| | - Eckhard Wolf
- Gene Center and Department of Biochemistry, Ludwig-Maximilians-Universität Muünchen, Germany
| | - Hong Dong
- Institute of Animal Science, Xinjiang Academy of Animal Sciences, Urumqi, Xinjiang, P.R. China
| | - Xinbao Zheng
- Institute of Animal Science, Xinjiang Academy of Animal Sciences, Urumqi, Xinjiang, P.R. China
| | - Jingbo Chen
- Institute of Animal Science, Xinjiang Academy of Animal Sciences, Urumqi, Xinjiang, P.R. China
| | | |
Collapse
|
30
|
Duan J(E, Zhang M, Flock K, Seesi SA, Mandoiu I, Jones A, Johnson E, Pillai S, Hoffman M, McFadden K, Jiang H, Reed S, Govoni K, Zinn S, Jiang Z, Tian X(C. Effects of maternal nutrition on the expression of genomic imprinted genes in ovine fetuses. Epigenetics 2018; 13:793-807. [PMID: 30051747 PMCID: PMC6224220 DOI: 10.1080/15592294.2018.1503489] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2018] [Revised: 07/04/2018] [Accepted: 07/15/2018] [Indexed: 12/27/2022] Open
Abstract
Genomic imprinting is an epigenetic phenomenon of differential allelic expression based on parental origin. To date, 263 imprinted genes have been identified among all investigated mammalian species. However, only 21 have been described in sheep, of which 11 are annotated in the current ovine genome. Here, we aim to i) use DNA/RNA high throughput sequencing to identify new monoallelically expressed and imprinted genes in day 135 ovine fetuses and ii) determine whether maternal diet (100%, 60%, or 140% of National Research Council Total Digestible Nutrients) influences expression of imprinted genes. We also reported strategies to solve technical challenges in the data analysis pipeline. We identified 80 monoallelically expressed, 13 new putative imprinted genes, and five known imprinted genes in sheep using the 263 genes stated above as a guide. Sanger sequencing confirmed allelic expression of seven genes, CASD1, COPG2, DIRAS3, INPP5F, PLAGL1, PPP1R9A, and SLC22A18. Among the 13 putative imprinted genes, five were localized in the known sheep imprinting domains of MEST on chromosome 4, DLK1/GTL2 on chromosome 18 and KCNQ1 on chromosome 21, and three were in a novel sheep imprinted cluster on chromosome 4, known in other species as PEG10/SGCE. The expression of DIRAS3, IGF2, PHLDA2, and SLC22A18 was altered by maternal diet, albeit without allelic expression reversal. Together, our results expanded the list of sheep imprinted genes to 34 and demonstrated that while the expression levels of four imprinted genes were changed by maternal diet, the allelic expression patterns were un-changed for all imprinted genes studied.
Collapse
Affiliation(s)
| | - Mingyuan Zhang
- Department of Animal Science, University of Connecticut, Storrs, CT, USA
- College of Animal Science and Technology, Guangxi University, Nanning, China
| | - Kaleigh Flock
- Department of Animal Science, University of Connecticut, Storrs, CT, USA
| | - Sahar Al Seesi
- Department of Computer Science, University of Connecticut, Storrs, CT, USA
| | - Ion Mandoiu
- Department of Computer Science, University of Connecticut, Storrs, CT, USA
| | - Amanda Jones
- Department of Animal Science, University of Connecticut, Storrs, CT, USA
| | - Elizabeth Johnson
- Department of Animal Science, University of Connecticut, Storrs, CT, USA
| | - Sambhu Pillai
- Department of Animal Science, University of Connecticut, Storrs, CT, USA
| | - Maria Hoffman
- Department of Animal Science, University of Connecticut, Storrs, CT, USA
| | - Katelyn McFadden
- Department of Animal Science, University of Connecticut, Storrs, CT, USA
| | - Hesheng Jiang
- College of Animal Science and Technology, Guangxi University, Nanning, China
| | - Sarah Reed
- Department of Animal Science, University of Connecticut, Storrs, CT, USA
| | - Kristen Govoni
- Department of Animal Science, University of Connecticut, Storrs, CT, USA
| | - Steve Zinn
- Department of Animal Science, University of Connecticut, Storrs, CT, USA
| | - Zongliang Jiang
- School of Animal Science, Louisiana State University Agricultural Center, Baton Rouge, LA, USA
| | | |
Collapse
|
31
|
Event Analysis: Using Transcript Events To Improve Estimates of Abundance in RNA-seq Data. G3-GENES GENOMES GENETICS 2018; 8:2923-2940. [PMID: 30021829 PMCID: PMC6118309 DOI: 10.1534/g3.118.200373] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
Alternative splicing leverages genomic content by allowing the synthesis of multiple transcripts and, by implication, protein isoforms, from a single gene. However, estimating the abundance of transcripts produced in a given tissue from short sequencing reads is difficult and can result in both the construction of transcripts that do not exist, and the failure to identify true transcripts. An alternative approach is to catalog the events that make up isoforms (splice junctions and exons). We present here the Event Analysis (EA) approach, where we project transcripts onto the genome and identify overlapping/unique regions and junctions. In addition, all possible logical junctions are assembled into a catalog. Transcripts are filtered before quantitation based on simple measures: the proportion of the events detected, and the coverage. We find that mapping to a junction catalog is more efficient at detecting novel junctions than mapping in a splice aware manner. We identify 99.8% of true transcripts while iReckon identifies 82% of the true transcripts and creates more transcripts not included in the simulation than were initially used in the simulation. Using PacBio Iso-seq data from a mouse neural progenitor cell model, EA detects 60% of the novel junctions that are combinations of existing exons while only 43% are detected by STAR. EA further detects ∼5,000 annotated junctions missed by STAR. Filtering transcripts based on the proportion of the transcript detected and the number of reads on average supporting that transcript captures 95% of the PacBio transcriptome. Filtering the reference transcriptome before quantitation, results in is a more stable estimate of isoform abundance, with improved correlation between replicates. This was particularly evident when EA is applied to an RNA-seq study of type 1 diabetes (T1D), where the coefficient of variation among subjects (n = 81) in the transcript abundance estimates was substantially reduced compared to the estimation using the full reference. EA focuses on individual transcriptional events. These events can be quantitate and analyzed directly or used to identify the probable set of expressed transcripts. Simple rules based on detected events and coverage used in filtering result in a dramatic improvement in isoform estimation without the use of ancillary data (e.g., ChIP, long reads) that may not be available for many studies.
Collapse
|
32
|
Papastamoulis P, Rattray M. Bayesian estimation of differential transcript usage from RNA-seq data. Stat Appl Genet Mol Biol 2018; 16:367-386. [PMID: 29091583 DOI: 10.1515/sagmb-2017-0005] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Next generation sequencing allows the identification of genes consisting of differentially expressed transcripts, a term which usually refers to changes in the overall expression level. A specific type of differential expression is differential transcript usage (DTU) and targets changes in the relative within gene expression of a transcript. The contribution of this paper is to: (a) extend the use of cjBitSeq to the DTU context, a previously introduced Bayesian model which is originally designed for identifying changes in overall expression levels and (b) propose a Bayesian version of DRIMSeq, a frequentist model for inferring DTU. cjBitSeq is a read based model and performs fully Bayesian inference by MCMC sampling on the space of latent state of each transcript per gene. BayesDRIMSeq is a count based model and estimates the Bayes Factor of a DTU model against a null model using Laplace's approximation. The proposed models are benchmarked against the existing ones using a recent independent simulation study as well as a real RNA-seq dataset. Our results suggest that the Bayesian methods exhibit similar performance with DRIMSeq in terms of precision/recall but offer better calibration of False Discovery Rate.
Collapse
|
33
|
Schaeffer L, Pimentel H, Bray N, Melsted P, Pachter L. Pseudoalignment for metagenomic read assignment. Bioinformatics 2018; 33:2082-2088. [PMID: 28334086 DOI: 10.1093/bioinformatics/btx106] [Citation(s) in RCA: 44] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2016] [Accepted: 02/17/2017] [Indexed: 12/13/2022] Open
Abstract
Motivation Read assignment is an important first step in many metagenomic analysis workflows, providing the basis for identification and quantification of species. However ambiguity among the sequences of many strains makes it difficult to assign reads at the lowest level of taxonomy, and reads are typically assigned to taxonomic levels where they are unambiguous. We explore connections between metagenomic read assignment and the quantification of transcripts from RNA-Seq data in order to develop novel methods for rapid and accurate quantification of metagenomic strains. Results We find that the recent idea of pseudoalignment introduced in the RNA-Seq context is highly applicable in the metagenomics setting. When coupled with the Expectation-Maximization (EM) algorithm, reads can be assigned far more accurately and quickly than is currently possible with state of the art software, making it possible and practical for the first time to analyze abundances of individual genomes in metagenomics projects. Availability and Implementation Pipeline and analysis code can be downloaded from http://github.com/pachterlab/metakallisto. Contact lpachter@math.berkeley.edu.
Collapse
Affiliation(s)
- L Schaeffer
- Department of Molecular and Cell Biology, UC Berkeley, Berkeley, CA, USA
| | - H Pimentel
- Department of Genetics, Stanford University, Stanford, CA, USA
| | - N Bray
- Department of Molecular and Cell Biology and Innovative Genomics Institute, UC Berkeley, Berkeley, CA, USA
| | - P Melsted
- Faculty of Industrial Engineering, Mechanical Engineering and Computer Science, University of Iceland, Reykjavik, Iceland
| | - L Pachter
- Department of Molecular and Cell Biology, UC Berkeley, Berkeley, CA, USA.,Departments of Mathematics and Computer Science, UC Berkeley, Berkeley, CA, USA
| |
Collapse
|
34
|
Mangul S, Yang HT, Strauli N, Gruhl F, Porath HT, Hsieh K, Chen L, Daley T, Christenson S, Wesolowska-Andersen A, Spreafico R, Rios C, Eng C, Smith AD, Hernandez RD, Ophoff RA, Santana JR, Levanon EY, Woodruff PG, Burchard E, Seibold MA, Shifman S, Eskin E, Zaitlen N. ROP: dumpster diving in RNA-sequencing to find the source of 1 trillion reads across diverse adult human tissues. Genome Biol 2018; 19:36. [PMID: 29548336 PMCID: PMC5857127 DOI: 10.1186/s13059-018-1403-7] [Citation(s) in RCA: 32] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2017] [Accepted: 02/02/2018] [Indexed: 11/22/2022] Open
Abstract
High-throughput RNA-sequencing (RNA-seq) technologies provide an unprecedented opportunity to explore the individual transcriptome. Unmapped reads are a large and often overlooked output of standard RNA-seq analyses. Here, we present Read Origin Protocol (ROP), a tool for discovering the source of all reads originating from complex RNA molecules. We apply ROP to samples across 2630 individuals from 54 diverse human tissues. Our approach can account for 99.9% of 1 trillion reads of various read length. Additionally, we use ROP to investigate the functional mechanisms underlying connections between the immune system, microbiome, and disease. ROP is freely available at https://github.com/smangul1/rop/wiki.
Collapse
Affiliation(s)
- Serghei Mangul
- Department of Computer Science, University of California, Los Angeles, CA, USA. .,Institute for Quantitative and Computational Biosciences, University of California, Los Angeles, CA, USA.
| | - Harry Taegyun Yang
- Department of Computer Science, University of California, Los Angeles, CA, USA
| | - Nicolas Strauli
- Biomedical Sciences Graduate Program, University of California, San Francisco, CA, USA
| | - Franziska Gruhl
- Center for Integrative Genomics, University of Lausanne, Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Hagit T Porath
- The Mina and Everard Goodman Faculty of Life Sciences, Bar-Ilan University, Ramat-Gan, Israel
| | - Kevin Hsieh
- Department of Computer Science, University of California, Los Angeles, CA, USA
| | - Linus Chen
- Department of Bioengineering, University of California, Los Angeles, CA, USA
| | - Timothy Daley
- Molecular and Computational Biology, Department of Biological Sciences, University of Southern California, Los Angeles, CA, USA
| | - Stephanie Christenson
- Division of Pulmonary, Critical Care, Sleep and Allergy, Department of Medicine, and Cardiovascular Research Institute, University of California, San Francisco, CA, USA
| | | | - Roberto Spreafico
- Institute for Quantitative and Computational Biosciences, University of California, Los Angeles, CA, USA
| | - Cydney Rios
- Center for Genes, Environment, and Health, National Jewish Health, Denver, CO, USA
| | - Celeste Eng
- Department of Medicine, University of California, San Francisco, CA, USA
| | - Andrew D Smith
- Molecular and Computational Biology, Department of Biological Sciences, University of Southern California, Los Angeles, CA, USA
| | - Ryan D Hernandez
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, CA, USA.,Institute for Quantitative Biosciences, University of California, San Francisco, CA, USA.,Institute for Human Genetics, University of California, San Francisco, San Francisco, CA, USA
| | - Roel A Ophoff
- Center for Neurobehavioral Genetics, Semel Institute for Neuroscience and Human Behavior, University California, Los Angeles, CA, USA.,Department of Human Genetics, University of California, Los Angeles, CA, USA.,Department of Psychiatry, Brain Center Rudolf Magnus, University Medical Center Utrecht, Utrecht, The Netherlands
| | | | - Erez Y Levanon
- The Mina and Everard Goodman Faculty of Life Sciences, Bar-Ilan University, Ramat-Gan, Israel
| | - Prescott G Woodruff
- Division of Pulmonary, Critical Care, Sleep and Allergy, Department of Medicine, and Cardiovascular Research Institute, University of California, San Francisco, CA, USA
| | - Esteban Burchard
- Schools of Pharmacy and Medicine, Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, CA, USA
| | - Max A Seibold
- Department of Pediatrics, National Jewish Health, Denver, CO, USA.,University of Colorado School of Medicine, Denver, CO, USA
| | - Sagiv Shifman
- Department of Genetics, The Institute of Life Sciences, The Hebrew University of Jerusalem, Jerusalem, Israel
| | - Eleazar Eskin
- Department of Computer Science, University of California, Los Angeles, CA, USA.,Department of Human Genetics, University of California, Los Angeles, CA, USA
| | - Noah Zaitlen
- Division of Pulmonary, Critical Care, Sleep and Allergy, Department of Medicine, and Cardiovascular Research Institute, University of California, San Francisco, CA, USA.
| |
Collapse
|
35
|
Li HL, Lin HR, Xia JH. Differential Gene Expression Profiles and Alternative Isoform Regulations in Gill of Nile Tilapia in Response to Acute Hypoxia. MARINE BIOTECHNOLOGY (NEW YORK, N.Y.) 2017; 19:551-562. [PMID: 28920148 DOI: 10.1007/s10126-017-9774-4] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/19/2017] [Accepted: 07/27/2017] [Indexed: 06/07/2023]
Abstract
Fish often encounters exposures to acute environmental hypoxia either spatially or temporally. Gill organ plays important roles in response to hypoxic stress in fish. Few studies focus on the molecular regulation mechanisms of gills under hypoxic stress. In this study, we investigated the transcriptomic response to 12-h acute hypoxia in gill of a hypoxia tolerant fish, Nile tilapia Oreochromis niloticus through RNA sequencing (RNA-Seq). We sequenced messenger RNA from three control samples and three hypoxia-treated samples. Bioinformatics analysis identified 239 differentially expressed genes (DEG) and 34 genes (DUES) that had significant differential alternative isoform regulation events in at least one exonic region in gill in response to acute hypoxia. The spatiotemporal expression analysis in five tissues (heart, liver, brain, gill, and spleen) sampled at three time points (6, 12, and 24 h) under hypoxia treatment confirmed the significant association of differential exon usages in two DUES genes (TLDC2 and SSX2IPA) with hypoxia conditions. Further functional analysis suggested several energy and immune response-related pathways, e.g., metabolic pathway and antigen processing and presentation, contained the most abundant DEG genes. We found that some GO biological processes for DEG genes were significantly enriched under hypoxic stress, such as glycolysis, metabolic process, generation of precursor metabolites and energy, and cholesterol metabolic process. Our findings suggest abundant differential gene expression changes and alternative isoform regulation events in genes involved in the hypoxia response in gill. Our results provide a basis for exploring the gene regulation mechanism under hypoxic stress in fish.
Collapse
Affiliation(s)
- Hong Lian Li
- State Key Laboratory of Biocontrol, Institute of Aquatic Economic Animals and Guangdong Provincial Key Laboratory for Aquatic Economic Animals, College of Life Sciences, Sun Yat-Sen University, Guangzhou, 510275, People's Republic of China
| | - Hao Ran Lin
- State Key Laboratory of Biocontrol, Institute of Aquatic Economic Animals and Guangdong Provincial Key Laboratory for Aquatic Economic Animals, College of Life Sciences, Sun Yat-Sen University, Guangzhou, 510275, People's Republic of China
| | - Jun Hong Xia
- State Key Laboratory of Biocontrol, Institute of Aquatic Economic Animals and Guangdong Provincial Key Laboratory for Aquatic Economic Animals, College of Life Sciences, Sun Yat-Sen University, Guangzhou, 510275, People's Republic of China.
| |
Collapse
|
36
|
Zhang P, He D, Xu Y, Hou J, Pan BF, Wang Y, Liu T, Davis CM, Ehli EA, Tan L, Zhou F, Hu J, Yu Y, Chen X, Nguyen TM, Rosen JM, Hawke DH, Ji Z, Chen Y. Genome-wide identification and differential analysis of translational initiation. Nat Commun 2017; 8:1749. [PMID: 29170441 PMCID: PMC5701008 DOI: 10.1038/s41467-017-01981-8] [Citation(s) in RCA: 83] [Impact Index Per Article: 11.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2017] [Accepted: 10/31/2017] [Indexed: 01/28/2023] Open
Abstract
Translation is principally regulated at the initiation stage. The development of the translation initiation (TI) sequencing (TI-seq) technique has enabled the global mapping of TIs and revealed unanticipated complex translational landscapes in metazoans. Despite the wide adoption of TI-seq, there is no computational tool currently available for analyzing TI-seq data. To fill this gap, we develop a comprehensive toolkit named Ribo-TISH, which allows for detecting and quantitatively comparing TIs across conditions from TI-seq data. Ribo-TISH can also predict novel open reading frames (ORFs) from regular ribosome profiling (rRibo-seq) data and outperform several established methods in both computational efficiency and prediction accuracy. Applied to published TI-seq/rRibo-seq data sets, Ribo-TISH uncovers a novel signature of elevated mitochondrial translation during amino-acid deprivation and predicts novel ORFs in 5'UTRs, long noncoding RNAs, and introns. These successful applications demonstrate the power of Ribo-TISH in extracting biological insights from TI-seq/rRibo-seq data.
Collapse
Affiliation(s)
- Peng Zhang
- Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, TX, 77030, USA
| | - Dandan He
- Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, TX, 77030, USA
| | - Yi Xu
- Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, TX, 77030, USA
| | - Jiakai Hou
- Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, TX, 77030, USA
| | - Bih-Fang Pan
- Proteomics and Metabolomics Facility, and Department of Systems Biology, The University of Texas MD Anderson Cancer Center, Houston, TX, 77030, USA
| | - Yunfei Wang
- Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, TX, 77030, USA
| | - Tao Liu
- Department of Biochemistry, State University of New York at Buffalo, Buffalo, NY, 14203, USA
| | | | - Erik A Ehli
- Avera Institute for Human Genetics, Sioux Falls, SD, 57108, USA
| | - Lin Tan
- Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, TX, 77030, USA
| | - Feng Zhou
- Liver Cancer Institute, Zhongshan Hospital, Key Laboratory of Carcinogenesis and Cancer Invasion, Minister of Education, and Institutes of Biomedical Sciences, Fudan University, Shanghai, 200032, China
| | - Jian Hu
- Department of Cancer Biology, The University of Texas MD Anderson Cancer Center, Houston, TX, 77054, USA
| | - Yonghao Yu
- Department of Biochemistry, The University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
| | - Xi Chen
- Department of Molecular and Cellular Biology, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Tuan M Nguyen
- Department of Molecular and Cellular Biology, Baylor College of Medicine, Houston, TX, 77030, USA
- Program in Translational Biology and Molecular Medicine, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Jeffrey M Rosen
- Department of Molecular and Cellular Biology, Baylor College of Medicine, Houston, TX, 77030, USA
| | - David H Hawke
- Proteomics and Metabolomics Facility, and Department of Systems Biology, The University of Texas MD Anderson Cancer Center, Houston, TX, 77030, USA
| | - Zhe Ji
- Department of Biological Chemistry and Molecular and Pharmacology, Harvard Medical School, Boston, MA, 02115, USA
- Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA
| | - Yiwen Chen
- Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, TX, 77030, USA.
| |
Collapse
|
37
|
Srivastava A, Sarkar H, Gupta N, Patro R. RapMap: a rapid, sensitive and accurate tool for mapping RNA-seq reads to transcriptomes. Bioinformatics 2017; 32:i192-i200. [PMID: 27307617 PMCID: PMC4908361 DOI: 10.1093/bioinformatics/btw277] [Citation(s) in RCA: 66] [Impact Index Per Article: 9.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
Motivation: The alignment of sequencing reads to a transcriptome is a common and important step in many RNA-seq analysis tasks. When aligning RNA-seq reads directly to a transcriptome (as is common in the de novo setting or when a trusted reference annotation is available), care must be taken to report the potentially large number of multi-mapping locations per read. This can pose a substantial computational burden for existing aligners, and can considerably slow downstream analysis. Results: We introduce a novel concept, quasi-mapping, and an efficient algorithm implementing this approach for mapping sequencing reads to a transcriptome. By attempting only to report the potential loci of origin of a sequencing read, and not the base-to-base alignment by which it derives from the reference, RapMap—our tool implementing quasi-mapping—is capable of mapping sequencing reads to a target transcriptome substantially faster than existing alignment tools. The algorithm we use to implement quasi-mapping uses several efficient data structures and takes advantage of the special structure of shared sequence prevalent in transcriptomes to rapidly provide highly-accurate mapping information. We demonstrate how quasi-mapping can be successfully applied to the problems of transcript-level quantification from RNA-seq reads and the clustering of contigs from de novo assembled transcriptomes into biologically meaningful groups. Availability and implementation: RapMap is implemented in C ++11 and is available as open-source software, under GPL v3, at https://github.com/COMBINE-lab/RapMap. Contact:rob.patro@cs.stonybrook.edu Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Avi Srivastava
- Department of Computer Science, Stony Brook University Stony Brook, New York, NY 11794-2424, USA
| | - Hirak Sarkar
- Department of Computer Science, Stony Brook University Stony Brook, New York, NY 11794-2424, USA
| | - Nitish Gupta
- Department of Computer Science, Stony Brook University Stony Brook, New York, NY 11794-2424, USA
| | - Rob Patro
- Department of Computer Science, Stony Brook University Stony Brook, New York, NY 11794-2424, USA
| |
Collapse
|
38
|
Zakeri M, Srivastava A, Almodaresi F, Patro R. Improved data-driven likelihood factorizations for transcript abundance estimation. Bioinformatics 2017; 33:i142-i151. [PMID: 28881996 PMCID: PMC5870700 DOI: 10.1093/bioinformatics/btx262] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
Abstract
MOTIVATION Many methods for transcript-level abundance estimation reduce the computational burden associated with the iterative algorithms they use by adopting an approximate factorization of the likelihood function they optimize. This leads to considerably faster convergence of the optimization procedure, since each round of e.g. the EM algorithm, can execute much more quickly. However, these approximate factorizations of the likelihood function simplify calculations at the expense of discarding certain information that can be useful for accurate transcript abundance estimation. RESULTS We demonstrate that model simplifications (i.e. factorizations of the likelihood function) adopted by certain abundance estimation methods can lead to a diminished ability to accurately estimate the abundances of highly related transcripts. In particular, considering factorizations based on transcript-fragment compatibility alone can result in a loss of accuracy compared to the per-fragment, unsimplified model. However, we show that such shortcomings are not an inherent limitation of approximately factorizing the underlying likelihood function. By considering the appropriate conditional fragment probabilities, and adopting improved, data-driven factorizations of this likelihood, we demonstrate that such approaches can achieve accuracy nearly indistinguishable from methods that consider the complete (i.e. per-fragment) likelihood, while retaining the computational efficiently of the compatibility-based factorizations. AVAILABILITY AND IMPLEMENTATION Our data-driven factorizations are incorporated into a branch of the Salmon transcript quantification tool: https://github.com/COMBINE-lab/salmon/tree/factorizations . CONTACT rob.patro@cs.stonybrook.edu. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Mohsen Zakeri
- Department of Computer Science, Stony Brook University, Stony Brook, NY, USA
| | - Avi Srivastava
- Department of Computer Science, Stony Brook University, Stony Brook, NY, USA
| | - Fatemeh Almodaresi
- Department of Computer Science, Stony Brook University, Stony Brook, NY, USA
| | - Rob Patro
- Department of Computer Science, Stony Brook University, Stony Brook, NY, USA
| |
Collapse
|
39
|
Endocannabinoid system acts as a regulator of immune homeostasis in the gut. Proc Natl Acad Sci U S A 2017; 114:5005-5010. [PMID: 28439004 DOI: 10.1073/pnas.1612177114] [Citation(s) in RCA: 99] [Impact Index Per Article: 14.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023] Open
Abstract
Endogenous cannabinoids (endocannabinoids) are small molecules biosynthesized from membrane glycerophospholipid. Anandamide (AEA) is an endogenous intestinal cannabinoid that controls appetite and energy balance by engagement of the enteric nervous system through cannabinoid receptors. Here, we uncover a role for AEA and its receptor, cannabinoid receptor 2 (CB2), in the regulation of immune tolerance in the gut and the pancreas. This work demonstrates a major immunological role for an endocannabinoid. The pungent molecule capsaicin (CP) has a similar effect as AEA; however, CP acts by engagement of the vanilloid receptor TRPV1, causing local production of AEA, which acts through CB2. We show that the engagement of the cannabinoid/vanilloid receptors augments the number and immune suppressive function of the regulatory CX3CR1hi macrophages (Mϕ), which express the highest levels of such receptors among the gut immune cells. Additionally, TRPV1-/- or CB2-/- mice have fewer CX3CR1hi Mϕ in the gut. Treatment of mice with CP also leads to differentiation of a regulatory subset of CD4+ cells, the Tr1 cells, in an IL-27-dependent manner in vitro and in vivo. In a functional demonstration, tolerance elicited by engagement of TRPV1 can be transferred to naïve nonobese diabetic (NOD) mice [model of type 1 diabetes (T1D)] by transfer of CD4+ T cells. Further, oral administration of AEA to NOD mice provides protection from T1D. Our study unveils a role for the endocannabinoid system in maintaining immune homeostasis in the gut/pancreas and reveals a conversation between the nervous and immune systems using distinct receptors.
Collapse
|
40
|
Patro R, Duggal G, Love MI, Irizarry RA, Kingsford C. Salmon provides fast and bias-aware quantification of transcript expression. Nat Methods 2017; 14:417-419. [PMID: 28263959 PMCID: PMC5600148 DOI: 10.1038/nmeth.4197] [Citation(s) in RCA: 5930] [Impact Index Per Article: 847.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2016] [Accepted: 01/22/2017] [Indexed: 12/12/2022]
Abstract
We introduce Salmon, a lightweight method for quantifying transcript abundance from RNA-seq reads. Salmon combines a new dual-phase parallel inference algorithm and feature-rich bias models with an ultra-fast read mapping procedure. It is the first transcriptome-wide quantifier to correct for fragment GC-content bias, which, as we demonstrate here, substantially improves the accuracy of abundance estimates and the sensitivity of subsequent differential expression analysis.
Collapse
Affiliation(s)
- Rob Patro
- Department of Computer Science, Stony Brook University, Stony Brook, New York, USA
| | | | - Michael I Love
- Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Cambridge, Massachusetts, USA
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Cambridge, Massachusetts, USA
| | - Rafael A Irizarry
- Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Cambridge, Massachusetts, USA
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Cambridge, Massachusetts, USA
| | - Carl Kingsford
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| |
Collapse
|
41
|
Abstract
We introduce Salmon, a lightweight method for quantifying transcript abundance from RNA-seq reads. Salmon combines a new dual-phase parallel inference algorithm and feature-rich bias models with an ultra-fast read mapping procedure. It is the first transcriptome-wide quantifier to correct for fragment GC-content bias, which, as we demonstrate here, substantially improves the accuracy of abundance estimates and the sensitivity of subsequent differential expression analysis.
Collapse
|
42
|
Papastamoulis P, Rattray M. A Bayesian model selection approach for identifying differentially expressed transcripts from RNA sequencing data. J R Stat Soc Ser C Appl Stat 2017; 67:3-23. [PMID: 29353941 PMCID: PMC5763373 DOI: 10.1111/rssc.12213] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023]
Abstract
Recent advances in molecular biology allow the quantification of the transcriptome and scoring transcripts as differentially or equally expressed between two biological conditions. Although these two tasks are closely linked, the available inference methods treat them separately: a primary model is used to estimate expression and its output is post processed by using a differential expression model. In the paper, both issues are simultaneously addressed by proposing the joint estimation of expression levels and differential expression: the unknown relative abundance of each transcript can either be equal or not between two conditions. A hierarchical Bayesian model builds on the BitSeq framework and the posterior distribution of transcript expression and differential expression is inferred by using Markov chain Monte Carlo sampling. It is shown that the model proposed enjoys conjugacy for fixed dimension variables; thus the full conditional distributions are analytically derived. Two samplers are constructed, a reversible jump Markov chain Monte Carlo sampler and a collapsed Gibbs sampler, and the latter is found to perform better. A cluster representation of the aligned reads to the transcriptome is introduced, allowing parallel estimation of the marginal posterior distribution of subsets of transcripts under reasonable computing time. Under a fixed prior probability of differential expression the clusterwise sampler has the same marginal posterior distributions as the raw sampler, but a more general prior structure is also employed. The algorithm proposed is benchmarked against alternative methods by using synthetic data sets and applied to real RNA sequencing data. Source code is available on line from https://github.com/mqbssppe/cjBitSeq.
Collapse
|
43
|
Williams CR, Baccarella A, Parrish JZ, Kim CC. Empirical assessment of analysis workflows for differential expression analysis of human samples using RNA-Seq. BMC Bioinformatics 2017; 18:38. [PMID: 28095772 PMCID: PMC5240434 DOI: 10.1186/s12859-016-1457-z] [Citation(s) in RCA: 48] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2016] [Accepted: 12/31/2016] [Indexed: 02/07/2023] Open
Abstract
BACKGROUND RNA-Seq has supplanted microarrays as the preferred method of transcriptome-wide identification of differentially expressed genes. However, RNA-Seq analysis is still rapidly evolving, with a large number of tools available for each of the three major processing steps: read alignment, expression modeling, and identification of differentially expressed genes. Although some studies have benchmarked these tools against gold standard gene expression sets, few have evaluated their performance in concert with one another. Additionally, there is a general lack of testing of such tools on real-world, physiologically relevant datasets, which often possess qualities not reflected in tightly controlled reference RNA samples or synthetic datasets. RESULTS Here, we evaluate 219 combinatorial implementations of the most commonly used analysis tools for their impact on differential gene expression analysis by RNA-Seq. A test dataset was generated using highly purified human classical and nonclassical monocyte subsets from a clinical cohort, allowing us to evaluate the performance of 495 unique workflows, when accounting for differences in expression units and gene- versus transcript-level estimation. We find that the choice of methodologies leads to wide variation in the number of genes called significant, as well as in performance as gauged by precision and recall, calculated by comparing our RNA-Seq results to those from four previously published microarray and BeadChip analyses of the same cell populations. The method of differential gene expression identification exhibited the strongest impact on performance, with smaller impacts from the choice of read aligner and expression modeler. Many workflows were found to exhibit similar overall performance, but with differences in their calibration, with some biased toward higher precision and others toward higher recall. CONCLUSIONS There is significant heterogeneity in the performance of RNA-Seq workflows to identify differentially expressed genes. Among the higher performing workflows, different workflows exhibit a precision/recall tradeoff, and the ultimate choice of workflow should take into consideration how the results will be used in subsequent applications. Our analyses highlight the performance characteristics of these workflows, and the data generated in this study could also serve as a useful resource for future development of software for RNA-Seq analysis.
Collapse
Affiliation(s)
- Claire R Williams
- Department of Biology, University of Washington, Seattle, WA, 98195, USA
| | - Alyssa Baccarella
- Division of Experimental Medicine, Department of Medicine, University of California, San Francisco, CA, 94143, USA
| | - Jay Z Parrish
- Department of Biology, University of Washington, Seattle, WA, 98195, USA
| | - Charles C Kim
- Division of Experimental Medicine, Department of Medicine, University of California, San Francisco, CA, 94143, USA. .,Present address: Verily, South San Francisco, CA, 94080, USA.
| |
Collapse
|
44
|
Love MI, Hogenesch JB, Irizarry RA. Modeling of RNA-seq fragment sequence bias reduces systematic errors in transcript abundance estimation. Nat Biotechnol 2016. [PMID: 27669167 DOI: 10.1101/025767] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/02/2023]
Abstract
We find that current computational methods for estimating transcript abundance from RNA-seq data can lead to hundreds of false-positive results. We show that these systematic errors stem largely from a failure to model fragment GC content bias. Sample-specific biases associated with fragment sequence features lead to misidentification of transcript isoforms. We introduce alpine, a method for estimating sample-specific bias-corrected transcript abundance. By incorporating fragment sequence features, alpine greatly increases the accuracy of transcript abundance estimates, enabling a fourfold reduction in the number of false positives for reported changes in expression compared with Cufflinks. Using simulated data, we also show that alpine retains the ability to discover true positives, similar to other approaches. The method is available as an R/Bioconductor package that includes data visualization tools useful for bias discovery.
Collapse
Affiliation(s)
- Michael I Love
- Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, Massachusetts, USA
- Department of Biostatistics, Harvard TH Chan School of Public Health, Boston, Massachusetts, USA
| | - John B Hogenesch
- Department of Pharmacology, Institute for Translational Medicine and Therapeutics, University of Pennsylvania School of Medicine, Philadelphia, Pennsylvania, USA
| | - Rafael A Irizarry
- Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, Massachusetts, USA
- Department of Biostatistics, Harvard TH Chan School of Public Health, Boston, Massachusetts, USA
| |
Collapse
|
45
|
Modeling of RNA-seq fragment sequence bias reduces systematic errors in transcript abundance estimation. Nat Biotechnol 2016; 34:1287-1291. [PMID: 27669167 PMCID: PMC5143225 DOI: 10.1038/nbt.3682] [Citation(s) in RCA: 100] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2015] [Accepted: 08/22/2016] [Indexed: 11/17/2022]
|
46
|
Karunakaran DKP, Al Seesi S, Banday AR, Baumgartner M, Olthof A, Lemoine C, Măndoiu II, Kanadia RN. Network-based bioinformatics analysis of spatio-temporal RNA-Seq data reveals transcriptional programs underpinning normal and aberrant retinal development. BMC Genomics 2016; 17 Suppl 5:495. [PMID: 27586787 PMCID: PMC5009874 DOI: 10.1186/s12864-016-2822-z] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
Background The retina as a model system with extensive information on genes involved in development/maintenance is of great value for investigations employing deep sequencing to capture transcriptome change over time. This in turn could enable us to find patterns in gene expression across time to reveal transition in biological processes. Methods We developed a bioinformatics pipeline to categorize genes based on their differential expression and their alternative splicing status across time by binning genes based on their transcriptional kinetics. Genes within same bins were then leveraged to query gene annotation databases to discover molecular programs employed by the developing retina. Results Using our pipeline on RNA-Seq data obtained from fractionated (nucleus/cytoplasm) developing retina at embryonic day (E) 16 and postnatal day (P) 0, we captured high-resolution as in the difference between the cytoplasm and the nucleus at the same developmental time. We found de novo transcription of genes whose transcripts were exclusively found in the nuclear transcriptome at P0. Further analysis showed that these genes enriched for functions that are known to be executed during postnatal development, thus showing that the P0 nuclear transcriptome is temporally ahead of that of its cytoplasm. We extended our strategy to perform temporal analysis comparing P0 data to either P21-Nrl-wildtype (WT) or P21-Nrl-knockout (KO) retinae, which predicted that the KO retina would have compromised vasculature. Indeed, histological manifestation of vasodilation has been reported at a later time point (P60). Conclusions Thus, our approach was predictive of a phenotype before it presented histologically. Our strategy can be extended to investigating the development and/or disease progression of other tissue types. Electronic supplementary material The online version of this article (doi:10.1186/s12864-016-2822-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
| | - Sahar Al Seesi
- Department of Computer Science and Engineering, University of Connecticut, Storrs, CT, 06269, USA
| | - Abdul Rouf Banday
- Department of Physiology and Neurobiology, University of Connecticut, Storrs, CT, 06269, USA
| | - Marybeth Baumgartner
- Department of Physiology and Neurobiology, University of Connecticut, Storrs, CT, 06269, USA
| | - Anouk Olthof
- Department of Physiology and Neurobiology, University of Connecticut, Storrs, CT, 06269, USA.,Utrecht University, 3508 TC, Utrecht, The Netherlands
| | - Christopher Lemoine
- Department of Physiology and Neurobiology, University of Connecticut, Storrs, CT, 06269, USA
| | - Ion I Măndoiu
- Department of Computer Science and Engineering, University of Connecticut, Storrs, CT, 06269, USA
| | - Rahul N Kanadia
- Department of Physiology and Neurobiology, University of Connecticut, Storrs, CT, 06269, USA.
| |
Collapse
|
47
|
Huang Y, Sanguinetti G. Statistical modeling of isoform splicing dynamics from RNA-seq time series data. Bioinformatics 2016; 32:2965-72. [PMID: 27318208 DOI: 10.1093/bioinformatics/btw364] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2016] [Accepted: 06/05/2016] [Indexed: 01/08/2023] Open
Abstract
MOTIVATION Isoform quantification is an important goal of RNA-seq experiments, yet it remains problematic for genes with low expression or several isoforms. These difficulties may in principle be ameliorated by exploiting correlated experimental designs, such as time series or dosage response experiments. Time series RNA-seq experiments, in particular, are becoming increasingly popular, yet there are no methods that explicitly leverage the experimental design to improve isoform quantification. RESULTS Here, we present DICEseq, the first isoform quantification method tailored to correlated RNA-seq experiments. DICEseq explicitly models the correlations between different RNA-seq experiments to aid the quantification of isoforms across experiments. Numerical experiments on simulated datasets show that DICEseq yields more accurate results than state-of-the-art methods, an advantage that can become considerable at low coverage levels. On real datasets, our results show that DICEseq provides substantially more reproducible and robust quantifications, increasing the correlation of estimates from replicate datasets by up to 10% on genes with low or moderate expression levels (bottom third of all genes). Furthermore, DICEseq permits to quantify the trade-off between temporal sampling of RNA and depth of sequencing, frequently an important choice when planning experiments. Our results have strong implications for the design of RNA-seq experiments, and offer a novel tool for improved analysis of such datasets. AVAILABILITY AND IMPLEMENTATION Python code is freely available at http://diceseq.sf.net CONTACT G.Sanguinetti@ed.ac.uk SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yuanhua Huang
- School of Informatics, University of Edinburgh, Edinburgh EH8 9AB, UK
| | - Guido Sanguinetti
- School of Informatics, University of Edinburgh, Edinburgh EH8 9AB, UK Centre for Synthetic and Systems Biology (SynthSys), University of Edinburgh, Edinburgh EH9 3BF, UK
| |
Collapse
|
48
|
Yuan Y, Xu H, Leung RKK. An optimized protocol for generation and analysis of Ion Proton sequencing reads for RNA-Seq. BMC Genomics 2016; 17:403. [PMID: 27229683 PMCID: PMC4880854 DOI: 10.1186/s12864-016-2745-8] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2015] [Accepted: 05/14/2016] [Indexed: 11/28/2022] Open
Abstract
BACKGROUND Previous studies compared running cost, time and other performance measures of popular sequencing platforms. However, comprehensive assessment of library construction and analysis protocols for Proton sequencing platform remains unexplored. Unlike Illumina sequencing platforms, Proton reads are heterogeneous in length and quality. When sequencing data from different platforms are combined, this can result in reads with various read length. Whether the performance of the commonly used software for handling such kind of data is satisfactory is unknown. RESULTS By using universal human reference RNA as the initial material, RNaseIII and chemical fragmentation methods in library construction showed similar result in gene and junction discovery number and expression level estimated accuracy. In contrast, sequencing quality, read length and the choice of software affected mapping rate to a much larger extent. Unspliced aligner TMAP attained the highest mapping rate (97.27 % to genome, 86.46 % to transcriptome), though 47.83 % of mapped reads were clipped. Long reads could paradoxically reduce mapping in junctions. With reference annotation guide, the mapping rate of TopHat2 significantly increased from 75.79 to 92.09 %, especially for long (>150 bp) reads. Sailfish, a k-mer based gene expression quantifier attained highly consistent results with that of TaqMan array and highest sensitivity. CONCLUSION We provided for the first time, the reference statistics of library preparation methods, gene detection and quantification and junction discovery for RNA-Seq by the Ion Proton platform. Chemical fragmentation performed equally well with the enzyme-based one. The optimal Ion Proton sequencing options and analysis software have been evaluated.
Collapse
Affiliation(s)
- Yongxian Yuan
- BGI-tech, BGI-Shenzhen, Shenzhen, 518083, Guangdong, China
| | - Huaiqian Xu
- BGI-tech, BGI-Wuhan, Wuhan, 430075, Hubei, China
| | - Ross Ka-Kit Leung
- BGI-tech, BGI-Shenzhen, Shenzhen, 518083, Guangdong, China.
- School of Public Health, The University of Hong Kong, Hong Kong, China.
- Stanley Ho Centre for Emerging Infectious Diseases, The Chinese University of Hong Kong, Hong Kong, China.
| |
Collapse
|
49
|
Ntranos V, Kamath GM, Zhang JM, Pachter L, Tse DN. Fast and accurate single-cell RNA-seq analysis by clustering of transcript-compatibility counts. Genome Biol 2016; 17:112. [PMID: 27230763 PMCID: PMC4881296 DOI: 10.1186/s13059-016-0970-8] [Citation(s) in RCA: 72] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2016] [Accepted: 04/29/2016] [Indexed: 12/17/2022] Open
Abstract
Current approaches to single-cell transcriptomic analysis are computationally intensive and require assay-specific modeling, which limits their scope and generality. We propose a novel method that compares and clusters cells based on their transcript-compatibility read counts rather than on the transcript or gene quantifications used in standard analysis pipelines. In the reanalysis of two landmark yet disparate single-cell RNA-seq datasets, we show that our method is up to two orders of magnitude faster than previous approaches, provides accurate and in some cases improved results, and is directly applicable to data from a wide variety of assays.
Collapse
Affiliation(s)
- Vasilis Ntranos
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA, USA
| | - Govinda M Kamath
- Department of Electrical Engineering, Stanford University, Stanford, CA, USA
| | - Jesse M Zhang
- Department of Electrical Engineering, Stanford University, Stanford, CA, USA
| | - Lior Pachter
- Departments of Mathematics and Molecular and Cell Biology, University of California, Berkeley, CA, USA.
| | - David N Tse
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA, USA. .,Department of Electrical Engineering, Stanford University, Stanford, CA, USA.
| |
Collapse
|
50
|
Lin Z, Li M, Sestan N, Zhao H. A Markov random field-based approach for joint estimation of differentially expressed genes in mouse transcriptome data. Stat Appl Genet Mol Biol 2016; 15:139-50. [PMID: 26926866 PMCID: PMC5587217 DOI: 10.1515/sagmb-2015-0070] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
The statistical methodology developed in this study was motivated by our interest in studying neurodevelopment using the mouse brain RNA-Seq data set, where gene expression levels were measured in multiple layers in the somatosensory cortex across time in both female and male samples. We aim to identify differentially expressed genes between adjacent time points, which may provide insights on the dynamics of brain development. Because of the extremely small sample size (one male and female at each time point), simple marginal analysis may be underpowered. We propose a Markov random field (MRF)-based approach to capitalizing on the between layers similarity, temporal dependency and the similarity between sex. The model parameters are estimated by an efficient EM algorithm with mean field-like approximation. Simulation results and real data analysis suggest that the proposed model improves the power to detect differentially expressed genes than simple marginal analysis. Our method also reveals biologically interesting results in the mouse brain RNA-Seq data set.
Collapse
Affiliation(s)
- Zhixiang Lin
- Department of Statistics, Stanford University, Stanford, CA 94305, USA
- Computational Biology and Bioinformatics, Yale University, New Haven, CT 06511, USA
| | - Mingfeng Li
- Department of Neurobiology, Kavli Institute for Neuroscience, Yale School of Medicine, 06510 New Haven, CT, USA
| | - Nenad Sestan
- Department of Neurobiology, Kavli Institute for Neuroscience, Yale School of Medicine, 06510 New Haven, CT, USA
| | - Hongyu Zhao
- Department of Biostatistics, Yale School of Public Health, New Haven, Connecticut 06520, USA
- Department of Genetics, Yale School of Medicine, New Haven, Connecticut 06520, USA
| |
Collapse
|