1
|
PaintorPipe: a pipeline for genetic variant fine-mapping using functional annotations. BIOINFORMATICS ADVANCES 2023; 4:vbad188. [PMID: 38213821 PMCID: PMC10783948 DOI: 10.1093/bioadv/vbad188] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/30/2023] [Revised: 12/07/2023] [Accepted: 12/19/2023] [Indexed: 01/13/2024]
Abstract
Motivation Genome-wide association studies (GWAS) have identified thousands of genetic variants associated with common diseases. These results include a mix of causal and non-causal variants related through strong linkage disequilibrium (LD, i.e. highly correlated). Fine-mapping methods have been developed to decipher the causal from non-causal variants using GWAS results and LD information, assigning to each variant a probability of being causal. In this field, the PAINTOR program has become a standard, one of its advantages being its ability to take into account functional annotations. This approach requires many pre- and post-processing steps. Here, we developed a Nextflow pipeline called PaintorPipe that wraps all these steps and the fine-mapping itself together. PaintorPipe uses three independent sources of information: GWAS summary statistics, LD information and functional annotations, to rank the variants according to their susceptibility to be involved in the disease development. The PAINTOR framework is used to calculate the posterior probability of each variant (single nucleotide polymorphism) to be causal (a.k.a. Bayesian fine-mapping). The resulting credible sets of variants are annotated with their biological functions and visualized using CANVIS. This pipeline requires minimal input from users (a GWAS summary statistics file and a set of functional annotation files) and is designed to be modular and customizable, allowing for an easy integration of diverse functional annotations. Availability and implementation PaintorPipe is implemented in the Nextflow pipeline specific language, can be run locally or on a slurm cluster and handles containerization using Singularity. PaintorPipe is freely available on GitHub (https://github.com/sdjebali/PaintorPipe).
Collapse
|
2
|
TAGADA: a scalable pipeline to improve genome annotations with RNA-seq data. NAR Genom Bioinform 2023; 5:lqad089. [PMID: 37850035 PMCID: PMC10578202 DOI: 10.1093/nargab/lqad089] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2023] [Revised: 08/11/2023] [Accepted: 09/19/2023] [Indexed: 10/19/2023] Open
Abstract
Genome annotation plays a crucial role in providing comprehensive catalog of genes and transcripts for a particular species. As research projects generate new transcriptome data worldwide, integrating this information into existing annotations becomes essential. However, most bioinformatics pipelines are limited in their ability to effectively and consistently update annotations using new RNA-seq data. Here we introduce TAGADA, an RNA-seq pipeline for Transcripts And Genes Assembly, Deconvolution, and Analysis. Given a genomic sequence, a reference annotation and RNA-seq reads, TAGADA enhances existing gene models by generating an improved annotation. It also computes expression values for both the reference and novel annotation, identifies long non-coding transcripts (lncRNAs), and provides a comprehensive quality control report. Developed using Nextflow DSL2, TAGADA offers user-friendly functionalities and ensures reproducibility across different computing platforms through its containerized environment. In this study, we demonstrate the efficacy of TAGADA using RNA-seq data from the GENE-SWiTCH project alongside chicken and pig genome annotations as references. Results indicate that TAGADA can substantially increase the number of annotated transcripts by approximately [Formula: see text] in these species. Furthermore, we illustrate how TAGADA can integrate Illumina NovaSeq short reads with PacBio Iso-Seq long reads, showcasing its versatility. TAGADA is available at github.com/FAANG/analysis-TAGADA.
Collapse
|
3
|
A Bos taurus sequencing methods benchmark for assembly, haplotyping, and variant calling. Sci Data 2023; 10:369. [PMID: 37291142 PMCID: PMC10250393 DOI: 10.1038/s41597-023-02249-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2022] [Accepted: 05/16/2023] [Indexed: 06/10/2023] Open
Abstract
Inspired by the production of reference data sets in the Genome in a Bottle project, we sequenced one Charolais heifer with different technologies: Illumina paired-end, Oxford Nanopore, Pacific Biosciences (HiFi and CLR), 10X Genomics linked-reads, and Hi-C. In order to generate haplotypic assemblies, we also sequenced both parents with short reads. From these data, we built two haplotyped trio high quality reference genomes and a consensus assembly, using up-to-date software packages. The assemblies obtained using PacBio HiFi reaches a size of 3.2 Gb, which is significantly larger than the 2.7 Gb ARS-UCD1.2 reference. The BUSCO score of the consensus assembly reaches a completeness of 95.8%, among highly conserved mammal genes. We also identified 35,866 structural variants larger than 50 base pairs. This assembly is a contribution to the bovine pangenome for the "Charolais" breed. These datasets will prove to be useful resources enabling the community to gain additional insight on sequencing technologies for applications such as SNP, indel or structural variant calling, and de novo assembly.
Collapse
|
4
|
Enhancer/gene relationships: Need for more reliable genome-wide reference sets. FRONTIERS IN BIOINFORMATICS 2023; 3:1092853. [PMID: 36909938 PMCID: PMC9999192 DOI: 10.3389/fbinf.2023.1092853] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2022] [Accepted: 02/07/2023] [Indexed: 02/26/2023] Open
Abstract
Differences in cells' functions arise from differential activity of regulatory elements, including enhancers. Enhancers are cis-regulatory elements that cooperate with promoters through transcription factors to activate the expression of one or several genes by getting physically close to them in the 3D space of the nucleus. There is increasing evidence that genetic variants associated with common diseases are enriched in enhancers active in cell types relevant to these diseases. Identifying the enhancers associated with genes and conversely, the sets of genes activated by each enhancer (the so-called enhancer/gene or E/G relationships) across cell types, can help understanding the genetic mechanisms underlying human diseases. There are three broad approaches for the genome-wide identification of E/G relationships in a cell type: 1) genetic link methods or eQTL, 2) functional link methods based on 1D functional data such as open chromatin, histone mark or gene expression and 3) spatial link methods based on 3D data such as HiC. Since 1) and 3) are costly, the current strategy is to develop functional link methods and to use data from 1) and 3) as reference to evaluate them. However, there is still no consensus on the best functional link method to date, and method comparison remain seldom. Here, we compared the relative performances of three recent methods for the identification of enhancer-gene links, TargetFinder, Average-Rank, and the ABC model, using the three latest benchmarks from the field: a reference that combines 3D and eQTL data, called BENGI, and two genetic screening references, called CRiFF and CRiSPRi. Overall, none of the three methods performed best on the three references. CRiFF and CRISPRi reference sets are likely more reliable, but CRiFF is not genome-wide and CRiFF and CRISPRi are mostly available on the K562 cancer cell line. The BENGI reference set is genome-wide but likely contains many false positives. This study therefore calls for new reliable and genome-wide E/G reference data rather than new functional link E/G identification methods.
Collapse
|
5
|
RNA-Seq Data for Reliable SNP Detection and Genotype Calling: Interest for Coding Variant Characterization and Cis-Regulation Analysis by Allele-Specific Expression in Livestock Species. Front Genet 2021; 12:655707. [PMID: 34262593 PMCID: PMC8273700 DOI: 10.3389/fgene.2021.655707] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2021] [Accepted: 06/01/2021] [Indexed: 12/19/2022] Open
Abstract
In addition to their common usages to study gene expression, RNA-seq data accumulated over the last 10 years are a yet-unexploited resource of SNPs in numerous individuals from different populations. SNP detection by RNA-seq is particularly interesting for livestock species since whole genome sequencing is expensive and exome sequencing tools are unavailable. These SNPs detected in expressed regions can be used to characterize variants affecting protein functions, and to study cis-regulated genes by analyzing allele-specific expression (ASE) in the tissue of interest. However, gene expression can be highly variable, and filters for SNP detection using the popular GATK toolkit are not yet standardized, making SNP detection and genotype calling by RNA-seq a challenging endeavor. We compared SNP calling results using GATK suggested filters, on two chicken populations for which both RNA-seq and DNA-seq data were available for the same samples of the same tissue. We showed, in expressed regions, a RNA-seq precision of 91% (SNPs detected by RNA-seq and shared by DNA-seq) and we characterized the remaining 9% of SNPs. We then studied the genotype (GT) obtained by RNA-seq and the impact of two factors (GT call-rate and read number per GT) on the concordance of GT with DNA-seq; we proposed thresholds for them leading to a 95% concordance. Applying these thresholds to 767 multi-tissue RNA-seq of 382 birds of 11 chicken populations, we found 9.5 M SNPs in total, of which ∼550,000 SNPs per tissue and population with a reliable GT (call rate ≥ 50%) and among them, ∼340,000 with a MAF ≥ 10%. We showed that such RNA-seq data from one tissue can be used to (i) detect SNPs with a strong predicted impact on proteins, despite their scarcity in each population (16,307 SIFT deleterious missenses and 590 stop-gained), (ii) study, on a large scale, cis-regulations of gene expression, with ∼81% of protein-coding and 68% of long non-coding genes (TPM ≥ 1) that can be analyzed for ASE, and with ∼29% of them that were cis-regulated, and (iii) analyze population genetic using such SNPs located in expressed regions. This work shows that RNA-seq data can be used with good confidence to detect SNPs and associated GT within various populations and used them for different analyses as GTEx studies.
Collapse
|
6
|
Author Correction: An integrative atlas of chicken long non-coding genes and their annotations across 25 tissues. Sci Rep 2021; 11:9463. [PMID: 33911173 PMCID: PMC8080728 DOI: 10.1038/s41598-021-89158-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
|
7
|
A limited set of transcriptional programs define major cell types. Genome Res 2020; 30:1047-1059. [PMID: 32759341 PMCID: PMC7397875 DOI: 10.1101/gr.263186.120] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2020] [Accepted: 04/29/2020] [Indexed: 12/12/2022]
Abstract
We have produced RNA sequencing data for 53 primary cells from different locations in the human body. The clustering of these primary cells reveals that most cells in the human body share a few broad transcriptional programs, which define five major cell types: epithelial, endothelial, mesenchymal, neural, and blood cells. These act as basic components of many tissues and organs. Based on gene expression, these cell types redefine the basic histological types by which tissues have been traditionally classified. We identified genes whose expression is specific to these cell types, and from these genes, we estimated the contribution of the major cell types to the composition of human tissues. We found this cellular composition to be a characteristic signature of tissues and to reflect tissue morphological heterogeneity and histology. We identified changes in cellular composition in different tissues associated with age and sex, and found that departures from the normal cellular composition correlate with histological phenotypes associated with disease.
Collapse
|
8
|
Multi-species annotation of transcriptome and chromatin structure in domesticated animals. BMC Biol 2019; 17:108. [PMID: 31884969 PMCID: PMC6936065 DOI: 10.1186/s12915-019-0726-5] [Citation(s) in RCA: 58] [Impact Index Per Article: 11.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2019] [Accepted: 11/19/2019] [Indexed: 01/10/2023] Open
Abstract
BACKGROUND Comparative genomics studies are central in identifying the coding and non-coding elements associated with complex traits, and the functional annotation of genomes is a critical step to decipher the genotype-to-phenotype relationships in livestock animals. As part of the Functional Annotation of Animal Genomes (FAANG) action, the FR-AgENCODE project aimed to create reference functional maps of domesticated animals by profiling the landscape of transcription (RNA-seq), chromatin accessibility (ATAC-seq) and conformation (Hi-C) in species representing ruminants (cattle, goat), monogastrics (pig) and birds (chicken), using three target samples related to metabolism (liver) and immunity (CD4+ and CD8+ T cells). RESULTS RNA-seq assays considerably extended the available catalog of annotated transcripts and identified differentially expressed genes with unknown function, including new syntenic lncRNAs. ATAC-seq highlighted an enrichment for transcription factor binding sites in differentially accessible regions of the chromatin. Comparative analyses revealed a core set of conserved regulatory regions across species. Topologically associating domains (TADs) and epigenetic A/B compartments annotated from Hi-C data were consistent with RNA-seq and ATAC-seq data. Multi-species comparisons showed that conserved TAD boundaries had stronger insulation properties than species-specific ones and that the genomic distribution of orthologous genes in A/B compartments was significantly conserved across species. CONCLUSIONS We report the first multi-species and multi-assay genome annotation results obtained by a FAANG project. Beyond the generation of reference annotations and the confirmation of previous findings on model animals, the integrative analysis of data from multiple assays and species sheds a new light on the multi-scale selective pressure shaping genome organization from birds to mammals. Overall, these results emphasize the value of FAANG for research on domesticated animals and reinforces the importance of future meta-analyses of the reference datasets being generated by this community on different species.
Collapse
|
9
|
32 Functional annotation of livestock genomes: chromatin structure and regulation of gene expression. J Anim Sci 2019. [DOI: 10.1093/jas/skz122.028] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Abstract
Improving the functional annotation of animal genomes is a key challenge in bridging the gap between genotype and phenotype, thus enabling predictive biology. Regarding livestock production, major outcomes are expected from a better understanding of the genetic architecture underlying quantitative traits. As part of the Functional Annotation of ANimal Genomes action (FAANG: www.faang.org), the FR-AgENCODE project generated omics data to improve the reference annotation of the cattle, pig, goat and chicken genome. High-throughput molecular assays have been performed on tissues/cells relevant to immune and metabolic traits. From two males and two females per species (pig, cattle, goat, chicken), strand-oriented RNA-seq gene expression and ATAC-seq chromatin accessibility assays were performed on liver and two PBMC-sorted T-cell types (CD4+ and CD8+). Chromosome Conformation Capture (in situ Hi-C) was also carried out on liver samples. About 4,000 samples have been collected at the INRA biorepository and registered at the EBI BioSamples registry. More than 80% of the planned experiments could be completed, generating ~11.5 billions of sequencing reads over the 3 assays. While most (50–80%) RNA-seq reads mapped to annotated exons, thousands of novel transcripts were found, with ~60K mRNAs and ~22K lncRNAs in cattle. Differentially expressed genes between cell types were enriched for immunity- or metabolism-related terms, and differentially accessible chromatin regions were identified as potential regulatory sites. Interestingly, correlations between gene expression and promoter accessibility across samples were skewed towards both positive and negative values, suggesting distinct regulatory mechanisms of gene expression. These patterns have been further investigated using human data from the Epigenome Roadmap Mapping Consortium. Altogether, this study illustrates the interest of a coordinated effort to tackle the genome-to-phenome challenge and provides a useful resource to the community. Availability: www.fragencode.org.
Collapse
|
10
|
Analysis of pig transcriptomes suggests a global regulation mechanism enabling temporary bursts of circular RNAs. RNA Biol 2019; 16:1190-1204. [PMID: 31120323 DOI: 10.1080/15476286.2019.1621621] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022] Open
Abstract
To investigate the dynamics of circRNA expression in pig testes, we designed specific strategies to individually study circRNA production from intron lariats and circRNAs originating from back-splicing of two exons. By applying these methods on seven Total-RNA-seq datasets sampled during the testicular puberty, we detected 126 introns in 114 genes able to produce circRNAs and 5,236 exonic circRNAs produced by 2,516 genes. Comparing our RNA-seq datasets to datasets from the literature (embryonic cortex and postnatal muscle stages) revealed highly abundant intronic and exonic circRNAs in one sample each in pubertal testis and embryonic cortex, respectively. This abundance was due to higher production of circRNA by the same genes in comparison to other testis samples, rather than to the recruitment of new genes. No global relationship between circRNA and mRNA production was found. We propose ExoCirc-9244 (SMARCA5) as a marker of a particular stage in testis, which is characterized by a very low plasma estradiol level and a high abundance of circRNA in testis. We hypothesize that the abundance of testicular circRNA is associated with an abrupt switch of the cellular process to overcome a particular challenge that may have arisen in the early stages of steroid production. We also hypothesize that, in certain circumstances, isoforms and circular transcripts from different genes share functions and that a global regulation of circRNA production is established. Our data indicate that this massive production of circRNAs is much more related to the structure of the genes generating circRNAs than to their function. Abbreviations: PE: Paired Ends; CR: chimeric Read; SR: Split Read; circRNA: circular RNA; NC: non conventional; ExoCirc-RNA: exonic circular RNA; IntroLCirc-: name of a porcine intronic lariat circRNA; ExoCirc-: name of a porcine exonic circRNA; IntronCircle-: name of a porcine intron circle; sisRNA: stable intronic sequence RNA; P: porcine breed Pietrain; LW: porcine breed Large White; RT: reverse transcription/reverse transcriptase; Total-RNA-seq: RNA-seq obtained from total RNA after ribosomal depletion; mRNA-seq: RNA-seq of poly(A) transcripts; TPM: transcripts per million; CR-PM: chimeric reads per million; RBP: RNA binding protein; miRNA: micro RNA; E2: estradiol; DHT: dihydrotestesterone.
Collapse
|
11
|
Long noncoding RNA repertoire in chicken liver and adipose tissue. Genet Sel Evol 2017; 49:6. [PMID: 28073357 PMCID: PMC5225574 DOI: 10.1186/s12711-016-0275-0] [Citation(s) in RCA: 51] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2016] [Accepted: 11/29/2016] [Indexed: 01/05/2023] Open
Abstract
BACKGROUND Improving functional annotation of the chicken genome is a key challenge in bridging the gap between genotype and phenotype. Among all transcribed regions, long noncoding RNAs (lncRNAs) are a major component of the transcriptome and its regulation, and whole-transcriptome sequencing (RNA-Seq) has greatly improved their identification and characterization. We performed an extensive profiling of the lncRNA transcriptome in the chicken liver and adipose tissue by RNA-Seq. We focused on these two tissues because of their importance in various economical traits for which energy storage and mobilization play key roles and also because of their high cell homogeneity. To predict lncRNAs, we used a recently developed tool called FEELnc, which also classifies them with respect to their distance and strand orientation to the closest protein-coding genes. Moreover, to confidently identify the genes/transcripts expressed in each tissue (a complex task for weakly expressed molecules such as lncRNAs), we probed a particularly large number of biological replicates (16 per tissue) compared to common multi-tissue studies with a larger set of tissues but less sampling. RESULTS We predicted 2193 lncRNA genes, among which 1670 were robustly expressed across replicates in the liver and/or adipose tissue and which were classified into 1493 intergenic and 177 intragenic lncRNAs located between and within protein-coding genes, respectively. We observed similar structural features between chickens and mammals, with strong synteny conservation but without sequence conservation. As previously reported, we confirm that lncRNAs have a lower and more tissue-specific expression than mRNAs. Finally, we showed that adjacent lncRNA-mRNA genes in divergent orientation have a higher co-expression level when separated by less than 1 kb compared to more distant divergent pairs. Among these, we highlighted for the first time a novel lncRNA candidate involved in lipid metabolism, lnc_DHCR24, which is highly correlated with the DHCR24 gene that encodes a key enzyme of cholesterol biosynthesis. CONCLUSIONS We provide a comprehensive lncRNA repertoire in the chicken liver and adipose tissue, which shows interesting patterns of co-expression between mRNAs and lncRNAs. It contributes to improving the structural and functional annotation of the chicken genome and provides a basis for further studies on energy storage and mobilization traits in the chicken.
Collapse
|
12
|
ChimPipe: accurate detection of fusion genes and transcription-induced chimeras from RNA-seq data. BMC Genomics 2017; 18:7. [PMID: 28049418 PMCID: PMC5209911 DOI: 10.1186/s12864-016-3404-9] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2016] [Accepted: 12/09/2016] [Indexed: 11/28/2022] Open
Abstract
Background Chimeric transcripts are commonly defined as transcripts linking two or more different genes in the genome, and can be explained by various biological mechanisms such as genomic rearrangement, read-through or trans-splicing, but also by technical or biological artefacts. Several studies have shown their importance in cancer, cell pluripotency and motility. Many programs have recently been developed to identify chimeras from Illumina RNA-seq data (mostly fusion genes in cancer). However outputs of different programs on the same dataset can be widely inconsistent, and tend to include many false positives. Other issues relate to simulated datasets restricted to fusion genes, real datasets with limited numbers of validated cases, result inconsistencies between simulated and real datasets, and gene rather than junction level assessment. Results Here we present ChimPipe, a modular and easy-to-use method to reliably identify fusion genes and transcription-induced chimeras from paired-end Illumina RNA-seq data. We have also produced realistic simulated datasets for three different read lengths, and enhanced two gold-standard cancer datasets by associating exact junction points to validated gene fusions. Benchmarking ChimPipe together with four other state-of-the-art tools on this data showed ChimPipe to be the top program at identifying exact junction coordinates for both kinds of datasets, and the one showing the best trade-off between sensitivity and precision. Applied to 106 ENCODE human RNA-seq datasets, ChimPipe identified 137 high confidence chimeras connecting the protein coding sequence of their parent genes. In subsequent experiments, three out of four predicted chimeras, two of which recurrently expressed in a large majority of the samples, could be validated. Cloning and sequencing of the three cases revealed several new chimeric transcript structures, 3 of which with the potential to encode a chimeric protein for which we hypothesized a new role. Applying ChimPipe to human and mouse ENCODE RNA-seq data led to the identification of 131 recurrent chimeras common to both species, and therefore potentially conserved. Conclusions ChimPipe combines discordant paired-end reads and split-reads to detect any kind of chimeras, including those originating from polymerase read-through, and shows an excellent trade-off between sensitivity and precision. The chimeras found by ChimPipe can be validated in-vitro with high accuracy. Electronic supplementary material The online version of this article (doi:10.1186/s12864-016-3404-9) contains supplementary material, which is available to authorized users.
Collapse
|
13
|
Erratum to: Bioinformatics Pipeline for Transcriptome Sequencing Analysis. Methods Mol Biol 2017; 1468:E1. [PMID: 27943167 DOI: 10.1007/978-1-4939-4035-6_17] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
|
14
|
Abstract
The development of High Throughput Sequencing (HTS) for RNA profiling (RNA-seq) has shed light on the diversity of transcriptomes. While RNA-seq is becoming a de facto standard for monitoring the population of expressed transcripts in a given condition at a specific time, processing the huge amount of data it generates requires dedicated bioinformatics programs. Here, we describe a standard bioinformatics protocol using state-of-the-art tools, the STAR mapper to align reads onto a reference genome, Cufflinks to reconstruct the transcriptome, and RSEM to quantify expression levels of genes and transcripts. We present the workflow using human transcriptome sequencing data from two biological replicates of the K562 cell line produced as part of the ENCODE3 project.
Collapse
|
15
|
0414 Recurrent chimeric transcripts in human and mouse. J Anim Sci 2016. [DOI: 10.2527/jam2016-0414] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|
16
|
Erratum to: A benchmark for RNA-seq quantification pipelines. Genome Biol 2016; 17:203. [PMID: 27716375 PMCID: PMC5045616 DOI: 10.1186/s13059-016-1060-7] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2016] [Accepted: 09/12/2016] [Indexed: 11/30/2022] Open
|
17
|
P3038 Chromatin accessibility in the liver and circulating immune cells of pigs, goats and chickens. J Anim Sci 2016. [DOI: 10.2527/jas2016.94supplement471a] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|
18
|
S0103 Recurrent chimeric transcripts in human and mouse. J Anim Sci 2016. [DOI: 10.2527/jas2016.94supplement43x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|
19
|
Abstract
BACKGROUND A comparison of transcriptional profiles derived from different tissues in a given species or among different species assumes that commonalities reflect evolutionarily conserved programs and that differences reflect species or tissue responses to environmental conditions or developmental program staging. Apparently conflicting results have been published regarding whether organ-specific transcriptional patterns dominate over species-specific patterns, or vice versa, making it unclear to what extent the biology of a given organism can be extrapolated to another. These studies have in common that they treat the transcriptomes monolithically, implicitly ignoring that each gene is likely to have a specific pattern of transcriptional variation across organs and species. RESULTS We use linear models to quantify this pattern. We find a continuum in the spectrum of expression variation: the expression of some genes varies considerably across species and little across organs, and simply reflects evolutionary distance. At the other extreme are genes whose expression varies considerably across organs and little across species; these genes are much more likely to be associated with diseases than are genes whose expression varies predominantly across species. CONCLUSIONS Whether transcriptomes, when considered globally, cluster preferentially according to one component or the other may not be a property of the transcriptomes, but rather a consequence of the dominant behavior of a subset of genes. Therefore, the values of the components of the variance of expression for each gene could become a useful resource when planning, interpreting, and extrapolating experimental data from mouse to humans.
Collapse
|
20
|
Erratum to: A benchmark for RNA-seq quantification pipelines. Genome Biol 2016; 17:107. [PMID: 27215799 PMCID: PMC4877800 DOI: 10.1186/s13059-016-0986-0] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2016] [Accepted: 05/13/2016] [Indexed: 11/10/2022] Open
|
21
|
Abstract
Obtaining RNA-seq measurements involves a complex data analytical process with a large number of competing algorithms as options. There is much debate about which of these methods provides the best approach. Unfortunately, it is currently difficult to evaluate their performance due in part to a lack of sensitive assessment metrics. We present a series of statistical summaries and plots to evaluate the performance in terms of specificity and sensitivity, available as a R/Bioconductor package (http://bioconductor.org/packages/rnaseqcomp). Using two independent datasets, we assessed seven competing pipelines. Performance was generally poor, with two methods clearly underperforming and RSEM slightly outperforming the rest.
Collapse
|
22
|
Abstract
Transcriptional regulation and posttranscriptional processing underlie many cellular and organismal phenotypes. We used RNA sequence data generated by Genotype-Tissue Expression (GTEx) project to investigate the patterns of transcriptome variation across individuals and tissues. Tissues exhibit characteristic transcriptional signatures that show stability in postmortem samples. These signatures are dominated by a relatively small number of genes—which is most clearly seen in blood—though few are exclusive to a particular tissue and vary more across tissues than individuals. Genes exhibiting high interindividual expression variation include disease candidates associated with sex, ethnicity, and age. Primary transcription is the major driver of cellular specificity, with splicing playing mostly a complementary role; except for the brain, which exhibits a more divergent splicing program. Variation in splicing, despite its stochasticity, may play in contrast a comparatively greater role in defining individual phenotypes.
Collapse
|
23
|
Predicting pathogen-specific CD8 T cell immune responses from a modeling approach. J Theor Biol 2015; 374:66-82. [PMID: 25846273 DOI: 10.1016/j.jtbi.2015.03.033] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2014] [Revised: 03/04/2015] [Accepted: 03/09/2015] [Indexed: 12/21/2022]
Abstract
The primary CD8 T cell immune response constitutes a major mechanism to fight an infection by intra-cellular pathogens. We aim at assessing whether pathogen-specific dynamical parameters of the CD8 T cell response can be identified, based on measurements of CD8 T cell counts, using a modeling approach. We generated experimental data consisting in CD8 T cell counts kinetics during the response to three different live intra-cellular pathogens: two viruses (influenza, vaccinia) injected intranasally, and one bacteria (Listeria monocytogenes) injected intravenously. All pathogens harbor the same antigen (NP68), but differ in their interaction with the host. In parallel, we developed a mathematical model describing the evolution of CD8 T cell counts and pathogen amount during an immune response. This model is characterized by 9 parameters and includes relevant feedback controls. The model outputs were compared with the three data series and an exhaustive estimation of the parameter values was performed. By focusing on the ability of the model to fit experimental data and to produce a CD8 T cell population mainly composed of memory cells at the end of the response, critical parameters were identified. We show that a small number of parameters (2-4) define the main features of the CD8 T cell immune response and are characteristic of a given pathogen. Among these parameters, two are related to the effector CD8 T cell mediated control of cell and pathogen death. The parameter associated with memory cell death is shown to play no relevant role during the main phases of the CD8 T cell response, yet it becomes essential when looking at the predictions of the model several months after the infection.
Collapse
|
24
|
A comparative encyclopedia of DNA elements in the mouse genome. Nature 2015; 515:355-64. [PMID: 25409824 PMCID: PMC4266106 DOI: 10.1038/nature13992] [Citation(s) in RCA: 1135] [Impact Index Per Article: 126.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2014] [Accepted: 10/24/2014] [Indexed: 12/11/2022]
Abstract
The laboratory mouse shares the majority of its protein-coding genes with humans, making it the premier model organism in biomedical research, yet the two mammals differ in significant ways. To gain greater insights into both shared and species-specific transcriptional and cellular regulatory programs in the mouse, the Mouse ENCODE Consortium has mapped transcription, DNase I hypersensitivity, transcription factor binding, chromatin modifications and replication domains throughout the mouse genome in diverse cell and tissue types. By comparing with the human genome, we not only confirm substantial conservation in the newly annotated potential functional sequences, but also find a large degree of divergence of sequences involved in transcriptional regulation, chromatin state and higher order chromatin organization. Our results illuminate the wide range of evolutionary forces acting on genes and their regulatory regions, and provide a general resource for research into mammalian biology and mechanisms of human diseases.
Collapse
|
25
|
Transcriptional Signatures of Viral Control in HIV-1 Infected South African Women. AIDS Res Hum Retroviruses 2014. [DOI: 10.1089/aid.2014.5116a.abstract] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
|
26
|
Abstract
The transcriptome is the readout of the genome. Identifying common features in it across distant species can reveal fundamental principles. To this end, the ENCODE and modENCODE consortia have generated large amounts of matched RNA-sequencing data for human, worm and fly. Uniform processing and comprehensive annotation of these data allow comparison across metazoan phyla, extending beyond earlier within-phylum transcriptome comparisons and revealing ancient, conserved features. Specifically, we discover co-expression modules shared across animals, many of which are enriched in developmental genes. Moreover, we use expression patterns to align the stages in worm and fly development and find a novel pairing between worm embryo and fly pupae, in addition to the embryo-to-embryo and larvae-to-larvae pairings. Furthermore, we find that the extent of non-canonical, non-coding transcription is similar in each organism, per base pair. Finally, we find in all three organisms that the gene-expression levels, both coding and non-coding, can be quantitatively predicted from chromatin features at the promoter using a 'universal model' based on a single set of organism-independent parameters.
Collapse
|
27
|
Transcriptome characterization by RNA sequencing identifies a major molecular and clinical subdivision in chronic lymphocytic leukemia. Genome Res 2013; 24:212-26. [PMID: 24265505 PMCID: PMC3912412 DOI: 10.1101/gr.152132.112] [Citation(s) in RCA: 153] [Impact Index Per Article: 13.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Chronic lymphocytic leukemia (CLL) has heterogeneous clinical and biological behavior. Whole-genome and -exome sequencing has contributed to the characterization of the mutational spectrum of the disease, but the underlying transcriptional profile is still poorly understood. We have performed deep RNA sequencing in different subpopulations of normal B-lymphocytes and CLL cells from a cohort of 98 patients, and characterized the CLL transcriptional landscape with unprecedented resolution. We detected thousands of transcriptional elements differentially expressed between the CLL and normal B cells, including protein-coding genes, noncoding RNAs, and pseudogenes. Transposable elements are globally derepressed in CLL cells. In addition, two thousand genes—most of which are not differentially expressed—exhibit CLL-specific splicing patterns. Genes involved in metabolic pathways showed higher expression in CLL, while genes related to spliceosome, proteasome, and ribosome were among the most down-regulated in CLL. Clustering of the CLL samples according to RNA-seq derived gene expression levels unveiled two robust molecular subgroups, C1 and C2. C1/C2 subgroups and the mutational status of the immunoglobulin heavy variable (IGHV) region were the only independent variables in predicting time to treatment in a multivariate analysis with main clinico-biological features. This subdivision was validated in an independent cohort of patients monitored through DNA microarrays. Further analysis shows that B-cell receptor (BCR) activation in the microenvironment of the lymph node may be at the origin of the C1/C2 differences.
Collapse
|
28
|
Unravelling the hidden DNA structural/physical code provides novel insights on promoter location. Nucleic Acids Res 2013; 41:7220-30. [PMID: 23761436 PMCID: PMC3753636 DOI: 10.1093/nar/gkt511] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
Although protein recognition of DNA motifs in promoter regions has been traditionally considered as a critical regulatory element in transcription, the location of promoters, and in particular transcription start sites (TSSs), still remains a challenge. Here we perform a comprehensive analysis of putative core promoter sequences relative to non-annotated predicted TSSs along the human genome, which were defined by distinct DNA physical properties implemented in our ProStar computational algorithm. A representative sampling of predicted regions was subjected to extensive experimental validation and analyses. Interestingly, the vast majority proved to be transcriptionally active despite the lack of specific sequence motifs, indicating that physical signaling is indeed able to detect promoter activity beyond conventional TSS prediction methods. Furthermore, highly active regions displayed typical chromatin features associated to promoters of housekeeping genes. Our results enable to redefine the promoter signatures and analyze the diversity, evolutionary conservation and dynamic regulation of human core promoters at large-scale. Moreover, the present study strongly supports the hypothesis of an ancient regulatory mechanism encoded by the intrinsic physical properties of the DNA that may contribute to the complexity of transcription regulation in the human genome.
Collapse
|
29
|
The GENCODE v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and expression. Genome Res 2013; 22:1775-89. [PMID: 22955988 PMCID: PMC3431493 DOI: 10.1101/gr.132159.111] [Citation(s) in RCA: 3740] [Impact Index Per Article: 340.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
The human genome contains many thousands of long noncoding RNAs (lncRNAs). While several studies have demonstrated compelling biological and disease roles for individual examples, analytical and experimental approaches to investigate these genes have been hampered by the lack of comprehensive lncRNA annotation. Here, we present and analyze the most complete human lncRNA annotation to date, produced by the GENCODE consortium within the framework of the ENCODE project and comprising 9277 manually annotated genes producing 14,880 transcripts. Our analyses indicate that lncRNAs are generated through pathways similar to that of protein-coding genes, with similar histone-modification profiles, splicing signals, and exon/intron lengths. In contrast to protein-coding genes, however, lncRNAs display a striking bias toward two-exon transcripts, they are predominantly localized in the chromatin and nucleus, and a fraction appear to be preferentially processed into small RNAs. They are under stronger selective pressure than neutrally evolving sequences—particularly in their promoter regions, which display levels of selection comparable to protein-coding genes. Importantly, about one-third seem to have arisen within the primate lineage. Comprehensive analysis of their expression in multiple human organs and brain regions shows that lncRNAs are generally lower expressed than protein-coding genes, and display more tissue-specific expression patterns, with a large fraction of tissue-specific lncRNAs expressed in the brain. Expression correlation analysis indicates that lncRNAs show particularly striking positive correlation with the expression of antisense coding genes. This GENCODE annotation represents a valuable resource for future studies of lncRNAs.
Collapse
|
30
|
Understanding transcriptional regulation by integrative analysis of transcription factor binding data. Genome Res 2013; 22:1658-67. [PMID: 22955978 PMCID: PMC3431483 DOI: 10.1101/gr.136838.111] [Citation(s) in RCA: 138] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
Statistical models have been used to quantify the relationship between gene expression and transcription factor (TF) binding signals. Here we apply the models to the large-scale data generated by the ENCODE project to study transcriptional regulation by TFs. Our results reveal a notable difference in the prediction accuracy of expression levels of transcription start sites (TSSs) captured by different technologies and RNA extraction protocols. In general, the expression levels of TSSs with high CpG content are more predictable than those with low CpG content. For genes with alternative TSSs, the expression levels of downstream TSSs are more predictable than those of the upstream ones. Different TF categories and specific TFs vary substantially in their contributions to predicting expression. Between two cell lines, the differential expression of TSS can be precisely reflected by the difference of TF-binding signals in a quantitative manner, arguing against the conventional on-and-off model of TF binding. Finally, we explore the relationships between TF-binding signals and other chromatin features such as histone modifications and DNase hypersensitivity for determining expression. The models imply that these features regulate transcription in a highly coordinated manner.
Collapse
|
31
|
Abstract
Eukaryotic cells make many types of primary and processed RNAs that are found either in specific subcellular compartments or throughout the cells. A complete catalogue of these RNAs is not yet available and their characteristic subcellular localizations are also poorly understood. Because RNA represents the direct output of the genetic information encoded by genomes and a significant proportion of a cell's regulatory capabilities are focused on its synthesis, processing, transport, modification and translation, the generation of such a catalogue is crucial for understanding genome function. Here we report evidence that three-quarters of the human genome is capable of being transcribed, as well as observations about the range and levels of expression, localization, processing fates, regulatory regions and modifications of almost all currently annotated and thousands of previously unannotated RNAs. These observations, taken together, prompt a redefinition of the concept of a gene.
Collapse
|
32
|
Abstract
To complement the human Encyclopedia of DNA Elements (ENCODE) project and to enable a broad range of mouse genomics efforts, the Mouse ENCODE Consortium is applying the same experimental pipelines developed for human ENCODE to annotate the mouse genome.
Collapse
|
33
|
Modeling gene expression using chromatin features in various cellular contexts. Genome Biol 2012; 13:R53. [PMID: 22950368 PMCID: PMC3491397 DOI: 10.1186/gb-2012-13-9-r53] [Citation(s) in RCA: 175] [Impact Index Per Article: 14.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2012] [Revised: 06/13/2012] [Accepted: 06/19/2012] [Indexed: 01/01/2023] Open
Abstract
BACKGROUND Previous work has demonstrated that chromatin feature levels correlate with gene expression. The ENCODE project enables us to further explore this relationship using an unprecedented volume of data. Expression levels from more than 100,000 promoters were measured using a variety of high-throughput techniques applied to RNA extracted by different protocols from different cellular compartments of several human cell lines. ENCODE also generated the genome-wide mapping of eleven histone marks, one histone variant, and DNase I hypersensitivity sites in seven cell lines. RESULTS We built a novel quantitative model to study the relationship between chromatin features and expression levels. Our study not only confirms that the general relationships found in previous studies hold across various cell lines, but also makes new suggestions about the relationship between chromatin features and gene expression levels. We found that expression status and expression levels can be predicted by different groups of chromatin features, both with high accuracy. We also found that expression levels measured by CAGE are better predicted than by RNA-PET or RNA-Seq, and different categories of chromatin features are the most predictive of expression for different RNA measurement methods. Additionally, PolyA+ RNA is overall more predictable than PolyA- RNA among different cell compartments, and PolyA+ cytosolic RNA measured with RNA-Seq is more predictable than PolyA+ nuclear RNA, while the opposite is true for PolyA- RNA. CONCLUSIONS Our study provides new insights into transcriptional regulation by analyzing chromatin features in different cellular contexts.
Collapse
|
34
|
Evidence for transcript networks composed of chimeric RNAs in human cells. PLoS One 2012; 7:e28213. [PMID: 22238572 PMCID: PMC3251577 DOI: 10.1371/journal.pone.0028213] [Citation(s) in RCA: 53] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2011] [Accepted: 11/03/2011] [Indexed: 12/03/2022] Open
Abstract
The classic organization of a gene structure has followed the Jacob and Monod bacterial gene model proposed more than 50 years ago. Since then, empirical determinations of the complexity of the transcriptomes found in yeast to human has blurred the definition and physical boundaries of genes. Using multiple analysis approaches we have characterized individual gene boundaries mapping on human chromosomes 21 and 22. Analyses of the locations of the 5′ and 3′ transcriptional termini of 492 protein coding genes revealed that for 85% of these genes the boundaries extend beyond the current annotated termini, most often connecting with exons of transcripts from other well annotated genes. The biological and evolutionary importance of these chimeric transcripts is underscored by (1) the non-random interconnections of genes involved, (2) the greater phylogenetic depth of the genes involved in many chimeric interactions, (3) the coordination of the expression of connected genes and (4) the close in vivo and three dimensional proximity of the genomic regions being transcribed and contributing to parts of the chimeric RNAs. The non-random nature of the connection of the genes involved suggest that chimeric transcripts should not be studied in isolation, but together, as an RNA network.
Collapse
|
35
|
Efficient targeted transcript discovery via array-based normalization of RACE libraries. Nat Methods 2008; 5:629-35. [PMID: 18500348 PMCID: PMC2713501 DOI: 10.1038/nmeth.1216] [Citation(s) in RCA: 37] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2008] [Accepted: 04/24/2008] [Indexed: 11/09/2022]
Abstract
RACE (Rapid Amplification of cDNA Ends) is a widely used approach for transcript identification. Random clone selection from the RACE mixture, however, is an ineffective sampling strategy if the dynamic range of transcript abundances is large. Here, we describe a strategy that uses array hybridization to improve sampling efficiency of human transcripts. The products of the RACE reaction are hybridized onto tiling arrays, and the exons detected are used to delineate a series of RT-PCR reactions, through which the original RACE mixture is segregated into simpler RT-PCR reactions. These are independently cloned, and randomly selected clones are sequenced. This approach is superior to direct cloning and sequencing of RACE products: it specifically targets novel transcripts, and often results in overall normalization of transcript abundances. We show theoretically and experimentally that this strategy leads indeed to efficient sampling of novel transcripts, and we investigate multiplexing it by pooling RACE reactions from multiple interrogated loci prior to hybridization.
Collapse
|
36
|
Abstract
Background Accurate and automatic gene identification in eukaryotic genomic DNA is more than ever of crucial importance to efficiently exploit the large volume of assembled genome sequences available to the community. Automatic methods have always been considered less reliable than human expertise. This is illustrated in the EGASP project, where reference annotations against which all automatic methods are measured are generated by human annotators and experimentally verified. We hypothesized that replicating the accuracy of human annotators in an automatic method could be achieved by formalizing the rules and decisions that they use, in a mathematical formalism. Results We have developed Exogean, a flexible framework based on directed acyclic colored multigraphs (DACMs) that can represent biological objects (for example, mRNA, ESTs, protein alignments, exons) and relationships between them. Graphs are analyzed to process the information according to rules that replicate those used by human annotators. Simple individual starting objects given as input to Exogean are thus combined and synthesized into complex objects such as protein coding transcripts. Conclusion We show here, in the context of the EGASP project, that Exogean is currently the method that best reproduces protein coding gene annotations from human experts, in terms of identifying at least one exact coding sequence per gene. We discuss current limitations of the method and several avenues for improvement.
Collapse
|