1
|
Kallenborn F, Cascitti J, Schmidt B. CARE 2.0: reducing false-positive sequencing error corrections using machine learning. BMC Bioinformatics 2022; 23:227. [PMID: 35698033 PMCID: PMC9195321 DOI: 10.1186/s12859-022-04754-3] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2022] [Accepted: 05/30/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Next-generation sequencing pipelines often perform error correction as a preprocessing step to obtain cleaned input data. State-of-the-art error correction programs are able to reliably detect and correct the majority of sequencing errors. However, they also introduce new errors by making false-positive corrections. These correction mistakes can have negative impact on downstream analysis, such as k-mer statistics, de-novo assembly, and variant calling. This motivates the need for more precise error correction tools. RESULTS We present CARE 2.0, a context-aware read error correction tool based on multiple sequence alignment targeting Illumina datasets. In addition to a number of newly introduced optimizations its most significant change is the replacement of CARE 1.0's hand-crafted correction conditions with a novel classifier based on random decision forests trained on Illumina data. This results in up to two orders-of-magnitude fewer false-positive corrections compared to other state-of-the-art error correction software. At the same time, CARE 2.0 is able to achieve high numbers of true-positive corrections comparable to its competitors. On a simulated full human dataset with 914M reads CARE 2.0 generates only 1.2M false positives (FPs) (and 801.4M true positives (TPs)) at a highly competitive runtime while the best corrections achieved by other state-of-the-art tools contain at least 3.9M FPs and at most 814.5M TPs. Better de-novo assembly and improved k-mer analysis show the applicability of CARE 2.0 to real-world data. CONCLUSION False-positive corrections can negatively influence down-stream analysis. The precision of CARE 2.0 greatly reduces the number of those corrections compared to other state-of-the-art programs including BFC, Karect, Musket, Bcool, SGA, and Lighter. Thus, higher-quality datasets are produced which improve k-mer analysis and de-novo assembly in real-world datasets which demonstrates the applicability of machine learning techniques in the context of sequencing read error correction. CARE 2.0 is written in C++/CUDA for Linux systems and can be run on the CPU as well as on CUDA-enabled GPUs. It is available at https://github.com/fkallen/CARE .
Collapse
Affiliation(s)
- Felix Kallenborn
- Department of Computer Science, Johannes Gutenberg University Mainz, Mainz, Germany.
| | - Julian Cascitti
- Department of Computer Science, Johannes Gutenberg University Mainz, Mainz, Germany
| | - Bertil Schmidt
- Department of Computer Science, Johannes Gutenberg University Mainz, Mainz, Germany
| |
Collapse
|
2
|
Onuchic V, Hartmaier RJ, Boone DN, Samuels ML, Patel RY, White WM, Garovic VD, Oesterreich S, Roth ME, Lee AV, Milosavljevic A. Epigenomic Deconvolution of Breast Tumors Reveals Metabolic Coupling between Constituent Cell Types. Cell Rep 2017; 17:2075-2086. [PMID: 27851969 DOI: 10.1016/j.celrep.2016.10.057] [Citation(s) in RCA: 63] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2016] [Revised: 07/28/2016] [Accepted: 09/26/2016] [Indexed: 12/13/2022] Open
Abstract
Cancer progression depends on both cell-intrinsic processes and interactions between different cell types. However, large-scale assessment of cell type composition and molecular profiles of individual cell types within tumors remains challenging. To address this, we developed epigenomic deconvolution (EDec), an in silico method that infers cell type composition of complex tissues as well as DNA methylation and gene transcription profiles of constituent cell types. By applying EDec to The Cancer Genome Atlas (TCGA) breast tumors, we detect changes in immune cell infiltration related to patient prognosis, and a striking change in stromal fibroblast-to-adipocyte ratio across breast cancer subtypes. Furthermore, we show that a less adipose stroma tends to display lower levels of mitochondrial activity and to be associated with cancerous cells with higher levels of oxidative metabolism. These findings highlight the role of stromal composition in the metabolic coupling between distinct cell types within tumors.
Collapse
Affiliation(s)
- Vitor Onuchic
- Molecular and Human Genetics Department, Baylor College of Medicine, 1 Baylor Plaza, Houston, TX 77030, USA; Program in Structural and Computational Biology and Molecular Biophysics, Baylor College of Medicine, 1 Baylor Plaza, Houston, TX 77030, USA.
| | - Ryan J Hartmaier
- Department of Pharmacology and Chemical Biology, Magee Womens Research Institute, University of Pittsburgh Cancer Institute, 204 Craft Avenue, B705, Pittsburgh, PA 15213, USA
| | - David N Boone
- Department of Pharmacology and Chemical Biology, Magee Womens Research Institute, University of Pittsburgh Cancer Institute, 204 Craft Avenue, B705, Pittsburgh, PA 15213, USA
| | - Michael L Samuels
- RainDance Technologies, Inc., 749 Middlesex Turnpike, Billerica, MA 01821, USA
| | - Ronak Y Patel
- Molecular and Human Genetics Department, Baylor College of Medicine, 1 Baylor Plaza, Houston, TX 77030, USA
| | - Wendy M White
- Department of Obstetrics and Gynecology, Mayo Clinic College of Medicine, 200 1st Street SW, Rochester, MN 55905, USA
| | - Vesna D Garovic
- Division of Nephrology and Hypertension, Mayo Clinic, 200 1st Street SW, Rochester, MN 55905, USA
| | - Steffi Oesterreich
- Department of Pharmacology and Chemical Biology, Magee Womens Research Institute, University of Pittsburgh Cancer Institute, 204 Craft Avenue, B705, Pittsburgh, PA 15213, USA
| | - Matt E Roth
- Molecular and Human Genetics Department, Baylor College of Medicine, 1 Baylor Plaza, Houston, TX 77030, USA
| | - Adrian V Lee
- Department of Pharmacology and Chemical Biology, Magee Womens Research Institute, University of Pittsburgh Cancer Institute, 204 Craft Avenue, B705, Pittsburgh, PA 15213, USA
| | - Aleksandar Milosavljevic
- Molecular and Human Genetics Department, Baylor College of Medicine, 1 Baylor Plaza, Houston, TX 77030, USA; Program in Structural and Computational Biology and Molecular Biophysics, Baylor College of Medicine, 1 Baylor Plaza, Houston, TX 77030, USA.
| |
Collapse
|
3
|
In silico prediction of the effects of mutations in the human triose phosphate isomerase gene: Towards a predictive framework for TPI deficiency. Eur J Med Genet 2017; 60:289-298. [PMID: 28341520 DOI: 10.1016/j.ejmg.2017.03.008] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2016] [Revised: 02/27/2017] [Accepted: 03/20/2017] [Indexed: 01/24/2023]
Abstract
Triose phosphate isomerase (TPI) deficiency is a rare, but highly debilitating, inherited metabolic disease. Almost all patients suffer severe neurological effects and the most severely affected are unlikely to live beyond early childhood. Here, we describe an in silico study into well-characterised variants which are associated with the disease alongside an investigation into 79 currently uncharacterised TPI variants which are known to occur in the human population. The majority of the disease-associated mutations affected amino acid residues close to the dimer interface or the active site. However, the location of the altered amino acid residue did not predict the severity of the resulting disease. Prediction of the effect on protein stability using a range of different programs suggested a relationship between the degree of instability caused by the sequence variation and the severity of the resulting disease. Disease-associated variations tended to affect well-conserved residues in the protein's sequence. However, the degree of conservation of the residue was not predictive of disease severity. The majority of the 79 uncharacterised variants are potentially associated with disease since they were predicted to destabilise the protein and often occur in well-conserved residues. We predict that individuals homozygous for the corresponding mutations would be likely to suffer from TPI deficiency.
Collapse
|
4
|
Schleicher J, Conrad T, Gustafsson M, Cedersund G, Guthke R, Linde J. Facing the challenges of multiscale modelling of bacterial and fungal pathogen-host interactions. Brief Funct Genomics 2017; 16:57-69. [PMID: 26857943 PMCID: PMC5439285 DOI: 10.1093/bfgp/elv064] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open
Abstract
Recent and rapidly evolving progress on high-throughput measurement techniques and computational performance has led to the emergence of new disciplines, such as systems medicine and translational systems biology. At the core of these disciplines lies the desire to produce multiscale models: mathematical models that integrate multiple scales of biological organization, ranging from molecular, cellular and tissue models to organ, whole-organism and population scale models. Using such models, hypotheses can systematically be tested. In this review, we present state-of-the-art multiscale modelling of bacterial and fungal infections, considering both the pathogen and host as well as their interaction. Multiscale modelling of the interactions of bacteria, especially Mycobacterium tuberculosis, with the human host is quite advanced. In contrast, models for fungal infections are still in their infancy, in particular regarding infections with the most important human pathogenic fungi, Candida albicans and Aspergillus fumigatus. We reflect on the current availability of computational approaches for multiscale modelling of host-pathogen interactions and point out current challenges. Finally, we provide an outlook for future requirements of multiscale modelling.
Collapse
Affiliation(s)
| | | | | | | | | | - Jörg Linde
- Corresponding author: Jörg Linde, Leibniz Institute for Natural Product Research and Infection Biology—Hans Knöll Institute, Jena, Germany. Tel.: +49-3641-532-1290; E-mail:
| |
Collapse
|
5
|
Montgomery SH, Mank JE. Inferring regulatory change from gene expression: the confounding effects of tissue scaling. Mol Ecol 2016; 25:5114-5128. [PMID: 27564408 DOI: 10.1111/mec.13824] [Citation(s) in RCA: 58] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2016] [Revised: 08/10/2016] [Accepted: 08/12/2016] [Indexed: 01/18/2023]
Abstract
Comparative studies of gene expression are often designed with the aim of identifying regulatory changes associated with phenotypic variation. In recent years, large-scale transcriptome sequencing methods have increasingly been applied to nonmodel organisms to ask important ecological or evolutionary questions. Although experimental design varies, many of these studies have been based on RNA libraries obtained from heterogeneous tissue samples, for example homogenized whole bodies. Comparisons between groups of samples that vary in tissue composition can introduce sufficient variation in RNA abundance to produce patterns of differential expression that are mistakenly interpreted as evidence of regulatory differences. Here, we present a simple model that demonstrates this effect. The model describes the relationship between transcript abundance and tissue composition in a two-tissue system, and how this relationship varies under different scaling relationships. Using a range of biologically realistic variables, including real biological examples, to parameterize the model we highlight the potentially severe influence of tissue scaling on relative transcript abundance. We use these results to identify key aspects of experimental design and analysis that can help to limit the influence of tissue scaling on the inference of regulatory difference from comparative studies of gene expression.
Collapse
Affiliation(s)
- Stephen H Montgomery
- Department of Genetics, Evolution and Environment, University College London, London, WC1E 6BT, UK.
| | - Judith E Mank
- Department of Genetics, Evolution and Environment, University College London, London, WC1E 6BT, UK
| |
Collapse
|
6
|
Ackerman WE, Buhimschi IA, Eidem HR, Rinker DC, Rokas A, Rood K, Zhao G, Summerfield TL, Landon MB, Buhimschi CS. Comprehensive RNA profiling of villous trophoblast and decidua basalis in pregnancies complicated by preterm birth following intra-amniotic infection. Placenta 2016; 44:23-33. [PMID: 27452435 DOI: 10.1016/j.placenta.2016.05.010] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/22/2015] [Revised: 04/11/2016] [Accepted: 05/23/2016] [Indexed: 12/20/2022]
Abstract
INTRODUCTION We performed RNA sequencing with the primary goal of discovering key placental villous trophoblast (VT) and decidua basalis (DB) transcripts differentially expressed in intra-amniotic infection (IAI)-induced preterm birth (PTB). METHODS RNA was extracted from 15 paired VT and DB specimens delivered of women with: 1) spontaneous PTB in the setting of amniocentesis-proven IAI and histological chorioamnionitis (n = 5); 2) spontaneous idiopathic PTB (iPTB, n = 5); and 3) physiologic term pregnancy (n = 5). RNA sequencing was performed using the Illumina HiSeq 2500 platform, and a spectrum of computational tools was used for gene prioritization and pathway analyses. RESULTS In the VT specimens, 128 unique long transcripts and 7 mature microRNAs differed significantly between pregnancies complicated by IAI relative to iPTB (FDR<0.1). The up-regulated transcripts included many characteristic of myeloblast-derived cells, and bioinformatic analyses revealed enrichment for multiple pathways associated with acute inflammation. In an expanded cohort including additional IAI and iPTB specimens, the expression of three proteins (cathepsin S, lysozyme, and hexokinase 3) and two microRNAs (miR-133a and miR-223) was validated using immunohistochemistry and quantitative PCR, respectively. In the DB specimens, only 11 long transcripts and no microRNAs differed significantly between IAI cases and iPTB controls (FDR<0.1). Comparison of the VT and DB specimens in each clinical scenario revealed signatures distinguishing these placental regions. DISCUSSION IAI is associated with a transcriptional signature consistent with acute inflammation in the villous trophoblast. The present findings illuminate novel signaling pathways involved in IAI, and suggest putative therapeutic targets and potential biomarkers associated with this condition.
Collapse
Affiliation(s)
- William E Ackerman
- Department of Obstetrics and Gynecology, The Ohio State College of Medicine, Columbus, OH, USA.
| | - Irina A Buhimschi
- Center for Perinatal Research, Nationwide Children's Hospital, Columbus, OH, USA.
| | - Haley R Eidem
- Department of Biological Sciences, Vanderbilt University, Nashville, TN, USA.
| | - David C Rinker
- Program in Human Genetics, Vanderbilt University Medical Center, Nashville, TN, USA.
| | - Antonis Rokas
- Department of Biological Sciences, Vanderbilt University, Nashville, TN, USA; Program in Human Genetics, Vanderbilt University Medical Center, Nashville, TN, USA.
| | - Kara Rood
- Department of Obstetrics and Gynecology, The Ohio State College of Medicine, Columbus, OH, USA.
| | - Guomao Zhao
- Center for Perinatal Research, Nationwide Children's Hospital, Columbus, OH, USA.
| | - Taryn L Summerfield
- Department of Obstetrics and Gynecology, The Ohio State College of Medicine, Columbus, OH, USA.
| | - Mark B Landon
- Department of Obstetrics and Gynecology, The Ohio State College of Medicine, Columbus, OH, USA.
| | - Catalin S Buhimschi
- Department of Obstetrics and Gynecology, The Ohio State College of Medicine, Columbus, OH, USA.
| |
Collapse
|
7
|
Reinartz S, Finkernagel F, Adhikary T, Rohnalter V, Schumann T, Schober Y, Nockher WA, Nist A, Stiewe T, Jansen JM, Wagner U, Müller-Brüsselbach S, Müller R. A transcriptome-based global map of signaling pathways in the ovarian cancer microenvironment associated with clinical outcome. Genome Biol 2016; 17:108. [PMID: 27215396 PMCID: PMC4877997 DOI: 10.1186/s13059-016-0956-6] [Citation(s) in RCA: 80] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2015] [Accepted: 04/15/2016] [Indexed: 01/05/2023] Open
Abstract
BACKGROUND Soluble protein and lipid mediators play essential roles in the tumor environment, but their cellular origins, targets, and clinical relevance are only partially known. We have addressed this question for the most abundant cell types in human ovarian carcinoma ascites, namely tumor cells and tumor-associated macrophages. RESULTS Transcriptome-derived datasets were adjusted for errors caused by contaminating cell types by an algorithm using expression data derived from pure cell types as references. These data were utilized to construct a network of autocrine and paracrine signaling pathways comprising 358 common and 58 patient-specific signaling mediators and their receptors. RNA sequencing based predictions were confirmed for several proteins and lipid mediators. Published expression microarray results for 1018 patients were used to establish clinical correlations for a number of components with distinct cellular origins and target cells. Clear associations with early relapse were found for STAT3-inducing cytokines, specific components of WNT and fibroblast growth factor signaling, ephrin and semaphorin axon guidance molecules, and TGFβ/BMP-triggered pathways. An association with early relapse was also observed for secretory macrophage-derived phospholipase PLA2G7, its product arachidonic acid (AA) and signaling pathways controlled by the AA metabolites PGE2, PGI2, and LTB4. By contrast, the genes encoding norrin and its receptor frizzled 4, both selectively expressed by cancer cells and previously not linked to tumor suppression, show a striking association with a favorable clinical course. CONCLUSIONS We have established a signaling network operating in the ovarian cancer microenvironment with previously unidentified pathways and have defined clinically relevant components within this network.
Collapse
Affiliation(s)
- Silke Reinartz
- Clinic for Gynecology, Gynecological Oncology and Gynecological Endocrinology, Center for Tumor Biology and Immunology (ZTI), Philipps University, Marburg, Germany
| | - Florian Finkernagel
- Institute of Molecular Biology and Tumor Research (IMT), Center for Tumor Biology and Immunology (ZTI), Philipps University, Hans-Meerwein-Str. 3, Marburg, 35043, Germany
| | - Till Adhikary
- Institute of Molecular Biology and Tumor Research (IMT), Center for Tumor Biology and Immunology (ZTI), Philipps University, Hans-Meerwein-Str. 3, Marburg, 35043, Germany
| | - Verena Rohnalter
- Institute of Molecular Biology and Tumor Research (IMT), Center for Tumor Biology and Immunology (ZTI), Philipps University, Hans-Meerwein-Str. 3, Marburg, 35043, Germany
| | - Tim Schumann
- Institute of Molecular Biology and Tumor Research (IMT), Center for Tumor Biology and Immunology (ZTI), Philipps University, Hans-Meerwein-Str. 3, Marburg, 35043, Germany
| | - Yvonne Schober
- Metabolomics Core Facility and Institute of Laboratory Medicine and Pathobiochemistry, Center for Tumor Biology and Immunology (ZTI), Philipps University, Marburg, Germany
| | - W Andreas Nockher
- Metabolomics Core Facility and Institute of Laboratory Medicine and Pathobiochemistry, Center for Tumor Biology and Immunology (ZTI), Philipps University, Marburg, Germany
| | - Andrea Nist
- Genomics Core Facility, Center for Tumor Biology and Immunology (ZTI), Philipps University, Marburg, Germany
| | - Thorsten Stiewe
- Genomics Core Facility, Center for Tumor Biology and Immunology (ZTI), Philipps University, Marburg, Germany
| | - Julia M Jansen
- Clinic for Gynecology, Gynecological Oncology and Gynecological Endocrinology, Center for Tumor Biology and Immunology (ZTI), Philipps University, Marburg, Germany
| | - Uwe Wagner
- Clinic for Gynecology, Gynecological Oncology and Gynecological Endocrinology, Center for Tumor Biology and Immunology (ZTI), Philipps University, Marburg, Germany
| | - Sabine Müller-Brüsselbach
- Institute of Molecular Biology and Tumor Research (IMT), Center for Tumor Biology and Immunology (ZTI), Philipps University, Hans-Meerwein-Str. 3, Marburg, 35043, Germany
| | - Rolf Müller
- Institute of Molecular Biology and Tumor Research (IMT), Center for Tumor Biology and Immunology (ZTI), Philipps University, Hans-Meerwein-Str. 3, Marburg, 35043, Germany.
| |
Collapse
|
8
|
Anjum A, Jaggi S, Varghese E, Lall S, Bhowmik A, Rai A. Identification of Differentially Expressed Genes in RNA-seq Data of Arabidopsis thaliana: A Compound Distribution Approach. J Comput Biol 2016; 23:239-47. [PMID: 26949988 PMCID: PMC4827276 DOI: 10.1089/cmb.2015.0205] [Citation(s) in RCA: 28] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Abstract
Gene expression is the process by which information from a gene is used in the synthesis of a functional gene product, which may be proteins. A gene is declared differentially expressed if an observed difference or change in read counts or expression levels between two experimental conditions is statistically significant. To identify differentially expressed genes between two conditions, it is important to find statistical distributional property of the data to approximate the nature of differential genes. In the present study, the focus is mainly to investigate the differential gene expression analysis for sequence data based on compound distribution model. This approach was applied in RNA-seq count data of Arabidopsis thaliana and it has been found that compound Poisson distribution is more appropriate to capture the variability as compared with Poisson distribution. Thus, fitting of appropriate distribution to gene expression data provides statistically sound cutoff values for identifying differentially expressed genes.
Collapse
Affiliation(s)
- Arfa Anjum
- ICAR-Indian Agricultural Statistics Research Institute, Indian Council of Agricultural Research , New Delhi, India
| | - Seema Jaggi
- ICAR-Indian Agricultural Statistics Research Institute, Indian Council of Agricultural Research , New Delhi, India
| | - Eldho Varghese
- ICAR-Indian Agricultural Statistics Research Institute, Indian Council of Agricultural Research , New Delhi, India
| | - Shwetank Lall
- ICAR-Indian Agricultural Statistics Research Institute, Indian Council of Agricultural Research , New Delhi, India
| | - Arpan Bhowmik
- ICAR-Indian Agricultural Statistics Research Institute, Indian Council of Agricultural Research , New Delhi, India
| | - Anil Rai
- ICAR-Indian Agricultural Statistics Research Institute, Indian Council of Agricultural Research , New Delhi, India
| |
Collapse
|
9
|
Rautio S, Lähdesmäki H. MixChIP: a probabilistic method for cell type specific protein-DNA binding analysis. BMC Bioinformatics 2015; 16:413. [PMID: 26703974 PMCID: PMC4690251 DOI: 10.1186/s12859-015-0834-3] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2015] [Accepted: 11/24/2015] [Indexed: 08/30/2023] Open
Abstract
Background Transcription factors (TFs) are proteins that bind to DNA and regulate gene expression. To understand details of gene regulation, characterizing TF binding sites in different cell types, diseases and among individuals is essential. However, sometimes TF binding can only be measured from biological samples that contain multiple cell or tissue types. Sample heterogeneity can have a considerable effect on TF binding site detection. While manual separation techniques can be used to isolate a cell type of interest from heterogeneous samples, such techniques are challenging and can change intra-cellular interactions, including protein-DNA binding. Computational deconvolution methods have emerged as an alternative strategy to study heterogeneous samples and numerous methods have been proposed to analyze gene expression. However, no computational method exists to deconvolve cell type specific TF binding from heterogeneous samples. Results We present a probabilistic method, MixChIP, to identify cell type specific TF binding sites from heterogeneous chromatin immunoprecipitation sequencing (ChIP-seq) data. Our method simultaneously estimates the binding strength in different cell types as well as the proportions of different cell types in each sample when only partial prior information about cell type composition is available. We demonstrate the utility of MixChIP by analyzing ChIP-seq data from two cell lines which we artificially mix to generate (simulated) heterogeneous samples and by analyzing ChIP-seq data from breast cancer patients measuring oestrogen receptor (ER) binding in primary breast cancer tissues. We show that MixChIP is more accurate in detecting TF binding sites from multiple heterogeneous ChIP-seq samples than the standard methods which do not account for sample heterogeneity. Conclusions Our results show that MixChIP can estimate cell-type proportions and identify cell type specific TF binding sites from heterogeneous ChIP-seq samples. Thus, MixChIP can be an invaluable tool in analyzing heterogeneous ChIP-seq samples, such as those originating from cancer studies. R implementation is available at http://research.ics.aalto.fi/csb/software/mixchip/. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0834-3) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Sini Rautio
- Department of Computer Science, Aalto University, Aalto, FI-00076, Finland.
| | - Harri Lähdesmäki
- Department of Computer Science, Aalto University, Aalto, FI-00076, Finland.
| |
Collapse
|
10
|
Shen Q, Hu J, Jiang N, Hu X, Luo Z, Zhang H. contamDE: differential expression analysis of RNA-seq data for contaminated tumor samples. Bioinformatics 2015; 32:705-12. [DOI: 10.1093/bioinformatics/btv657] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2015] [Accepted: 11/03/2015] [Indexed: 11/14/2022] Open
|
11
|
Parsons J, Munro S, Pine PS, McDaniel J, Mehaffey M, Salit M. Using mixtures of biological samples as process controls for RNA-sequencing experiments. BMC Genomics 2015; 16:708. [PMID: 26383878 PMCID: PMC4574543 DOI: 10.1186/s12864-015-1912-7] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2015] [Accepted: 09/09/2015] [Indexed: 12/02/2022] Open
Abstract
Background Genome-scale “-omics” measurements are challenging to benchmark due to the enormous variety of unique biological molecules involved. Mixtures of previously-characterized samples can be used to benchmark repeatability and reproducibility using component proportions as truth for the measurement. We describe and evaluate experiments characterizing the performance of RNA-sequencing (RNA-Seq) measurements, and discuss cases where mixtures can serve as effective process controls. Results We apply a linear model to total RNA mixture samples in RNA-seq experiments. This model provides a context for performance benchmarking. The parameters of the model fit to experimental results can be evaluated to assess bias and variability of the measurement of a mixture. A linear model describes the behavior of mixture expression measures and provides a context for performance benchmarking. Residuals from fitting the model to experimental data can be used as a metric for evaluating the effect that an individual step in an experimental process has on the linear response function and precision of the underlying measurement while identifying signals affected by interference from other sources. Effective benchmarking requires well-defined mixtures, which for RNA-Seq requires knowledge of the post-enrichment ‘target RNA’ content of the individual total RNA components. We demonstrate and evaluate an experimental method suitable for use in genome-scale process control and lay out a method utilizing spike-in controls to determine enriched RNA content of total RNA in samples. Conclusions Genome-scale process controls can be derived from mixtures. These controls relate prior knowledge of individual components to a complex mixture, allowing assessment of measurement performance. The target RNA fraction accounts for differential selection of RNA out of variable total RNA samples. Spike-in controls can be utilized to measure this relationship between target RNA content and input total RNA. Our mixture analysis method also enables estimation of the proportions of an unknown mixture, even when component-specific markers are not previously known, whenever pure components are measured alongside the mixture. Electronic supplementary material The online version of this article (doi:10.1186/s12864-015-1912-7) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Jerod Parsons
- Material Measurement Laboratory, National Institute of Standards and Technology, 100 Bureau Drive, Gaithersburg, MD, 20899, USA. .,Department of Bioengineering, Stanford University, 443 Via Ortega, Stanford, CA, 94305, USA.
| | - Sarah Munro
- Material Measurement Laboratory, National Institute of Standards and Technology, 100 Bureau Drive, Gaithersburg, MD, 20899, USA. .,Department of Bioengineering, Stanford University, 443 Via Ortega, Stanford, CA, 94305, USA.
| | - P Scott Pine
- Material Measurement Laboratory, National Institute of Standards and Technology, 100 Bureau Drive, Gaithersburg, MD, 20899, USA. .,Department of Bioengineering, Stanford University, 443 Via Ortega, Stanford, CA, 94305, USA.
| | - Jennifer McDaniel
- Material Measurement Laboratory, National Institute of Standards and Technology, 100 Bureau Drive, Gaithersburg, MD, 20899, USA.
| | - Michele Mehaffey
- Leidos Biomedical Research Inc., P.O. Box B Bldg 428, Frederick, MD, 21702, USA.
| | - Marc Salit
- Material Measurement Laboratory, National Institute of Standards and Technology, 100 Bureau Drive, Gaithersburg, MD, 20899, USA. .,Department of Bioengineering, Stanford University, 443 Via Ortega, Stanford, CA, 94305, USA.
| |
Collapse
|
12
|
Milano T, Di Salvo ML, Angelaccio S, Pascarella S. Conserved water molecules in bacterial serine hydroxymethyltransferases. Protein Eng Des Sel 2015; 28:415-26. [DOI: 10.1093/protein/gzv026] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2014] [Accepted: 04/17/2015] [Indexed: 12/27/2022] Open
|