1
|
Ai D, Chen L, Xie J, Cheng L, Zhang F, Luan Y, Li Y, Hou S, Sun F, Xia LC. Identifying local associations in biological time series: algorithms, statistical significance, and applications. Brief Bioinform 2023; 24:bbad390. [PMID: 37930023 DOI: 10.1093/bib/bbad390] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2023] [Revised: 08/21/2023] [Accepted: 09/14/2023] [Indexed: 11/07/2023] Open
Abstract
Local associations refer to spatial-temporal correlations that emerge from the biological realm, such as time-dependent gene co-expression or seasonal interactions between microbes. One can reveal the intricate dynamics and inherent interactions of biological systems by examining the biological time series data for these associations. To accomplish this goal, local similarity analysis algorithms and statistical methods that facilitate the local alignment of time series and assess the significance of the resulting alignments have been developed. Although these algorithms were initially devised for gene expression analysis from microarrays, they have been adapted and accelerated for multi-omics next generation sequencing datasets, achieving high scientific impact. In this review, we present an overview of the historical developments and recent advances for local similarity analysis algorithms, their statistical properties, and real applications in analyzing biological time series data. The benchmark data and analysis scripts used in this review are freely available at http://github.com/labxscut/lsareview.
Collapse
Affiliation(s)
- Dongmei Ai
- School of Mathematics and Physics, University of Science and Technology Beijing, Beijing 100083, China
| | - Lulu Chen
- School of Mathematics and Physics, University of Science and Technology Beijing, Beijing 100083, China
| | - Jiemin Xie
- Department of Statistics and Financial Mathematics, School of Mathematics, South China University of Technology, Guangzhou 510641, China
| | - Longwei Cheng
- School of Mathematics and Physics, University of Science and Technology Beijing, Beijing 100083, China
| | - Fang Zhang
- Shenwan Hongyuan Securities Co. Ltd., Shanghai 200031, China
| | - Yihui Luan
- School of Mathematics, Shandong University, Jinan 250100, China
| | - Yang Li
- Department of Statistics and Financial Mathematics, School of Mathematics, South China University of Technology, Guangzhou 510641, China
| | - Shengwei Hou
- Department of Ocean Science and Engineering, Southern University of Science and Technology, Shenzhen, 518055, China
| | - Fengzhu Sun
- Department of Quantitative and Computational Biology, University of Southern California, California, 90007, USA
| | - Li Charlie Xia
- Department of Statistics and Financial Mathematics, School of Mathematics, South China University of Technology, Guangzhou 510641, China
| |
Collapse
|
2
|
Ye W, Long Y, Ji G, Su Y, Ye P, Fu H, Wu X. Cluster analysis of replicated alternative polyadenylation data using canonical correlation analysis. BMC Genomics 2019; 20:75. [PMID: 30669970 PMCID: PMC6343338 DOI: 10.1186/s12864-019-5433-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2018] [Accepted: 01/03/2019] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Alternative polyadenylation (APA) has emerged as a pervasive mechanism that contributes to the transcriptome complexity and dynamics of gene regulation. The current tsunami of whole genome poly(A) site data from various conditions generated by 3' end sequencing provides a valuable data source for the study of APA-related gene expression. Cluster analysis is a powerful technique for investigating the association structure among genes, however, conventional gene clustering methods are not suitable for APA-related data as they fail to consider the information of poly(A) sites (e.g., location, abundance, number, etc.) within each gene or measure the association among poly(A) sites between two genes. RESULTS Here we proposed a computational framework, named PASCCA, for clustering genes from replicated or unreplicated poly(A) site data using canonical correlation analysis (CCA). PASCCA incorporates multiple layers of gene expression data from both the poly(A) site level and gene level and takes into account the number of replicates and the variability within each experimental group. Moreover, PASCCA characterizes poly(A) sites in various ways including the abundance and relative usage, which can exploit the advantages of 3' end deep sequencing in quantifying APA sites. Using both real and synthetic poly(A) site data sets, the cluster analysis demonstrates that PASCCA outperforms other widely-used distance measures under five performance metrics including connectivity, the Dunn index, average distance, average distance between means, and the biological homogeneity index. We also used PASCCA to infer APA-specific gene modules from recently published poly(A) site data of rice and discovered some distinct functional gene modules. We have made PASCCA an easy-to-use R package for APA-related gene expression analyses, including the characterization of poly(A) sites, quantification of association between genes, and clustering of genes. CONCLUSIONS By providing a better treatment of the noise inherent in repeated measurements and taking into account multiple layers of poly(A) site data, PASCCA could be a general tool for clustering and analyzing APA-specific gene expression data. PASCCA could be used to elucidate the dynamic interplay of genes and their APA sites among various biological conditions from emerging 3' end sequencing data to address the complex biological phenomenon.
Collapse
Affiliation(s)
- Wenbin Ye
- Department of Automation, Xiamen University, Xiamen, 361005, China.,Innovation Center for Cell Biology, Xiamen University, Xiamen, 361005, China
| | - Yuqi Long
- Department of Automation, Xiamen University, Xiamen, 361005, China.,Software Quality Testing Engineering Research Center, China Electronic Product Reliability and Environmental Testing Research Institute, Guangzhou, 510610, China
| | - Guoli Ji
- Department of Automation, Xiamen University, Xiamen, 361005, China.,Innovation Center for Cell Biology, Xiamen University, Xiamen, 361005, China
| | - Yaru Su
- College of Mathematics and Computer Science, Fuzhou University, Fuzhou, 350116, China
| | - Pengchao Ye
- Department of Automation, Xiamen University, Xiamen, 361005, China
| | - Hongjuan Fu
- Department of Automation, Xiamen University, Xiamen, 361005, China
| | - Xiaohui Wu
- Department of Automation, Xiamen University, Xiamen, 361005, China. .,Innovation Center for Cell Biology, Xiamen University, Xiamen, 361005, China.
| |
Collapse
|
3
|
Moghieb A, Clair G, Mitchell HD, Kitzmiller J, Zink EM, Kim YM, Petyuk V, Shukla A, Moore RJ, Metz TO, Carson J, McDermott JE, Corley RA, Whitsett JA, Ansong C. Time-resolved proteome profiling of normal lung development. Am J Physiol Lung Cell Mol Physiol 2018; 315:L11-L24. [PMID: 29516783 PMCID: PMC6087896 DOI: 10.1152/ajplung.00316.2017] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2017] [Revised: 01/31/2018] [Accepted: 03/01/2018] [Indexed: 12/20/2022] Open
Abstract
Biochemical networks mediating normal lung morphogenesis and function have important implications for ameliorating morbidity and mortality in premature infants. Although several transcript-level studies have examined normal lung development, corresponding protein-level analyses are lacking. Here we performed proteomics analysis of murine lungs from embryonic to early adult ages to identify the molecular networks mediating normal lung development. We identified 8,932 proteins, providing a deep and comprehensive view of the lung proteome. Analysis of the proteomics data revealed discrete modules and the underlying regulatory and signaling network modulating their expression during development. Our data support the cell proliferation that characterizes early lung development and highlight responses of the lung to exposure to a nonsterile oxygen-rich ambient environment and the important role of lipid (surfactant) metabolism in lung development. Comparison of dynamic regulation of proteomic and recent transcriptomic analyses identified biological processes under posttranscriptional control. Our study provides a unique proteomic resource for understanding normal lung formation and function and can be freely accessed at Lungmap.net.
Collapse
Affiliation(s)
- Ahmed Moghieb
- Biological Science Division, Pacific Northwest National Laboratory , Richland, Washington
| | - Geremy Clair
- Biological Science Division, Pacific Northwest National Laboratory , Richland, Washington
| | - Hugh D Mitchell
- Biological Science Division, Pacific Northwest National Laboratory , Richland, Washington
| | - Joseph Kitzmiller
- Division of Pulmonary Biology, Cincinnati Children's Hospital Medical Center , Cincinnati, Ohio
| | - Erika M Zink
- Biological Science Division, Pacific Northwest National Laboratory , Richland, Washington
| | - Young-Mo Kim
- Biological Science Division, Pacific Northwest National Laboratory , Richland, Washington
| | - Vladislav Petyuk
- Biological Science Division, Pacific Northwest National Laboratory , Richland, Washington
| | - Anil Shukla
- Biological Science Division, Pacific Northwest National Laboratory , Richland, Washington
| | - Ronald J Moore
- Biological Science Division, Pacific Northwest National Laboratory , Richland, Washington
| | - Thomas O Metz
- Biological Science Division, Pacific Northwest National Laboratory , Richland, Washington
| | - James Carson
- Texas Advanced Computing Center, University of Texas at Austin , Austin, Texas
| | - Jason E McDermott
- Biological Science Division, Pacific Northwest National Laboratory , Richland, Washington
| | - Richard A Corley
- Biological Science Division, Pacific Northwest National Laboratory , Richland, Washington
| | - Jeffrey A Whitsett
- Division of Pulmonary Biology, Cincinnati Children's Hospital Medical Center , Cincinnati, Ohio
| | - Charles Ansong
- Biological Science Division, Pacific Northwest National Laboratory , Richland, Washington
| |
Collapse
|
4
|
Uncovering robust patterns of microRNA co-expression across cancers using Bayesian Relevance Networks. PLoS One 2017; 12:e0183103. [PMID: 28817636 PMCID: PMC5560700 DOI: 10.1371/journal.pone.0183103] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2017] [Accepted: 07/19/2017] [Indexed: 01/17/2023] Open
Abstract
Co-expression networks have long been used as a tool for investigating the molecular circuitry governing biological systems. However, most algorithms for constructing co-expression networks were developed in the microarray era, before high-throughput sequencing-with its unique statistical properties-became the norm for expression measurement. Here we develop Bayesian Relevance Networks, an algorithm that uses Bayesian reasoning about expression levels to account for the differing levels of uncertainty in expression measurements between highly- and lowly-expressed entities, and between samples with different sequencing depths. It combines data from groups of samples (e.g., replicates) to estimate group expression levels and confidence ranges. It then computes uncertainty-moderated estimates of cross-group correlations between entities, and uses permutation testing to assess their statistical significance. Using large scale miRNA data from The Cancer Genome Atlas, we show that our Bayesian update of the classical Relevance Networks algorithm provides improved reproducibility in co-expression estimates and lower false discovery rates in the resulting co-expression networks. Software is available at www.perkinslab.ca.
Collapse
|
5
|
Zhu D, Deng N, Bai C. A generalized dSpliceType framework to detect differential splicing and differential expression events using RNA-Seq. IEEE Trans Nanobioscience 2015; 14:192-202. [PMID: 25680210 DOI: 10.1109/tnb.2015.2388593] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Transcriptomes are routinely compared in term of a list of differentially expressed genes followed by functional enrichment analysis. Due to the technology limitations of microarray, the molecular mechanisms of differential expression is poorly understood. Using RNA-seq data, we propose a generalized dSpliceType framework to systematically investigate the synergistic and antagonistic effects of differential splicing and differential expression. We applied the method to two public RNA-seq data sets and compared the transcriptomes between treatment and control conditions. The generalized dSpliceType detects and prioritizes a list of genes that are differentially expressed and/or spliced. In particular, the multivariate dSpliceType is among the fist to utilize sequential dependency of normalized base-wise read coverage signals and capture biological variability among replicates using a multivariate statistical model. We compared dSpliceType with two other methods in terms of five most common types of differential splicing events between two conditions using RNA-Seq. dSpliceType is free, available from http://dsplicetype.sourceforge.net/.
Collapse
|
6
|
Wang HQ, Tsai CJ. CorSig: a general framework for estimating statistical significance of correlation and its application to gene co-expression analysis. PLoS One 2013; 8:e77429. [PMID: 24194884 PMCID: PMC3806744 DOI: 10.1371/journal.pone.0077429] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2013] [Accepted: 09/02/2013] [Indexed: 11/19/2022] Open
Abstract
UNLABELLED With the rapid increase of omics data, correlation analysis has become an indispensable tool for inferring meaningful associations from a large number of observations. Pearson correlation coefficient (PCC) and its variants are widely used for such purposes. However, it remains challenging to test whether an observed association is reliable both statistically and biologically. We present here a new method, CorSig, for statistical inference of correlation significance. CorSig is based on a biology-informed null hypothesis, i.e., testing whether the true PCC (ρ) between two variables is statistically larger than a user-specified PCC cutoff (τ), as opposed to the simple null hypothesis of ρ = 0 in existing methods, i.e., testing whether an association can be declared without a threshold. CorSig incorporates Fisher's Z transformation of the observed PCC (r), which facilitates use of standard techniques for p-value computation and multiple testing corrections. We compared CorSig against two methods: one uses a minimum PCC cutoff while the other (Zhu's procedure) controls correlation strength and statistical significance in two discrete steps. CorSig consistently outperformed these methods in various simulation data scenarios by balancing between false positives and false negatives. When tested on real-world Populus microarray data, CorSig effectively identified co-expressed genes in the flavonoid pathway, and discriminated between closely related gene family members for their differential association with flavonoid and lignin pathways. The p-values obtained by CorSig can be used as a stand-alone parameter for stratification of co-expressed genes according to their correlation strength in lieu of an arbitrary cutoff. CorSig requires one single tunable parameter, and can be readily extended to other correlation measures. Thus, CorSig should be useful for a wide range of applications, particularly for network analysis of high-dimensional genomic data. SOFTWARE AVAILABILITY A web server for CorSig is provided at http://202.127.200.1:8080/probeWeb. R code for CorSig is freely available for non-commercial use at http://aspendb.uga.edu/downloads.
Collapse
Affiliation(s)
- Hong-Qiang Wang
- Intelligent Computing Lab, Hefei Institutes of Physical Science, Chinese Academy of Sciences, Hefei, China
- * E-mail: (HQW); (CJT)
| | - Chung-Jui Tsai
- Department of Genetics, University of Georgia, Athens, Georgia, United States of America
- Warnell School of Forestry and Natural Resources, University of Georgia, Athens, Georgia, United States of America
- * E-mail: (HQW); (CJT)
| |
Collapse
|
7
|
Mezhoud K. Graphical identification of cancer-associated gene subnetworks based on small proteomics data sets. OMICS : A JOURNAL OF INTEGRATIVE BIOLOGY 2013; 17:393-397. [PMID: 23642253 DOI: 10.1089/omi.2012.0084] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
Proteomics is a rapidly emerging frontier in post-genomics medicine and biology, but the quantitative analysis and validation of proteomic data are in need of further improvements. Before selecting potential candidate proteomic biomarkers, it is important to understand the broader context of how biological processes are regulated under different conditions or in different phenotypes. The enrichment of proteomic data consists of extracting as much biological meaning as possible from curated, pathway-based, functional protein interaction networks. Currently, most of the enrichment tools are intended for microarray data and require parametric data, whereas proteomic data are often nonparametric. In this study, we aimed to select a suite of interactive tools that can enrich proteomic results with a graphical overview. This facilitated diagnosis and interpretation prior to further analysis. From a list of proteins, a network was constructed using a map of the most severely disrupted biological process, and the disease entity was then identified on the basis of clinical data. Taken together, this graphical and interactive method ranks potential proteins via functional analysis in order to improve the choice of biomarkers for validation with the following advantages: 1) It adds neighbor proteins that are not selected by mass spectrometry analysis, but could in fact be key proteins; 2) pinpoints the biological process most often involved; and 3) predicts the most likely disease on the basis of clinical data.
Collapse
Affiliation(s)
- Karim Mezhoud
- UR04CNSTN01-Bio-computing Unit, Life Science Department, National Center for Nuclear Sciences and Technologies, Ariana, Tunisia.
| |
Collapse
|
8
|
Assessing numerical dependence in gene expression summaries with the jackknife expression difference. PLoS One 2012; 7:e39570. [PMID: 22876276 PMCID: PMC3411624 DOI: 10.1371/journal.pone.0039570] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2011] [Accepted: 05/27/2012] [Indexed: 11/19/2022] Open
Abstract
Statistical methods to test for differential expression traditionally assume that each gene's expression summaries are independent across arrays. When certain preprocessing methods are used to obtain those summaries, this assumption is not necessarily true. In general, the erroneous assumption of dependence results in a loss of statistical power. We introduce a diagnostic measure of numerical dependence for gene expression summaries from any preprocessing method and discuss the relative performance of several common preprocessing methods with respect to this measure. Some common preprocessing methods introduce non-trivial levels of numerical dependence. The issue of (between-array) dependence has received little if any attention in the literature, and researchers working with gene expression data should not take such properties for granted, or they risk unnecessarily losing statistical power.
Collapse
|
9
|
Xia LC, Steele JA, Cram JA, Cardon ZG, Simmons SL, Vallino JJ, Fuhrman JA, Sun F. Extended local similarity analysis (eLSA) of microbial community and other time series data with replicates. BMC SYSTEMS BIOLOGY 2011; 5 Suppl 2:S15. [PMID: 22784572 PMCID: PMC3287481 DOI: 10.1186/1752-0509-5-s2-s15] [Citation(s) in RCA: 146] [Impact Index Per Article: 11.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Background The increasing availability of time series microbial community data from metagenomics and other molecular biological studies has enabled the analysis of large-scale microbial co-occurrence and association networks. Among the many analytical techniques available, the Local Similarity Analysis (LSA) method is unique in that it captures local and potentially time-delayed co-occurrence and association patterns in time series data that cannot otherwise be identified by ordinary correlation analysis. However LSA, as originally developed, does not consider time series data with replicates, which hinders the full exploitation of available information. With replicates, it is possible to understand the variability of local similarity (LS) score and to obtain its confidence interval. Results We extended our LSA technique to time series data with replicates and termed it extended LSA, or eLSA. Simulations showed the capability of eLSA to capture subinterval and time-delayed associations. We implemented the eLSA technique into an easy-to-use analytic software package. The software pipeline integrates data normalization, statistical correlation calculation, statistical significance evaluation, and association network construction steps. We applied the eLSA technique to microbial community and gene expression datasets, where unique time-dependent associations were identified. Conclusions The extended LSA analysis technique was demonstrated to reveal statistically significant local and potentially time-delayed association patterns in replicated time series data beyond that of ordinary correlation analysis. These statistically significant associations can provide insights to the real dynamics of biological systems. The newly designed eLSA software efficiently streamlines the analysis and is freely available from the eLSA homepage, which can be accessed at http://meta.usc.edu/softs/lsa.
Collapse
Affiliation(s)
- Li C Xia
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089-2910, USA
| | | | | | | | | | | | | | | |
Collapse
|
10
|
Leichtle AB, Helmschrodt C, Ceglarek U, Shai I, Henkin Y, Schwarzfuchs D, Golan R, Gepner Y, Stampfer MJ, Blüher M, Stumvoll M, Thiery J, Fiedler GM. Effects of a 2-y dietary weight-loss intervention on cholesterol metabolism in moderately obese men. Am J Clin Nutr 2011; 94:1189-95. [PMID: 21940598 DOI: 10.3945/ajcn.111.018119] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
BACKGROUND Long-term dietary weight loss results in complex metabolic changes. However, its effect on cholesterol metabolism in obese subjects is still unclear. OBJECTIVE We assessed the effects of 2 y of weight loss achieved with various diet regimens on phytosterols (markers of intestinal cholesterol absorption), lanosterol (marker of de novo cholesterol synthesis), and changes in apolipoprotein concentrations. DESIGN We conducted the 2-y Dietary Intervention Randomized Controlled Trial (DIRECT-a study of low-fat, Mediterranean, and low-carbohydrate diets). We assessed circulating phytosterol and lanosterol concentrations and their ratios to cholesterol and apolipoproteins A-I and B-100 in 90 DIRECT participants at 0, 6, and 24 mo. RESULTS We observed a significant upregulation of the markers of cholesterol absorption (campesterol: +16.8%, P < 0.001) and a downregulation of the markers of cholesterol synthesis (lanosterol: -16.5%, P = 0.008) during the active weight-loss phase (first 6 mo, weight loss of 5%, 6%, and 10% in the 3 diet groups, respectively), followed by a rebound (campesterol: -6.2%, P = 0.045; lanosterol: +43.7%, P < 0.001) during the next 18 mo (weight gain of 1%, 1%, and 2% in the 3 diet groups, respectively). HDL cholesterol continuously increased during the study (17.0%, P < 0.001), whereas LDL cholesterol remained constant. At the end of the 24-mo follow-up period, campesterol (P < 0.001) and lanosterol (P = 0.016) amounts were significantly higher than baseline values. The concentration of apolipoprotein B-100 correlated with cholesterol metabolism (ρ = 0.299 and P = 0.020 for lanosterol; ρ = -0.105 and NS for campesterol), and the homeostasis model assessment of insulin resistance correlated with lanosterol (ρ = 0.09, P = 0.001). CONCLUSIONS Long-term weight loss is related to a characteristic response suggestive of altered cholesterol and apolipoprotein metabolism. Various diets have a similar effect on these effects. DIRECT is registered at clinicaltrials.gov as NCT00160108.
Collapse
Affiliation(s)
- Alexander B Leichtle
- University Institute of Clinical Chemistry, Inselspital - Bern University Hospital, Switzerland.
| | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
11
|
Zhu D, Acharya L, Zhang H. A generalized multivariate approach to pattern discovery from replicated and incomplete genome-wide measurements. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011; 8:1153-1169. [PMID: 21778521 DOI: 10.1109/tcbb.2010.102] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
Estimation of pairwise correlation from incomplete and replicated molecular profiling data is an ubiquitous problem in pattern discovery analysis, such as clustering and networking. However, existing methods solve this problem by ad hoc data imputation, followed by aveGation coefficient type approaches, which might annihilate important patterns present in the molecular profiling data. Moreover, these approaches do not consider and exploit the underlying experimental design information that specifies the replication mechanisms. We develop an Expectation-Maximization (EM) type algorithm to estimate the correlation structure using incomplete and replicated molecular profiling data with a priori known replication mechanism. The approach is sufficiently generalized to be applicable to any known replication mechanism. In case of unknown replication mechanism, it is reduced to the parsimonious model introduced previously. The efficacy of our approach was first evaluated by comprehensively comparing various bivariate and multivariate imputation approaches using simulation studies. Results from real-world data analysis further confirmed the superior performance of the proposed approach to the commonly used approaches, where we assessed the robustness of the method using data sets with up to 30 percent missing values.
Collapse
Affiliation(s)
- Dongxiao Zhu
- Department of Computer Science, University of New Orleans, New Orleans, Children's Hospital, New Orleans, LA, USA.
| | | | | |
Collapse
|
12
|
Yauk CL, Rowan-Carroll A, Stead JD, Williams A. Cross-platform analysis of global microRNA expression technologies. BMC Genomics 2010; 11:330. [PMID: 20504329 PMCID: PMC2890562 DOI: 10.1186/1471-2164-11-330] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2010] [Accepted: 05/26/2010] [Indexed: 02/07/2023] Open
Abstract
Background Although analysis of microRNAs (miRNAs) by DNA microarrays is gaining in popularity, these new technologies have not been adequately validated. We examined within and between platform reproducibility of four miRNA array technologies alongside TaqMan PCR arrays. Results Two distinct pools of reference materials were selected in order to maximize differences in miRNA content. Filtering for miRNA that yielded signal above background revealed 54 miRNA probes (matched by sequence) across all platforms. Using this probeset as well as all probes that were present on an individual platform, within-platform analyses revealed Spearman correlations of >0.9 for most platforms. Comparing between platforms, rank analysis of the log ratios of the two reference pools also revealed high correlation (range 0.663-0.949). Spearman rank correlation and concordance correlation coefficients for miRNA arrays against TaqMan qRT-PCR arrays were similar for all of the technologies. Platform performances were similar to those of previous cross-platform exercises on mRNA and miRNA microarray technologies. Conclusions These data indicate that miRNA microarray platforms generated highly reproducible data and can be recommended for the study of changes in miRNA expression.
Collapse
Affiliation(s)
- Carole L Yauk
- Environmental Health Sciences and Research Bureau, Health Canada, Ottawa, ON, Canada.
| | | | | | | |
Collapse
|
13
|
Williams A, Thomson EM. Effects of scanning sensitivity and multiple scan algorithms on microarray data quality. BMC Bioinformatics 2010; 11:127. [PMID: 20226031 PMCID: PMC2846908 DOI: 10.1186/1471-2105-11-127] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2009] [Accepted: 03/12/2010] [Indexed: 11/10/2022] Open
Abstract
Background Maximizing the utility of DNA microarray data requires optimization of data acquisition through selection of an appropriate scanner setting. To increase the amount of useable data, several approaches have been proposed that incorporate multiple scans at different sensitivities to reduce the quantification error and to minimize effects of saturation. However, no direct comparison of their efficacy has been made. In the present study we compared individual scans at low, medium and high sensitivity with three methods for combining data from multiple scans (either 2-scan or 3-scan cases) using an actual dataset comprising 40 technical replicates of a reference RNA standard. Results Of the individual scans, the low scan exhibited the lowest background signal, the highest signal-to-noise ratio, and equivalent reproducibility to the medium and high scans. Most multiple scan approaches increased the range of probe intensities compared to the individual scans, but did not increase the dynamic range (the proportion of useable data). Approaches displayed striking differences in the background signal and signal-to-noise ratio. However, increased probe intensity range and improved signal-to-noise ratios did not necessarily correlate with improved reproducibility. Importantly, for one multiple scan method that combined 3 scans, reproducibility was significantly improved relative to individual scans and all other multiple scan approaches. The same method using 2 scans yielded significantly lower reproducibility, attributable to a lack-of-fit of the statistical model. Conclusions Our data indicate that implementation of a suitable multiple scan approach can improve reproducibility, but that model validation is critical to ensure accurate estimates of probe intensity.
Collapse
Affiliation(s)
- Andrew Williams
- Population Health Studies Division, Environmental Health Science and Research Bureau, Health Canada, Ottawa, K1A 0K9, Canada.
| | | |
Collapse
|
14
|
Li H, Zhu D, Cook M. A statistical framework for consolidating "sibling" probe sets for Affymetrix GeneChip data. BMC Genomics 2008; 9:188. [PMID: 18435860 PMCID: PMC2397416 DOI: 10.1186/1471-2164-9-188] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2007] [Accepted: 04/24/2008] [Indexed: 12/02/2022] Open
Abstract
Background Affymetrix GeneChip typically contains multiple probe sets per gene, defined as sibling probe sets in this study. These probe sets may or may not behave similar across treatments. The most appropriate way of consolidating sibling probe sets suitable for analysis is an open problem. We propose the Analysis of Variance (ANOVA) framework to decide which sibling probe sets can be consolidated. Results The ANOVA model allows us to separate the sibling probe sets into two types: those behave similarly across treatments and those behave differently across treatments. We found that consolidation of sibling probe sets of the former type results in large increase in the number of differentially expressed genes under various statistical criteria. The approach to selecting sibling probe sets suitable for consolidating is implemented in R language and freely available from . Conclusion Our ANOVA analysis of sibling probe sets provides a statistical framework for selecting sibling probe sets for consolidation. Consolidating sibling probe sets by pooling data from each greatly improves the estimates of a gene expression level and results in identification of more biologically relevant genes. Sibling probe sets that do not qualify for consolidation may represent annotation errors or other artifacts, or may correspond to differentially processed transcripts of the same gene that require further analysis.
Collapse
Affiliation(s)
- Hua Li
- Bioinformatics Center, Stowers Institute for Medical Research, 1000 E 50th St, Kansas City, MO 64110, USA.
| | | | | |
Collapse
|