1
|
Jiao Z, Lai Y, Kang J, Gong W, Ma L, Jia T, Xie C, Xiang S, Cheng W, Heinz A, Desrivières S, Schumann G, Sun F, Feng J. A model-based approach to assess reproducibility for large-scale high-throughput MRI-based studies. Neuroimage 2022; 255:119166. [PMID: 35398282 DOI: 10.1016/j.neuroimage.2022.119166] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2021] [Revised: 03/26/2022] [Accepted: 03/30/2022] [Indexed: 12/21/2022] Open
Abstract
Magnetic Resonance Imaging (MRI) technology has been increasingly used in neuroscience studies. Reproducibility of statistically significant findings generated by MRI-based studies, especially association studies (phenotype vs. MRI metric) and task-induced brain activation, has been recently heavily debated. However, most currently available reproducibility measures depend on thresholds for the test statistics and cannot be use to evaluate overall study reproducibility. It is also crucial to elucidate the relationship between overall study reproducibility and sample size in an experimental design. In this study, we proposed a model-based reproducibility index to quantify reproducibility which could be used in large-scale high-throughput MRI-based studies including both association studies and task-induced brain activation. We performed the model-based reproducibility assessments for a few association studies and task-induced brain activation by using several recent large sMRI/fMRI databases. For large sample size association studies between brain structure/function features and some basic physiological phenotypes (i.e. Sex, BMI), we demonstrated that the model-based reproducibility of these studies is more than 0.99. For MID task activation, similar results could be observed. Furthermore, we proposed a model-based analytical tool to evaluate minimal sample size for the purpose of achieving a desirable model-based reproducibility. Additionally, we evaluated the model-based reproducibility of gray matter volume (GMV) changes for UK Biobank (UKB) vs. Parkinson Progression Marker Initiative (PPMI) and UK Biobank (UKB) vs. Human Connectome Project (HCP). We demonstrated that both sample size and study-specific experimental factors play important roles in the model-based reproducibility assessments for different experiments. In summary, a systematic assessment of reproducibility is fundamental and important in the current large-scale high-throughput MRI-based studies.
Collapse
Affiliation(s)
- Zeyu Jiao
- Shanghai Center for Mathematical Sciences, Fudan University, 220 Handan Road, Shanghai, China; Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China; Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, China; MOE Frontiers Center for Brain Science, Fudan University, Shanghai, China; Zhangjiang Fudan International Innovation Center, China
| | - Yinglei Lai
- School of Mathematical Sciences, University of Science and Technology of China, 96 Jinzhai Road, Hefei, Anhui 230026, China
| | - Jujiao Kang
- Shanghai Center for Mathematical Sciences, Fudan University, 220 Handan Road, Shanghai, China; Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China; Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, China; MOE Frontiers Center for Brain Science, Fudan University, Shanghai, China; Zhangjiang Fudan International Innovation Center, China
| | - Weikang Gong
- Center for Functional MRI of the Brain (FMRIB), Nuffield Department of Clinical Neurosciences, Welcome Center for Integrative Neuroimaging, University of Oxford, Oxford, United Kingdom
| | - Liang Ma
- Key Laboratory of Zoological Systematics and Evolution, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China
| | - Tianye Jia
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China; Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, China; MOE Frontiers Center for Brain Science, Fudan University, Shanghai, China; Zhangjiang Fudan International Innovation Center, China; Center for Population Neuroscience and Precision Medicine (PONS), Institute of Psychiatry, Psychology and Neuroscience, SGDP Center, King's College London, United Kingdom
| | - Chao Xie
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China; Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, China; MOE Frontiers Center for Brain Science, Fudan University, Shanghai, China; Zhangjiang Fudan International Innovation Center, China
| | - Shitong Xiang
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China; Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, China; MOE Frontiers Center for Brain Science, Fudan University, Shanghai, China; Zhangjiang Fudan International Innovation Center, China
| | - Wei Cheng
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China; Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, China; MOE Frontiers Center for Brain Science, Fudan University, Shanghai, China; Zhangjiang Fudan International Innovation Center, China
| | - Andreas Heinz
- Department of Psychiatry and Psychotherapy CCM, Charité - Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin, Humboldt-Universität zu Berlin, and Berlin Institute of Health, Berlin, Germany
| | - Sylvane Desrivières
- Center for Population Neuroscience and Precision Medicine (PONS), Institute of Psychiatry, Psychology and Neuroscience, SGDP Center, King's College London, United Kingdom
| | - Gunter Schumann
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China; Center for Population Neuroscience and Precision Medicine (PONS), Institute of Psychiatry, Psychology and Neuroscience, SGDP Center, King's College London, United Kingdom; PONS Research Group, Department of Psychiatry and Psychotherapy, Campus Charite Mitte, Humboldt University, Berlin, Germany
| | | | - Fengzhu Sun
- Quantitative and Computational Biology Department, University of Southern California, 1050 Childs Way, Los Angeles, CA 90089, United States
| | - Jianfeng Feng
- Shanghai Center for Mathematical Sciences, Fudan University, 220 Handan Road, Shanghai, China; Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China; Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, China; MOE Frontiers Center for Brain Science, Fudan University, Shanghai, China; Zhangjiang Fudan International Innovation Center, China; Department of Computer Science, University of Warwick, Coventry CV4 7AL, United Kingdom; School of Life Science and the Collaborative Innovation Center for Brain Science, Fudan University, Shanghai, China.
| |
Collapse
|
2
|
Milhaud X, Pommeret D, Salhi Y, Vandekerkhove P. Semiparametric two-sample admixture components comparison test: The symmetric case. J Stat Plan Inference 2022. [DOI: 10.1016/j.jspi.2021.05.010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
3
|
Lai Y, Zhang F, Nayak TK, Modarres R, Lee NH, McCaffrey TA. An efficient concordant integrative analysis of multiple large-scale two-sample expression data sets. Bioinformatics 2018; 33:3852-3860. [PMID: 28174897 DOI: 10.1093/bioinformatics/btx061] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2016] [Accepted: 01/31/2017] [Indexed: 11/13/2022] Open
Abstract
Motivation We have proposed a mixture model based approach to the concordant integrative analysis of multiple large-scale two-sample expression datasets. Since the mixture model is based on the transformed differential expression test P-values (z-scores), it is generally applicable to the expression data generated by either microarray or RNA-seq platforms. The mixture model is simple with three normal distribution components for each dataset to represent down-regulation, up-regulation and no differential expression. However, when the number of datasets increases, the model parameter space increases exponentially due to the component combination from different datasets. Results In this study, motivated by the well-known generalized estimating equations (GEEs) for longitudinal data analysis, we focus on the concordant components and assume that the proportions of non-concordant components follow a special structure. We discuss the exchangeable, multiset coefficient and autoregressive structures for model reduction, and their related expectation-maximization (EM) algorithms. Then, the parameter space is linear with the number of datasets. In our previous study, we have applied the general mixture model to three microarray datasets for lung cancer studies. We show that more gene sets (or pathways) can be detected by the reduced mixture model with the exchangeable structure. Furthermore, we show that more genes can also be detected by the reduced model. The Cancer Genome Atlas (TCGA) data have been increasingly collected. The advantage of incorporating the concordance feature has also been clearly demonstrated based on TCGA RNA sequencing data for studying two closely related types of cancer. Availability and Implementation Additional results are included in a supplemental file. Computer program R-functions are freely available at http://home.gwu.edu/∼ylai/research/Concordance. Contact ylai@gwu.edu. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yinglei Lai
- Department of Statistics, The George Washington University, Washington, DC 20052, USA
| | - Fanni Zhang
- Department of Statistics, The George Washington University, Washington, DC 20052, USA
| | - Tapan K Nayak
- Department of Statistics, The George Washington University, Washington, DC 20052, USA
| | - Reza Modarres
- Department of Statistics, The George Washington University, Washington, DC 20052, USA
| | | | - Timothy A McCaffrey
- Division of Genomic Medicine, Department of Medicine, The George Washington University Medical Center, Washington, DC 20037, USA
| |
Collapse
|
4
|
Lai Y, Zhang F, Nayak TK, Modarres R, Lee NH, McCaffrey TA. Detecting discordance enrichment among a series of two-sample genome-wide expression data sets. BMC Genomics 2017; 18:1050. [PMID: 28198679 PMCID: PMC5310286 DOI: 10.1186/s12864-016-3265-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
BACKGROUND With the current microarray and RNA-seq technologies, two-sample genome-wide expression data have been widely collected in biological and medical studies. The related differential expression analysis and gene set enrichment analysis have been frequently conducted. Integrative analysis can be conducted when multiple data sets are available. In practice, discordant molecular behaviors among a series of data sets can be of biological and clinical interest. METHODS In this study, a statistical method is proposed for detecting discordance gene set enrichment. Our method is based on a two-level multivariate normal mixture model. It is statistically efficient with linearly increased parameter space when the number of data sets is increased. The model-based probability of discordance enrichment can be calculated for gene set detection. RESULTS We apply our method to a microarray expression data set collected from forty-five matched tumor/non-tumor pairs of tissues for studying pancreatic cancer. We divided the data set into a series of non-overlapping subsets according to the tumor/non-tumor paired expression ratio of gene PNLIP (pancreatic lipase, recently shown it association with pancreatic cancer). The log-ratio ranges from a negative value (e.g. more expressed in non-tumor tissue) to a positive value (e.g. more expressed in tumor tissue). Our purpose is to understand whether any gene sets are enriched in discordant behaviors among these subsets (when the log-ratio is increased from negative to positive). We focus on KEGG pathways. The detected pathways will be useful for our further understanding of the role of gene PNLIP in pancreatic cancer research. Among the top list of detected pathways, the neuroactive ligand receptor interaction and olfactory transduction pathways are the most significant two. Then, we consider gene TP53 that is well-known for its role as tumor suppressor in cancer research. The log-ratio also ranges from a negative value (e.g. more expressed in non-tumor tissue) to a positive value (e.g. more expressed in tumor tissue). We divided the microarray data set again according to the expression ratio of gene TP53. After the discordance enrichment analysis, we observed overall similar results and the above two pathways are still the most significant detections. More interestingly, only these two pathways have been identified for their association with pancreatic cancer in a pathway analysis of genome-wide association study (GWAS) data. CONCLUSIONS This study illustrates that some disease-related pathways can be enriched in discordant molecular behaviors when an important disease-related gene changes its expression. Our proposed statistical method is useful in the detection of these pathways. Furthermore, our method can also be applied to genome-wide expression data collected by the recent RNA-seq technology.
Collapse
Affiliation(s)
- Yinglei Lai
- Department of Statistics, The George Washington University, 801 22nd St. N.W., Rome Hall, 7th Floor, Washington, 20052, D.C., USA.
| | - Fanni Zhang
- Department of Statistics, The George Washington University, 801 22nd St. N.W., Rome Hall, 7th Floor, Washington, 20052, D.C., USA
| | - Tapan K Nayak
- Department of Statistics, The George Washington University, 801 22nd St. N.W., Rome Hall, 7th Floor, Washington, 20052, D.C., USA
| | - Reza Modarres
- Department of Statistics, The George Washington University, 801 22nd St. N.W., Rome Hall, 7th Floor, Washington, 20052, D.C., USA
| | - Norman H Lee
- Department of Pharmacology and Physiology, The George Washington University Medical Center, Washington, 20037, D.C., USA
| | - Timothy A McCaffrey
- Department of Medicine, Division of Genomic Medicine, The George Washington University Medical Center, Washington, 20037, D.C., USA
| |
Collapse
|
5
|
Siska C, Kechris K. Differential correlation for sequencing data. BMC Res Notes 2017; 10:54. [PMID: 28103954 PMCID: PMC5244536 DOI: 10.1186/s13104-016-2331-9] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2016] [Accepted: 12/10/2016] [Indexed: 01/15/2023] Open
Abstract
BACKGROUND Several methods have been developed to identify differential correlation (DC) between pairs of molecular features from -omics studies. Most DC methods have only been tested with microarrays and other platforms producing continuous and Gaussian-like data. Sequencing data is in the form of counts, often modeled with a negative binomial distribution making it difficult to apply standard correlation metrics. We have developed an R package for identifying DC called Discordant which uses mixture models for correlations between features and the Expectation Maximization (EM) algorithm for fitting parameters of the mixture model. Several correlation metrics for sequencing data are provided and tested using simulations. Other extensions in the Discordant package include additional modeling for different types of differential correlation, and faster implementation, using a subsampling routine to reduce run-time and address the assumption of independence between molecular feature pairs. RESULTS With simulations and breast cancer miRNA-Seq and RNA-Seq data, we find that Spearman's correlation has the best performance among the tested correlation methods for identifying differential correlation. Application of Spearman's correlation in the Discordant method demonstrated the most power in ROC curves and sensitivity/specificity plots, and improved ability to identify experimentally validated breast cancer miRNA. We also considered including additional types of differential correlation, which showed a slight reduction in power due to the additional parameters that need to be estimated, but more versatility in applications. Finally, subsampling within the EM algorithm considerably decreased run-time with negligible effect on performance. CONCLUSIONS A new method and R package called Discordant is presented for identifying differential correlation with sequencing data. Based on comparisons with different correlation metrics, this study suggests Spearman's correlation is appropriate for sequencing data, but other correlation metrics are available to the user depending on the application and data type. The Discordant method can also be extended to investigate additional DC types and subsampling with the EM algorithm is now available for reduced run-time. These extensions to the R package make Discordant more robust and versatile for multiple -omics studies.
Collapse
Affiliation(s)
- Charlotte Siska
- Computational Bioscience Program, Department of Pharmacology, School of Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO, USA.
| | - Katerina Kechris
- Department of Biostatistics and Informatics, Colorado School of Public Health, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| |
Collapse
|
6
|
McKenzie AT, Katsyv I, Song WM, Wang M, Zhang B. DGCA: A comprehensive R package for Differential Gene Correlation Analysis. BMC SYSTEMS BIOLOGY 2016; 10:106. [PMID: 27846853 PMCID: PMC5111277 DOI: 10.1186/s12918-016-0349-1] [Citation(s) in RCA: 140] [Impact Index Per Article: 17.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/23/2016] [Accepted: 11/03/2016] [Indexed: 12/31/2022]
Abstract
BACKGROUND Dissecting the regulatory relationships between genes is a critical step towards building accurate predictive models of biological systems. A powerful approach towards this end is to systematically study the differences in correlation between gene pairs in more than one distinct condition. RESULTS In this study we develop an R package, DGCA (for Differential Gene Correlation Analysis), which offers a suite of tools for computing and analyzing differential correlations between gene pairs across multiple conditions. To minimize parametric assumptions, DGCA computes empirical p-values via permutation testing. To understand differential correlations at a systems level, DGCA performs higher-order analyses such as measuring the average difference in correlation and multiscale clustering analysis of differential correlation networks. Through a simulation study, we show that the straightforward z-score based method that DGCA employs significantly outperforms the existing alternative methods for calculating differential correlation. Application of DGCA to the TCGA RNA-seq data in breast cancer not only identifies key changes in the regulatory relationships between TP53 and PTEN and their target genes in the presence of inactivating mutations, but also reveals an immune-related differential correlation module that is specific to triple negative breast cancer (TNBC). CONCLUSIONS DGCA is an R package for systematically assessing the difference in gene-gene regulatory relationships under different conditions. This user-friendly, effective, and comprehensive software tool will greatly facilitate the application of differential correlation analysis in many biological studies and thus will help identification of novel signaling pathways, biomarkers, and targets in complex biological systems and diseases.
Collapse
Affiliation(s)
- Andrew T. McKenzie
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, New York, NY 10029 USA
- Icahn Institute of Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, New York, NY 10029 USA
- Medical Scientist Training Program, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, New York, NY 10029 USA
| | - Igor Katsyv
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, New York, NY 10029 USA
- Icahn Institute of Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, New York, NY 10029 USA
- Medical Scientist Training Program, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, New York, NY 10029 USA
| | - Won-Min Song
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, New York, NY 10029 USA
- Icahn Institute of Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, New York, NY 10029 USA
| | - Minghui Wang
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, New York, NY 10029 USA
- Icahn Institute of Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, New York, NY 10029 USA
| | - Bin Zhang
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, New York, NY 10029 USA
- Icahn Institute of Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, New York, NY 10029 USA
- Department of Genetics & Genomic Sciences, Icahn School of Medicine at Mount Sinai, 1470 Madison Avenue, Room S8-111, New York, NY 10029 USA
| |
Collapse
|
7
|
Siska C, Bowler R, Kechris K. The discordant method: a novel approach for differential correlation. ACTA ACUST UNITED AC 2015; 32:690-6. [PMID: 26520855 DOI: 10.1093/bioinformatics/btv633] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2015] [Accepted: 10/24/2015] [Indexed: 11/12/2022]
Abstract
MOTIVATION Current differential correlation methods are designed to determine molecular feature pairs that have the largest magnitude of difference between correlation coefficients. These methods do not easily capture molecular feature pairs that experience no correlation in one group but correlation in another, which may reflect certain types of biological interactions. We have developed a tool, the Discordant method, which categorizes the correlation types for each group to make this possible. RESULTS We compare the Discordant method to existing approaches using simulations and two biological datasets with different types of -omics data. In contrast to other methods, Discordant identifies phenotype-related features at a similar or higher rate while maintaining reasonable computational tractability and usability. AVAILABILITY AND IMPLEMENTATION R code and sample data are available at https://github.com/siskac/discordant CONTACT katerina.kechris@ucdenver.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Charlotte Siska
- Computational Bioscience Program, Department of Pharmacology, University of Colorado Denver
| | | | - Katerina Kechris
- Department of Biostatistics and Informatics, University of Colorado Denver, Denver, CO, USA
| |
Collapse
|
8
|
Abou-Elwafa Abdallah M. Advances in Instrumental Analysis of Brominated Flame Retardants: Current Status and Future Perspectives. INTERNATIONAL SCHOLARLY RESEARCH NOTICES 2014; 2014:651834. [PMID: 27433482 PMCID: PMC4897317 DOI: 10.1155/2014/651834] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/27/2014] [Accepted: 07/14/2014] [Indexed: 11/18/2022]
Abstract
This review aims to highlight the recent advances and methodological improvements in instrumental techniques applied for the analysis of different brominated flame retardants (BFRs). The literature search strategy was based on the recent analytical reviews published on BFRs. The main selection criteria involved the successful development and application of analytical methods for determination of the target compounds in various environmental matrices. Different factors affecting chromatographic separation and mass spectrometric detection of brominated analytes were evaluated and discussed. Techniques using advanced instrumentation to achieve outstanding results in quantification of different BFRs and their metabolites/degradation products were highlighted. Finally, research gaps in the field of BFR analysis were identified and recommendations for future research were proposed.
Collapse
Affiliation(s)
- Mohamed Abou-Elwafa Abdallah
- Division of Environmental Health and Risk Management, School of Geography, Earth and Environmental Sciences, University of Birmingham, Birmingham B15 2TT, UK
- Department of Analytical Chemistry, Faculty of Pharmacy, Assiut University, Assiut 71526, Egypt
| |
Collapse
|
9
|
Lai Y, Zhang F, Nayak TK, Modarres R, Lee NH, McCaffrey TA. Concordant integrative gene set enrichment analysis of multiple large-scale two-sample expression data sets. BMC Genomics 2014; 15 Suppl 1:S6. [PMID: 24564564 PMCID: PMC4046697 DOI: 10.1186/1471-2164-15-s1-s6] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Background Gene set enrichment analysis (GSEA) is an important approach to the analysis of coordinate expression changes at a pathway level. Although many statistical and computational methods have been proposed for GSEA, the issue of a concordant integrative GSEA of multiple expression data sets has not been well addressed. Among different related data sets collected for the same or similar study purposes, it is important to identify pathways or gene sets with concordant enrichment. Methods We categorize the underlying true states of differential expression into three representative categories: no change, positive change and negative change. Due to data noise, what we observe from experiments may not indicate the underlying truth. Although these categories are not observed in practice, they can be considered in a mixture model framework. Then, we define the mathematical concept of concordant gene set enrichment and calculate its related probability based on a three-component multivariate normal mixture model. The related false discovery rate can be calculated and used to rank different gene sets. Results We used three published lung cancer microarray gene expression data sets to illustrate our proposed method. One analysis based on the first two data sets was conducted to compare our result with a previous published result based on a GSEA conducted separately for each individual data set. This comparison illustrates the advantage of our proposed concordant integrative gene set enrichment analysis. Then, with a relatively new and larger pathway collection, we used our method to conduct an integrative analysis of the first two data sets and also all three data sets. Both results showed that many gene sets could be identified with low false discovery rates. A consistency between both results was also observed. A further exploration based on the KEGG cancer pathway collection showed that a majority of these pathways could be identified by our proposed method. Conclusions This study illustrates that we can improve detection power and discovery consistency through a concordant integrative analysis of multiple large-scale two-sample gene expression data sets.
Collapse
|
10
|
Li Y, Ghosh D. Assumption weighting for incorporating heterogeneity into meta-analysis of genomic data. Bioinformatics 2012; 28:807-14. [PMID: 22285559 DOI: 10.1093/bioinformatics/bts037] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2023] Open
Abstract
MOTIVATION There is now a large literature on statistical methods for the meta-analysis of genomic data from multiple studies. However, a crucial assumption for performing many of these analyses is that the data exhibit small between-study variation or that this heterogeneity can be sufficiently modelled probabilistically. RESULTS In this article, we propose 'assumption weighting', which exploits a weighted hypothesis testing framework proposed by Genovese et al. to incorporate tests of between-study variation into the meta-analysis context. This methodology is fast and computationally simple to implement. Several weighting schemes are considered and compared using simulation studies. In addition, we illustrate application of the proposed methodology using data from several high-profile stem cell gene expression datasets.
Collapse
Affiliation(s)
- Yihan Li
- Department of Statistics, Penn State University, University Park, PA 16802, USA
| | | |
Collapse
|
11
|
Differentially expressed RNA from public microarray data identifies serum protein biomarkers for cross-organ transplant rejection and other conditions. PLoS Comput Biol 2010; 6. [PMID: 20885780 PMCID: PMC2944782 DOI: 10.1371/journal.pcbi.1000940] [Citation(s) in RCA: 58] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2010] [Accepted: 08/23/2010] [Indexed: 02/03/2023] Open
Abstract
Serum proteins are routinely used to diagnose diseases, but are hard to find due to low sensitivity in screening the serum proteome. Public repositories of microarray data, such as the Gene Expression Omnibus (GEO), contain RNA expression profiles for more than 16,000 biological conditions, covering more than 30% of United States mortality. We hypothesized that genes coding for serum- and urine-detectable proteins, and showing differential expression of RNA in disease-damaged tissues would make ideal diagnostic protein biomarkers for those diseases. We showed that predicted protein biomarkers are significantly enriched for known diagnostic protein biomarkers in 22 diseases, with enrichment significantly higher in diseases for which at least three datasets are available. We then used this strategy to search for new biomarkers indicating acute rejection (AR) across different types of transplanted solid organs. We integrated three biopsy-based microarray studies of AR from pediatric renal, adult renal and adult cardiac transplantation and identified 45 genes upregulated in all three. From this set, we chose 10 proteins for serum ELISA assays in 39 renal transplant patients, and discovered three that were significantly higher in AR. Interestingly, all three proteins were also significantly higher during AR in the 63 cardiac transplant recipients studied. Our best marker, serum PECAM1, identified renal AR with 89% sensitivity and 75% specificity, and also showed increased expression in AR by immunohistochemistry in renal, hepatic and cardiac transplant biopsies. Our results demonstrate that integrating gene expression microarray measurements from disease samples and even publicly-available data sets can be a powerful, fast, and cost-effective strategy for the discovery of new diagnostic serum protein biomarkers. Protein biomarkers in the blood are urgently needed for the diagnosis of a wide variety of diseases to improve health care. We aim to find a fast and cost-effective strategy to discover diagnostic protein biomarkers. Hundreds of diseases have already been investigated using microarray technology, measuring the mRNA expression of all genes in the disease-damaged tissues. We analyzed biopsy-based microarray data for 41 diseases in the public repository, identified genes with dysregulated mRNA expressions and detectable-protein abundance in the blood, and predicted them as candidate diagnostic protein biomarkers. We found that clinically and preclinically validated diagnostic protein biomarkers were significantly enriched in our predicted protein candidates for 22 diseases. We then measured the concentrations of ten predicted protein biomarkers in the serum samples from 39 renal transplant patients. Three of them were confirmed to be diagnostic of acute rejection after renal transplantation. All three proteins were further confirmed to be diagnostic of acute rejection in 63 cardiac transplant recipients. Our results show that publically available genome-wide gene expression data on disease-damaged tissues can be effectively translated into diagnostic protein biomarkers.
Collapse
|
12
|
Microarray Data Classifier Consisting of k-Top-Scoring Rank-Comparison Decision Rules With a Variable Number of Genes. ACTA ACUST UNITED AC 2010. [DOI: 10.1109/tsmcc.2009.2036594] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
|
13
|
You J, Cozzi P, Walsh B, Willcox M, Kearsley J, Russell P, Li Y. Innovative biomarkers for prostate cancer early diagnosis and progression. Crit Rev Oncol Hematol 2010; 73:10-22. [DOI: 10.1016/j.critrevonc.2009.02.007] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2008] [Revised: 02/05/2009] [Accepted: 02/25/2009] [Indexed: 02/07/2023] Open
|
14
|
Howard BE, Sick B, Heber S. Unsupervised assessment of microarray data quality using a Gaussian mixture model. BMC Bioinformatics 2009; 10:191. [PMID: 19545436 PMCID: PMC2717951 DOI: 10.1186/1471-2105-10-191] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2008] [Accepted: 06/22/2009] [Indexed: 12/22/2022] Open
Abstract
BACKGROUND Quality assessment of microarray data is an important and often challenging aspect of gene expression analysis. This task frequently involves the examination of a variety of summary statistics and diagnostic plots. The interpretation of these diagnostics is often subjective, and generally requires careful expert scrutiny. RESULTS We show how an unsupervised classification technique based on the Expectation-Maximization (EM) algorithm and the naïve Bayes model can be used to automate microarray quality assessment. The method is flexible and can be easily adapted to accommodate alternate quality statistics and platforms. We evaluate our approach using Affymetrix 3' gene expression and exon arrays and compare the performance of this method to a similar supervised approach. CONCLUSION This research illustrates the efficacy of an unsupervised classification approach for the purpose of automated microarray data quality assessment. Since our approach requires only unannotated training data, it is easy to customize and to keep up-to-date as technology evolves. In contrast to other "black box" classification systems, this method also allows for intuitive explanations.
Collapse
Affiliation(s)
- Brian E Howard
- Bioinformatics Research Center, North Carolina State University, Raleigh, NC, USA.
| | | | | |
Collapse
|
15
|
Muir WM, Rosa GJM, Pittendrigh BR, Xu S, Rider SD, Fountain M, Ogas J. A mixture model approach for the analysis of small exploratory microarray experiments. Comput Stat Data Anal 2009; 53:1566-1576. [PMID: 20160862 DOI: 10.1016/j.csda.2008.06.011] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Abstract
The microarray is an important and powerful tool for prescreening of genes for further research. However, alternative solutions are needed to increase power in small microarray experiments. Use of traditional parametric and even non-parametric tests for such small experiments lack power and have distributional problems. A mixture model is described that is performed directly on expression differences assuming that genes in alternative treatments are expressed or not in all combinations (i) not expressed in either condition, (ii) expressed only under the first condition, (iii) expressed only under the second condition, and (iv) expressed under both conditions, giving rise to 4 possible clusters with two treatments. The approach is termed a Mean-Difference-Mixture-Model (MD-MM) method. Accuracy and power of the MD-MM was compared to other commonly used methods, using both simulations, microarray data, and quantitative real time PCR (qRT-PCR). The MD-MM was found to be generally superior to other methods in most situations. The advantage was greatest in situations where there were few replicates, poor signal to noise ratios, or non-homogenous variances.
Collapse
Affiliation(s)
- W M Muir
- Dept. Animal Sciences, Purdue University, W. Lafayette IN 47907
| | | | | | | | | | | | | |
Collapse
|
16
|
Lai Y, Eckenrode SE, She JX. A statistical framework for integrating two microarray data sets in differential expression analysis. BMC Bioinformatics 2009; 10 Suppl 1:S23. [PMID: 19208123 PMCID: PMC2648727 DOI: 10.1186/1471-2105-10-s1-s23] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Different microarray data sets can be collected for studying the same or similar diseases. We expect to achieve a more efficient analysis of differential expression if an efficient statistical method can be developed for integrating different microarray data sets. Although many statistical methods have been proposed for data integration, the genome-wide concordance of different data sets has not been well considered in the analysis. RESULTS Before considering data integration, it is necessary to evaluate the genome-wide concordance so that misleading results can be avoided. Based on the test results, different subsequent actions are suggested. The evaluation of genome-wide concordance and the data integration can be achieved based on the normal distribution based mixture models. CONCLUSION The results from our simulation study suggest that misleading results can be generated if the genome-wide concordance issue is not appropriately considered. Our method provides a rigorous parametric solution. The results also show that our method is robust to certain model misspecification and is practically useful for the integrative analysis of differential expression.
Collapse
Affiliation(s)
- Yinglei Lai
- Department of Statistics and Biostatistics Center, The George Washington University, 2140 Pennsylvania Avenue, N.W., Washington, D.C. 20052, USA
| | - Sarah E Eckenrode
- Center for Biotechnology and Genomic Medicine, Medical College of Georgia, 1120 15th street, CA4098, GA 30912, USA
| | - Jin-Xiong She
- Center for Biotechnology and Genomic Medicine, Medical College of Georgia, 1120 15th street, CA4098, GA 30912, USA
| |
Collapse
|