1
|
Shahrjooihaghighi A, Frigui H, Zhang X, Wei X, Shi B, Trabelsi A. An Ensemble Feature Selection Method for Biomarker Discovery. PROCEEDINGS OF THE ... IEEE INTERNATIONAL SYMPOSIUM ON SIGNAL PROCESSING AND INFORMATION TECHNOLOGY. IEEE INTERNATIONAL SYMPOSIUM ON SIGNAL PROCESSING AND INFORMATION TECHNOLOGY 2017; 2017:416-421. [PMID: 30887013 PMCID: PMC6420823 DOI: 10.1109/isspit.2017.8388679] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/12/2023]
Abstract
Feature selection in Liquid Chromatography-Mass Spectrometry (LC-MS)-based metabolomics data (biomarker discovery) have become an important topic for machine learning researchers. High dimensionality and small sample size of LC-MS data make feature selection a challenging task. The goal of biomarker discovery is to select the few most discriminative features among a large number of irreverent ones. To improve the reliability of the discovered biomarkers, we use an ensemble-based approach. Ensemble learning can improve the accuracy of feature selection by combining multiple algorithms that have complementary information. In this paper, we propose an ensemble approach to combine the results of filter-based feature selection methods. To evaluate the proposed approach, we compared it to two commonly used methods, t-test and PLS-DA, using a real data set.
Collapse
Affiliation(s)
- Aliasghar Shahrjooihaghighi
- Department of Computer Engineering and Computer Science, University of Louisville, Louisville, KY 40292, USA
| | - Hichem Frigui
- Department of Computer Engineering and Computer Science, University of Louisville, Louisville, KY 40292, USA
| | - Xiang Zhang
- Department of Chemistry, University of Louisville, Louisville, KY 40292, USA
| | - Xiaoli Wei
- Department of Chemistry, University of Louisville, Louisville, KY 40292, USA
| | - Biyun Shi
- Department of Chemistry, University of Louisville, Louisville, KY 40292, USA
| | - Ameni Trabelsi
- Department of Computer Engineering and Computer Science, University of Louisville, Louisville, KY 40292, USA
| |
Collapse
|
2
|
Abstract
Background The Receiver Operator Characteristic (ROC) curve is well-known in evaluating classification performance in biomedical field. Owing to its superiority in dealing with imbalanced and cost-sensitive data, the ROC curve has been exploited as a popular metric to evaluate and find out disease-related genes (features). The existing ROC-based feature selection approaches are simple and effective in evaluating individual features. However, these approaches may fail to find real target feature subset due to their lack of effective means to reduce the redundancy between features, which is essential in machine learning. Results In this paper, we propose to assess feature complementarity by a trick of measuring the distances between the misclassified instances and their nearest misses on the dimensions of pairwise features. If a misclassified instance and its nearest miss on one feature dimension are far apart on another feature dimension, the two features are regarded as complementary to each other. Subsequently, we propose a novel filter feature selection approach on the basis of the ROC analysis. The new approach employs an efficient heuristic search strategy to select optimal features with highest complementarities. The experimental results on a broad range of microarray data sets validate that the classifiers built on the feature subset selected by our approach can get the minimal balanced error rate with a small amount of significant features. Conclusions Compared with other ROC-based feature selection approaches, our new approach can select fewer features and effectively improve the classification performance.
Collapse
|
3
|
Frades I, Andreasson E, Mato JM, Alexandersson E, Matthiesen R, Martínez-Chantar ML. Integrative genomic signatures of hepatocellular carcinoma derived from nonalcoholic Fatty liver disease. PLoS One 2015; 10:e0124544. [PMID: 25993042 PMCID: PMC4439034 DOI: 10.1371/journal.pone.0124544] [Citation(s) in RCA: 54] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2015] [Accepted: 03/05/2015] [Indexed: 12/11/2022] Open
Abstract
Nonalcoholic fatty liver disease (NAFLD) is a risk factor for Hepatocellular carcinoma (HCC), but he transition from NAFLD to HCC is poorly understood. Feature selection algorithms in human and genetically modified mice NAFLD and HCC microarray data were applied to generate signatures of NAFLD progression and HCC differential survival. These signatures were used to study the pathogenesis of NAFLD derived HCC and explore which subtypes of cancers that can be investigated using mouse models. Our findings show that: (I) HNF4 is a common potential transcription factor mediating the transcription of NAFLD progression genes (II) mice HCC derived from NAFLD co-cluster with a less aggressive human HCC subtype of differential prognosis and mixed etiology (III) the HCC survival signature is able to correctly classify 95% of the samples and gives Fgf20 and Tgfb1i1 as the most robust genes for prediction (IV) the expression values of genes composing the signature in an independent human HCC dataset revealed different HCC subtypes showing differences in survival time by a Logrank test. In summary, we present marker signatures for NAFLD derived HCC molecular pathogenesis both at the gene and pathway level.
Collapse
Affiliation(s)
- Itziar Frades
- Metabolomics Unit, CIC bioGUNE, Centro de Investigación Cooperativa en Biociencias, Bizkaia Technology Park, Derio, Bizkaia, Spain
- Department of Plant Protection Biology, Swedish University of Agricultural Sciences, Alnarp, Sweden
| | - Erik Andreasson
- Department of Plant Protection Biology, Swedish University of Agricultural Sciences, Alnarp, Sweden
| | - Jose Maria Mato
- Metabolomics Unit, CIC bioGUNE, Centro de Investigación Cooperativa en Biociencias, Bizkaia Technology Park, Derio, Bizkaia, Spain
| | - Erik Alexandersson
- Department of Plant Protection Biology, Swedish University of Agricultural Sciences, Alnarp, Sweden
| | - Rune Matthiesen
- Department of Human genetics, National Health Institute Doutor Ricardo Jorge, Lisboa, Portugal
| | - Mª Luz Martínez-Chantar
- Metabolomics Unit, CIC bioGUNE, Centro de Investigación Cooperativa en Biociencias, Bizkaia Technology Park, Derio, Bizkaia, Spain
| |
Collapse
|
4
|
Ulfenborg B, Klinga-Levan K, Olsson B. Classification of tumor samples from expression data using decision trunks. Cancer Inform 2013; 12:53-66. [PMID: 23467331 PMCID: PMC3579425 DOI: 10.4137/cin.s10356] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022] Open
Abstract
We present a novel machine learning approach for the classification of cancer samples using expression data. We refer to the method as “decision trunks,” since it is loosely based on decision trees, but contains several modifications designed to achieve an algorithm that: (1) produces smaller and more easily interpretable classifiers than decision trees; (2) is more robust in varying application scenarios; and (3) achieves higher classification accuracy. The decision trunk algorithm has been implemented and tested on 26 classification tasks, covering a wide range of cancer forms, experimental methods, and classification scenarios. This comprehensive evaluation indicates that the proposed algorithm performs at least as well as the current state of the art algorithms in terms of accuracy, while producing classifiers that include on average only 2–3 markers. We suggest that the resulting decision trunks have clear advantages over other classifiers due to their transparency, interpretability, and their correspondence with human decision-making and clinical testing practices.
Collapse
Affiliation(s)
- Benjamin Ulfenborg
- Systems Biology Research Centre, School of Life Sciences, University of Skövde, Skövde, Sweden
| | | | | |
Collapse
|
5
|
A robust hybrid approach based on estimation of distribution algorithm and support vector machine for hunting candidate disease genes. ScientificWorldJournal 2013; 2013:393570. [PMID: 23476131 PMCID: PMC3582165 DOI: 10.1155/2013/393570] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2012] [Accepted: 11/25/2012] [Indexed: 11/27/2022] Open
Abstract
Microarray data are high dimension with high noise ratio and relatively small sample size, which makes it a challenge to use microarray data to identify candidate disease genes. Here, we have presented a hybrid method that combines estimation of distribution algorithm with support vector machine for selection of key feature genes. We have benchmarked the method using the microarray data of both diffuse B cell lymphoma and colon cancer to demonstrate its performance for identifying key features from the profile data of high-dimension gene expression. The method was compared with a probabilistic model based on genetic algorithm and another hybrid method based on both genetics algorithm and support vector machine. The results showed that the proposed method provides new computational strategy for hunting candidate disease genes from the profile data of disease gene expression. The selected candidate disease genes may help to improve the diagnosis and treatment for diseases.
Collapse
|
6
|
Chen IBD, Rathi VK, DeAndrade DS, Jay PY. Association of genes with physiological functions by comparative analysis of pooled expression microarray data. Physiol Genomics 2012; 45:69-78. [PMID: 23170034 DOI: 10.1152/physiolgenomics.00116.2012] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
The physiological functions of a tissue in the body are carried out by its complement of expressed genes. Genes that execute a particular function should be more specifically expressed in tissues that perform the function. Given this premise, we mined public microarray expression data to build a database of genes ranked by their specificity of expression in multiple organs. The database permitted the accurate identification of genes and functions known to be specific to individual organs. Next, we used the database to predict transcriptional regulators of brown adipose tissue (BAT) and validated two candidate genes. Based upon hypotheses regarding pathways shared between combinations of BAT or white adipose tissue (WAT) and other organs, we identified genes that met threshold criteria for specific or counterspecific expression in each tissue. By contrasting WAT to the heart and BAT, the two most mitochondria-rich tissues in the body, we discovered a novel function for the transcription factor ESRRG in the induction of BAT genes in white adipocytes. Because the heart and other estrogen-related receptor gamma (ESRRG)-rich tissues do not express BAT markers, we hypothesized that an adipocyte co-regulator acts with ESRRG. By comparing WAT and BAT to the heart, brain, kidney and skeletal muscle, we discovered that an isoform of the transcription factor sterol regulatory element binding transcription factor 1 (SREBF1) induces BAT markers in C2C12 myocytes in the presence of ESRRG. The results demonstrate a straightforward bioinformatic strategy to associate genes with functions. The database upon which the strategy is based is provided so that investigators can perform their own screens.
Collapse
Affiliation(s)
- Iuan-bor D Chen
- Department of Pediatrics, Washington University School of Medicine, St. Louis, Missouri, USA
| | | | | | | |
Collapse
|
7
|
Mohammadi A, Saraee MH, Salehi M. Identification of disease-causing genes using microarray data mining and Gene Ontology. BMC Med Genomics 2011; 4:12. [PMID: 21269461 PMCID: PMC3037837 DOI: 10.1186/1755-8794-4-12] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2010] [Accepted: 01/26/2011] [Indexed: 01/25/2023] Open
Abstract
BACKGROUND One of the best and most accurate methods for identifying disease-causing genes is monitoring gene expression values in different samples using microarray technology. One of the shortcomings of microarray data is that they provide a small quantity of samples with respect to the number of genes. This problem reduces the classification accuracy of the methods, so gene selection is essential to improve the predictive accuracy and to identify potential marker genes for a disease. Among numerous existing methods for gene selection, support vector machine-based recursive feature elimination (SVMRFE) has become one of the leading methods, but its performance can be reduced because of the small sample size, noisy data and the fact that the method does not remove redundant genes. METHODS We propose a novel framework for gene selection which uses the advantageous features of conventional methods and addresses their weaknesses. In fact, we have combined the Fisher method and SVMRFE to utilize the advantages of a filtering method as well as an embedded method. Furthermore, we have added a redundancy reduction stage to address the weakness of the Fisher method and SVMRFE. In addition to gene expression values, the proposed method uses Gene Ontology which is a reliable source of information on genes. The use of Gene Ontology can compensate, in part, for the limitations of microarrays, such as having a small number of samples and erroneous measurement results. RESULTS The proposed method has been applied to colon, Diffuse Large B-Cell Lymphoma (DLBCL) and prostate cancer datasets. The empirical results show that our method has improved classification performance in terms of accuracy, sensitivity and specificity. In addition, the study of the molecular function of selected genes strengthened the hypothesis that these genes are involved in the process of cancer growth. CONCLUSIONS The proposed method addresses the weakness of conventional methods by adding a redundancy reduction stage and utilizing Gene Ontology information. It predicts marker genes for colon, DLBCL and prostate cancer with a high accuracy. The predictions made in this study can serve as a list of candidates for subsequent wet-lab verification and might help in the search for a cure for cancers.
Collapse
Affiliation(s)
- Azadeh Mohammadi
- Intelligent Databases, Data mining and Bioinformatics Laboratory, Isfahan University of Technology, Isfahan, Iran
| | - Mohammad H Saraee
- Intelligent Databases, Data mining and Bioinformatics Laboratory, Isfahan University of Technology, Isfahan, Iran
| | - Mansoor Salehi
- Dept. of Genetics, Medical School, Isfahan University of Medical Sciences, Isfahan, Iran
| |
Collapse
|
8
|
Stability of ranked gene lists in large microarray analysis studies. J Biomed Biotechnol 2010; 2010:616358. [PMID: 20625502 PMCID: PMC2896709 DOI: 10.1155/2010/616358] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2010] [Accepted: 05/17/2010] [Indexed: 11/29/2022] Open
Abstract
This paper presents an empirical study that aims to explain the relationship between the number of samples and stability of different gene selection techniques for microarray datasets. Unlike other similar studies where number of genes in a ranked gene list is variable, this study uses an alternative approach where stability is observed at different number of samples that are used for gene selection. Three different metrics of stability, including a novel metric in bioinformatics, were used to estimate the stability of the ranked gene lists. Results of this study demonstrate that the univariate selection methods produce significantly more stable ranked gene lists than the multivariate selection methods used in this study. More specifically, thousands of samples are needed for these multivariate selection methods to achieve the same level of stability any given univariate selection method can achieve with only hundreds.
Collapse
|
9
|
Chen AH, Tsau YW, Lin CH. Novel methods to identify biologically relevant genes for leukemia and prostate cancer from gene expression profiles. BMC Genomics 2010; 11:274. [PMID: 20433712 PMCID: PMC2873479 DOI: 10.1186/1471-2164-11-274] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2009] [Accepted: 04/30/2010] [Indexed: 11/24/2022] Open
Abstract
Background High-throughput microarray experiments now permit researchers to screen thousands of genes simultaneously and determine the different expression levels of genes in normal or cancerous tissues. In this paper, we address the challenge of selecting a relevant and manageable subset of genes from a large microarray dataset. Currently, most gene selection methods focus on identifying a set of genes that can further improve classification accuracy. Few or none of these small sets of genes, however, are biologically relevant (i.e. supported by medical evidence). To deal with this critical issue, we propose two novel methods that can identify biologically relevant genes concerning cancers. Results In this paper, we propose two novel techniques, entitled random forest gene selection (RFGS) and support vector sampling technique (SVST). Compared with results from six other methods developed in this paper, we demonstrate experimentally that RFGS and SVST can identify more biologically relevant genes in patients with leukemia or prostate cancer. Among the top 25 genes selected using SVST method, 15 genes were biologically relevant genes in patients with leukemia and 13 genes were biologically relevant genes in patients with prostate cancer. Meanwhile, the RFGS method, while less effective than SVST, still identified an average of 9 biologically relevant genes in both leukemia and prostate cancers. In contrast to traditional statistical methods, which only identify less than 8 genes in patients with leukemia and less than 8 genes in patients with prostate cancer, our methods yield significantly better results. Conclusions Our proposed SVST and RFGS methods are novel approaches that can identify a greater number of biologically relevant genes. These methods have been successfully applied to both leukemia and prostate cancers. Research in the fields of biology and medicine should benefit from the identification of biologically relevant genes by confirming recent discoveries in cancer research or suggesting new avenues for exploration.
Collapse
Affiliation(s)
- Austin H Chen
- Department of Medical Informatics, Tzu Chi University, Hualien City, Hualien County, Taiwan.
| | | | | |
Collapse
|
10
|
Yao B, Li S. ANMM4CBR: a case-based reasoning method for gene expression data classification. Algorithms Mol Biol 2010; 5:14. [PMID: 20051140 PMCID: PMC2843690 DOI: 10.1186/1748-7188-5-14] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2009] [Accepted: 01/06/2010] [Indexed: 11/20/2022] Open
Abstract
BACKGROUND Accurate classification of microarray data is critical for successful clinical diagnosis and treatment. The "curse of dimensionality" problem and noise in the data, however, undermines the performance of many algorithms. METHOD In order to obtain a robust classifier, a novel Additive Nonparametric Margin Maximum for Case-Based Reasoning (ANMM4CBR) method is proposed in this article. ANMM4CBR employs a case-based reasoning (CBR) method for classification. CBR is a suitable paradigm for microarray analysis, where the rules that define the domain knowledge are difficult to obtain because usually only a small number of training samples are available. Moreover, in order to select the most informative genes, we propose to perform feature selection via additively optimizing a nonparametric margin maximum criterion, which is defined based on gene pre-selection and sample clustering. Our feature selection method is very robust to noise in the data. RESULTS The effectiveness of our method is demonstrated on both simulated and real data sets. We show that the ANMM4CBR method performs better than some state-of-the-art methods such as support vector machine (SVM) and k nearest neighbor (kNN), especially when the data contains a high level of noise. AVAILABILITY The source code is attached as an additional file of this paper.
Collapse
Affiliation(s)
- Bangpeng Yao
- MOE Key Laboratory of Bioinformatics and Bioinformatics Division, TNLIST/Department of Automation, Tsinghua University, Beijing 100084, PR China
| | - Shao Li
- MOE Key Laboratory of Bioinformatics and Bioinformatics Division, TNLIST/Department of Automation, Tsinghua University, Beijing 100084, PR China
| |
Collapse
|
11
|
Huerta EB, Duval B, Hao JK. Fuzzy logic for elimination of redundant information of microarray data. GENOMICS PROTEOMICS & BIOINFORMATICS 2009; 6:61-73. [PMID: 18973862 PMCID: PMC5054105 DOI: 10.1016/s1672-0229(08)60021-2] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
Gene subset selection is essential for classification and analysis of microarray data. However, gene selection is known to be a very difficult task since gene expression data not only have high dimensionalities, but also contain redundant information and noises. To cope with these difficulties, this paper introduces a fuzzy logic based pre-processing approach composed of two main steps. First, we use fuzzy inference rules to transform the gene expression levels of a given dataset into fuzzy values. Then we apply a similarity relation to these fuzzy values to define fuzzy equivalence groups, each group containing strongly similar genes. Dimension reduction is achieved by considering for each group of similar genes a single representative based on mutual information. To assess the usefulness of this approach, extensive experimentations were carried out on three well-known public datasets with a combined classification model using three statistic filters and three classifiers.
Collapse
|
12
|
Kim KY, Ki DH, Jeung HC, Chung HC, Rha SY. Improving the prediction accuracy in classification using the combined data sets by ranks of gene expressions. BMC Bioinformatics 2008; 9:283. [PMID: 18554423 PMCID: PMC2442106 DOI: 10.1186/1471-2105-9-283] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2008] [Accepted: 06/16/2008] [Indexed: 11/10/2022] Open
Abstract
Background The information from different data sets experimented under different conditions may be inconsistent even though they are performed with the same research objectives. More than that, even when the data sets were generated from the same platform, the data agreement may be affected by the technical variation among the laboratories. In this case, it is necessary to use the combined data set after adjusting the differences between such data sets, for detecting the more reliable information. Results The proposed method combines data sets posterior to the discretization of data sets based on the ranks of the gene expression ratios, and the statistical method is applied to the combined data set for predictive gene selection. The efficiency of the proposed method was evaluated using five colon cancer related data sets, which were experimented using cDNA microarrays with different RNA sources, and one experiment utilized oligonucleotide arrays. NCI-60 cell lines data sets were used, which were performed with two different platforms of cDNA microarrays and Affymetrix HU6800 oligonucleotide arrays. The combined data set by the proposed method predicted the test data sets more accurately than the separated data sets did. The biological significant genes were detected from the combined data set, which were missed on the separated data sets. Conclusion By transforming gene expressions using ranks, the proposed method is not influenced by systematic bias among chips and normalization method. The method may be especially more useful to find predictive genes from data sets which have different scale in gene expressions.
Collapse
Affiliation(s)
- Ki-Yeol Kim
- Oral Cancer Research Institute, Yonsei University College of Dentistry, Seoul, 120-752, South Korea.
| | | | | | | | | |
Collapse
|
13
|
Dawes NL, Glassey J. Normalisation of multicondition cDNA macroarray data. Comp Funct Genomics 2007:90578. [PMID: 17538691 PMCID: PMC1872052 DOI: 10.1155/2007/90578] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2006] [Revised: 12/21/2006] [Accepted: 02/28/2007] [Indexed: 11/17/2022] Open
Abstract
Background. Normalisation is a critical step in obtaining meaningful information from the high-dimensional DNA array data. This is particularly important when complex biological hypotheses/questions, such a functional analysis and regulatory interactions within biological systems, are investigated. A nonparametric, intensity-dependent normalisation method based on global identification of self-consistent set (SCS) of genes is proposed here for such systems.
Results. The SCS normalisation is introduced and its behaviour demonstrated for a range of user-defined parameters affecting its performance. It is compared to a standard global normalisation method in terms of noise reduction and signal retention. Conclusions. The SCS normalisation results using 16 macroarray data sets from a Bacillus subtilis experiment confirm that the method is capable of reducing undesirable experimental variation whilst retaining important biological information. The ease and speed of implementation mean that this method can be easily adapted to other multicondition time/strain series single colour array data.
Collapse
Affiliation(s)
- Nicola L. Dawes
- School of Chemical Engineering and Advanced Materials, Merz Court, University of Newcastle upon Tyne, Newcastle upon Tyne, NE1 7RU, UK
| | - Jarka Glassey
- School of Chemical Engineering and Advanced Materials, Merz Court, University of Newcastle upon Tyne, Newcastle upon Tyne, NE1 7RU, UK
- *Jarka Glassey:
| |
Collapse
|
14
|
Tan YD, Fornage M, Fu YX. Ranking analysis of microarray data: a powerful method for identifying differentially expressed genes. Genomics 2006; 88:846-854. [PMID: 16979869 PMCID: PMC2584353 DOI: 10.1016/j.ygeno.2006.08.003] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2006] [Revised: 07/31/2006] [Accepted: 08/02/2006] [Indexed: 11/28/2022]
Abstract
Microarray technology provides a powerful tool for the expression profile of thousands of genes simultaneously, which makes it possible to explore the molecular and metabolic etiology of the development of a complex disease under study. However, classical statistical methods and technologies fail to be applicable to microarray data. Therefore, it is necessary and motivating to develop powerful methods for large-scale statistical analyses. In this paper, we described a novel method, called Ranking Analysis of Microarray Data (RAM). RAM, which is a large-scale two-sample t-test method, is based on comparisons between a set of ranked T statistics and a set of ranked Z values (a set of ranked estimated null scores) yielded by a "randomly splitting" approach instead of a "permutation" approach and a two-simulation strategy for estimating the proportion of genes identified by chance, i.e., the false discovery rate (FDR). The results obtained from the simulated and observed microarray data show that RAM is more efficient in identification of genes differentially expressed and estimation of FDR under undesirable conditions such as a large fudge factor, small sample size, or mixture distribution of noises than Significance Analysis of Microarrays.
Collapse
Affiliation(s)
- Yuan-De Tan
- Institute of Molecular Medicine, School of Public Health, University of Texas at Houston, Houston, TX 77030, USA
| | - Myriam Fornage
- Institute of Molecular Medicine, School of Public Health, University of Texas at Houston, Houston, TX 77030, USA
| | - Yun-Xin Fu
- Laboratory for Conservation and Utilization of Bioresources, Yunnan University, Kunming, Yunnan 650, China; Human Genetics Center, School of Public Health, University of Texas at Houston, Houston, TX 77030, USA.
| |
Collapse
|
15
|
Gao X, Song PXK. Nonparametric tests for differential gene expression and interaction effects in multi-factorial microarray experiments. BMC Bioinformatics 2005; 6:186. [PMID: 16042764 PMCID: PMC1199581 DOI: 10.1186/1471-2105-6-186] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2005] [Accepted: 07/21/2005] [Indexed: 11/10/2022] Open
Abstract
Background Numerous nonparametric approaches have been proposed in literature to detect differential gene expression in the setting of two user-defined groups. However, there is a lack of nonparametric procedures to analyze microarray data with multiple factors attributing to the gene expression. Furthermore, incorporating interaction effects in the analysis of microarray data has long been of great interest to biological scientists, little of which has been investigated in the nonparametric framework. Results In this paper, we propose a set of nonparametric tests to detect treatment effects, clinical covariate effects, and interaction effects for multifactorial microarray data. When the distribution of expression data is skewed or heavy-tailed, the rank tests are substantially more powerful than the competing parametric F tests. On the other hand, in the case of light or medium-tailed distributions, the rank tests appear to be marginally less powerful than the parametric competitors. Conclusion The proposed rank tests enable us to detect differential gene expression and establish interaction effects for microarray data with various non-normally distributed expression measurements across genome. In the presence of outliers, they are advantageous alternative approaches to the existing parametric F tests due to the robustness feature.
Collapse
Affiliation(s)
- Xin Gao
- Department of Mathematics and Statistics, York University, 4700 Keele Street, Toronto, ON M3J 1P3, Canada
| | - Peter XK Song
- Department of Statistics and Actuarial Science, University of Waterloo, 200 University Ave. W., Waterloo, ON N2L 3G1, Canada
| |
Collapse
|
16
|
Lyons-Weiler J, Patel S, Becich MJ, Godfrey TE. Tests for finding complex patterns of differential expression in cancers: towards individualized medicine. BMC Bioinformatics 2004; 5:110. [PMID: 15307894 PMCID: PMC514539 DOI: 10.1186/1471-2105-5-110] [Citation(s) in RCA: 35] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2004] [Accepted: 08/12/2004] [Indexed: 11/10/2022] Open
Abstract
Background Microarray studies in cancer compare expression levels between two or more sample groups on thousands of genes. Data analysis follows a population-level approach (e.g., comparison of sample means) to identify differentially expressed genes. This leads to the discovery of 'population-level' markers, i.e., genes with the expression patterns A > B and B > A. We introduce the PPST test that identifies genes where a significantly large subset of cases exhibit expression values beyond upper and lower thresholds observed in the control samples. Results Interestingly, the test identifies A > B and B < A pattern genes that are missed by population-level approaches, such as the t-test, and many genes that exhibit both significant overexpression and significant underexpression in statistically significantly large subsets of cancer patients (ABA pattern genes). These patterns tend to show distributions that are unique to individual genes, and are aptly visualized in a 'gene expression pattern grid'. The low degree of among-gene correlations in these genes suggests unique underlying genomic pathologies and high degree of unique tumor-specific differential expression. We compare the PPST and the ABA test to the parametric and non-parametric t-test by analyzing two independently published data sets from studies of progression in astrocytoma. Conclusions The PPST test resulted findings similar to the nonparametric t-test with higher self-consistency. These tests and the gene expression pattern grid may be useful for the identification of therapeutic targets and diagnostic or prognostic markers that are present only in subsets of cancer patients, and provide a more complete portrait of differential expression in cancer.
Collapse
Affiliation(s)
- James Lyons-Weiler
- Department of Pathology, Center for Biomedical Informatics, and Interdisciplinary Biomedical Graduate Program, University of Pittsburgh, PA 15232 USA
- Clinical Genomics Facility, Center for Pathology Informatics, Benedum Center for Oncology Informatics, University of Pittsburgh Cancer Institute, Pittsburgh, PA 15232 USA
| | - Satish Patel
- Department of Pathology, Center for Biomedical Informatics, and Interdisciplinary Biomedical Graduate Program, University of Pittsburgh, PA 15232 USA
- Clinical Genomics Facility, Center for Pathology Informatics, Benedum Center for Oncology Informatics, University of Pittsburgh Cancer Institute, Pittsburgh, PA 15232 USA
| | - Michael J Becich
- Department of Pathology, Center for Biomedical Informatics, and Interdisciplinary Biomedical Graduate Program, University of Pittsburgh, PA 15232 USA
- Clinical Genomics Facility, Center for Pathology Informatics, Benedum Center for Oncology Informatics, University of Pittsburgh Cancer Institute, Pittsburgh, PA 15232 USA
| | - Tony E Godfrey
- Departments of Surgery and Human Genetics, University of Pittsburgh Medical School, Pittsburgh, PA 15232 USA
- Mount Sinai School of Medicine, One Gustave Levy Place, Box 1668, East Building, Room 1070C, New York, NY 10029 USA
| |
Collapse
|
17
|
Tang C, Zhang A, Ramanathan M. ESPD: a pattern detection model underlying gene expression profiles. Bioinformatics 2004; 20:829-38. [PMID: 14751997 PMCID: PMC2573998 DOI: 10.1093/bioinformatics/btg486] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION DNA arrays permit rapid, large-scale screening for patterns of gene expression and simultaneously yield the expression levels of thousands of genes for samples. The number of samples is usually limited, and such datasets are very sparse in high-dimensional gene space. Furthermore, most of the genes collected may not necessarily be of interest and uncertainty about which genes are relevant makes it difficult to construct an informative gene space. Unsupervised empirical sample pattern discovery and informative genes identification of such sparse high-dimensional datasets present interesting but challenging problems. RESULTS A new model called empirical sample pattern detection (ESPD) is proposed to delineate pattern quality with informative genes. By integrating statistical metrics, data mining and machine learning techniques, this model dynamically measures and manipulates the relationship between samples and genes while conducting an iterative detection of informative space and the empirical pattern. The performance of the proposed method with various array datasets is illustrated.
Collapse
Affiliation(s)
- Chun Tang
- Department of Computer Science and Engineering, State University of New York at Buffalo, NY 14260, USA.
| | | | | |
Collapse
|
18
|
Dettling M, Bühlmann P. Supervised clustering of genes. Genome Biol 2002; 3:RESEARCH0069. [PMID: 12537558 PMCID: PMC151171 DOI: 10.1186/gb-2002-3-12-research0069] [Citation(s) in RCA: 75] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2002] [Revised: 08/30/2002] [Accepted: 10/02/2002] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND We focus on microarray data where experiments monitor gene expression in different tissues and where each experiment is equipped with an additional response variable such as a cancer type. Although the number of measured genes is in the thousands, it is assumed that only a few marker components of gene subsets determine the type of a tissue. Here we present a new method for finding such groups of genes by directly incorporating the response variables into the grouping process, yielding a supervised clustering algorithm for genes. RESULTS An empirical study on eight publicly available microarray datasets shows that our algorithm identifies gene clusters with excellent predictive potential, often superior to classification with state-of-the-art methods based on single genes. Permutation tests and bootstrapping provide evidence that the output is reasonably stable and more than a noise artifact. CONCLUSIONS In contrast to other methods such as hierarchical clustering, our algorithm identifies several gene clusters whose expression levels clearly distinguish the different tissue types. The identification of such gene clusters is potentially useful for medical diagnostics and may at the same time reveal insights into functional genomics.
Collapse
Affiliation(s)
- Marcel Dettling
- Seminar für Statistik, Eidgenössische Technische Hochschule (ETH) Zürich, 8092 Zürich, Switzerland.
| | | |
Collapse
|
19
|
Li H, Hong F. Cluster-Rasch models for microarray gene expression data. Genome Biol 2001; 2:RESEARCH0031. [PMID: 11532215 PMCID: PMC55328 DOI: 10.1186/gb-2001-2-8-research0031] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2001] [Revised: 05/11/2001] [Accepted: 06/19/2001] [Indexed: 11/17/2022] Open
Abstract
BACKGROUND We propose two different formulations of the Rasch statistical models to the problem of relating gene expression profiles to the phenotypes. One formulation allows us to investigate whether a cluster of genes with similar expression profiles is related to the observed phenotypes; this model can also be used for future prediction. The other formulation provides an alternative way of identifying genes that are over- or underexpressed from their expression levels in tissue or cell samples of a given tissue or cell type. RESULTS We illustrate the methods on available datasets of a classification of acute leukemias and of 60 cancer cell lines. For tumor classification, the results are comparable to those previously obtained. For the cancer cell lines dataset, we found four clusters of genes that are related to drug response for many of the 90 drugs that we considered. In addition, for each type of cell line, we identified genes that are over- or underexpressed relative to other genes. CONCLUSIONS The cluster-Rasch model provides a probabilistic model for describing gene expression patterns across samples and can be used to relate gene expression profiles to phenotypes.
Collapse
Affiliation(s)
- H Li
- Rowe Program in Human Genetics, Departments of Medicine and Statistics, University of California, Davis, CA 95616, USA.
| | | |
Collapse
|