1
|
Protein Markers for the Identification of Cork Oak Plants Infected with Phytophthora cinnamomi by Applying an (α, β)-k-Feature Set Approach. FORESTS 2022. [DOI: 10.3390/f13060940] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Cork oak decline in Mediterranean forests is a complex phenomenon, observed with remarkable frequency in the southern part of the Iberian Peninsula, causing the weakening and death of these woody plants. The defoliation of the canopy, the presence of dry peripheral branches, and exudations on the trunk are visible symptoms used for the prognosis of decline, complemented by the presence of Phytophthora cinnamomi identified in the rhizosphere of the trees and adjacent soils. Recently, a large proteomic dataset obtained from the leaves of cork oak plants inoculated and non-inoculated with P. cinnamomi has become available. We explored it to search for an optimal set of proteins, markers of the biological pattern of interaction with the oomycete. Thus, using published data from the cork oak leaf proteome, we mathematically modelled the problem as an α, β-k-Feature Set Problem to select molecular markers. A set of proteins (features) that represent dominant effects on the host metabolism resulting from pathogen action on roots was found. These results contribute to an early diagnosis of biochemical changes occurring in cork oak associated with P. cinnamomi infection. We hypothesize that these markers may be decisive in identifying trees that go into decline due to interactions with the pathogen, assisting the management of cork oak forest ecosystems.
Collapse
|
2
|
Abu Zaher A, Berretta R, Noman N, Moscato P. An adaptive memetic algorithm for feature selection using proximity graphs. Comput Intell 2018. [DOI: 10.1111/coin.12196] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
Affiliation(s)
- Amer Abu Zaher
- School of Electrical Engineering and Computing The University of Newcastle Callaghan Australia
| | - Regina Berretta
- School of Electrical Engineering and Computing The University of Newcastle Callaghan Australia
| | - Nasimul Noman
- School of Electrical Engineering and Computing The University of Newcastle Callaghan Australia
| | - Pablo Moscato
- School of Electrical Engineering and Computing The University of Newcastle Callaghan Australia
| |
Collapse
|
3
|
Mathieson L, Mendes A, Marsden J, Pond J, Moscato P. Computer-Aided Breast Cancer Diagnosis with Optimal Feature Sets: Reduction Rules and Optimization Techniques. Methods Mol Biol 2017; 1526:299-325. [PMID: 27896749 DOI: 10.1007/978-1-4939-6613-4_17] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
This chapter introduces a new method for knowledge extraction from databases for the purpose of finding a discriminative set of features that is also a robust set for within-class classification. Our method is generic and we introduce it here in the field of breast cancer diagnosis from digital mammography data. The mathematical formalism is based on a generalization of the k-Feature Set problem called (α, β)-k-Feature Set problem, introduced by Cotta and Moscato (J Comput Syst Sci 67(4):686-690, 2003). This method proceeds in two steps: first, an optimal (α, β)-k-feature set of minimum cardinality is identified and then, a set of classification rules using these features is obtained. We obtain the (α, β)-k-feature set in two phases; first a series of extremely powerful reduction techniques, which do not lose the optimal solution, are employed; and second, a metaheuristic search to identify the remaining features to be considered or disregarded. Two algorithms were tested with a public domain digital mammography dataset composed of 71 malignant and 75 benign cases. Based on the results provided by the algorithms, we obtain classification rules that employ only a subset of these features.
Collapse
Affiliation(s)
- Luke Mathieson
- Centre for Bioinformatics, Biomarker Discovery and Information-Based Medicine (CIBM), Faculty of Engineering and Built Environment, The University of Newcastle, Callaghan, NSW, 2308, Australia
| | - Alexandre Mendes
- Centre for Bioinformatics, Biomarker Discovery and Information-Based Medicine (CIBM), Faculty of Engineering and Built Environment, The University of Newcastle, Callaghan, NSW, 2308, Australia
| | - John Marsden
- Centre for Bioinformatics, Biomarker Discovery and Information-Based Medicine (CIBM), Faculty of Engineering and Built Environment, The University of Newcastle, Callaghan, NSW, 2308, Australia
| | - Jeffrey Pond
- Centre for Bioinformatics, Biomarker Discovery and Information-Based Medicine (CIBM), Faculty of Engineering and Built Environment, The University of Newcastle, Callaghan, NSW, 2308, Australia
| | - Pablo Moscato
- Centre for Bioinformatics, Biomarker Discovery and Information-Based Medicine (CIBM), Faculty of Engineering and Built Environment, The University of Newcastle, Callaghan, NSW, 2308, Australia.
| |
Collapse
|
4
|
Puthiyedth N, Riveros C, Berretta R, Moscato P. Identification of Differentially Expressed Genes through Integrated Study of Alzheimer's Disease Affected Brain Regions. PLoS One 2016; 11:e0152342. [PMID: 27050411 PMCID: PMC4822961 DOI: 10.1371/journal.pone.0152342] [Citation(s) in RCA: 63] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2015] [Accepted: 03/11/2016] [Indexed: 11/28/2022] Open
Abstract
Background Alzheimer’s disease (AD) is the most common form of dementia in older adults that damages the brain and results in impaired memory, thinking and behaviour. The identification of differentially expressed genes and related pathways among affected brain regions can provide more information on the mechanisms of AD. In the past decade, several studies have reported many genes that are associated with AD. This wealth of information has become difficult to follow and interpret as most of the results are conflicting. In that case, it is worth doing an integrated study of multiple datasets that helps to increase the total number of samples and the statistical power in detecting biomarkers. In this study, we present an integrated analysis of five different brain region datasets and introduce new genes that warrant further investigation. Methods The aim of our study is to apply a novel combinatorial optimisation based meta-analysis approach to identify differentially expressed genes that are associated to AD across brain regions. In this study, microarray gene expression data from 161 samples (74 non-demented controls, 87 AD) from the Entorhinal Cortex (EC), Hippocampus (HIP), Middle temporal gyrus (MTG), Posterior cingulate cortex (PC), Superior frontal gyrus (SFG) and visual cortex (VCX) brain regions were integrated and analysed using our method. The results are then compared to two popular meta-analysis methods, RankProd and GeneMeta, and to what can be obtained by analysing the individual datasets. Results We find genes related with AD that are consistent with existing studies, and new candidate genes not previously related with AD. Our study confirms the up-regualtion of INFAR2 and PTMA along with the down regulation of GPHN, RAB2A, PSMD14 and FGF. Novel genes PSMB2, WNK1, RPL15, SEMA4C, RWDD2A and LARGE are found to be differentially expressed across all brain regions. Further investigation on these genes may provide new insights into the development of AD. In addition, we identified the presence of 23 non-coding features, including four miRNA precursors (miR-7, miR570, miR-1229 and miR-6821), dysregulated across the brain regions. Furthermore, we compared our results with two popular meta-analysis methods RankProd and GeneMeta to validate our findings and performed a sensitivity analysis by removing one dataset at a time to assess the robustness of our results. These new findings may provide new insights into the disease mechanisms and thus make a significant contribution in the near future towards understanding, prevention and cure of AD.
Collapse
Affiliation(s)
- Nisha Puthiyedth
- Information Based Medicine Program, Hunter Medical Research Institute, New Lambton Heights NSW, Australia
- Centre for Bioinformatics, Biomarker Discovery and Information-Based Medicine, School of Electrical Engineering and Computer Science, The University of Newcastle, Callaghan NSW, Australia
| | - Carlos Riveros
- Clinical Research Design, Information Technology and Statistics Suport Unit, Hunter Medical Research Institute, New Lambton Heights NSW, Australia
| | - Regina Berretta
- Information Based Medicine Program, Hunter Medical Research Institute, New Lambton Heights NSW, Australia
- Centre for Bioinformatics, Biomarker Discovery and Information-Based Medicine, School of Electrical Engineering and Computer Science, The University of Newcastle, Callaghan NSW, Australia
| | - Pablo Moscato
- Information Based Medicine Program, Hunter Medical Research Institute, New Lambton Heights NSW, Australia
- Centre for Bioinformatics, Biomarker Discovery and Information-Based Medicine, School of Electrical Engineering and Computer Science, The University of Newcastle, Callaghan NSW, Australia
- * E-mail:
| |
Collapse
|
5
|
Heterogeneous Ensemble Combination Search Using Genetic Algorithm for Class Imbalanced Data Classification. PLoS One 2016; 11:e0146116. [PMID: 26764911 PMCID: PMC4713117 DOI: 10.1371/journal.pone.0146116] [Citation(s) in RCA: 38] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2014] [Accepted: 12/13/2015] [Indexed: 11/30/2022] Open
Abstract
Classification of datasets with imbalanced sample distributions has always been a challenge. In general, a popular approach for enhancing classification performance is the construction of an ensemble of classifiers. However, the performance of an ensemble is dependent on the choice of constituent base classifiers. Therefore, we propose a genetic algorithm-based search method for finding the optimum combination from a pool of base classifiers to form a heterogeneous ensemble. The algorithm, called GA-EoC, utilises 10 fold-cross validation on training data for evaluating the quality of each candidate ensembles. In order to combine the base classifiers decision into ensemble’s output, we used the simple and widely used majority voting approach. The proposed algorithm, along with the random sub-sampling approach to balance the class distribution, has been used for classifying class-imbalanced datasets. Additionally, if a feature set was not available, we used the (α, β) − k Feature Set method to select a better subset of features for classification. We have tested GA-EoC with three benchmarking datasets from the UCI-Machine Learning repository, one Alzheimer’s disease dataset and a subset of the PubFig database of Columbia University. In general, the performance of the proposed method on the chosen datasets is robust and better than that of the constituent base classifiers and many other well-known ensembles. Based on our empirical study we claim that a genetic algorithm is a superior and reliable approach to heterogeneous ensemble construction and we expect that the proposed GA-EoC would perform consistently in other cases.
Collapse
|
6
|
A New Combinatorial Optimization Approach for Integrated Feature Selection Using Different Datasets: A Prostate Cancer Transcriptomic Study. PLoS One 2015; 10:e0127702. [PMID: 26106884 PMCID: PMC4480358 DOI: 10.1371/journal.pone.0127702] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2014] [Accepted: 04/17/2015] [Indexed: 12/26/2022] Open
Abstract
Background The joint study of multiple datasets has become a common technique for increasing statistical power in detecting biomarkers obtained from smaller studies. The approach generally followed is based on the fact that as the total number of samples increases, we expect to have greater power to detect associations of interest. This methodology has been applied to genome-wide association and transcriptomic studies due to the availability of datasets in the public domain. While this approach is well established in biostatistics, the introduction of new combinatorial optimization models to address this issue has not been explored in depth. In this study, we introduce a new model for the integration of multiple datasets and we show its application in transcriptomics. Methods We propose a new combinatorial optimization problem that addresses the core issue of biomarker detection in integrated datasets. Optimal solutions for this model deliver a feature selection from a panel of prospective biomarkers. The model we propose is a generalised version of the (α,β)-k-Feature Set problem. We illustrate the performance of this new methodology via a challenging meta-analysis task involving six prostate cancer microarray datasets. The results are then compared to the popular RankProd meta-analysis tool and to what can be obtained by analysing the individual datasets by statistical and combinatorial methods alone. Results Application of the integrated method resulted in a more informative signature than the rank-based meta-analysis or individual dataset results, and overcomes problems arising from real world datasets. The set of genes identified is highly significant in the context of prostate cancer. The method used does not rely on homogenisation or transformation of values to a common scale, and at the same time is able to capture markers associated with subgroups of the disease.
Collapse
|
7
|
Evaluation of Different Normalization and Analysis Procedures for Illumina Gene Expression Microarray Data Involving Small Changes. MICROARRAYS 2013; 2:131-52. [PMID: 27605185 PMCID: PMC5003482 DOI: 10.3390/microarrays2020131] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/25/2013] [Revised: 05/08/2013] [Accepted: 05/10/2013] [Indexed: 12/28/2022]
Abstract
While Illumina microarrays can be used successfully for detecting small gene expression changes due to their high degree of technical replicability, there is little information on how different normalization and differential expression analysis strategies affect outcomes. To evaluate this, we assessed concordance across gene lists generated by applying different combinations of normalization strategy and analytical approach to two Illumina datasets with modest expression changes. In addition to using traditional statistical approaches, we also tested an approach based on combinatorial optimization. We found that the choice of both normalization strategy and analytical approach considerably affected outcomes, in some cases leading to substantial differences in gene lists and subsequent pathway analysis results. Our findings suggest that important biological phenomena may be overlooked when there is a routine practice of using only one approach to investigate all microarray datasets. Analytical artefacts of this kind are likely to be especially relevant for datasets involving small fold changes, where inherent technical variation-if not adequately minimized by effective normalization-may overshadow true biological variation. This report provides some basic guidelines for optimizing outcomes when working with Illumina datasets involving small expression changes.
Collapse
|
8
|
Mo J, Maudsley S, Martin B, Siddiqui S, Cheung H, Johnson CA. Classification of Alzheimer Diagnosis from ADNI Plasma Biomarker Data. 2013 ACM CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY AND BIOMEDICAL INFORMATICS : ACM - BCB 2013 : WASHINGTON, D.C., U.S.A., SEPTEMBER 22 - 25, 2013. ACM CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY AND BIOMEDICAL INFORMA... 2013; 2013:569. [PMID: 25599092 PMCID: PMC4295502 DOI: 10.1145/2506583.2506637] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
Abstract
Research into modeling the progression of Alzheimer's disease (AD) has made recent progress in identifying plasma proteomic biomarkers to identify the disease at the pre-clinical stage. In contrast with cerebral spinal fluid (CSF) biomarkers and PET imaging, plasma biomarker diagnoses have the advantage of being cost-effective and minimally invasive, thereby improving our understanding of AD and hopefully leading to early interventions as research into this subject advances. The Alzheimer's Disease Neuroimaging Initiative* (ADNI) has collected data on 190 plasma analytes from individuals diagnosed with AD as well subjects with mild cognitive impairment and cognitively normal (CN) controls. We propose an approach to classify subjects as AD or CN via an ensemble of classifiers trained and validated on ADNI data. Classifier performance is enhanced by an augmentation of a selective biomarker feature space with principal components obtained from the entire set of biomarkers. This procedure yields accuracy of 89% and area under the ROC curve of 94%.
Collapse
Affiliation(s)
- Jue Mo
- Division of Computational Bioscience, Center for Information Technology, National Institutes of Health, Bethesda, MD 20892, USA
| | - Stuart Maudsley
- Receptor Pharmacology Unit, National Institute on Aging, National Institutes of Health, Baltimore, MD 21224, USA
| | - Bronwen Martin
- Metabolism Unit, National Institute on Aging, National Institutes of Health, Baltimore, MD 21224, USA
| | - Sana Siddiqui
- Receptor Pharmacology Unit, National Institute on Aging, National Institutes of Health, Baltimore, MD 21224, USA
| | - Huey Cheung
- Div. of Computational Bioscience, Center for Information Technology, National Institutes of Health, Bethesda, MD 20892, USA
| | - Calvin A. Johnson
- Division of Computational Bioscience, Center for Information Technology, National Institutes of Health, Bethesda, MD 20892, USA
| |
Collapse
|
9
|
Unveiling clusters of RNA transcript pairs associated with markers of Alzheimer's disease progression. PLoS One 2012; 7:e45535. [PMID: 23029078 PMCID: PMC3448659 DOI: 10.1371/journal.pone.0045535] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2011] [Accepted: 08/23/2012] [Indexed: 12/17/2022] Open
Abstract
Background One primary goal of transcriptomic studies is identifying gene expression patterns correlating with disease progression. This is usually achieved by considering transcripts that independently pass an arbitrary threshold (e.g. p<0.05). In diseases involving severe perturbations of multiple molecular systems, such as Alzheimer’s disease (AD), this univariate approach often results in a large list of seemingly unrelated transcripts. We utilised a powerful multivariate clustering approach to identify clusters of RNA biomarkers strongly associated with markers of AD progression. We discuss the value of considering pairs of transcripts which, in contrast to individual transcripts, helps avoid natural human transcriptome variation that can overshadow disease-related changes. Methodology/Principal Findings We re-analysed a dataset of hippocampal transcript levels in nine controls and 22 patients with varying degrees of AD. A large-scale clustering approach determined groups of transcript probe sets that correlate strongly with measures of AD progression, including both clinical and neuropathological measures and quantifiers of the characteristic transcriptome shift from control to severe AD. This enabled identification of restricted groups of highly correlated probe sets from an initial list of 1,372 previously published by our group. We repeated this analysis on an expanded dataset that included all pair-wise combinations of the 1,372 probe sets. As clustering of this massive dataset is unfeasible using standard computational tools, we adapted and re-implemented a clustering algorithm that uses external memory algorithmic approach. This identified various pairs that strongly correlated with markers of AD progression and highlighted important biological pathways potentially involved in AD pathogenesis. Conclusions/Significance Our analyses demonstrate that, although there exists a relatively large molecular signature of AD progression, only a small number of transcripts recurrently cluster with different markers of AD progression. Furthermore, considering the relationship between two transcripts can highlight important biological relationships that are missed when considering either transcript in isolation.
Collapse
|
10
|
Arefin AS, Riveros C, Berretta R, Moscato P. GPU-FS-kNN: a software tool for fast and scalable kNN computation using GPUs. PLoS One 2012; 7:e44000. [PMID: 22937144 PMCID: PMC3429408 DOI: 10.1371/journal.pone.0044000] [Citation(s) in RCA: 46] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2012] [Accepted: 07/27/2012] [Indexed: 12/05/2022] Open
Abstract
Background The analysis of biological networks has become a major challenge due to the recent development of high-throughput techniques that are rapidly producing very large data sets. The exploding volumes of biological data are craving for extreme computational power and special computing facilities (i.e. super-computers). An inexpensive solution, such as General Purpose computation based on Graphics Processing Units (GPGPU), can be adapted to tackle this challenge, but the limitation of the device internal memory can pose a new problem of scalability. An efficient data and computational parallelism with partitioning is required to provide a fast and scalable solution to this problem. Results We propose an efficient parallel formulation of the k-Nearest Neighbour (kNN) search problem, which is a popular method for classifying objects in several fields of research, such as pattern recognition, machine learning and bioinformatics. Being very simple and straightforward, the performance of the kNN search degrades dramatically for large data sets, since the task is computationally intensive. The proposed approach is not only fast but also scalable to large-scale instances. Based on our approach, we implemented a software tool GPU-FS-kNN (GPU-based Fast and Scalable k-Nearest Neighbour) for CUDA enabled GPUs. The basic approach is simple and adaptable to other available GPU architectures. We observed speed-ups of 50–60 times compared with CPU implementation on a well-known breast microarray study and its associated data sets. Conclusion Our GPU-based Fast and Scalable k-Nearest Neighbour search technique (GPU-FS-kNN) provides a significant performance improvement for nearest neighbour computation in large-scale networks. Source code and the software tool is available under GNU Public License (GPL) at https://sourceforge.net/p/gpufsknn/.
Collapse
Affiliation(s)
- Ahmed Shamsul Arefin
- Centre for Bioinformatics, Biomarker Discovery and Information-Based Medicine, The University of Newcastle, Callaghan, New South Wales, Australia
| | - Carlos Riveros
- Centre for Bioinformatics, Biomarker Discovery and Information-Based Medicine, The University of Newcastle, Callaghan, New South Wales, Australia
- Hunter Medical Research Institute, Information Based Medicine Program, John Hunter Hospital, New Lambton Heights, New South Wales, Australia
| | - Regina Berretta
- Centre for Bioinformatics, Biomarker Discovery and Information-Based Medicine, The University of Newcastle, Callaghan, New South Wales, Australia
- Hunter Medical Research Institute, Information Based Medicine Program, John Hunter Hospital, New Lambton Heights, New South Wales, Australia
| | - Pablo Moscato
- Centre for Bioinformatics, Biomarker Discovery and Information-Based Medicine, The University of Newcastle, Callaghan, New South Wales, Australia
- Hunter Medical Research Institute, Information Based Medicine Program, John Hunter Hospital, New Lambton Heights, New South Wales, Australia
- Australian Research Council Centre of Excellence in Bioinformatics, Callaghan, New South Wales, Australia
- * E-mail:
| |
Collapse
|
11
|
Johnstone D, Milward EA, Berretta R, Moscato P. Multivariate protein signatures of pre-clinical Alzheimer's disease in the Alzheimer's disease neuroimaging initiative (ADNI) plasma proteome dataset. PLoS One 2012; 7:e34341. [PMID: 22485168 PMCID: PMC3317783 DOI: 10.1371/journal.pone.0034341] [Citation(s) in RCA: 58] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2011] [Accepted: 03/01/2012] [Indexed: 11/18/2022] Open
Abstract
BACKGROUND Recent Alzheimer's disease (AD) research has focused on finding biomarkers to identify disease at the pre-clinical stage of mild cognitive impairment (MCI), allowing treatment to be initiated before irreversible damage occurs. Many studies have examined brain imaging or cerebrospinal fluid but there is also growing interest in blood biomarkers. The Alzheimer's Disease Neuroimaging Initiative (ADNI) has generated data on 190 plasma analytes in 566 individuals with MCI, AD or normal cognition. We conducted independent analyses of this dataset to identify plasma protein signatures predicting pre-clinical AD. METHODS AND FINDINGS We focused on identifying signatures that discriminate cognitively normal controls (n = 54) from individuals with MCI who subsequently progress to AD (n = 163). Based on p value, apolipoprotein E (APOE) showed the strongest difference between these groups (p = 2.3 × 10(-13)). We applied a multivariate approach based on combinatorial optimization ((α,β)-k Feature Set Selection), which retains information about individual participants and maintains the context of interrelationships between different analytes, to identify the optimal set of analytes (signature) to discriminate these two groups. We identified 11-analyte signatures achieving values of sensitivity and specificity between 65% and 86% for both MCI and AD groups, depending on whether APOE was included and other factors. Classification accuracy was improved by considering "meta-features," representing the difference in relative abundance of two analytes, with an 8-meta-feature signature consistently achieving sensitivity and specificity both over 85%. Generating signatures based on longitudinal rather than cross-sectional data further improved classification accuracy, returning sensitivities and specificities of approximately 90%. CONCLUSIONS Applying these novel analysis approaches to the powerful and well-characterized ADNI dataset has identified sets of plasma biomarkers for pre-clinical AD. While studies of independent test sets are required to validate the signatures, these analyses provide a starting point for developing a cost-effective and minimally invasive test capable of diagnosing AD in its pre-clinical stages.
Collapse
Affiliation(s)
- Daniel Johnstone
- Priority Research Centre for Bioinformatics, Biomarker Discovery and Information-Based Medicine, The University of Newcastle, Callaghan, New South Wales, Australia
- School of Electrical Engineering and Computer Science, The University of Newcastle, Callaghan, New South Wales, Australia
| | - Elizabeth A. Milward
- Priority Research Centre for Bioinformatics, Biomarker Discovery and Information-Based Medicine, The University of Newcastle, Callaghan, New South Wales, Australia
- School of Biomedical Sciences and Pharmacy, The University of Newcastle, Callaghan, New South Wales, Australia
| | - Regina Berretta
- Priority Research Centre for Bioinformatics, Biomarker Discovery and Information-Based Medicine, The University of Newcastle, Callaghan, New South Wales, Australia
- School of Electrical Engineering and Computer Science, The University of Newcastle, Callaghan, New South Wales, Australia
| | - Pablo Moscato
- Priority Research Centre for Bioinformatics, Biomarker Discovery and Information-Based Medicine, The University of Newcastle, Callaghan, New South Wales, Australia
- School of Electrical Engineering and Computer Science, The University of Newcastle, Callaghan, New South Wales, Australia
- * E-mail:
| | | |
Collapse
|
12
|
Johnstone D, Graham RM, Trinder D, Delima RD, Riveros C, Olynyk JK, Scott RJ, Moscato P, Milward EA. Brain transcriptome perturbations in the Hfe(-/-) mouse model of genetic iron loading. Brain Res 2012; 1448:144-52. [PMID: 22370144 DOI: 10.1016/j.brainres.2012.02.006] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2011] [Revised: 01/31/2012] [Accepted: 02/02/2012] [Indexed: 12/14/2022]
Abstract
Severe disruption of brain iron homeostasis can cause fatal neurodegenerative disease, however debate surrounds the neurologic effects of milder, more common iron loading disorders such as hereditary hemochromatosis, which is usually caused by loss-of-function polymorphisms in the HFE gene. There is evidence from both human and animal studies that HFE gene variants may affect brain function and modify risks of brain disease. To investigate how disruption of HFE influences brain transcript levels, we used microarray and real-time reverse transcription polymerase chain reaction to assess the brain transcriptome in Hfe(-/-) mice relative to wildtype AKR controls (age 10 weeks, n≥4/group). The Hfe(-/-) mouse brain showed numerous significant changes in transcript levels (p<0.05) although few of these related to proteins directly involved in iron homeostasis. There were robust changes of at least 2-fold in levels of transcripts for prominent genes relating to transcriptional regulation (FBJ osteosarcoma oncogene Fos, early growth response genes), neurotransmission (glutamate NMDA receptor Grin1, GABA receptor Gabbr1) and synaptic plasticity and memory (calcium/calmodulin-dependent protein kinase IIα Camk2a). As previously reported for dietary iron-supplemented mice, there were altered levels of transcripts for genes linked to neuronal ceroid lipofuscinosis, a disease characterized by excessive lipofuscin deposition. Labile iron is known to enhance lipofuscin generation which may accelerate brain aging. The findings provide evidence that iron loading disorders can considerably perturb levels of transcripts for genes essential for normal brain function and may help explain some of the neurologic signs and symptoms reported in hemochromatosis patients.
Collapse
Affiliation(s)
- Daniel Johnstone
- School of Biomedical Sciences and Pharmacy, The University of Newcastle, Callaghan, Australia
| | | | | | | | | | | | | | | | | |
Collapse
|
13
|
Shin E, Yoon Y, Ahn J, Park S. TC-VGC: a tumor classification system using variations in genes' correlation. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2011; 104:e87-e101. [PMID: 21531474 DOI: 10.1016/j.cmpb.2011.03.002] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/24/2010] [Revised: 01/11/2011] [Accepted: 03/07/2011] [Indexed: 05/30/2023]
Abstract
Classification analysis of microarray data is widely used to reveal biological features and to diagnose various diseases, including cancers. Most existing approaches improve the performance of learning models by removing most irrelevant and redundant genes from the data. They select the marker genes which are expressed differently in normal and tumor tissues. These techniques ignore the importance of the complex functional-dependencies between genes. In this paper, we propose a new method for cancer classification which uses distinguished variations of gene-gene correlation in two sample groups. The cancer specific genetic network composed of these gene pairs contains many literature-curated prostate cancer genes, and we were successful in identifying new candidate prostate cancer genes inferred by them. Furthermore, this method achieved a high accuracy with a small number of genes in cancer classification.
Collapse
Affiliation(s)
- Eunji Shin
- Department of Computer Science, Yonsei University, 134 Sinchon-dong, Seodaemun-gu, Seoul 120-749, South Korea
| | | | | | | |
Collapse
|
14
|
Rocha de Paula M, Gómez Ravetti M, Berretta R, Moscato P. Differences in abundances of cell-signalling proteins in blood reveal novel biomarkers for early detection of clinical Alzheimer's disease. PLoS One 2011; 6:e17481. [PMID: 21479255 PMCID: PMC3063784 DOI: 10.1371/journal.pone.0017481] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2010] [Accepted: 02/06/2011] [Indexed: 01/26/2023] Open
Abstract
BACKGROUND In November 2007 a study published in Nature Medicine proposed a simple test based on the abundance of 18 proteins in blood to predict the onset of clinical symptoms of Alzheimer's Disease (AD) two to six years before these symptoms manifest. Later, another study, published in PLoS ONE, showed that only five proteins (IL-1, IL-3, EGF, TNF- and G-CSF) have overall better prediction accuracy. These classifiers are based on the abundance of 120 proteins. Such values were standardised by a Z-score transformation, which means that their values are relative to the average of all others. METHODOLOGY The original datasets from the Nature Medicine paper are further studied using methods from combinatorial optimisation and Information Theory. We expand the original dataset by also including all pair-wise differences of z-score values of the original dataset ("metafeatures"). Using an exact algorithm to solve the resulting Feature Set problem, used to tackle the feature selection problem, we found signatures that contain either only features, metafeatures or both, and evaluated their predictive performance on the independent test set. CONCLUSIONS It was possible to show that a specific pattern of cell signalling imbalance in blood plasma has valuable information to distinguish between NDC and AD samples. The obtained signatures were able to predict AD in patients that already had a Mild Cognitive Impairment (MCI) with up to 84% of sensitivity, while maintaining also a strong prediction accuracy of 90% on a independent dataset with Non Demented Controls (NDC) and AD samples. The novel biomarkers uncovered with this method now confirms ANG-2, IL-11, PDGF-BB, CCL15/MIP-1; and supports the joint measurement of other signalling proteins not previously discussed: GM-CSF, NT-3, IGFBP-2 and VEGF-B.
Collapse
Affiliation(s)
- Mateus Rocha de Paula
- Centre for Bioinformatics, Biomarker Discovery & Information-Based Medicine, The University of Newcastle, Callaghan, Australia
| | - Martín Gómez Ravetti
- Departamento de Engenharia de Produção, Universidade Federal de Minas Gerais (UFMG), Belo Horizonte, Brazil
| | - Regina Berretta
- Centre for Bioinformatics, Biomarker Discovery & Information-Based Medicine, The University of Newcastle, Callaghan, Australia
| | - Pablo Moscato
- Centre for Bioinformatics, Biomarker Discovery & Information-Based Medicine, The University of Newcastle, Callaghan, Australia
| |
Collapse
|
15
|
Riveros C, Mellor D, Gandhi KS, McKay FC, Cox MB, Berretta R, Vaezpour SY, Inostroza-Ponta M, Broadley SA, Heard RN, Vucic S, Stewart GJ, Williams DW, Scott RJ, Lechner-Scott J, Booth DR, Moscato P. A transcription factor map as revealed by a genome-wide gene expression analysis of whole-blood mRNA transcriptome in multiple sclerosis. PLoS One 2010; 5:e14176. [PMID: 21152067 PMCID: PMC2995726 DOI: 10.1371/journal.pone.0014176] [Citation(s) in RCA: 46] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2010] [Accepted: 10/20/2010] [Indexed: 12/03/2022] Open
Abstract
Background Several lines of evidence suggest that transcription factors are involved in the pathogenesis of Multiple Sclerosis (MS) but complete mapping of the whole network has been elusive. One of the reasons is that there are several clinical subtypes of MS and transcription factors that may be involved in one subtype may not be in others. We investigate the possibility that this network could be mapped using microarray technologies and contemporary bioinformatics methods on a dataset derived from whole blood in 99 untreated MS patients (36 Relapse Remitting MS, 43 Primary Progressive MS, and 20 Secondary Progressive MS) and 45 age-matched healthy controls. Methodology/Principal Findings We have used two different analytical methodologies: a non-standard differential expression analysis and a differential co-expression analysis, which have converged on a significant number of regulatory motifs that are statistically overrepresented in genes that are either differentially expressed (or differentially co-expressed) in cases and controls (e.g., V$KROX_Q6, p-value <3.31E-6; V$CREBP1_Q2, p-value <9.93E-6, V$YY1_02, p-value <1.65E-5). Conclusions/Significance Our analysis uncovered a network of transcription factors that potentially dysregulate several genes in MS or one or more of its disease subtypes. The most significant transcription factor motifs were for the Early Growth Response EGR/KROX family, ATF2, YY1 (Yin and Yang 1), E2F-1/DP-1 and E2F-4/DP-2 heterodimers, SOX5, and CREB and ATF families. These transcription factors are involved in early T-lymphocyte specification and commitment as well as in oligodendrocyte dedifferentiation and development, both pathways that have significant biological plausibility in MS causation.
Collapse
Affiliation(s)
- Carlos Riveros
- Centre for Bioinformatics, Biomarker Discovery & Information-Based Medicine, University of Newcastle, and Hunter Medical Research Institute, Newcastle, Australia
| | - Drew Mellor
- Centre for Bioinformatics, Biomarker Discovery & Information-Based Medicine, University of Newcastle, and Hunter Medical Research Institute, Newcastle, Australia
- School of Computer Science and Software Engineering, The University of Western Australia, Crawley, Australia
| | - Kaushal S. Gandhi
- Westmead Millennium Institute, University of Sydney, Westmead, Australia
| | - Fiona C. McKay
- Westmead Millennium Institute, University of Sydney, Westmead, Australia
| | - Mathew B. Cox
- Centre for Bioinformatics, Biomarker Discovery & Information-Based Medicine, University of Newcastle, and Hunter Medical Research Institute, Newcastle, Australia
- Hunter Medical Research Institute, Newcastle, Australia
| | - Regina Berretta
- Centre for Bioinformatics, Biomarker Discovery & Information-Based Medicine, University of Newcastle, and Hunter Medical Research Institute, Newcastle, Australia
| | - S. Yahya Vaezpour
- Centre for Bioinformatics, Biomarker Discovery & Information-Based Medicine, University of Newcastle, and Hunter Medical Research Institute, Newcastle, Australia
- Department of Computer Engineering, Amirkabir University of Technology, Tehran, Iran
| | - Mario Inostroza-Ponta
- Centre for Bioinformatics, Biomarker Discovery & Information-Based Medicine, University of Newcastle, and Hunter Medical Research Institute, Newcastle, Australia
- Departamento de Ingeniería Informática, Universidad de Santiago de Chile, Santiago, Chile
| | - Simon A. Broadley
- School of Medicine, Griffith University, Brisbane, Australia
- Department of Neurology, Gold Coast Hospital, Southport, Australia
| | - Robert N. Heard
- Westmead Millennium Institute, University of Sydney, Westmead, Australia
| | - Stephen Vucic
- Westmead Millennium Institute, University of Sydney, Westmead, Australia
| | - Graeme J. Stewart
- Westmead Millennium Institute, University of Sydney, Westmead, Australia
| | | | - Rodney J. Scott
- Centre for Bioinformatics, Biomarker Discovery & Information-Based Medicine, University of Newcastle, and Hunter Medical Research Institute, Newcastle, Australia
| | - Jeanette Lechner-Scott
- Centre for Bioinformatics, Biomarker Discovery & Information-Based Medicine, University of Newcastle, and Hunter Medical Research Institute, Newcastle, Australia
| | - David R. Booth
- Westmead Millennium Institute, University of Sydney, Westmead, Australia
| | - Pablo Moscato
- Centre for Bioinformatics, Biomarker Discovery & Information-Based Medicine, University of Newcastle, and Hunter Medical Research Institute, Newcastle, Australia
- Australian Research Council Centre of Excellence in Bioinformatics, St Lucia, Australia
- * E-mail:
| | | |
Collapse
|
16
|
Rosso OA, Mendes A, Berretta R, Rostas JA, Hunter M, Moscato P. Distinguishing childhood absence epilepsy patients from controls by the analysis of their background brain electrical activity (II): A combinatorial optimization approach for electrode selection. J Neurosci Methods 2009; 181:257-67. [DOI: 10.1016/j.jneumeth.2009.04.028] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2009] [Revised: 04/28/2009] [Accepted: 04/30/2009] [Indexed: 11/30/2022]
|
17
|
Gómez Ravetti M, Moscato P. Identification of a 5-protein biomarker molecular signature for predicting Alzheimer's disease. PLoS One 2008; 3:e3111. [PMID: 18769539 PMCID: PMC2518833 DOI: 10.1371/journal.pone.0003111] [Citation(s) in RCA: 81] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2008] [Accepted: 08/04/2008] [Indexed: 11/19/2022] Open
Abstract
Background Alzheimer's disease (AD) is a progressive brain disease with a huge cost to human lives. The impact of the disease is also a growing concern for the governments of developing countries, in particular due to the increasingly high number of elderly citizens at risk. Alzheimer's is the most common form of dementia, a common term for memory loss and other cognitive impairments. There is no current cure for AD, but there are drug and non-drug based approaches for its treatment. In general the drug-treatments are directed at slowing the progression of symptoms. They have proved to be effective in a large group of patients but success is directly correlated with identifying the disease carriers at its early stages. This justifies the need for timely and accurate forms of diagnosis via molecular means. We report here a 5-protein biomarker molecular signature that achieves, on average, a 96% total accuracy in predicting clinical AD. The signature is composed of the abundances of IL-1α, IL-3, EGF, TNF-α and G-CSF. Methodology/Principal Findings Our results are based on a recent molecular dataset that has attracted worldwide attention. Our paper illustrates that improved results can be obtained with the abundance of only five proteins. Our methodology consisted of the application of an integrative data analysis method. This four step process included: a) abundance quantization, b) feature selection, c) literature analysis, d) selection of a classifier algorithm which is independent of the feature selection process. These steps were performed without using any sample of the test datasets. For the first two steps, we used the application of Fayyad and Irani's discretization algorithm for selection and quantization, which in turn creates an instance of the (alpha-beta)-k-Feature Set problem; a numerical solution of this problem led to the selection of only 10 proteins. Conclusions/Significance the previous study has provided an extremely useful dataset for the identification of AD biomarkers. However, our subsequent analysis also revealed several important facts worth reporting: 1. A 5-protein signature (which is a subset of the 18-protein signature of Ray et al.) has the same overall performance (when using the same classifier). 2. Using more than 20 different classifiers available in the widely-used Weka software package, our 5-protein signature has, on average, a smaller prediction error indicating the independence of the classifier and the robustness of this set of biomarkers (i.e. 96% accuracy when predicting AD against non-demented control). 3. Using very simple classifiers, like Simple Logistic or Logistic Model Trees, we have achieved the following results on 92 samples: 100 percent success to predict Alzheimer's Disease and 92 percent to predict Non Demented Control on the AD dataset.
Collapse
Affiliation(s)
- Martín Gómez Ravetti
- Centre for Bioinformatics, Biomarker Discovery & Information-Based Medicine, The University of Newcastle, Callaghan, Australia
- * E-mail: (MGR); (PM)
| | - Pablo Moscato
- Centre for Bioinformatics, Biomarker Discovery & Information-Based Medicine, The University of Newcastle, Callaghan, Australia
- * E-mail: (MGR); (PM)
| |
Collapse
|