1
|
Abu Zaher A, Berretta R, Noman N, Moscato P. An adaptive memetic algorithm for feature selection using proximity graphs. Comput Intell 2018. [DOI: 10.1111/coin.12196] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
Affiliation(s)
- Amer Abu Zaher
- School of Electrical Engineering and Computing The University of Newcastle Callaghan Australia
| | - Regina Berretta
- School of Electrical Engineering and Computing The University of Newcastle Callaghan Australia
| | - Nasimul Noman
- School of Electrical Engineering and Computing The University of Newcastle Callaghan Australia
| | - Pablo Moscato
- School of Electrical Engineering and Computing The University of Newcastle Callaghan Australia
| |
Collapse
|
2
|
A Novel Clustering Methodology Based on Modularity Optimisation for Detecting Authorship Affinities in Shakespearean Era Plays. PLoS One 2016; 11:e0157988. [PMID: 27571416 PMCID: PMC5003342 DOI: 10.1371/journal.pone.0157988] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2015] [Accepted: 06/08/2016] [Indexed: 01/22/2023] Open
Abstract
In this study we propose a novel, unsupervised clustering methodology for analyzing large datasets. This new, efficient methodology converts the general clustering problem into the community detection problem in graph by using the Jensen-Shannon distance, a dissimilarity measure originating in Information Theory. Moreover, we use graph theoretic concepts for the generation and analysis of proximity graphs. Our methodology is based on a newly proposed memetic algorithm (iMA-Net) for discovering clusters of data elements by maximizing the modularity function in proximity graphs of literary works. To test the effectiveness of this general methodology, we apply it to a text corpus dataset, which contains frequencies of approximately 55,114 unique words across all 168 written in the Shakespearean era (16th and 17th centuries), to analyze and detect clusters of similar plays. Experimental results and comparison with state-of-the-art clustering methods demonstrate the remarkable performance of our new method for identifying high quality clusters which reflect the commonalities in the literary style of the plays.
Collapse
|
3
|
Gasparyan AY, Yessirkepov M, Gerasimov AN, Kostyukova EI, Kitas GD. Scientific author names: errors, corrections, and identity profiles. Biochem Med (Zagreb) 2016; 26:169-73. [PMID: 27346960 PMCID: PMC4910270 DOI: 10.11613/bm.2016.017] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2016] [Accepted: 05/07/2016] [Indexed: 12/11/2022] Open
Abstract
Authorship problems are deep-rooted in the field of science communication. Some of these relate to lack of specific journal instructions. For decades, experts in journal editing and publishing have been exploring the authorship criteria and contributions deserving either co-authorship or acknowledgment. The issue of inconsistencies of listing and abbreviating author names has come to the fore lately. There are reports on the difficulties of figuring out Chinese surnames and given names of South Indians in scholarly articles. However, it seems that problems with correct listing and abbreviating author names are global. This article presents an example of swapping second (father’s) name with surname in a ‘predatory’ journal, where numerous instances of incorrectly identifying and crediting authors passed unnoticed for the journal editors, and no correction has been published. Possible solutions are discussed in relation to identifying author profiles and adjusting editorial policies to the emerging problems. Correcting mistakes with author names post-publication and integrating with the Open Researcher and Contributor ID (ORCID) platform are among them.
Collapse
Affiliation(s)
- Armen Yuri Gasparyan
- Departments of Rheumatology and Research and Development, Dudley Group NHS Foundation Trust, Russells Hall Hospital, Dudley, West Midlands, UK
| | - Marlen Yessirkepov
- Department of Biochemistry, Biology and Microbiology, South Kazakhstan State Pharmaceutical Academy, Shymkent, Kazakhstan
| | - Alexey N Gerasimov
- Department of Statistics and Econometrics, Stavropol State Agrarian University, Stavropol, Russian Federation
| | - Elena I Kostyukova
- Department of Accounting Management, Faculty of Accounting and Finance, Stavropol State Agrarian University, Stavropol, Russian Federation
| | - George D Kitas
- Departments of Rheumatology and Research and Development, Dudley Group NHS Foundation Trust, Russells Hall Hospital, Dudley, West Midlands, UK; Arthritis Research UK Epidemiology Unit, University of Manchester, Manchester, UK
| |
Collapse
|
4
|
Iteratively refining breast cancer intrinsic subtypes in the METABRIC dataset. BioData Min 2016; 9:2. [PMID: 26770261 PMCID: PMC4712506 DOI: 10.1186/s13040-015-0078-9] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2015] [Accepted: 12/25/2015] [Indexed: 01/28/2023] Open
Abstract
BACKGROUND Multi-gene lists and single sample predictor models have been currently used to reduce the multidimensional complexity of breast cancers, and to identify intrinsic subtypes. The perceived inability of some models to deal with the challenges of processing high-dimensional data, however, limits the accurate characterisation of these subtypes. Towards the development of robust strategies, we designed an iterative approach to consistently discriminate intrinsic subtypes and improve class prediction in the METABRIC dataset. FINDINGS In this study, we employed the CM1 score to identify the most discriminative probes for each group, and an ensemble learning technique to assess the ability of these probes on assigning subtype labels using 24 different classifiers. Our analysis is comprised of an iterative computation of these methods and statistical measures performed on a set of over 2000 samples. The refined labels assigned using this iterative approach revealed to be more consistent and in better agreement with clinicopathological markers and patients' overall survival than those originally provided by the PAM50 method. CONCLUSIONS The assignment of intrinsic subtypes has a significant impact in translational research for both understanding and managing breast cancer. The refined labelling, therefore, provides more accurate and reliable information by improving the source of fundamental science prior to clinical applications in medicine.
Collapse
|
5
|
FlexDM: Simple, parallel and fault-tolerant data mining using WEKA. SOURCE CODE FOR BIOLOGY AND MEDICINE 2015; 10:13. [PMID: 26579209 PMCID: PMC4647584 DOI: 10.1186/s13029-015-0045-3] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/20/2015] [Accepted: 11/09/2015] [Indexed: 12/03/2022]
Abstract
Background With the continued exponential growth in data volume, large-scale data mining and machine learning experiments have become a necessity for many researchers without programming or statistics backgrounds. WEKA (Waikato Environment for Knowledge Analysis) is a gold standard framework that facilitates and simplifies this task by allowing specification of algorithms, hyper-parameters and test strategies from a streamlined Experimenter GUI. Despite its popularity, the WEKA Experimenter exhibits several limitations that we address in our new FlexDM software. Results FlexDM addresses four fundamental limitations with the WEKA Experimenter: reliance on a verbose and difficult-to-modify XML schema; inability to meta-optimise experiments over a large number of algorithm hyper-parameters; inability to recover from software or hardware failure during a large experiment; and failing to leverage modern multicore processor architectures. Direct comparisons between the FlexDM and default WEKA XML schemas demonstrate a 10-fold improvement in brevity for a specification that allows finer control of experimental procedures. The stability of FlexDM has been tested on a large biological dataset (approximately 450 k attributes by 150 samples), and automatic parallelisation of tasks yields a quasi-linear reduction in execution time when distributed across multiple processor cores. Conclusion FlexDM is a powerful and easy-to-use extension to the WEKA package, which better handles the increased volume and complexity of data that has emerged during the 20 years since WEKA’s original development. FlexDM has been tested on Windows, OSX and Linux operating systems and is provided as a pre-configured virtual reference environment for trivial usage and extensibility. This software can substantially improve the productivity of any research group conducting large-scale data mining or machine learning tasks, in addition to providing non-programmers with improved control over specific aspects of their data analysis pipeline via a succinct and simplified XML schema. Electronic supplementary material The online version of this article (doi:10.1186/s13029-015-0045-3) contains supplementary material, which is available to authorized users.
Collapse
|
6
|
Milioli HH, Vimieiro R, Riveros C, Tishchenko I, Berretta R, Moscato P. The Discovery of Novel Biomarkers Improves Breast Cancer Intrinsic Subtype Prediction and Reconciles the Labels in the METABRIC Data Set. PLoS One 2015; 10:e0129711. [PMID: 26132585 PMCID: PMC4488510 DOI: 10.1371/journal.pone.0129711] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2014] [Accepted: 05/12/2015] [Indexed: 02/07/2023] Open
Abstract
BACKGROUND The prediction of breast cancer intrinsic subtypes has been introduced as a valuable strategy to determine patient diagnosis and prognosis, and therapy response. The PAM50 method, based on the expression levels of 50 genes, uses a single sample predictor model to assign subtype labels to samples. Intrinsic errors reported within this assay demonstrate the challenge of identifying and understanding the breast cancer groups. In this study, we aim to: a) identify novel biomarkers for subtype individuation by exploring the competence of a newly proposed method named CM1 score, and b) apply an ensemble learning, as opposed to the use of a single classifier, for sample subtype assignment. The overarching objective is to improve class prediction. METHODS AND FINDINGS The microarray transcriptome data sets used in this study are: the METABRIC breast cancer data recorded for over 2000 patients, and the public integrated source from ROCK database with 1570 samples. We first computed the CM1 score to identify the probes with highly discriminative patterns of expression across samples of each intrinsic subtype. We further assessed the ability of 42 selected probes on assigning correct subtype labels using 24 different classifiers from the Weka software suite. For comparison, the same method was applied on the list of 50 genes from the PAM50 method. CONCLUSIONS The CM1 score portrayed 30 novel biomarkers for predicting breast cancer subtypes, with the confirmation of the role of 12 well-established genes. Intrinsic subtypes assigned using the CM1 list and the ensemble of classifiers are more consistent and homogeneous than the original PAM50 labels. The new subtypes show accurate distributions of current clinical markers ER, PR and HER2, and survival curves in the METABRIC and ROCK data sets. Remarkably, the paradoxical attribution of the original labels reinforces the limitations of employing a single sample classifiers to predict breast cancer intrinsic subtypes.
Collapse
Affiliation(s)
- Heloisa Helena Milioli
- Priority Research Centre for Bioinformatics, Biomarker Discovery and Information-Based Medicine, Hunter Medical Research Institute, New Lambton Heights, NSW, Australia
- School of Environmental and Life Science, The University of Newcastle, Callaghan, NSW, Australia
| | - Renato Vimieiro
- Priority Research Centre for Bioinformatics, Biomarker Discovery and Information-Based Medicine, Hunter Medical Research Institute, New Lambton Heights, NSW, Australia
- Centro de Informática, Universidade Federal de Pernambuco, Recife, PE, Brazil
| | - Carlos Riveros
- Priority Research Centre for Bioinformatics, Biomarker Discovery and Information-Based Medicine, Hunter Medical Research Institute, New Lambton Heights, NSW, Australia
- School of Electrical Engineering and Computer Science, The University of Newcastle, Callaghan, NSW, Australia
| | - Inna Tishchenko
- Priority Research Centre for Bioinformatics, Biomarker Discovery and Information-Based Medicine, Hunter Medical Research Institute, New Lambton Heights, NSW, Australia
- School of Electrical Engineering and Computer Science, The University of Newcastle, Callaghan, NSW, Australia
| | - Regina Berretta
- Priority Research Centre for Bioinformatics, Biomarker Discovery and Information-Based Medicine, Hunter Medical Research Institute, New Lambton Heights, NSW, Australia
- School of Electrical Engineering and Computer Science, The University of Newcastle, Callaghan, NSW, Australia
| | - Pablo Moscato
- Priority Research Centre for Bioinformatics, Biomarker Discovery and Information-Based Medicine, Hunter Medical Research Institute, New Lambton Heights, NSW, Australia
- School of Electrical Engineering and Computer Science, The University of Newcastle, Callaghan, NSW, Australia
| |
Collapse
|
7
|
de Vries NJ, Reis R, Moscato P. Clustering consumers based on trust, confidence and giving behaviour: data-driven model building for charitable involvement in the Australian not-for-profit sector. PLoS One 2015; 10:e0122133. [PMID: 25849547 PMCID: PMC4388642 DOI: 10.1371/journal.pone.0122133] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2014] [Accepted: 02/11/2015] [Indexed: 11/19/2022] Open
Abstract
Organisations in the Not-for-Profit and charity sector face increasing competition to win time, money and efforts from a common donor base. Consequently, these organisations need to be more proactive than ever. The increased level of communications between individuals and organisations today, heightens the need for investigating the drivers of charitable giving and understanding the various consumer groups, or donor segments, within a population. It is contended that `trust' is the cornerstone of the not-for-profit sector's survival, making it an inevitable topic for research in this context. It has become imperative for charities and not-for-profit organisations to adopt for-profit's research, marketing and targeting strategies. This study provides the not-for-profit sector with an easily-interpretable segmentation method based on a novel unsupervised clustering technique (MST-kNN) followed by a feature saliency method (the CM1 score). A sample of 1,562 respondents from a survey conducted by the Australian Charities and Not-for-profits Commission is analysed to reveal donor segments. Each cluster's most salient features are identified using the CM1 score. Furthermore, symbolic regression modelling is employed to find cluster-specific models to predict `low' or `high' involvement in clusters. The MST-kNN method found seven clusters. Based on their salient features they were labelled as: the `non-institutionalist charities supporters', the `resource allocation critics', the `information-seeking financial sceptics', the `non-questioning charity supporters', the `non-trusting sceptics', the `charity management believers' and the `institutionalist charity believers'. Each cluster exhibits their own characteristics as well as different drivers of `involvement'. The method in this study provides the not-for-profit sector with a guideline for clustering, segmenting, understanding and potentially targeting their donor base better. If charities and not-for-profit organisations adopt these strategies, they will be more successful in today's competitive environment.
Collapse
Affiliation(s)
- Natalie Jane de Vries
- Centre for Bioinformatics, Biomarker Discovery & Information-Based Medicine, The University of Newcastle, Callaghan, New South Wales, Australia
| | - Rodrigo Reis
- Centre for Bioinformatics, Biomarker Discovery & Information-Based Medicine, The University of Newcastle, Callaghan, New South Wales, Australia
- Faculdade de Medicina de Ribeirao Preto, Universidade de São Paulo, São Paulo, Brazil
| | - Pablo Moscato
- Centre for Bioinformatics, Biomarker Discovery & Information-Based Medicine, The University of Newcastle, Callaghan, New South Wales, Australia
- * E-mail:
| |
Collapse
|
8
|
An information theoretic clustering approach for unveiling authorship affinities in Shakespearean era plays and poems. PLoS One 2014; 9:e111445. [PMID: 25347727 PMCID: PMC4210181 DOI: 10.1371/journal.pone.0111445] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2014] [Accepted: 10/02/2014] [Indexed: 01/17/2023] Open
Abstract
In this paper we analyse the word frequency profiles of a set of works from the Shakespearean era to uncover patterns of relationship between them, highlighting the connections within authorial canons. We used a text corpus comprising 256 plays and poems from the 16th and 17th centuries, with 17 works of uncertain authorship. Our clustering approach is based on the Jensen-Shannon divergence and a graph partitioning algorithm, and our results show that authors' characteristic styles are very powerful factors in explaining the variation of word use, frequently transcending cross-cutting factors like the differences between tragedy and comedy, early and late works, and plays and poems. Our method also provides an empirical guide to the authorship of plays and poems where this is unknown or disputed.
Collapse
|
9
|
Filiou MD, Arefin AS, Moscato P, Graeber MB. 'Neuroinflammation' differs categorically from inflammation: transcriptomes of Alzheimer's disease, Parkinson's disease, schizophrenia and inflammatory diseases compared. Neurogenetics 2014; 15:201-12. [PMID: 24928144 DOI: 10.1007/s10048-014-0409-x] [Citation(s) in RCA: 48] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2014] [Accepted: 06/02/2014] [Indexed: 12/30/2022]
Abstract
'Neuroinflammation' has become a widely applied term in the basic and clinical neurosciences but there is no generally accepted neuropathological tissue correlate. Inflammation, which is characterized by the presence of perivascular infiltrates of cells of the adaptive immune system, is indeed seen in the central nervous system (CNS) under certain conditions. Authors who refer to microglial activation as neuroinflammation confuse this issue because autoimmune neuroinflammation serves as a synonym for multiple sclerosis, the prototypical inflammatory disease of the CNS. We have asked the question whether a data-driven, unbiased in silico approach may help to clarify the nomenclatorial confusion. Specifically, we have examined whether unsupervised analysis of microarray data obtained from human cerebral cortex of Alzheimer's, Parkinson's and schizophrenia patients would reveal a degree of relatedness between these diseases and recognized inflammatory conditions including multiple sclerosis. Our results using two different data analysis methods provide strong evidence against this hypothesis demonstrating that very different sets of genes are involved. Consequently, the designations inflammation and neuroinflammation are not interchangeable. They represent different categories not only at the histophenotypic but also at the transcriptomic level. Therefore, non-autoimmune neuroinflammation remains a term in need of definition.
Collapse
Affiliation(s)
- Michaela D Filiou
- Max Planck Institute of Psychiatry, Kraepelinstraße 2, 80804, Munich, Germany
| | | | | | | |
Collapse
|