1
|
Mallik S, Seth S, Si A, Bhadra T, Zhao Z. Optimal ranking and directional signature classification using the integral strategy of multi-objective optimization-based association rule mining of multi-omics data. FRONTIERS IN BIOINFORMATICS 2023; 3:1182176. [PMID: 37576714 PMCID: PMC10415913 DOI: 10.3389/fbinf.2023.1182176] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2023] [Accepted: 06/19/2023] [Indexed: 08/15/2023] Open
Abstract
Introduction: Association rule mining (ARM) is a powerful tool for exploring the informative relationships among multiple items (genes) in any dataset. The main problem of ARM is that it generates many rules containing different rule-informative values, which becomes a challenge for the user to choose the effective rules. In addition, few works have been performed on the integration of multiple biological datasets and variable cutoff values in ARM. Methods: To solve all these problems, in this article, we developed a novel framework MOOVARM (multi-objective optimized variable cutoff-based association rule mining) for multi-omics profiles. Results: In this regard, we identified the positive ideal solution (PIS), which maximized the profit and minimized the loss, and negative ideal solution (NIS), which minimized the profit and maximized the loss for all gene sets (item sets), belonging to each extracted rule. Thereafter, we computed the distance (d +) from PIS and distance (d -) from NIS for each gene set or product. These two distances played an important role in determining the optimized associations among various pairs of genes in the multi-omics dataset. We then globally estimated the relative closeness to PIS for ranking the gene sets. When the relative closeness score of the rule is greater than or equal to the pre-defined threshold value, the rule can be considered a final resultant rule. Moreover, MOOVARM evaluated the relative score of the rule based on the status of all genes instead of individual genes. Conclusions: MOOVARM produced the final rank of the extracted (multi-objective optimized) rules of correlated genes which had better disease classification than the state-of-the-art algorithms on gene signature identification.
Collapse
Affiliation(s)
- Saurav Mallik
- Environmental Health, Harvard T. H. Chan School of Public Health, Boston, MA, United States
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, United States
| | - Soumita Seth
- Department of Computer Science and Engineering, Brainware University, Kolkata, India
- Department of Computer Science and Engineering, Aliah University, Kolkata, India
| | - Amalendu Si
- School of Information Technology, Maulana Abul Kalam Azad University of Technology, Haringhata, India
| | - Tapas Bhadra
- Department of Computer Science and Engineering, Aliah University, Kolkata, India
| | - Zhongming Zhao
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, United States
- Human Genetics Center, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX, United States
| |
Collapse
|
2
|
Đurasević M, Đumić M. Automated design of heuristics for the container relocation problem using genetic programming. Appl Soft Comput 2022. [DOI: 10.1016/j.asoc.2022.109696] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
|
3
|
A convex multi-class model via distance metric learning based class-to-instance confidence. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.109791] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register]
|
4
|
An Algorithm Framework for Drug-Induced Liver Injury Prediction Based on Genetic Algorithm and Ensemble Learning. Molecules 2022; 27:molecules27103112. [PMID: 35630587 PMCID: PMC9147181 DOI: 10.3390/molecules27103112] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2022] [Revised: 05/05/2022] [Accepted: 05/10/2022] [Indexed: 11/19/2022] Open
Abstract
In the process of drug discovery, drug-induced liver injury (DILI) is still an active research field and is one of the most common and important issues in toxicity evaluation research. It directly leads to the high wear attrition of the drug. At present, there are a variety of computer algorithms based on molecular representations to predict DILI. It is found that a single molecular representation method is insufficient to complete the task of toxicity prediction, and multiple molecular fingerprint fusion methods have been used as model input. In order to solve the problem of high dimensional and unbalanced DILI prediction data, this paper integrates existing datasets and designs a new algorithm framework, Rotation-Ensemble-GA (R-E-GA). The main idea is to find a feature subset with better predictive performance after rotating the fusion vector of high-dimensional molecular representation in the feature space. Then, an Adaboost-type ensemble learning method is integrated into R-E-GA to improve the prediction accuracy. The experimental results show that the performance of R-E-GA is better than other state-of-art algorithms including ensemble learning-based and graph neural network-based methods. Through five-fold cross-validation, the R-E-GA obtains an ACC of 0.77, an F1 score of 0.769, and an AUC of 0.842.
Collapse
|
5
|
Ke S, Pollock NR, Wang XW, Chen X, Daugherty K, Lin Q, Xu H, Garey KW, Gonzales-Luna AJ, Kelly CP, Liu YY. Integrating gut microbiome and host immune markers to understand the pathogenesis of Clostridioides difficile infection. Gut Microbes 2021; 13:1-18. [PMID: 34132169 PMCID: PMC8210874 DOI: 10.1080/19490976.2021.1935186] [Citation(s) in RCA: 22] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
Clostridioides difficile (C.difficile) infection is the most common cause of healthcare-associated infection and an important cause of morbidity and mortality among hospitalized patients. A comprehensive understanding of C.difficile infection (CDI) pathogenesis is crucial for disease diagnosis, treatment, and prevention. Here, we characterized gut microbial compositions and a broad panel of innate and adaptive immunological markers in 243 well-characterized human subjects (including 187 subjects with both microbiota and immune marker data), who were divided into four phenotype groups: CDI, Asymptomatic Carriage, Non-CDI Diarrhea, and Control. We found that the interactions between gut microbiota and host immune markers are very sensitive to the status of C.difficile colonization and infection. We demonstrated that incorporating both gut microbiome and host immune marker data into classification models can better distinguish CDI from other groups than can either type of data alone. Our classification models display robust diagnostic performance to differentiate CDI from Asymptomatic carriage (AUC~0.916), Non-CDI Diarrhea (AUC~0.917), or Non-CDI that combines all other three groups (AUC~0.929). Finally, we performed symbolic classification using selected features to derive simple mathematic formulas that explicitly quantify the interactions between the gut microbiome and host immune markers. These findings support the potential roles of gut microbiota and host immune markers in the pathogenesis of CDI. Our study provides new insights for a microbiome-immune marker-derived signature to diagnose CDI and design therapeutic strategies for CDI.
Collapse
Affiliation(s)
- Shanlin Ke
- Channing Division of Network Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, MassachusettsUSA,School of Animal Science and Technology, State Key Laboratory of Pig Genetic Improvement and Production Technology, Jiangxi Agricultural University 330045, China
| | - Nira R. Pollock
- Division of Infectious Diseases, Department of Medicine, Beth Israel Deaconess Medical Center, Boston, Massachusetts, USA,Department of Laboratory Medicine, Boston Children’s Hospital, Boston, Massachusetts, USA
| | - Xu-Wen Wang
- Channing Division of Network Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, MassachusettsUSA
| | - Xinhua Chen
- Division of Gastroenterology, Department of Medicine, Beth Israel Deaconess Medical Center, Boston, Massachusetts, USA
| | - Kaitlyn Daugherty
- Division of Gastroenterology, Department of Medicine, Beth Israel Deaconess Medical Center, Boston, Massachusetts, USA
| | - Qianyun Lin
- Division of Gastroenterology, Department of Medicine, Beth Israel Deaconess Medical Center, Boston, Massachusetts, USA
| | - Hua Xu
- Division of Gastroenterology, Department of Medicine, Beth Israel Deaconess Medical Center, Boston, Massachusetts, USA
| | - Kevin W. Garey
- Department of Pharmacy Practice and Translation Research, University of Houston College of Pharmacy, Houston, Texas, USA
| | - Anne J. Gonzales-Luna
- Department of Pharmacy Practice and Translation Research, University of Houston College of Pharmacy, Houston, Texas, USA
| | - Ciarán P. Kelly
- Division of Gastroenterology, Department of Medicine, Beth Israel Deaconess Medical Center, Boston, Massachusetts, USA,Ciarán P. Kelly Division of Gastroenterology, Department of Medicine, Beth Israel Deaconess Medical Center, Boston, MassachusettsUSA
| | - Yang-Yu Liu
- Channing Division of Network Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, MassachusettsUSA,CONTACT Yang-Yu Liu Channing Division of Network Medicine, Department of Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, MassachusettsUSA
| |
Collapse
|
6
|
Ma J, Gao X. Designing genetic programming classifiers with feature selection and feature construction. Appl Soft Comput 2020. [DOI: 10.1016/j.asoc.2020.106826] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
7
|
A Genetic Programming Strategy to Induce Logical Rules for Clinical Data Analysis. Processes (Basel) 2020. [DOI: 10.3390/pr8121565] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
This paper proposes a machine learning approach dealing with genetic programming to build classifiers through logical rule induction. In this context, we define and test a set of mutation operators across from different clinical datasets to improve the performance of the proposal for each dataset. The use of genetic programming for rule induction has generated interesting results in machine learning problems. Hence, genetic programming represents a flexible and powerful evolutionary technique for automatic generation of classifiers. Since logical rules disclose knowledge from the analyzed data, we use such knowledge to interpret the results and filter the most important features from clinical data as a process of knowledge discovery. The ultimate goal of this proposal is to provide the experts in the data domain with prior knowledge (as a guide) about the structure of the data and the rules found for each class, especially to track dichotomies and inequality. The results reached by our proposal on the involved datasets have been very promising when used in classification tasks and compared with other methods.
Collapse
|
8
|
Krzhizhanovskaya VV, Závodszky G, Lees MH, Dongarra JJ, Sloot PMA, Brissos S, Teixeira J. Performance Analysis of Binarization Strategies for Multi-class Imbalanced Data Classification. LECTURE NOTES IN COMPUTER SCIENCE 2020. [PMCID: PMC7303687 DOI: 10.1007/978-3-030-50423-6_11] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Multi-class imbalanced classification tasks are characterized by the skewed distribution of examples among the classes and, usually, strong overlapping between class regions in the feature space. Furthermore, frequently the goal of the final system is to obtain very high precision for each of the concepts. All of these factors contribute to the complexity of the task and increase the difficulty of building a quality data model by learning algorithms. One of the ways of addressing these challenges are so-called binarization strategies, which allow for decomposition of the multi-class problem into several binary tasks with lower complexity. Because of the different decomposition schemes used by each of those methods, some of them are considered to be better suited for handling imbalanced data than the others. In this study, we focus on the well-known binary approaches, namely One-Vs-All, One-Vs-One, and Error-Correcting Output Codes, and their effectiveness in multi-class imbalanced data classification, with respect to the base classifiers and various aggregation schemes for each of the strategies. We compare the performance of these approaches and try to boost the performance of seemingly weaker methods by sampling algorithms. The detailed comparative experimental study of the considered methods, supported by the statistical analysis, is presented. The results show the differences among various binarization strategies. We show how one can mitigate those differences using simple oversampling methods.
Collapse
|
9
|
|
10
|
Mallik S, Bhadra T, Mukherji A, Mallik S, Bhadra T, Mukherji A, Mallik S, Bhadra T, Mukherji A. DTFP-Growth: Dynamic Threshold-Based FP-Growth Rule Mining Algorithm Through Integrating Gene Expression, Methylation, and Protein-Protein Interaction Profiles. IEEE Trans Nanobioscience 2018; 17:117-125. [PMID: 29870335 DOI: 10.1109/tnb.2018.2803021] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Association rule mining is an important technique for identifying interesting relationships between gene pairs in a biological data set. Earlier methods basically work for a single biological data set, and, in maximum cases, a single minimum support cutoff can be applied globally, i.e., across all genesets/itemsets. To overcome this limitation, in this paper, we propose dynamic threshold-based FP-growth rule mining algorithm that integrates gene expression, methylation and protein-protein interaction profiles based on weighted shortest distance to find the novel associations among different pairs of genes in multi-view data sets. For this purpose, we introduce three new thresholds, namely, Distance-based Variable/Dynamic Supports (DVS), Distance-based Variable Confidences (DVC), and Distance-based Variable Lifts (DVL) for each rule by integrating co-expression, co-methylation, and protein-protein interactions existed in the multi-omics data set. We develop the proposed algorithm utilizing these three novel multiple threshold measures. In the proposed algorithm, the values of , , and are computed for each rule separately, and subsequently it is verified whether the support, confidence, and lift of each evolved rule are greater than or equal to the corresponding individual , , and values, respectively, or not. If all these three conditions for a rule are found to be true, the rule is treated as a resultant rule. One of the major advantages of the proposed method compared with other related state-of-the-art methods is that it considers both the quantitative and interactive significance among all pairwise genes belonging to each rule. Moreover, the proposed method generates fewer rules, takes less running time, and provides greater biological significance for the resultant top-ranking rules compared to previous methods.
Collapse
|
11
|
A novel effective diagnosis model based on optimized least squares support machine for gene microarray. Appl Soft Comput 2018. [DOI: 10.1016/j.asoc.2018.02.009] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
12
|
Liu KH, Zeng ZH, Ng VTY. A Hierarchical Ensemble of ECOC for cancer classification based on multi-class microarray data. Inf Sci (N Y) 2016. [DOI: 10.1016/j.ins.2016.02.028] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
13
|
Sen A, Islam MM, Murase K, Yao X. Binarization With Boosting and Oversampling for Multiclass Classification. IEEE TRANSACTIONS ON CYBERNETICS 2016; 46:1078-1091. [PMID: 25955858 DOI: 10.1109/tcyb.2015.2423295] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
Using a set of binary classifiers to solve multiclass classification problems has been a popular approach over the years. The decision boundaries learnt by binary classifiers (also called base classifiers) are much simpler than those learnt by multiclass classifiers. This paper proposes a new classification framework, termed binarization with boosting and oversampling (BBO), for efficiently solving multiclass classification problems. The new framework is devised based on the one-versus-all (OVA) binarization technique. Unlike most previous work, BBO employs boosting for solving the hard-to-learn instances and oversampling for handling the class-imbalance problem arising due to OVA binarization. These two features make BBO different from other existing works. Our new framework has been tested extensively on several multiclass supervised and semi-supervised classification problems using five different base classifiers, including neural networks, C4.5, k -nearest neighbor, repeated incremental pruning to produce error reduction, support vector machine, random forest, and learning with local and global consistency. Experimental results show that BBO can exhibit better performance compared to its counterparts on supervised and semi-supervised classification problems.
Collapse
|
14
|
Nag K, Pal NR. A Multiobjective Genetic Programming-Based Ensemble for Simultaneous Feature Selection and Classification. IEEE TRANSACTIONS ON CYBERNETICS 2016; 46:499-510. [PMID: 25769178 DOI: 10.1109/tcyb.2015.2404806] [Citation(s) in RCA: 28] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
We present an integrated algorithm for simultaneous feature selection (FS) and designing of diverse classifiers using a steady state multiobjective genetic programming (GP), which minimizes three objectives: 1) false positives (FPs); 2) false negatives (FNs); and 3) the number of leaf nodes in the tree. Our method divides a c -class problem into c binary classification problems. It evolves c sets of genetic programs to create c ensembles. During mutation operation, our method exploits the fitness as well as unfitness of features, which dynamically change with generations with a view to using a set of highly relevant features with low redundancy. The classifiers of i th class determine the net belongingness of an unknown data point to the i th class using a weighted voting scheme, which makes use of the FP and FN mistakes made on the training data. We test our method on eight microarray and 11 text data sets with diverse number of classes (from 2 to 44), large number of features (from 2000 to 49 151), and high feature-to-sample ratio (from 1.03 to 273.1). We compare our method with a bi-objective GP scheme that does not use any FS and rule size reduction strategy. It depicts the effectiveness of the proposed FS and rule size reduction schemes. Furthermore, we compare our method with four classification methods in conjunction with six features selection algorithms and full feature set. Our scheme performs the best for 380 out of 474 combinations of data sets, algorithm and FS method.
Collapse
|
15
|
Genetic programming based ensemble system for microarray data classification. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2015; 2015:193406. [PMID: 25810748 PMCID: PMC4355811 DOI: 10.1155/2015/193406] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/22/2014] [Revised: 01/01/2015] [Accepted: 01/19/2015] [Indexed: 11/18/2022]
Abstract
Recently, more and more machine learning techniques have been applied to microarray data analysis. The aim of this study is to propose a genetic programming (GP) based new ensemble system (named GPES), which can be used to effectively classify different types of cancers. Decision trees are deployed as base classifiers in this ensemble framework with three operators: Min, Max, and Average. Each individual of the GP is an ensemble system, and they become more and more accurate in the evolutionary process. The feature selection technique and balanced subsampling technique are applied to increase the diversity in each ensemble system. The final ensemble committee is selected by a forward search algorithm, which is shown to be capable of fitting data automatically. The performance of GPES is evaluated using five binary class and six multiclass microarray datasets, and results show that the algorithm can achieve better results in most cases compared with some other ensemble systems. By using elaborate base classifiers or applying other sampling techniques, the performance of GPES may be further improved.
Collapse
|
16
|
Tiwari AK, Srivastava R. A survey of computational intelligence techniques in protein function prediction. INTERNATIONAL JOURNAL OF PROTEOMICS 2014; 2014:845479. [PMID: 25574395 PMCID: PMC4276698 DOI: 10.1155/2014/845479] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 09/10/2014] [Revised: 10/31/2014] [Accepted: 11/07/2014] [Indexed: 02/08/2023]
Abstract
During the past, there was a massive growth of knowledge of unknown proteins with the advancement of high throughput microarray technologies. Protein function prediction is the most challenging problem in bioinformatics. In the past, the homology based approaches were used to predict the protein function, but they failed when a new protein was different from the previous one. Therefore, to alleviate the problems associated with homology based traditional approaches, numerous computational intelligence techniques have been proposed in the recent past. This paper presents a state-of-the-art comprehensive review of various computational intelligence techniques for protein function predictions using sequence, structure, protein-protein interaction network, and gene expression data used in wide areas of applications such as prediction of DNA and RNA binding sites, subcellular localization, enzyme functions, signal peptides, catalytic residues, nuclear/G-protein coupled receptors, membrane proteins, and pathway analysis from gene expression datasets. This paper also summarizes the result obtained by many researchers to solve these problems by using computational intelligence techniques with appropriate datasets to improve the prediction performance. The summary shows that ensemble classifiers and integration of multiple heterogeneous data are useful for protein function prediction.
Collapse
Affiliation(s)
- Arvind Kumar Tiwari
- Department of Computer Science & Engineering, Indian Institute of Technology (BHU), Varanasi 221005, India
| | - Rajeev Srivastava
- Department of Computer Science & Engineering, Indian Institute of Technology (BHU), Varanasi 221005, India
| |
Collapse
|
17
|
Reboiro-Jato M, Díaz F, Glez-Peña D, Fdez-Riverola F. A novel ensemble of classifiers that use biological relevant gene sets for microarray classification. Appl Soft Comput 2014. [DOI: 10.1016/j.asoc.2014.01.002] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023]
|
18
|
A Comparative Study of Cancer Classification Methods Using Microarray Gene Expression Profile. LECTURE NOTES IN ELECTRICAL ENGINEERING 2014. [DOI: 10.1007/978-981-4585-18-7_44] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/20/2023]
|
19
|
Analyzing the presence of noise in multi-class problems: alleviating its influence with the One-vs-One decomposition. Knowl Inf Syst 2012. [DOI: 10.1007/s10115-012-0570-1] [Citation(s) in RCA: 46] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
20
|
Khan MW, Alam M. A survey of application: genomics and genetic programming, a new frontier. Genomics 2012; 100:65-71. [PMID: 22683715 DOI: 10.1016/j.ygeno.2012.05.014] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2011] [Revised: 05/22/2012] [Accepted: 05/29/2012] [Indexed: 11/15/2022]
Abstract
The aim of this paper is to provide an introduction to the rapidly developing field of genetic programming (GP). Particular emphasis is placed on the application of GP to genomics. First, the basic methodology of GP is introduced. This is followed by a review of applications in the areas of gene network inference, gene expression data analysis, SNP analysis, epistasis analysis and gene annotation. Finally this paper concluded by suggesting potential avenues of possible future research on genetic programming, opportunities to extend the technique, and areas for possible practical applications.
Collapse
Affiliation(s)
- Mohammad Wahab Khan
- Department of Computer Science, Jamia Millia Islamia, Maulana Mohammad Ali Jauhar Marg, New Delhi 110025, India.
| | | |
Collapse
|
21
|
Genetic Programming as a tool for identification of analyte-specificity from complex response patterns using a non-specific whole-cell biosensor. Biosens Bioelectron 2012; 33:254-9. [PMID: 22325714 DOI: 10.1016/j.bios.2012.01.015] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2011] [Revised: 01/09/2012] [Accepted: 01/13/2012] [Indexed: 11/23/2022]
Abstract
Whole-cell biosensors are mostly non-specific with respect to their detection capabilities for toxicants, and therefore offering an interesting perspective in environmental monitoring. However, to fully employ this feature, a robust classification method needs to be implemented into these sensor systems to allow further identification of detected substances. Substance-specific information can be extracted from signals derived from biosensors harbouring one or multiple biological components. Here, a major task is the identification of substance-specific information among considerable amounts of biosensor data. For this purpose, several approaches make use of statistical methods or machine learning algorithms. Genetic Programming (GP), a heuristic machine learning technique offers several advantages compared to other machine learning approaches and consequently may be a promising tool for biosensor data classification. In the present study, we have evaluated the use of GP for the classification of herbicides and herbicide classes (chemical classes) by analysis of substance-specific patterns derived from a whole-cell multi-species biosensor. We re-analysed data from a previously described array-based biosensor system employing diverse microalgae (Podola and Melkonian, 2005), aiming on the identification of five individual herbicides as well as two herbicide classes. GP analyses were performed using the commercially available GP software 'Discipulus', resulting in classifiers (computer programs) for the binary classification of each individual herbicide or herbicide class. GP-generated classifiers both for individual herbicides and herbicide classes were able to perform a statistically significant identification of herbicides or herbicide classes, respectively. The majority of classifiers were able to perform correct classifications (sensitivity) of about 80-95% of test data sets, whereas the false positive rate (specificity) was lower than 20% for most classifiers. Results suggest that a higher number of data sets may lead to a better classification performance. In the present paper, GP-based classification was combined with a biosensor for the first time. Our results demonstrate GP was able to identify substance-specific information within complex biosensor response patterns and furthermore use this information for successful toxicant classification in unknown samples. This suggests further research to assess perspectives and limitations of this approach in the field of biosensors.
Collapse
|
22
|
Liu YC, Cheng CP, Tseng VS. Discovering relational-based association rules with multiple minimum supports on microarray datasets. ACTA ACUST UNITED AC 2011; 27:3142-8. [PMID: 21926125 DOI: 10.1093/bioinformatics/btr526] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
MOTIVATION Association rule analysis methods are important techniques applied to gene expression data for finding expression relationships between genes. However, previous methods implicitly assume that all genes have similar importance, or they ignore the individual importance of each gene. The relation intensity between any two items has never been taken into consideration. Therefore, we proposed a technique named REMMAR (RElational-based Multiple Minimum supports Association Rules) algorithm to tackle this problem. This method adjusts the minimum relation support (MRS) for each gene pair depending on the regulatory relation intensity to discover more important association rules with stronger biological meaning. RESULTS In the actual case study of this research, REMMAR utilized the shortest distance between any two genes in the Saccharomyces cerevisiae gene regulatory network (GRN) as the relation intensity to discover the association rules from two S.cerevisiae gene expression datasets. Under experimental evaluation, REMMAR can generate more rules with stronger relation intensity, and filter out rules without biological meaning in the protein-protein interaction network (PPIN). Furthermore, the proposed method has a higher precision (100%) than the precision of reference Apriori method (87.5%) for the discovered rules use a literature survey. Therefore, the proposed REMMAR algorithm can discover stronger association rules in biological relationships dissimilated by traditional methods to assist biologists in complicated genetic exploration.
Collapse
Affiliation(s)
- Yu-Cheng Liu
- Department of Computer Science and Information Engineering and Institute of Medical Informatics, National Cheng Kung University, Taiwan
| | | | | |
Collapse
|
23
|
Lê Cao KA, Boitard S, Besse P. Sparse PLS discriminant analysis: biologically relevant feature selection and graphical displays for multiclass problems. BMC Bioinformatics 2011; 12:253. [PMID: 21693065 PMCID: PMC3133555 DOI: 10.1186/1471-2105-12-253] [Citation(s) in RCA: 562] [Impact Index Per Article: 43.2] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2010] [Accepted: 06/22/2011] [Indexed: 11/24/2022] Open
Abstract
Background Variable selection on high throughput biological data, such as gene expression or single nucleotide polymorphisms (SNPs), becomes inevitable to select relevant information and, therefore, to better characterize diseases or assess genetic structure. There are different ways to perform variable selection in large data sets. Statistical tests are commonly used to identify differentially expressed features for explanatory purposes, whereas Machine Learning wrapper approaches can be used for predictive purposes. In the case of multiple highly correlated variables, another option is to use multivariate exploratory approaches to give more insight into cell biology, biological pathways or complex traits. Results A simple extension of a sparse PLS exploratory approach is proposed to perform variable selection in a multiclass classification framework. Conclusions sPLS-DA has a classification performance similar to other wrapper or sparse discriminant analysis approaches on public microarray and SNP data sets. More importantly, sPLS-DA is clearly competitive in terms of computational efficiency and superior in terms of interpretability of the results via valuable graphical outputs. sPLS-DA is available in the R package mixOmics, which is dedicated to the analysis of large biological data sets.
Collapse
Affiliation(s)
- Kim-Anh Lê Cao
- Queensland Facility for Advanced Bioinformatics, University of Queensland, 4072 St Lucia, QLD, Australia.
| | | | | |
Collapse
|
24
|
Tapia E, Ornella L, Bulacio P, Angelone L. Multiclass classification of microarray data samples with a reduced number of genes. BMC Bioinformatics 2011; 12:59. [PMID: 21342522 PMCID: PMC3056725 DOI: 10.1186/1471-2105-12-59] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2010] [Accepted: 02/22/2011] [Indexed: 01/05/2023] Open
Abstract
Background Multiclass classification of microarray data samples with a reduced number of genes is a rich and challenging problem in Bioinformatics research. The problem gets harder as the number of classes is increased. In addition, the performance of most classifiers is tightly linked to the effectiveness of mandatory gene selection methods. Critical to gene selection is the availability of estimates about the maximum number of genes that can be handled by any classification algorithm. Lack of such estimates may lead to either computationally demanding explorations of a search space with thousands of dimensions or classification models based on gene sets of unrestricted size. In the former case, unbiased but possibly overfitted classification models may arise. In the latter case, biased classification models unable to support statistically significant findings may be obtained. Results A novel bound on the maximum number of genes that can be handled by binary classifiers in binary mediated multiclass classification algorithms of microarray data samples is presented. The bound suggests that high-dimensional binary output domains might favor the existence of accurate and sparse binary mediated multiclass classifiers for microarray data samples. Conclusions A comprehensive experimental work shows that the bound is indeed useful to induce accurate and sparse multiclass classifiers for microarray data samples.
Collapse
Affiliation(s)
- Elizabeth Tapia
- CIFASIS-Conicet Institute, Bv, 27 de Febrero 210 Bis, Rosario, Argentina.
| | | | | | | |
Collapse
|
25
|
Liu GY, Liu KH, Zhang Y, Wang YZ, Wu XH, Lu YZ, Pan C, Yin P, Liao HF, Su JQ, Ge Q, Luo Q, Xiong B. Alterations of tumor-related genes do not exactly match the histopathological grade in gastric adenocarcinomas. World J Gastroenterol 2010; 16:1129-37. [PMID: 20205286 PMCID: PMC2835792 DOI: 10.3748/wjg.v16.i9.1129] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
AIM: To investigate the diverse characteristics of different pathological gradings of gastric adenocarcinoma (GA) using tumor-related genes.
METHODS: GA tissues in different pathological gradings and normal tissues were subjected to tissue arrays. Expressions of 15 major tumor-related genes were detected by RNA in situ hybridization along with 3’ terminal digoxin-labeled anti-sense single stranded oligonucleotide and locked nucleic acid modifying probe within the tissue array. The data obtained were processed by support vector machines by four different feature selection methods to discover the respective critical gene/gene subsets contributing to the GA activities of different pathological gradings.
RESULTS: In comparison of poorly differentiated GA with normal tissues, tumor-related gene TP53 plays a key role, although other six tumor-related genes could also achieve the Area Under Curve (AUC) of the receiver operating characteristic independently by more than 80%. Comparing the well differentiated GA with normal tissues, we found that 11 tumor-related genes could independently obtain the AUC by more than 80%, but only the gene subsets, TP53, RB and PTEN, play a key role. Only the gene subsets, Bcl10, UVRAG, APC, Beclin1, NM23, PTEN and RB could distinguish between the poorly differentiated and well differentiated GA. None of a single gene could obtain a valid distinction.
CONCLUSION: Different from the traditional point of view, the well differentiated cancer tissues have more alterations of important tumor-related genes than the poorly differentiated cancer tissues.
Collapse
|
26
|
Peng S, Zeng X, Li X, Peng X, Chen L. Multi-class cancer classification through gene expression profiles: microRNA versus mRNA. J Genet Genomics 2009; 36:409-16. [PMID: 19631915 DOI: 10.1016/s1673-8527(08)60130-7] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2009] [Revised: 05/03/2009] [Accepted: 05/06/2009] [Indexed: 01/08/2023]
Abstract
Both microRNA (miRNA) and mRNA expression profiles are important methods for cancer type classification. A comparative study of their classification performance will be helpful in choosing the means of classification. Here we evaluated the classification performance of miRNA and mRNA profiles using a new data mining approach based on a novel SVM (Support Vector Machines) based recursive feature elimination (nRFE) algorithm. Computational experiments showed that information encoded in miRNAs is not sufficient to classify cancers; gut-derived samples cluster more accurately when using mRNA expression profiles compared with using miRNA profiles; and poorly differentiated tumors (PDT) could be classified by mRNA expression profiles at the accuracy of 100% versus 93.8% when using miRNA profiles. Furthermore, we showed that mRNA expression profiles have higher capacity in normal tissue classifications than miRNA. We concluded that classification performance using mRNA profiles is superior to that of miRNA profiles in multiple-class cancer classifications.
Collapse
Affiliation(s)
- Sihua Peng
- Department of Pathology, School of Medicine, Zhejiang University, Hangzhou 310058, China
| | | | | | | | | |
Collapse
|
27
|
Turner SD, Crawford DC, Ritchie MD. Methods for optimizing statistical analyses in pharmacogenomics research. Expert Rev Clin Pharmacol 2009; 2:559-570. [PMID: 20221410 PMCID: PMC2835152 DOI: 10.1586/ecp.09.32] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023]
Abstract
Pharmacogenomics is a rapidly developing sector of human genetics research with arguably the highest potential for immediate benefit. There is a considerable body of evidence demonstrating that variability in drug-treatment response can be explained in part by genetic variation. Subsequently, much research has ensued and is ongoing to identify genetic variants associated with drug-response phenotypes. To reap the full benefits of the data we collect we must give careful consideration to the study population under investigation, the phenotype being examined and the statistical methodology used in data analysis. Here, we discuss principles of study design and optimizing statistical methods for pharmacogenomic studies when the outcome of interest is a continuous measure. We review traditional hypothesis testing procedures, as well as novel approaches that may be capable of accounting for more variance in a quantitative pharmacogenomic trait. We give examples of studies that have employed the analytical methodologies discussed here, as well as resources for acquiring software to run the analyses.
Collapse
Affiliation(s)
- Stephen D Turner
- Center for Human Genetics Research, Department of Molecular Physiology and Biophysics, Vanderbilt University, Nashville TN, 37232, USA, Tel.: +1 615 343 6549, Fax: +1 615 322 6974,
| | - Dana C Crawford
- Center for Human Genetics Research, Assistant Professor, Department of Molecular Physiology and Biophysics, Vanderbilt University, Nashville TN, 37232, USA, Tel.: +1 615 343 7852, Fax: +1 615 322 6974,
| | - Marylyn D Ritchie
- Center for Human Genetics Research, Department of Molecular Physiology and Biophysics, Vanderbilt University, Nashville TN, 37232, USA, Tel.: +1 615 343 5851, Fax: +1 615 322 6974,
| |
Collapse
|
28
|
Liu KH, Li B, Wu QQ, Zhang J, Du JX, Liu GY. Microarray data classification based on ensemble independent component selection. Comput Biol Med 2009; 39:953-60. [PMID: 19716554 DOI: 10.1016/j.compbiomed.2009.07.006] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2008] [Revised: 01/06/2009] [Accepted: 07/14/2009] [Indexed: 11/26/2022]
Abstract
Independent component analysis (ICA) has been widely deployed to the analysis of microarray datasets. Although it was pointed out that after ICA transformation, different independent components (ICs) are of different biological significance, the IC selection problem is still far from fully explored. In this paper, we propose a genetic algorithm (GA) based ensemble independent component selection (EICS) system. In this system, GA is applied to select a set of optimal IC subsets, which are then used to build diverse and accurate base classifiers. Finally, all base classifiers are combined with majority vote rule. To show the validity of the proposed method, we apply it to classify three DNA microarray data sets involving various human normal and tumor tissue samples. The experimental results show that our ensemble method obtains stable and satisfying classification results when compared with several existing methods.
Collapse
Affiliation(s)
- Kun-Hong Liu
- Software School of Xiamen University, Xiamen, Fujian, 361005, China.
| | | | | | | | | | | |
Collapse
|