1
|
Srivastava LK, Ehrlicher AJ. Sensing the squeeze: nuclear mechanotransduction in health and disease. Nucleus 2024; 15:2374854. [PMID: 38951951 PMCID: PMC11221475 DOI: 10.1080/19491034.2024.2374854] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2024] [Accepted: 06/26/2024] [Indexed: 07/03/2024] Open
Abstract
The nucleus not only is a repository for DNA but also a center of cellular and nuclear mechanotransduction. From nuclear deformation to the interplay between mechanosensing components and genetic control, the nucleus is poised at the nexus of mechanical forces and cellular function. Understanding the stresses acting on the nucleus, its mechanical properties, and their effects on gene expression is therefore crucial to appreciate its mechanosensitive function. In this review, we examine many elements of nuclear mechanotransduction, and discuss the repercussions on the health of cells and states of illness. By describing the processes that underlie nuclear mechanosensation and analyzing its effects on gene regulation, the review endeavors to open new avenues for studying nuclear mechanics in physiology and diseases.
Collapse
Affiliation(s)
| | - Allen J. Ehrlicher
- Department of Bioengineering, McGill University, Montreal, Canada
- Department of Biomedical Engineering, McGill University, Montreal, Canada
- Department of Anatomy and Cell Biology, McGill University, Montreal, Canada
- Centre for Structural Biology, McGill University, Montreal, Canada
- Department of Mechanical Engineering, McGill University, Montreal, Canada
- Rosalind and Morris Goodman Cancer Institute, McGill University, Montreal, Canada
| |
Collapse
|
2
|
Kuzudisli C, Bakir-Gungor B, Bulut N, Qaqish B, Yousef M. Review of feature selection approaches based on grouping of features. PeerJ 2023; 11:e15666. [PMID: 37483989 PMCID: PMC10358338 DOI: 10.7717/peerj.15666] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2022] [Accepted: 06/08/2023] [Indexed: 07/25/2023] Open
Abstract
With the rapid development in technology, large amounts of high-dimensional data have been generated. This high dimensionality including redundancy and irrelevancy poses a great challenge in data analysis and decision making. Feature selection (FS) is an effective way to reduce dimensionality by eliminating redundant and irrelevant data. Most traditional FS approaches score and rank each feature individually; and then perform FS either by eliminating lower ranked features or by retaining highly-ranked features. In this review, we discuss an emerging approach to FS that is based on initially grouping features, then scoring groups of features rather than scoring individual features. Despite the presence of reviews on clustering and FS algorithms, to the best of our knowledge, this is the first review focusing on FS techniques based on grouping. The typical idea behind FS through grouping is to generate groups of similar features with dissimilarity between groups, then select representative features from each cluster. Approaches under supervised, unsupervised, semi supervised and integrative frameworks are explored. The comparison of experimental results indicates the effectiveness of sequential, optimization-based (i.e., fuzzy or evolutionary), hybrid and multi-method approaches. When it comes to biological data, the involvement of external biological sources can improve analysis results. We hope this work's findings can guide effective design of new FS approaches using feature grouping.
Collapse
Affiliation(s)
- Cihan Kuzudisli
- Department of Computer Engineering, Hasan Kalyoncu University, Gaziantep, Turkey
- Department of Electrical and Computer Engineering, Abdullah Gul University, Kayseri, Turkey
| | - Burcu Bakir-Gungor
- Department of Computer Engineering, Abdullah Gul University, Kayseri, Turkey
| | - Nurten Bulut
- Department of Computer Engineering, Abdullah Gul University, Kayseri, Turkey
| | - Bahjat Qaqish
- Department of Biostatistics, University of North Carolina at Chapel Hill, North Carolina, Chapel Hill, United States of America
| | - Malik Yousef
- Department of Information Systems, Zefat Academic College, Zefat, Israel
- Galilee Digital Health Research Center, Zefat Academic College, Zefat, Israel
| |
Collapse
|
3
|
Naik D, Dharavath R, Qi L. Quantum-PSO based unsupervised clustering of users in social networks using attributes. CLUSTER COMPUTING 2023:1-19. [PMID: 37359059 PMCID: PMC10099026 DOI: 10.1007/s10586-023-03993-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/19/2023] [Revised: 02/22/2023] [Accepted: 03/18/2023] [Indexed: 06/28/2023]
Abstract
Unsupervised cluster detection in social network analysis involves grouping social actors into distinct groups, each distinct from the others. Users in the clusters are semantically very similar to those in the same cluster and dissimilar to those in different clusters. Social network clustering reveals a wide range of useful information about users and has many applications in daily life. Various approaches are developed to find social network users' clusters, using only links or attributes and links. This work proposes a method for detecting social network users' clusters based solely on their attributes. In this case, users' attributes are considered categorical values. The most popular clustering algorithm used for categorical data is the K-mode algorithm. However, it may suffer from local optimum due to its random initialization of centroids. To overcome this issue, this manuscript proposes a methodology named the Quantum PSO approach based on user similarity maximization. In the proposed approach, firstly, dimensionality reduction is conducted by performing the relevant attribute set selection followed by redundant attribute removal. Secondly, the QPSO technique is used to maximize the similarity score between users to get clusters. Three different similarity measures are used separately to perform the dimensionality reduction and similarity maximization processes. Experiments are conducted on two popular social network datasets; ego-Twitter, and ego-Facebook. The results show that the proposed approach performs better clustering results in terms of three different performance metrics than K-Mode and K-Mean algorithms.
Collapse
Affiliation(s)
| | | | - Lianyong Qi
- China University of Petroleum (East China), Dongying, China
| |
Collapse
|
4
|
The ability to classify patients based on gene-expression data varies by algorithm and performance metric. PLoS Comput Biol 2022; 18:e1009926. [PMID: 35275931 PMCID: PMC8942277 DOI: 10.1371/journal.pcbi.1009926] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2021] [Revised: 03/23/2022] [Accepted: 02/15/2022] [Indexed: 01/02/2023] Open
Abstract
By classifying patients into subgroups, clinicians can provide more effective care than using a uniform approach for all patients. Such subgroups might include patients with a particular disease subtype, patients with a good (or poor) prognosis, or patients most (or least) likely to respond to a particular therapy. Transcriptomic measurements reflect the downstream effects of genomic and epigenomic variations. However, high-throughput technologies generate thousands of measurements per patient, and complex dependencies exist among genes, so it may be infeasible to classify patients using traditional statistical models. Machine-learning classification algorithms can help with this problem. However, hundreds of classification algorithms exist-and most support diverse hyperparameters-so it is difficult for researchers to know which are optimal for gene-expression biomarkers. We performed a benchmark comparison, applying 52 classification algorithms to 50 gene-expression datasets (143 class variables). We evaluated algorithms that represent diverse machine-learning methodologies and have been implemented in general-purpose, open-source, machine-learning libraries. When available, we combined clinical predictors with gene-expression data. Additionally, we evaluated the effects of performing hyperparameter optimization and feature selection using nested cross validation. Kernel- and ensemble-based algorithms consistently outperformed other types of classification algorithms; however, even the top-performing algorithms performed poorly in some cases. Hyperparameter optimization and feature selection typically improved predictive performance, and univariate feature-selection algorithms typically outperformed more sophisticated methods. Together, our findings illustrate that algorithm performance varies considerably when other factors are held constant and thus that algorithm selection is a critical step in biomarker studies.
Collapse
|
5
|
Zhang Y, Cheung YM. A New Distance Metric Exploiting Heterogeneous Interattribute Relationship for Ordinal-and-Nominal-Attribute Data Clustering. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:758-771. [PMID: 32340972 DOI: 10.1109/tcyb.2020.2983073] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Ordinal attribute has all the common characteristics of a nominal one but it differs from the nominal one by having naturally ordered possible values (also called categories interchangeably). In clustering analysis tasks, categorical data composed of both ordinal and nominal attributes (also called mixed-categorical data interchangeably) are common. Under this circumstance, existing distance and similarity measures suffer from at least one of the following two drawbacks: 1) directly treat ordinal attributes as nominal ones, and thus ignore the order information from them and 2) suppose all the attributes are independent of each other, measure the distance between two categories from a target attribute without considering the valuable information provided by the other attributes that correlate with the target one. These two drawbacks may twist the natural distances of attributes and further lead to unsatisfactory clustering results. This article, therefore, presents an entropy-based distance metric that quantifies the distance between categories by exploiting the information provided by different attributes that correlate with the target one. It also preserves the order relationship among ordinal categories during the distance measurement. Since attributes are usually correlated in different degrees, we also define the interdependence between different types of attributes to weight their contributions in forming distances. The proposed metric overcomes the two above-mentioned drawbacks for mixed-categorical data clustering. More important, it conceptually unifies the distances of ordinal and nominal attributes to avoid information loss during clustering. Moreover, it is parameter free, and will not bring extra computational cost compared to the existing state-of-the-art counterparts. Extensive experiments show the superiority of the proposed distance metric.
Collapse
|
6
|
Bose S, Das C, Banerjee A, Ghosh K, Chattopadhyay M, Chattopadhyay S, Barik A. An ensemble machine learning model based on multiple filtering and supervised attribute clustering algorithm for classifying cancer samples. PeerJ Comput Sci 2021; 7:e671. [PMID: 34616883 PMCID: PMC8459790 DOI: 10.7717/peerj-cs.671] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2020] [Accepted: 07/20/2021] [Indexed: 06/13/2023]
Abstract
BACKGROUND Machine learning is one kind of machine intelligence technique that learns from data and detects inherent patterns from large, complex datasets. Due to this capability, machine learning techniques are widely used in medical applications, especially where large-scale genomic and proteomic data are used. Cancer classification based on bio-molecular profiling data is a very important topic for medical applications since it improves the diagnostic accuracy of cancer and enables a successful culmination of cancer treatments. Hence, machine learning techniques are widely used in cancer detection and prognosis. METHODS In this article, a new ensemble machine learning classification model named Multiple Filtering and Supervised Attribute Clustering algorithm based Ensemble Classification model (MFSAC-EC) is proposed which can handle class imbalance problem and high dimensionality of microarray datasets. This model first generates a number of bootstrapped datasets from the original training data where the oversampling procedure is applied to handle the class imbalance problem. The proposed MFSAC method is then applied to each of these bootstrapped datasets to generate sub-datasets, each of which contains a subset of the most relevant/informative attributes of the original dataset. The MFSAC method is a feature selection technique combining multiple filters with a new supervised attribute clustering algorithm. Then for every sub-dataset, a base classifier is constructed separately, and finally, the predictive accuracy of these base classifiers is combined using the majority voting technique forming the MFSAC-based ensemble classifier. Also, a number of most informative attributes are selected as important features based on their frequency of occurrence in these sub-datasets. RESULTS To assess the performance of the proposed MFSAC-EC model, it is applied on different high-dimensional microarray gene expression datasets for cancer sample classification. The proposed model is compared with well-known existing models to establish its effectiveness with respect to other models. From the experimental results, it has been found that the generalization performance/testing accuracy of the proposed classifier is significantly better compared to other well-known existing models. Apart from that, it has been also found that the proposed model can identify many important attributes/biomarker genes.
Collapse
Affiliation(s)
- Shilpi Bose
- Department of Computer Science and Engineering, Netaji Subhash Engineering College, Kolkata, West Bengal, India
| | - Chandra Das
- Department of Computer Science and Engineering, Netaji Subhash Engineering College, Kolkata, West Bengal, India
| | - Abhik Banerjee
- Department of Computer Science and Engineering, Netaji Subhash Engineering College, Kolkata, West Bengal, India
| | - Kuntal Ghosh
- Machine Intelligence Unit & Center for Soft Computing Research, Indian Statistical Institute, Kolkata, West Bengal, India
| | | | - Samiran Chattopadhyay
- Department of Information Technology, Jadavpur University, Kolkata, West Bengal, India
| | - Aishwarya Barik
- Department of Computer Science and Engineering, Netaji Subhash Engineering College, Kolkata, West Bengal, India
| |
Collapse
|
7
|
A Constrained Feature Selection Approach Based on Feature Clustering and Hypothesis Margin Maximization. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2021. [DOI: 10.1155/2021/5554873] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
In this paper, we propose a semisupervised feature selection approach that is based on feature clustering and hypothesis margin maximization. The aim is to improve the classification accuracy by choosing the right feature subset and to allow building more interpretable models. Our approach handles the two core aspects of feature selection, i.e., relevance and redundancy, and is divided into three steps. First, the similarity weights between features are represented by a sparse graph where each feature can be reconstructed from the sparse linear combination of the others. Second, features are then hierarchically clustered identifying groups of the most similar ones. Finally, a semisupervised margin-based objective function is optimized to select the most data discriminative feature from within each cluster, hence maximizing relevance while minimizing redundancy among features. Eventually, we empirically validate our proposed approach on multiple well-known UCI benchmark datasets in terms of classification accuracy and representation entropy, where it proved to outperform four other semisupervised and unsupervised methods and competed with two widely used supervised ones.
Collapse
|
8
|
Cao F, Wu X, Yu L, Liang J. An outlier detection algorithm for categorical matrix-object data. Appl Soft Comput 2021. [DOI: 10.1016/j.asoc.2021.107182] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
9
|
EKNN: Ensemble classifier incorporating connectivity and density into kNN with application to cancer diagnosis. Artif Intell Med 2020; 111:101985. [PMID: 33461685 DOI: 10.1016/j.artmed.2020.101985] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2019] [Revised: 11/02/2020] [Accepted: 11/02/2020] [Indexed: 11/20/2022]
Abstract
In the microarray-based approach for automated cancer diagnosis, the application of the traditional k-nearest neighbors kNN algorithm suffers from several difficulties such as the large number of genes (high dimensionality of the feature space) with many irrelevant genes (noise) relative to the small number of available samples and the imbalance in the size of the samples of the target classes. This research provides an ensemble classifier based on decision models derived from kNN that is applicable to problems characterized by imbalanced small size datasets. The proposed classification method is an ensemble of the traditional kNN algorithm and four novel classification models derived from it. The proposed models exploit the increase in density and connectivity using K1-nearest neighbors table (KNN-table) created during the training phase. In the density model, an unseen sample u is classified as belonging to a class t if it achieves the highest increase in density when this sample is added to it i.e. the unseen sample can replace more neighbors in the KNN-table for samples of class t than other classes. In the other three connectivity models, the mean and standard deviation of the distribution of the average, minimum as well the maximum distance to the K neighbors of the members of each class are computed in the training phase. The class t to which u achieves the highest possibility of belongness to its distribution is chosen, i.e. the addition of u to the samples of this class produces the least change to the distribution of the corresponding decision model for class t. Combining the predicted results of the four individual models along with traditional kNN makes the decision space more discriminative. With the help of the KNN-table which can be updated online in the training phase, an improved performance has been achieved compared to the traditional kNN algorithm with slight increase in classification time. The proposed ensemble method achieves significant increase in accuracy compared to the accuracy achieved using any of its base classifiers on Kentridge, GDS3257, Notterman, Leukemia and CNS datasets. The method is also compared to several existing ensemble methods and state of the art techniques using different dimensionality reduction techniques on several standard datasets. The results prove clear superiority of EKNN over several individual and ensemble classifiers regardless of the choice of the gene selection strategy.
Collapse
|
10
|
Peralta D, Saeys Y. Robust unsupervised dimensionality reduction based on feature clustering for single-cell imaging data. Appl Soft Comput 2020. [DOI: 10.1016/j.asoc.2020.106421] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023]
|
11
|
Zhang Y, Cheung YM, Tan KC. A Unified Entropy-Based Distance Metric for Ordinal-and-Nominal-Attribute Data Clustering. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2020; 31:39-52. [PMID: 30908240 DOI: 10.1109/tnnls.2019.2899381] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Ordinal data are common in many data mining and machine learning tasks. Compared to nominal data, the possible values (also called categories interchangeably) of an ordinal attribute are naturally ordered. Nevertheless, since the data values are not quantitative, the distance between two categories of an ordinal attribute is generally not well defined, which surely has a serious impact on the result of the quantitative analysis if an inappropriate distance metric is utilized. From the practical perspective, ordinal-and-nominal-attribute categorical data, i.e., categorical data associated with a mixture of nominal and ordinal attributes, is common, but the distance metric for such data has yet to be well explored in the literature. In this paper, within the framework of clustering analysis, we therefore first propose an entropy-based distance metric for ordinal attributes, which exploits the underlying order information among categories of an ordinal attribute for the distance measurement. Then, we generalize this distance metric and propose a unified one accordingly, which is applicable to ordinal-and-nominal-attribute categorical data. Compared with the existing metrics proposed for categorical data, the proposed metric is simple to use and nonparametric. More importantly, it reasonably exploits the underlying order information of ordinal attributes and statistical information of nominal attributes for distance measurement. Extensive experiments show that the proposed metric outperforms the existing counterparts on both the real and benchmark data sets.
Collapse
|
12
|
Oladipupo O, Olugbara O. Evaluation of data analytics based clustering algorithms for knowledge mining in a student engagement data. INTELL DATA ANAL 2019. [DOI: 10.3233/ida-184254] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
|
13
|
Mahfouz MA, Nepomuceno JA. Graph coloring for extracting discriminative genes in cancer data. Ann Hum Genet 2019; 83:141-159. [PMID: 30644085 DOI: 10.1111/ahg.12297] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2018] [Revised: 10/12/2018] [Accepted: 11/15/2018] [Indexed: 11/29/2022]
Abstract
BACKGROUND AND OBJECTIVE The major difficulty of the analysis of the input gene expression data in a microarray-based approach for an automated diagnosis of cancer is the large number of genes (high dimensionality) with many irrelevant genes (noise) compared to the very small number of samples. This research study tackles the dimensionality reduction challenge in this area. METHODS This research study introduces a dimension-reduction technique termed graph coloring approach (GCA) for microarray data-based cancer classification based on analyzing the absolute correlation between gene-gene pairs and partitioning genes into several hubs using graph coloring. GCA starts by a gene-selection step in which top relevant genes are selected using a biserial correlation. Each time, a gene from an ordered list of top relevant genes is selected as the hub gene (representative) and redundant genes are added to its group; the process is repeated recursively for the remaining genes. A gene is considered redundant if its absolute correlation with the hub gene is greater than a controlling threshold. A suitable range for the threshold is estimated by computing a percentage graph for the absolute correlation between gene-gene pairs. Each value in the estimated range for the threshold can efficiently produce a new feature subset. RESULTS GCA achieved significant improvement over several existing techniques in terms of higher accuracy and a smaller number of features. Also, genes selected by this method are relevant genes according to the information stored in scientific repositories. CONCLUSIONS The proposed dimension-reduction technique can help biologists accurately predict cancer in several areas of the body.
Collapse
Affiliation(s)
- Mohamed A Mahfouz
- Department of Computer and Systems Engineering, Faculty of Engineering, Alexandria University, Alexandria, Egypt
| | - Juan A Nepomuceno
- Departmento de Lenguajes y Sistemas Informáticos, Higher Technical School of Computer Engineering, University of Seville, Seville, Spain
| |
Collapse
|
14
|
|
15
|
Feng S, Xu J, Xu T. An efficient gene selection technique based on Self-organizing Map and Particle Swarm Optimization. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2017. [DOI: 10.3233/jifs-161887] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Affiliation(s)
- Sen Feng
- College of Computer and Information Engineering, Henan Normal University, Xinxiang, P.R. China
- Engineering and Technology Research Center for Computational Intelligence and Data Mining of Universities of Henan Province, Xinxiang, P.R. China
| | - Jiucheng Xu
- College of Computer and Information Engineering, Henan Normal University, Xinxiang, P.R. China
- Engineering and Technology Research Center for Computational Intelligence and Data Mining of Universities of Henan Province, Xinxiang, P.R. China
| | - Tianhe Xu
- College of Computer and Information Engineering, Henan Normal University, Xinxiang, P.R. China
- Engineering and Technology Research Center for Computational Intelligence and Data Mining of Universities of Henan Province, Xinxiang, P.R. China
| |
Collapse
|
16
|
Lin HY. Gene discretization based on EM clustering and adaptive sequential forward gene selection for molecular classification. Appl Soft Comput 2016. [DOI: 10.1016/j.asoc.2016.07.015] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
17
|
Abstract
BACKGROUND Development of biologically relevant models from gene expression data notably, microarray data has become a topic of great interest in the field of bioinformatics and clinical genetics and oncology. Only a small number of gene expression data compared to the total number of genes explored possess a significant correlation with a certain phenotype. Gene selection enables researchers to obtain substantial insight into the genetic nature of the disease and the mechanisms responsible for it. Besides improvement of the performance of cancer classification, it can also cut down the time and cost of medical diagnoses. METHODS This study presents a modified Artificial Bee Colony Algorithm (ABC) to select minimum number of genes that are deemed to be significant for cancer along with improvement of predictive accuracy. The search equation of ABC is believed to be good at exploration but poor at exploitation. To overcome this limitation we have modified the ABC algorithm by incorporating the concept of pheromones which is one of the major components of Ant Colony Optimization (ACO) algorithm and a new operation in which successive bees communicate to share their findings. RESULTS The proposed algorithm is evaluated using a suite of ten publicly available datasets after the parameters are tuned scientifically with one of the datasets. Obtained results are compared to other works that used the same datasets. The performance of the proposed method is proved to be superior. CONCLUSION The method presented in this paper can provide subset of genes leading to more accurate classification results while the number of selected genes is smaller. Additionally, the proposed modified Artificial Bee Colony Algorithm could conceivably be applied to problems in other areas as well.
Collapse
Affiliation(s)
| | - Rameen Shakur
- Wellcome Trust - Medical Research Council Cambridge Stem Cell Institute, University of Cambridge, Cambridge, UK
| | - Mohammad Kaykobad
- A ℓEDA Group, Department of CSE, BUET, Dhaka-1205, Dhaka, Bangladesh
| | | |
Collapse
|
18
|
Jiang S, Wang L. A clustering-based feature selection via feature separability. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2016. [DOI: 10.3233/jifs-169022] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Affiliation(s)
- Shengyi Jiang
- School of Informatics, Guangdong University of Foreign Studies, Guangzhou, China
- Laboratory of Language Engineering and Computing, Guangzhou, China
| | - Lianxi Wang
- School of Information Management, Sun Yat-Sen University, Guangzhou, China
| |
Collapse
|
19
|
Jia H, Cheung YM, Liu J. A New Distance Metric for Unsupervised Learning of Categorical Data. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2016; 27:1065-79. [PMID: 26068881 DOI: 10.1109/tnnls.2015.2436432] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/20/2023]
Abstract
Distance metric is the basis of many learning algorithms, and its effectiveness usually has a significant influence on the learning results. In general, measuring distance for numerical data is a tractable task, but it could be a nontrivial problem for categorical data sets. This paper, therefore, presents a new distance metric for categorical data based on the characteristics of categorical values. In particular, the distance between two values from one attribute measured by this metric is determined by both the frequency probabilities of these two values and the values of other attributes that have high interdependence with the calculated one. Dynamic attribute weight is further designed to adjust the contribution of each attribute-distance to the distance between the whole data objects. Promising experimental results on different real data sets have shown the effectiveness of the proposed distance metric.
Collapse
|
20
|
Kamkar I, Gupta SK, Phung D, Venkatesh S. Stabilizing l1-norm prediction models by supervised feature grouping. J Biomed Inform 2015; 59:149-68. [PMID: 26689771 DOI: 10.1016/j.jbi.2015.11.012] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2015] [Revised: 11/18/2015] [Accepted: 11/23/2015] [Indexed: 01/05/2023]
Abstract
Emerging Electronic Medical Records (EMRs) have reformed the modern healthcare. These records have great potential to be used for building clinical prediction models. However, a problem in using them is their high dimensionality. Since a lot of information may not be relevant for prediction, the underlying complexity of the prediction models may not be high. A popular way to deal with this problem is to employ feature selection. Lasso and l1-norm based feature selection methods have shown promising results. But, in presence of correlated features, these methods select features that change considerably with small changes in data. This prevents clinicians to obtain a stable feature set, which is crucial for clinical decision making. Grouping correlated variables together can improve the stability of feature selection, however, such grouping is usually not known and needs to be estimated for optimal performance. Addressing this problem, we propose a new model that can simultaneously learn the grouping of correlated features and perform stable feature selection. We formulate the model as a constrained optimization problem and provide an efficient solution with guaranteed convergence. Our experiments with both synthetic and real-world datasets show that the proposed model is significantly more stable than Lasso and many existing state-of-the-art shrinkage and classification methods. We further show that in terms of prediction performance, the proposed method consistently outperforms Lasso and other baselines. Our model can be used for selecting stable risk factors for a variety of healthcare problems, so it can assist clinicians toward accurate decision making.
Collapse
Affiliation(s)
- Iman Kamkar
- Centre for Pattern Recognition and Data Analytics, Deakin University, Australia.
| | - Sunil Kumar Gupta
- Centre for Pattern Recognition and Data Analytics, Deakin University, Australia.
| | - Dinh Phung
- Centre for Pattern Recognition and Data Analytics, Deakin University, Australia.
| | - Svetha Venkatesh
- Centre for Pattern Recognition and Data Analytics, Deakin University, Australia.
| |
Collapse
|
21
|
Meng J, Li R, Luan Y. Classification by integrating plant stress response gene expression data with biological knowledge. Math Biosci 2015; 266:65-72. [PMID: 26092610 DOI: 10.1016/j.mbs.2015.06.005] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2015] [Revised: 05/03/2015] [Accepted: 06/05/2015] [Indexed: 12/01/2022]
Abstract
Classification of microarray data has always been a challenging task because of the enormous number of genes. In this study, a clustering method by integrating plant stress response gene expression data with biological knowledge is presented. Clustering is one of the promising tools for attribute reduction, but gene clusters are biologically uninformative. So we integrated biological knowledge into genomic analysis to help to improve the interpretation of the results. Biological similarity based on gene ontology (GO) semantic similarity was combined with gene expression data to find out biologically meaningful clusters. Affinity propagation clustering algorithm was chosen to analyze the impact of the biological similarity on the results. Based on clustering result, neighborhood rough set was used to select representative genes for each cluster. The prediction accuracy of classifiers built on reduced gene subsets indicated that our approach outperformed other classical methods. The information fusion was proven to be effective through quantitative analysis, as it could select gene subsets with high biological significance and select significant genes.
Collapse
Affiliation(s)
- Jun Meng
- School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning 116023, China..
| | - Rui Li
- School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning 116023, China..
| | - Yushi Luan
- School of Life Science and Biotechnology, Dalian University of Technology, Dalian, Liaoning 116023, China..
| |
Collapse
|
22
|
Xie H, Li Q, Mao X, Li X, Cai Y, Rao Y. Community-aware user profile enrichment in folksonomy. Neural Netw 2014; 58:111-21. [PMID: 24907893 DOI: 10.1016/j.neunet.2014.05.009] [Citation(s) in RCA: 61] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2013] [Revised: 05/04/2014] [Accepted: 05/17/2014] [Indexed: 11/29/2022]
Abstract
In the era of big data, collaborative tagging (a.k.a. folksonomy) systems have proliferated as a consequence of the growth of Web 2.0 communities. Constructing user profiles from folksonomy systems is useful for many applications such as personalized search and recommender systems. The identification of latent user communities is one way to better understand and meet user needs. The behavior of users is highly influenced by the behavior of their neighbors or community members, and this can be utilized in constructing user profiles. However, conventional user profiling techniques often encounter data sparsity problems as data from a single user is insufficient to build a powerful profile. Hence, in this paper we propose a method of enriching user profiles based on latent user communities in folksonomy data. Specifically, the proposed approach contains four sub-processes: (i) tag-based user profiles are extracted from a folksonomy tripartite graph; (ii) a multi-faceted folksonomy graph is constructed by integrating tag and image affinity subgraphs with the folksonomy tripartite graph; (iii) random walk distance is used to unify various relationships and measure user similarities; (iv) a novel prototype-based clustering method based on user similarities is used to identify user communities, which are further used to enrich the extracted user profiles. To evaluate the proposed method, we conducted experiments using a public dataset, the results of which show that our approach outperforms previous ones in user profile enrichment.
Collapse
Affiliation(s)
- Haoran Xie
- Department of Computer Science, City University of Hong Kong, 83 Tat Chee Avenue, Kowloon, Hong Kong Special Administrative Region
| | - Qing Li
- Department of Computer Science, City University of Hong Kong, 83 Tat Chee Avenue, Kowloon, Hong Kong Special Administrative Region; Multimedia Software Engineering Research Centre, City University of Hong Kong, Kowloon, Hong Kong Special Administrative Region
| | - Xudong Mao
- Department of Computer Science, City University of Hong Kong, 83 Tat Chee Avenue, Kowloon, Hong Kong Special Administrative Region
| | - Xiaodong Li
- Department of Computer Science, City University of Hong Kong, 83 Tat Chee Avenue, Kowloon, Hong Kong Special Administrative Region
| | - Yi Cai
- School of Software Engineering, South China University of Technology, Guangzhou 510006, China.
| | - Yanghui Rao
- Department of Computer Science, City University of Hong Kong, 83 Tat Chee Avenue, Kowloon, Hong Kong Special Administrative Region
| |
Collapse
|
23
|
|
24
|
Masciari E, Mazzeo G, Zaniolo C. Analysing microarray expression data through effective clustering. Inf Sci (N Y) 2014. [DOI: 10.1016/j.ins.2013.12.003] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
25
|
Hijazi H, Chan C. A classification framework applied to cancer gene expression profiles. JOURNAL OF HEALTHCARE ENGINEERING 2013; 4:255-83. [PMID: 23778014 DOI: 10.1260/2040-2295.4.2.255] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
Abstract
Classification of cancer based on gene expression has provided insight into possible treatment strategies. Thus, developing machine learning methods that can successfully distinguish among cancer subtypes or normal versus cancer samples is important. This work discusses supervised learning techniques that have been employed to classify cancers. Furthermore, a two-step feature selection method based on an attribute estimation method (e.g., ReliefF) and a genetic algorithm was employed to find a set of genes that can best differentiate between cancer subtypes or normal versus cancer samples. The application of different classification methods (e.g., decision tree, k-nearest neighbor, support vector machine (SVM), bagging, and random forest) on 5 cancer datasets shows that no classification method universally outperforms all the others. However, k-nearest neighbor and linear SVM generally improve the classification performance over other classifiers. Finally, incorporating diverse types of genomic data (e.g., protein-protein interaction data and gene expression) increase the prediction accuracy as compared to using gene expression alone.
Collapse
Affiliation(s)
- Hussein Hijazi
- Department of Computer Science and Engineering, Michigan State University, East Lansing, MI 48824, USA.
| | | |
Collapse
|
26
|
Effectiveness of Different Partition Based Clustering Algorithms for Estimation of Missing Values in Microarray Gene Expression Data. ADVANCES IN COMPUTING AND INFORMATION TECHNOLOGY 2013. [DOI: 10.1007/978-3-642-31552-7_5] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
|
27
|
Zhao X, Deng W, Shi Y. Feature Selection with Attributes Clustering by Maximal Information Coefficient. ACTA ACUST UNITED AC 2013. [DOI: 10.1016/j.procs.2013.05.011] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
28
|
Mitra S, Ghosh S. Feature Selection and Clustering of Gene Expression Profiles Using Biological Knowledge. ACTA ACUST UNITED AC 2012. [DOI: 10.1109/tsmcc.2012.2209416] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
|
29
|
Durston KK, Chiu DKY, Wong AKC, Li GCL. Statistical discovery of site inter-dependencies in sub-molecular hierarchical protein structuring. EURASIP JOURNAL ON BIOINFORMATICS & SYSTEMS BIOLOGY 2012; 2012:8. [PMID: 22793672 PMCID: PMC3524763 DOI: 10.1186/1687-4153-2012-8] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/02/2011] [Accepted: 05/29/2012] [Indexed: 11/10/2022]
Abstract
UNLABELLED BACKGROUND Much progress has been made in understanding the 3D structure of proteins using methods such as NMR and X-ray crystallography. The resulting 3D structures are extremely informative, but do not always reveal which sites and residues within the structure are of special importance. Recently, there are indications that multiple-residue, sub-domain structural relationships within the larger 3D consensus structure of a protein can be inferred from the analysis of the multiple sequence alignment data of a protein family. These intra-dependent clusters of associated sites are used to indicate hierarchical inter-residue relationships within the 3D structure. To reveal the patterns of associations among individual amino acids or sub-domain components within the structure, we apply a k-modes attribute (aligned site) clustering algorithm to the ubiquitin and transthyretin families in order to discover associations among groups of sites within the multiple sequence alignment. We then observe what these associations imply within the 3D structure of these two protein families. RESULTS The k-modes site clustering algorithm we developed maximizes the intra-group interdependencies based on a normalized mutual information measure. The clusters formed correspond to sub-structural components or binding and interface locations. Applying this data-directed method to the ubiquitin and transthyretin protein family multiple sequence alignments as a test bed, we located numerous interesting associations of interdependent sites. These clusters were then arranged into cluster tree diagrams which revealed four structural sub-domains within the single domain structure of ubiquitin and a single large sub-domain within transthyretin associated with the interface among transthyretin monomers. In addition, several clusters of mutually interdependent sites were discovered for each protein family, each of which appear to play an important role in the molecular structure and/or function. CONCLUSIONS Our results demonstrate that the method we present here using a k-modes site clustering algorithm based on interdependency evaluation among sites obtained from a sequence alignment of homologous proteins can provide significant insights into the complex, hierarchical inter-residue structural relationships within the 3D structure of a protein family.
Collapse
Affiliation(s)
- Kirk K Durston
- School of Computer Science, University of Guelph, 50 Stone Road East, Guelph, ON, N1G 2W1, Canada
| | - David KY Chiu
- School of Computer Science, University of Guelph, 50 Stone Road East, Guelph, ON, N1G 2W1, Canada
| | - Andrew KC Wong
- Department of System Design Engineering, University of Waterloo, 200 University Ave. W, Waterloo, ON, N2L 3G1, Canada
| | - Gary CL Li
- Department of System Design Engineering, University of Waterloo, 200 University Ave. W, Waterloo, ON, N2L 3G1, Canada
| |
Collapse
|
30
|
Maji P, Das C. Relevant and Significant Supervised Gene Clusters for Microarray Cancer Classification. IEEE Trans Nanobioscience 2012; 11:161-8. [DOI: 10.1109/tnb.2012.2193590] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
|
31
|
Van Hulse J, Khoshgoftaar TM, Napolitano A, Wald R. Threshold-based feature selection techniques for high-dimensional bioinformatics data. ACTA ACUST UNITED AC 2012. [DOI: 10.1007/s13721-012-0006-6] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|
32
|
Salome JJ. Efficient Retrieval Technique for Microarray Gene Expression. INTERNATIONAL JOURNAL OF INFORMATION RETRIEVAL RESEARCH 2012. [DOI: 10.4018/ijirr.2012040104] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
The DNA mciroarray gene data is in the expression levels of thousands of genes for a small amount of samples. From the microarray gene data, the process of extracting the required knowledge remains an open challenge. Acquiring knowledge is the intricacy in such types of gene data, though number of researches is arising in order to acquire information from these gene data. In order to retrieve the required information, gene classification is vital; however, the task is complex because of the data characteristics, high dimensionality and smaller sample size. Initially, the dimensionality diminution process is carried out in order to shrink the microarray data without losing information with the aid of LPP and PCA techniques and utilized for information retrieval. In this paper, we propose an effective gene retrieval technique based on LPP and PCA called LPCA. The technique like LPP and PCA is chosen for the dimensionality reduction for efficient retrieval of microarray gene data. An application of microarray gene data is included with classification by SVM. SVM is trained by the dimensionality reduced gene data for effective classification. A comparative study is made with these dimensionality reduction techniques.
Collapse
Affiliation(s)
- J. Jacinth Salome
- Department of Computer Science, Arignar Anna Government Arts College, Walajapet, Tamil Nadu, India
| |
Collapse
|
33
|
|
34
|
ROMDHANE LOTFIBEN, SHILI HECHMI, AYEB BECHIR. P3M— POSSIBILISTIC MULTI-STEP MAXMIN AND MERGING ALGORITHM WITH APPLICATION TO GENE EXPRESSION DATA MINING. INT J ARTIF INTELL T 2011. [DOI: 10.1142/s0218213009000263] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Gene expression data generated by DNA microarray experiments provide a vast resource of medical diagnostic and disease understanding. Unfortunately, the large amount of data makes it hard, sometimes impossible, to understand the correct behavior of genes. In this work, we develop a possibilistic approach for mining gene microarray data. Our model consists of two steps. In the first step, we use possibilistic clustering to partition the data into groups (or clusters). The optimal number of clusters is evaluated automatically from data using the Partition Information Entropy as a validity measure. In the second step, we select from each computed cluster the most representative genes and model them as a graph called a proximity graph. This set of graphs (or hyper-graph) will be used to predict the function of new and previously unknown genes. Benchmark results on real-world data sets reveal a good performance of our model in computing optimal partitions even in the presence of noise; and a high prediction accuracy on unknown genes.
Collapse
Affiliation(s)
- LOTFI BEN ROMDHANE
- PRINCE Research Group (PRINCE stands for Pole de Recherche en INformatique du CEntre. It is a multi-disciplinary research group working in the fields of data mining, distributed computing, and intelligent networks.), Department of Computer Science, Faculty of Sciences of Monastir, University of Monastir, Monastir 5019, Tunisia
| | - HECHMI SHILI
- PRINCE Research Group (PRINCE stands for Pole de Recherche en INformatique du CEntre. It is a multi-disciplinary research group working in the fields of data mining, distributed computing, and intelligent networks.), Department of Computer Science, Faculty of Sciences of Monastir, University of Monastir, Monastir 5019, Tunisia
| | - BECHIR AYEB
- PRINCE Research Group (PRINCE stands for Pole de Recherche en INformatique du CEntre. It is a multi-disciplinary research group working in the fields of data mining, distributed computing, and intelligent networks.), Department of Computer Science, Faculty of Sciences of Monastir, University of Monastir, Monastir 5019, Tunisia
| |
Collapse
|
35
|
PAPADIMITRIOU STERGIOS, MAVROUDI SEFERINA, LIKOTHANASSIS SPIRIDOND. MUTUAL INFORMATION CLUSTERING FOR EFFICIENT MINING OF FUZZY ASSOCIATION RULES WITH APPLICATION TO GENE EXPRESSION DATA ANALYSIS. INT J ARTIF INTELL T 2011. [DOI: 10.1142/s0218213006002643] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Fuzzy association rules can reveal useful dependencies and interactions hidden in large gene expression data sets. However their derivation perplexes very difficult combinatorial problems that depend heavily on the size of these sets. The paper follows a divide and conquer approach to the problem that obtains computationally manageable solutions. Initially we cluster genes that more probably are associated. Thereafter, the fuzzy association rule extraction algorithms confront many but significantly reduced computationally problems that usually can be processed very fast. The clustering phase is accomplished by means of an approach based on mutual information (MI). This approach uses the mutual information as a similarity measure. However, the numerical evaluation of the MI is subtle. We experiment with the main methods and we compare between them. As the device that implements the mutual information clustering we use a SOM (Self-Organized Map) based approach that is capable of effectively incorporating supervised bias. After the mutual information clustering phase the fuzzy association rules are extracted locally on a per cluster basis. The paper presents an application of the techniques for mining the gene expression data. However, the presented techniques can easily be adapted and can be fruitful for intelligent exploration of any other similar data set as well.
Collapse
Affiliation(s)
- STERGIOS PAPADIMITRIOU
- Department of Information Management, Technological Educational Institute of Kavala, 65404 Kavala, Greece
| | - SEFERINA MAVROUDI
- Pattern Recognition Laboratory, Department of Computer Engineering and Informatics, School of Engineering, University of Patras, Rion, Patras, 26500, Greece
| | - SPIRIDON D. LIKOTHANASSIS
- Pattern Recognition Laboratory, Department of Computer Engineering and Informatics, School of Engineering, University of Patras, Rion, Patras, 26500, Greece
| |
Collapse
|
36
|
|
37
|
Abstract
BACKGROUND Discovering patterns from gene expression levels is regarded as a classification problem when tissue classes of the samples are given and solved as a discrete-data problem by discretizing the expression levels of each gene into intervals maximizing the interdependence between that gene and the class labels. However, when class information is unavailable, discovering gene expression patterns becomes difficult. METHODS For a gene pool with large number of genes, we first cluster the genes into smaller groups. In each group, we use the representative gene, one with highest interdependence with others in the group, to drive the discretization of the gene expression levels of other genes. Treating intervals as discrete events, association patterns of events can be discovered. If the gene groups obtained are crisp gene clusters, significant patterns overlapping different gene clusters cannot be found. This paper presents a new method of "fuzzifying" the crisp gene clusters to overcome such problem. RESULTS To evaluate the effectiveness of our approach, we first apply the above described procedure on a synthetic data set and then a gene expression data set with known class labels. The class labels are not being used in both analyses but used later as the ground truth in a classificatory problem for assessing the algorithm's effectiveness in fuzzy gene clustering and discretization. The results show the efficacy of the proposed method. The existence of correlation among continuous valued gene expression levels suggests that certain genes in the gene groups have high interdependence with other genes in the group. Fuzzification of a crisp gene cluster allows the cluster to take in genes from other clusters so that overlapping relationship among gene clusters could be uncovered. Hence, previously unknown hidden patterns resided in overlapping gene clusters are discovered. From the experimental results, the high order patterns discovered reveal multiple gene interaction patterns in cancerous tissues not found in normal tissues. It was also found that for the colon cancer experiment, 70% of the top patterns and most of the discriminative patterns between cancerous and normal tissues are among those spanning across different crisp gene clusters. CONCLUSIONS We show that the proposed method for analyzing the error-prone microarray is effective even without the presence of tissue class information. A unified framework is presented, allowing fast and accurate pattern discovery for gene expression data. For a large gene set, to discover a comprehensive set of patterns, gene clustering, gene expression discretization and gene cluster fuzzification are absolutely necessary.
Collapse
Affiliation(s)
- Gene PK Wu
- Department of Computing, The Hong Kong Polytechnic University, Kowloon, Hong Kong
| | - Keith CC Chan
- Department of Computing, The Hong Kong Polytechnic University, Kowloon, Hong Kong
| | - Andrew KC Wong
- Department of Systems Design Engineering, University of Waterloo, Waterloo, Ontario N2L 3G1, Canada
| |
Collapse
|
38
|
Lam WWM, Chan KCC. Discovering functional interdependence relationship in PPI networks for protein complex identification. IEEE Trans Biomed Eng 2010; 59:899-908. [PMID: 21095855 DOI: 10.1109/tbme.2010.2093524] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
Protein molecules interact with each other in protein complexes to perform many vital functions, and different computational techniques have been developed to identify protein complexes in protein-protein interaction (PPI) networks. These techniques are developed to search for subgraphs of high connectivity in PPI networks under the assumption that the proteins in a protein complex are highly interconnected. While these techniques have been shown to be quite effective, it is also possible that the matching rate between the protein complexes they discover and those that are previously determined experimentally be relatively low and the "false-alarm" rate can be relatively high. This is especially the case when the assumption of proteins in protein complexes being more highly interconnected be relatively invalid. To increase the matching rate and reduce the false-alarm rate, we have developed a technique that can work effectively without having to make this assumption. The name of the technique called protein complex identification by discovering functional interdependence (PCIFI) searches for protein complexes in PPI networks by taking into consideration both the functional interdependence relationship between protein molecules and the network topology of the network. The PCIFI works in several steps. The first step is to construct a multiple-function protein network graph by labeling each vertex with one or more of the molecular functions it performs. The second step is to filter out protein interactions between protein pairs that are not functionally interdependent of each other in the statistical sense. The third step is to make use of an information-theoretic measure to determine the strength of the functional interdependence between all remaining interacting protein pairs. Finally, the last step is to try to form protein complexes based on the measure of the strength of functional interdependence and the connectivity between proteins. For performance evaluation, PCIFI was used to identify protein complexes in real PPI network data and the protein complexes it found were matched against those that were previously known in MIPS. The results show that PCIFI can be an effective technique for the identification of protein complexes. The protein complexes it found can match more known protein complexes with a smaller false-alarm rate and can provide useful insights into the understanding of the functional interdependence relationships between proteins in protein complexes.
Collapse
Affiliation(s)
- Winnie W M Lam
- Department of Computing, The Hong Kong Polytechnic University, Hung Hom 999077, Hong Kong.
| | | |
Collapse
|
39
|
He Z, Yu W. Stable feature selection for biomarker discovery. Comput Biol Chem 2010; 34:215-25. [PMID: 20702140 DOI: 10.1016/j.compbiolchem.2010.07.002] [Citation(s) in RCA: 131] [Impact Index Per Article: 9.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2010] [Revised: 06/27/2010] [Accepted: 07/10/2010] [Indexed: 12/27/2022]
|
40
|
Maji P. Fuzzy-rough supervised attribute clustering algorithm and classification of microarray data. ACTA ACUST UNITED AC 2010; 41:222-33. [PMID: 20542768 DOI: 10.1109/tsmcb.2010.2050684] [Citation(s) in RCA: 62] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
One of the major tasks with gene expression data is to find groups of coregulated genes whose collective expression is strongly associated with sample categories. In this regard, a new clustering algorithm, termed as fuzzy-rough supervised attribute clustering (FRSAC), is proposed to find such groups of genes. The proposed algorithm is based on the theory of fuzzy-rough sets, which directly incorporates the information of sample categories into the gene clustering process. A new quantitative measure is introduced based on fuzzy-rough sets that incorporates the information of sample categories to measure the similarity among genes. The proposed algorithm is based on measuring the similarity between genes using the new quantitative measure, whereby redundancy among the genes is removed. The clusters are refined incrementally based on sample categories. The effectiveness of the proposed FRSAC algorithm, along with a comparison with existing supervised and unsupervised gene selection and clustering algorithms, is demonstrated on six cancer and two arthritis data sets based on the class separability index and predictive accuracy of the naive Bayes' classifier, the K-nearest neighbor rule, and the support vector machine.
Collapse
Affiliation(s)
- Pradipta Maji
- Machine Intelligence Unit, Indian Statistical Institute, Kolkata 700 108, India.
| |
Collapse
|
41
|
Liu H, Liu L, Zhang H. Ensemble gene selection by grouping for microarray data classification. J Biomed Inform 2009; 43:81-7. [PMID: 19699316 DOI: 10.1016/j.jbi.2009.08.010] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2008] [Revised: 07/01/2009] [Accepted: 08/17/2009] [Indexed: 10/20/2022]
Abstract
Selecting relevant and discriminative genes for sample classification is a common and critical task in gene expression analysis (e.g. disease diagnostic). It is desirable that gene selection can improve classification performance of learning algorithm effectively. In general, for most gene selection methods widely used in reality, an individual gene subset will be chosen according to its discriminative power. One of deficiencies of individual gene subset is that its contribution to classification purpose is limited. This issue can be alleviated by ensemble gene selection based on random selection to some extend. However, the random one requires an unnecessary large number of candidate gene subsets and its reliability is a problem. In this study, we propose a new ensemble method, called ensemble gene selection by grouping (EGSG), to select multiple gene subsets for the classification purpose. Rather than selecting randomly, our method chooses salient gene subsets from microarray data by virtue of information theory and approximate Markov blanket. The effectiveness and accuracy of our method is validated by experiments on five publicly available microarray data sets. The experimental results show that our ensemble gene selection method has comparable classification performance to other gene selection methods, and is more stable than the random one.
Collapse
Affiliation(s)
- Huawen Liu
- College of Computer Science, Jilin University, Changchun 130012, China.
| | | | | |
Collapse
|
42
|
Chatterjee S, Bhattacharjee K, Konar A. A simple and robust algorithm for microarray data clustering based on gene population-variance ratio metric. Biotechnol J 2009; 4:1357-61. [PMID: 19579218 DOI: 10.1002/biot.200800219] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
With the advent of the microarray technology, the field of life science has been greatly revolutionized, since this technique allows the simultaneous monitoring of the expression levels of thousands of genes in a particular organism. However, the statistical analysis of expression data has its own challenges, primarily because of the huge amount of data that is to be dealt with, and also because of the presence of noise, which is almost an inherent characteristic of microarray data. Clustering is one tool used to mine meaningful patterns from microarray data. In this paper, we present a novel method of clustering yeast microarray data, which is robust and yet simple to implement. It identifies the best clusters from a given dataset on the basis of the population of the clusters as well as the variance of the feature values of the members from the cluster-center. It has been found to yield satisfactory results even in the presence of noisy data.
Collapse
|
43
|
Romdhane LB, Shili H, Ayeb B. Mining microarray gene expression data with unsupervised possibilistic clustering and proximity graphs. APPL INTELL 2009. [DOI: 10.1007/s10489-009-0161-3] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
44
|
Wong AKC, Au WH, Chan KCC. Discovering high-order patterns of gene expression levels. J Comput Biol 2008; 15:625-37. [PMID: 18631025 DOI: 10.1089/cmb.2007.0147] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
This paper reports the discovery of statistically significant association patterns of gene expression levels from microarray data. By association patterns, we mean certain gene expression intensity intervals having statistically significant associations among themselves and with the tissue classes, such as cancerous and normal tissues. We describe how the significance of the associations among gene expression levels can be evaluated using a statistical measure in an objective manner. If an association is found to be significant based on the measure, we say that it is statistically significant. Given a gene expression data set, we first cluster the entire gene pool comprising all the genes into groups by optimizing the correlation (or more precisely, interdependence) among the gene expression levels within gene groups. From each group, we select one or several genes that are most correlated with other genes within that group to form a smaller gene pool. This gene pool then constitutes the most representative genes from the original pool. Our pattern discovery algorithm is then used, for the first time, to discover the significant association patterns of gene expression levels among the genes from the small pool. With our method, it is more effective to discover and express the associations in terms of their intensity intervals. Hence, we discretize each gene expression levels into intervals maximizing the interdependence between the gene expression and the tissue classes. From this data set of gene expression intervals, we discover the association patterns representing statistically significant associations, some positively and some negatively, with different tissue classes. We apply our pattern discovery methodology to the colon-cancer microarray gene expression data set. It consists of 2000 genes and 62 samples taken from colon cancer or normal subjects. The statistically significant combinations of gene expression levels that repress or activate colon cancer are revealed in the colon-cancer data set. The discovered association patterns are ranked according to their statistical significance and displayed for interpretation and further analysis.
Collapse
Affiliation(s)
- Andrew K C Wong
- Department of Systems Design Engineering, University of Waterloo, Waterloo, Ontario, Canada
| | | | | |
Collapse
|
45
|
Yin ZX, Chiang JH. Novel algorithm for coexpression detection in time-varying microarray data sets. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2008; 5:120-135. [PMID: 18245881 DOI: 10.1109/tcbb.2007.1052] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/25/2023]
Abstract
When analyzing the results of microarray experiments, biologists generally use unsupervised categorization tools. However, such tools regard each time point as an independent dimension and utilize the Euclidean distance to compute the similarities between expressions. Furthermore, some of these methods require the number of clusters to be determined in advance, which is clearly impossible in the case of a new dataset. Therefore, this study proposes a novel scheme, designated as the Variation-based Coexpression Detection (VCD) algorithm, to analyze the trends of expressions based on their variation over time. The proposed algorithm has two advantages. First, it is unnecessary to determine the number of clusters in advance since the algorithm automatically detects those genes whose profiles are grouped together and creates patterns for these groups. Second, the algorithm features a new measurement criterion for calculating the degree of change of the expressions between adjacent time points and evaluating their trend similarities. Three real-world microarray datasets are employed to evaluate the performance of the proposed algorithm.
Collapse
|
46
|
Abstract
With the advent of microarray technology it has been possible to measure thousands of expression values of genes in a single experiment. Biclustering or simultaneous clustering of both genes and conditions is challenging particularly for the analysis of high-dimensional gene expression data in information retrieval, knowledge discovery, and data mining. The objective here is to find sub-matrices, i.e., maximal subgroups of genes and subgroups of conditions where the genes exhibit highly correlated activities over a range of conditions while maximizing the volume simultaneously. Since these two objectives are mutually conflicting, they become suitable candidates for multi-objective modeling. In this study, we will describe some recent literature on biclustering as well as a multi-objective evolutionary biclustering framework for gene expression data along with the experimental results.
Collapse
Affiliation(s)
- Haider Banka
- Center for Soft Computing Research: A National Facility, Indian Statistical Institute, Kolkata
| | - Sushmita Mitra
- Machine Intelligence Unit, Indian Statistical Institute, Kolkata
| |
Collapse
|