1
|
Wang S, Kim SY, Sohn KA. ClearF++: Improved Supervised Feature Scoring Using Feature Clustering in Class-Wise Embedding and Reconstruction. Bioengineering (Basel) 2023; 10:824. [PMID: 37508851 PMCID: PMC10376817 DOI: 10.3390/bioengineering10070824] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2023] [Revised: 06/28/2023] [Accepted: 07/04/2023] [Indexed: 07/30/2023] Open
Abstract
Feature selection methods are essential for accurate disease classification and identifying informative biomarkers. While information-theoretic methods have been widely used, they often exhibit limitations such as high computational costs. Our previously proposed method, ClearF, addresses these issues by using reconstruction error from low-dimensional embeddings as a proxy for the entropy term in the mutual information. However, ClearF still has limitations, including a nontransparent bottleneck layer selection process, which can result in unstable feature selection. To address these limitations, we propose ClearF++, which simplifies the bottleneck layer selection and incorporates feature-wise clustering to enhance biomarker detection. We compare its performance with other commonly used methods such as MultiSURF and IFS, as well as ClearF, across multiple benchmark datasets. Our results demonstrate that ClearF++ consistently outperforms these methods in terms of prediction accuracy and stability, even with limited samples. We also observe that employing the Deep Embedded Clustering (DEC) algorithm for feature-wise clustering improves performance, indicating its suitability for handling complex data structures with limited samples. ClearF++ offers an improved biomarker prioritization approach with enhanced prediction performance and faster execution. Its stability and effectiveness with limited samples make it particularly valuable for biomedical data analysis.
Collapse
Affiliation(s)
- Sehee Wang
- Department of Artificial Intelligence, Ajou University, Suwon 16499, Republic of Korea
| | - So Yeon Kim
- Department of Artificial Intelligence, Ajou University, Suwon 16499, Republic of Korea
- Department of Software and Computer Engineering, Ajou University, Suwon 16499, Republic of Korea
| | - Kyung-Ah Sohn
- Department of Artificial Intelligence, Ajou University, Suwon 16499, Republic of Korea
- Department of Software and Computer Engineering, Ajou University, Suwon 16499, Republic of Korea
| |
Collapse
|
2
|
Ensemble filters with harmonize PSO-SVM algorithm for optimal hearing disorder prediction. Neural Comput Appl 2023; 35:10473-10496. [PMID: 36747886 PMCID: PMC9894525 DOI: 10.1007/s00521-023-08244-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2021] [Accepted: 01/06/2023] [Indexed: 02/05/2023]
Abstract
Discovering a hearing disorder at an earlier intervention is critical for reducing the effects of hearing loss and the approaches to increase the remaining hearing ability can be implemented to achieve the successful development of human communication. Recently, the explosive dataset features have increased the complexity for audiologists to decide the proper treatment for the patient. In most cases, data with irrelevant features and improper classifier parameters causes a crucial influence on the audiometry system in terms of accuracy. This is due to the dependent processes of these two, where the classification accuracy performance could be worsened if both processes are conducted independently. Although the filter algorithm is capable of eliminating irrelevant features, it still lacks the ability to consider feature reliance and results in a poor selection of significant features. Improper kernel parameter settings may also contribute to poor accuracy performance. In this paper, an ensemble filters feature selection based on Information Gain (IG), Gain Ratio (GR), Chi-squared (CS), and Relief-F (RF) with harmonize optimization of Particle Swarm Optimization (PSO) and Support Vector Machine (SVM) is presented to mitigate these problems. Ensemble filters are utilized so that the initial top dominant features relevant for classification can be considered. Then, PSO and SVM are optimized simultaneously to achieve the optimal solution. The results on a standard Audiology dataset show that the proposed method produces 96.50% accuracy with optimal solution compared to classical SVM, which signifies the proposed method is effective in handling high dimensional data for hearing disorder prediction.
Collapse
|
3
|
Yang Q, Li B, Wang P, Xie J, Feng Y, Liu Z, Zhu F. LargeMetabo: an out-of-the-box tool for processing and analyzing large-scale metabolomic data. Brief Bioinform 2022; 23:6768054. [PMID: 36274234 DOI: 10.1093/bib/bbac455] [Citation(s) in RCA: 30] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2022] [Revised: 09/06/2022] [Accepted: 09/24/2022] [Indexed: 12/14/2022] Open
Abstract
Large-scale metabolomics is a powerful technique that has attracted widespread attention in biomedical studies focused on identifying biomarkers and interpreting the mechanisms of complex diseases. Despite a rapid increase in the number of large-scale metabolomic studies, the analysis of metabolomic data remains a key challenge. Specifically, diverse unwanted variations and batch effects in processing many samples have a substantial impact on identifying true biological markers, and it is a daunting challenge to annotate a plethora of peaks as metabolites in untargeted mass spectrometry-based metabolomics. Therefore, the development of an out-of-the-box tool is urgently needed to realize data integration and to accurately annotate metabolites with enhanced functions. In this study, the LargeMetabo package based on R code was developed for processing and analyzing large-scale metabolomic data. This package is unique because it is capable of (1) integrating multiple analytical experiments to effectively boost the power of statistical analysis; (2) selecting the appropriate biomarker identification method by intelligent assessment for large-scale metabolic data and (3) providing metabolite annotation and enrichment analysis based on an enhanced metabolite database. The LargeMetabo package can facilitate flexibility and reproducibility in large-scale metabolomics. The package is freely available from https://github.com/LargeMetabo/LargeMetabo.
Collapse
Affiliation(s)
- Qingxia Yang
- Department of Bioinformatics, Smart Health Big Data Analysis and Location Services Engineering Lab of Jiangsu Province, School of Geographic and Biologic Information, Nanjing University of Posts and Telecommunications, Nanjing, 210023, China.,College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, Zhejiang 310058, China
| | - Bo Li
- College of Life Sciences, Chongqing Normal University, Chongqing, Chongqing 401331, China
| | - Panpan Wang
- College of Chemistry and Pharmaceutical Engineering, Huanghuai University, Zhumadian 463000, China
| | - Jicheng Xie
- Department of Bioinformatics, Smart Health Big Data Analysis and Location Services Engineering Lab of Jiangsu Province, School of Geographic and Biologic Information, Nanjing University of Posts and Telecommunications, Nanjing, 210023, China
| | - Yuhao Feng
- Department of Bioinformatics, Smart Health Big Data Analysis and Location Services Engineering Lab of Jiangsu Province, School of Geographic and Biologic Information, Nanjing University of Posts and Telecommunications, Nanjing, 210023, China
| | - Ziqiang Liu
- Department of Bioinformatics, Smart Health Big Data Analysis and Location Services Engineering Lab of Jiangsu Province, School of Geographic and Biologic Information, Nanjing University of Posts and Telecommunications, Nanjing, 210023, China
| | - Feng Zhu
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, Zhejiang 310058, China
| |
Collapse
|
4
|
AlMazrua H, AlShamlan H. A New Algorithm for Cancer Biomarker Gene Detection Using Harris Hawks Optimization. SENSORS (BASEL, SWITZERLAND) 2022; 22:s22197273. [PMID: 36236372 PMCID: PMC9572901 DOI: 10.3390/s22197273] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/28/2022] [Revised: 09/01/2022] [Accepted: 09/09/2022] [Indexed: 05/29/2023]
Abstract
This paper presents two novel swarm intelligence algorithms for gene selection, HHO-SVM and HHO-KNN. Both of these algorithms are based on Harris Hawks Optimization (HHO), one in conjunction with support vector machines (SVM) and the other in conjunction with k-nearest neighbors (k-NN). In both algorithms, the goal is to determine a small gene subset that can be used to classify samples with a high degree of accuracy. The proposed algorithms are divided into two phases. To obtain an accurate gene set and to deal with the challenge of high-dimensional data, the redundancy analysis and relevance calculation are conducted in the first phase. To solve the gene selection problem, the second phase applies SVM and k-NN with leave-one-out cross-validation. A performance evaluation was performed on six microarray data sets using the two proposed algorithms. A comparison of the two proposed algorithms with several known algorithms indicates that both of them perform quite well in terms of classification accuracy and the number of selected genes.
Collapse
|
5
|
Reel PS, Reel S, Pearson E, Trucco E, Jefferson E. Using machine learning approaches for multi-omics data analysis: A review. Biotechnol Adv 2021; 49:107739. [PMID: 33794304 DOI: 10.1016/j.biotechadv.2021.107739] [Citation(s) in RCA: 265] [Impact Index Per Article: 88.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2020] [Revised: 03/01/2021] [Accepted: 03/25/2021] [Indexed: 02/06/2023]
Abstract
With the development of modern high-throughput omic measurement platforms, it has become essential for biomedical studies to undertake an integrative (combined) approach to fully utilise these data to gain insights into biological systems. Data from various omics sources such as genetics, proteomics, and metabolomics can be integrated to unravel the intricate working of systems biology using machine learning-based predictive algorithms. Machine learning methods offer novel techniques to integrate and analyse the various omics data enabling the discovery of new biomarkers. These biomarkers have the potential to help in accurate disease prediction, patient stratification and delivery of precision medicine. This review paper explores different integrative machine learning methods which have been used to provide an in-depth understanding of biological systems during normal physiological functioning and in the presence of a disease. It provides insight and recommendations for interdisciplinary professionals who envisage employing machine learning skills in multi-omics studies.
Collapse
Affiliation(s)
- Parminder S Reel
- Division of Population Health and Genomics, School of Medicine, University of Dundee, Dundee, United Kingdom
| | - Smarti Reel
- Division of Population Health and Genomics, School of Medicine, University of Dundee, Dundee, United Kingdom
| | - Ewan Pearson
- Division of Population Health and Genomics, School of Medicine, University of Dundee, Dundee, United Kingdom
| | - Emanuele Trucco
- VAMPIRE project, Computing, School of Science and Engineering, University of Dundee, Dundee, United Kingdom
| | - Emily Jefferson
- Division of Population Health and Genomics, School of Medicine, University of Dundee, Dundee, United Kingdom.
| |
Collapse
|
6
|
Zhang G, Xue Z, Yan C, Wang J, Luo H. A Novel Biomarker Identification Approach for Gastric Cancer Using Gene Expression and DNA Methylation Dataset. Front Genet 2021; 12:644378. [PMID: 33868380 PMCID: PMC8044773 DOI: 10.3389/fgene.2021.644378] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2020] [Accepted: 02/16/2021] [Indexed: 01/09/2023] Open
Abstract
As one type of complex disease, gastric cancer has high mortality rate, and there are few effective treatments for patients in advanced stage. With the development of biological technology, a large amount of multiple-omics data of gastric cancer are generated, which enables computational method to discover potential biomarkers of gastric cancer. That will be very important to detect gastric cancer at earlier stages and thus assist in providing timely treatment. However, most of biological data have the characteristics of high dimension and low sample size. It is hard to process directly without feature selection. Besides, only using some omic data, such as gene expression data, provides limited evidence to investigate gastric cancer associated biomarkers. In this research, gene expression data and DNA methylation data are integrated to analyze gastric cancer, and a feature selection approach is proposed to identify the possible biomarkers of gastric cancer. After the original data are pre-processed, the mutual information (MI) is applied to select some top genes. Then, fold change (FC) and T-test are adopted to identify differentially expressed genes (DEG). In particular, false discover rate (FDR) is introduced to revise p_value to further screen genes. For chosen genes, a deep neural network (DNN) model is utilized as the classifier to measure the quality of classification. The experimental results show that the approach can achieve superior performance in terms of accuracy and other metrics. Biological analysis for chosen genes further validates the effectiveness of the approach.
Collapse
Affiliation(s)
- Ge Zhang
- School of Computer and Information Engineering, Henan University, Kaifeng, China
| | - Zijing Xue
- School of Computer and Information Engineering, Henan University, Kaifeng, China
| | - Chaokun Yan
- School of Computer and Information Engineering, Henan University, Kaifeng, China
| | - Jianlin Wang
- School of Computer and Information Engineering, Henan University, Kaifeng, China
| | - Huimin Luo
- School of Computer and Information Engineering, Henan University, Kaifeng, China
| |
Collapse
|
7
|
Leary E, Stoker AM, Cook JL. Classification, Categorization, and Algorithms for Articular Cartilage Defects. J Knee Surg 2020; 33:1069-1077. [PMID: 32663886 DOI: 10.1055/s-0040-1713778] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
There is a critical unmet need in the clinical implementation of valid preventative and therapeutic strategies for patients with articular cartilage pathology based on the significant gap in understanding of the relationships between diagnostic data, disease progression, patient-related variables, and symptoms. In this article, the current state of classification and categorization for articular cartilage pathology is discussed with particular focus on machine learning methods and the authors propose a bedside-bench-bedside approach with highly quantitative techniques as a solution to these hurdles. Leveraging computational learning with available data toward articular cartilage pathology patient phenotyping holds promise for clinical research and will likely be an important tool to identify translational solutions into evidence-based clinical applications to benefit patients. Recommendations for successful implementation of these approaches include using standardized definitions of articular cartilage, to include characterization of depth, size, location, and number; using measurements that minimize subjectivity or validated patient-reported outcome measures; considering not just the articular cartilage pathology but the whole joint, and the patient perception and perspective. Application of this approach through a multistep process by a multidisciplinary team of clinicians and scientists holds promise for validating disease mechanism-based phenotypes toward clinically relevant understanding of articular cartilage pathology for evidence-based application to orthopaedic practice.
Collapse
Affiliation(s)
- Emily Leary
- Thompson Laboratory for Regenerative Orthopaedics, University of Missouri, Columbia, Missouri.,Department of Orthopaedic Surgery, University of Missouri, Columbia, Missouri
| | - Aaron M Stoker
- Thompson Laboratory for Regenerative Orthopaedics, University of Missouri, Columbia, Missouri.,Department of Orthopaedic Surgery, University of Missouri, Columbia, Missouri
| | - James L Cook
- Thompson Laboratory for Regenerative Orthopaedics, University of Missouri, Columbia, Missouri.,Department of Orthopaedic Surgery, University of Missouri, Columbia, Missouri
| |
Collapse
|
8
|
Wang S, Jeong HH, Sohn KA. ClearF: a supervised feature scoring method to find biomarkers using class-wise embedding and reconstruction. BMC Med Genomics 2019; 12:95. [PMID: 31296201 PMCID: PMC6624178 DOI: 10.1186/s12920-019-0512-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
BACKGROUND Feature selection or scoring methods for the detection of biomarkers are essential in bioinformatics. Various feature selection methods have been developed for the detection of biomarkers, and several studies have employed information-theoretic approaches. However, most of these methods generally require a long processing time. In addition, information-theoretic methods discretize continuous features, which is a drawback that can lead to the loss of information. RESULTS In this paper, a novel supervised feature scoring method named ClearF is proposed. The proposed method is suitable for continuous-valued data, which is similar to the principle of feature selection using mutual information, with the added advantage of a reduced computation time. The proposed score calculation is motivated by the association between the reconstruction error and the information-theoretic measurement. Our method is based on class-wise low-dimensional embedding and the resulting reconstruction error. Given multi-class datasets such as a case-control study dataset, low-dimensional embedding is first applied to each class to obtain a compressed representation of the class, and also for the entire dataset. Reconstruction is then performed to calculate the error of each feature and the final score for each feature is defined in terms of the reconstruction errors. The correlation between the information theoretic measurement and the proposed method is demonstrated using a simulation. For performance validation, we compared the classification performance of the proposed method with those of various algorithms on benchmark datasets. CONCLUSIONS The proposed method showed higher accuracy and lower execution time than the other established methods. Moreover, an experiment was conducted on the TCGA breast cancer dataset, and it was confirmed that the genes with the highest scores were highly associated with subtypes of breast cancer.
Collapse
Affiliation(s)
- Sehee Wang
- Department of Computer Engineering, Ajou University, Suwon, 16499 South Korea
| | - Hyun-Hwan Jeong
- Jan and Dan Duncan Neurological Research Institute, Texas Children’s Hospital, Houston, TX 77030 USA
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030 USA
| | - Kyung-Ah Sohn
- Department of Computer Engineering, Ajou University, Suwon, 16499 South Korea
| |
Collapse
|
9
|
Pereira T, Vilaprinyo E, Belli G, Herrero E, Salvado B, Sorribas A, Altés G, Alves R. Quantitative Operating Principles of Yeast Metabolism during Adaptation to Heat Stress. Cell Rep 2019; 22:2421-2430. [PMID: 29490277 DOI: 10.1016/j.celrep.2018.02.020] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2017] [Revised: 01/15/2018] [Accepted: 02/05/2018] [Indexed: 11/18/2022] Open
Abstract
Microorganisms evolved adaptive responses to survive stressful challenges in ever-changing environments. Understanding the relationships between the physiological/metabolic adjustments allowing cellular stress adaptation and gene expression changes being used by organisms to achieve such adjustments may significantly impact our ability to understand and/or guide evolution. Here, we studied those relationships during adaptation to various stress challenges in Saccharomyces cerevisiae, focusing on heat stress responses. We combined dozens of independent experiments measuring whole-genome gene expression changes during stress responses with a simplified kinetic model of central metabolism. We identified alternative quantitative ranges for a set of physiological variables in the model (production of ATP, trehalose, NADH, etc.) that are specific for adaptation to either heat stress or desiccation/rehydration. Our approach is scalable to other adaptive responses and could assist in developing biotechnological applications to manipulate cells for medical, biotechnological, or synthetic biology purposes.
Collapse
Affiliation(s)
- Tania Pereira
- Institute of Biomedical Research of Lleida IRBLleida, 25198, Lleida, Catalunya, Spain; Departament de Ciències Mèdiques Bàsiques, University of Lleida, 25198, Lleida, Catalunya, Spain
| | - Ester Vilaprinyo
- Institute of Biomedical Research of Lleida IRBLleida, 25198, Lleida, Catalunya, Spain; Departament de Ciències Mèdiques Bàsiques, University of Lleida, 25198, Lleida, Catalunya, Spain
| | - Gemma Belli
- Institute of Biomedical Research of Lleida IRBLleida, 25198, Lleida, Catalunya, Spain; Departament de Ciències Mèdiques Bàsiques, University of Lleida, 25198, Lleida, Catalunya, Spain
| | - Enric Herrero
- Departament de Ciències Mèdiques Bàsiques, University of Lleida, 25198, Lleida, Catalunya, Spain
| | - Baldiri Salvado
- Institute of Biomedical Research of Lleida IRBLleida, 25198, Lleida, Catalunya, Spain; Departament de Ciències Mèdiques Bàsiques, University of Lleida, 25198, Lleida, Catalunya, Spain
| | - Albert Sorribas
- Institute of Biomedical Research of Lleida IRBLleida, 25198, Lleida, Catalunya, Spain; Departament de Ciències Mèdiques Bàsiques, University of Lleida, 25198, Lleida, Catalunya, Spain
| | - Gisela Altés
- Institute of Biomedical Research of Lleida IRBLleida, 25198, Lleida, Catalunya, Spain; Departament de Ciències Mèdiques Bàsiques, University of Lleida, 25198, Lleida, Catalunya, Spain
| | - Rui Alves
- Institute of Biomedical Research of Lleida IRBLleida, 25198, Lleida, Catalunya, Spain; Departament de Ciències Mèdiques Bàsiques, University of Lleida, 25198, Lleida, Catalunya, Spain.
| |
Collapse
|
10
|
Moon WK, Chen IL, Chang JM, Shin SU, Lo CM, Chang RF. The adaptive computer-aided diagnosis system based on tumor sizes for the classification of breast tumors detected at screening ultrasound. ULTRASONICS 2017; 76:70-77. [PMID: 28086107 DOI: 10.1016/j.ultras.2016.12.017] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/17/2016] [Revised: 12/06/2016] [Accepted: 12/26/2016] [Indexed: 06/06/2023]
Abstract
Screening ultrasound (US) is increasingly used as a supplement to mammography in women with dense breasts, and more than 80% of cancers detected by US alone are 1cm or smaller. An adaptive computer-aided diagnosis (CAD) system based on tumor size was proposed to classify breast tumors detected at screening US images using quantitative morphological and textural features. In the present study, a database containing 156 tumors (78 benign and 78 malignant) was separated into two subsets of different tumor sizes (<1cm and ⩾1cm) to explore the improvement in the performance of the CAD system. After adaptation, the accuracies, sensitivities, specificities and Az values of the CAD for the entire database increased from 73.1% (114/156), 73.1% (57/78), 73.1% (57/78), and 0.790 to 81.4% (127/156), 83.3% (65/78), 79.5% (62/78), and 0.852, respectively. In the data subset of tumors larger than 1cm, the performance improved from 66.2% (51/77), 68.3% (28/41), 63.9% (23/36), and 0.703 to 81.8% (63/77), 85.4% (35/41), 77.8% (28/36), and 0.855, respectively. The proposed CAD system can be helpful to classify breast tumors detected at screening US.
Collapse
Affiliation(s)
- Woo Kyung Moon
- Department of Radiology, Seoul National University College of Medicine and Seoul National University Hospital, Seoul, Republic of Korea
| | - I-Ling Chen
- Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan
| | - Jung Min Chang
- Department of Radiology, Seoul National University College of Medicine and Seoul National University Hospital, Seoul, Republic of Korea
| | - Sung Ui Shin
- Department of Radiology, Seoul National University College of Medicine and Seoul National University Hospital, Seoul, Republic of Korea
| | - Chung-Ming Lo
- Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan; Graduate Institute of Biomedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei, Taiwan.
| | - Ruey-Feng Chang
- Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan; Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University, Taipei, Taiwan.
| |
Collapse
|
11
|
Alkuhlani A, Nassef M, Farag I. Multistage feature selection approach for high-dimensional cancer data. Soft comput 2016. [DOI: 10.1007/s00500-016-2439-9] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/20/2023]
|
12
|
Lawlor N, Fabbri A, Guan P, George J, Karuturi RKM. multiClust: An R-package for Identifying Biologically Relevant Clusters in Cancer Transcriptome Profiles. Cancer Inform 2016; 15:103-14. [PMID: 27330269 PMCID: PMC4907340 DOI: 10.4137/cin.s38000] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2015] [Revised: 03/28/2016] [Accepted: 04/03/2016] [Indexed: 12/26/2022] Open
Abstract
Clustering is carried out to identify patterns in transcriptomics profiles to determine clinically relevant subgroups of patients. Feature (gene) selection is a critical and an integral part of the process. Currently, there are many feature selection and clustering methods to identify the relevant genes and perform clustering of samples. However, choosing an appropriate methodology is difficult. In addition, extensive feature selection methods have not been supported by the available packages. Hence, we developed an integrative R-package called multiClust that allows researchers to experiment with the choice of combination of methods for gene selection and clustering with ease. Using multiClust, we identified the best performing clustering methodology in the context of clinical outcome. Our observations demonstrate that simple methods such as variance-based ranking perform well on the majority of data sets, provided that the appropriate number of genes is selected. However, different gene ranking and selection methods remain relevant as no methodology works for all studies.
Collapse
Affiliation(s)
- Nathan Lawlor
- Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, USA
| | - Alec Fabbri
- Department of Biomedical Engineering, University of Connecticut, Storrs, CT, USA
| | - Peiyong Guan
- Genome Institute of Singapore, A*STAR (Agency for Science, Technology and Research), Singapore
- School of Computer Science and Engineering, Nanyang Technological University, Singapore
| | - Joshy George
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
| | | |
Collapse
|
13
|
Network regularised Cox regression and multiplex network models to predict disease comorbidities and survival of cancer. Comput Biol Chem 2015; 59 Pt B:15-31. [DOI: 10.1016/j.compbiolchem.2015.08.010] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2015] [Revised: 08/21/2015] [Accepted: 08/25/2015] [Indexed: 12/17/2022]
|
14
|
Yildirim P. Filter Based Feature Selection Methods for Prediction of Risks in Hepatitis Disease. ACTA ACUST UNITED AC 2015. [DOI: 10.7763/ijmlc.2015.v5.517] [Citation(s) in RCA: 39] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/01/2022]
|
15
|
A comparative analysis of swarm intelligence techniques for feature selection in cancer classification. ScientificWorldJournal 2014; 2014:693831. [PMID: 25157377 PMCID: PMC4137534 DOI: 10.1155/2014/693831] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2014] [Accepted: 06/18/2014] [Indexed: 11/17/2022] Open
Abstract
Feature selection in cancer classification is a central area of research in the field of bioinformatics and used to select the informative genes from thousands of genes of the microarray. The genes are ranked based on T-statistics, signal-to-noise ratio (SNR), and F-test values. The swarm intelligence (SI) technique finds the informative genes from the top-m ranked genes. These selected genes are used for classification. In this paper the shuffled frog leaping with Lévy flight (SFLLF) is proposed for feature selection. In SFLLF, the Lévy flight is included to avoid premature convergence of shuffled frog leaping (SFL) algorithm. The SI techniques such as particle swarm optimization (PSO), cuckoo search (CS), SFL, and SFLLF are used for feature selection which identifies informative genes for classification. The k-nearest neighbour (k-NN) technique is used to classify the samples. The proposed work is applied on 10 different benchmark datasets and examined with SI techniques. The experimental results show that the results obtained from k-NN classifier through SFLLF feature selection method outperform PSO, CS, and SFL.
Collapse
|
16
|
Mandal M, Mukhopadhyay A. A graph-theoretic approach for identifying non-redundant and relevant gene markers from microarray data using multiobjective binary PSO. PLoS One 2014; 9:e90949. [PMID: 24625895 PMCID: PMC3953335 DOI: 10.1371/journal.pone.0090949] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2013] [Accepted: 02/05/2014] [Indexed: 11/18/2022] Open
Abstract
The purpose of feature selection is to identify the relevant and non-redundant features from a dataset. In this article, the feature selection problem is organized as a graph-theoretic problem where a feature-dissimilarity graph is shaped from the data matrix. The nodes represent features and the edges represent their dissimilarity. Both nodes and edges are given weight according to the feature's relevance and dissimilarity among the features, respectively. The problem of finding relevant and non-redundant features is then mapped into densest subgraph finding problem. We have proposed a multiobjective particle swarm optimization (PSO)-based algorithm that optimizes average node-weight and average edge-weight of the candidate subgraph simultaneously. The proposed algorithm is applied for identifying relevant and non-redundant disease-related genes from microarray gene expression data. The performance of the proposed method is compared with that of several other existing feature selection techniques on different real-life microarray gene expression datasets.
Collapse
Affiliation(s)
- Monalisa Mandal
- Department of Computer Science and Engineering, University of Kalyani, Kalyani, West Bengal, India
| | - Anirban Mukhopadhyay
- Department of Computer Science and Engineering, University of Kalyani, Kalyani, West Bengal, India
| |
Collapse
|
17
|
Prasartvit T, Banharnsakun A, Kaewkamnerdpong B, Achalakul T. Reducing bioinformatics data dimension with ABC-kNN. Neurocomputing 2013. [DOI: 10.1016/j.neucom.2012.01.045] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
18
|
An Improved Minimum Redundancy Maximum Relevance Approach for Feature Selection in Gene Expression Data. ACTA ACUST UNITED AC 2013. [DOI: 10.1016/j.protcy.2013.12.332] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
19
|
Zhou J, Wu D, Liu X, Yuan S, Yang X, Wang X. Translational medicine as a permanent glue and force of clinical medicine and public health: perspectives (1) from 2012 Sino-American symposium on clinical and translational medicine. Clin Transl Med 2012; 1:21. [PMID: 23369646 PMCID: PMC3560983 DOI: 10.1186/2001-1326-1-21] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2012] [Accepted: 08/21/2012] [Indexed: 12/01/2022] Open
Abstract
Abstracts Health systems globally face challenges and opportunities in balancing quality, access, and cost, where clinical and translational medicine (CTM) should play more important and powerful roles in the identification, development and validation of solutions and strategies. Strategic collaboration can gather global strengths and resources and improve health systems, care delivery, regulations and policies. CTM-driven innovation and development has the potential to achieve step-change improvements across three dimensions. Thus, we have the reasons to believe that CTM will play even more roles in the development of new diagnostics, therapies, healthcare, and policies and SAS-CTM will become more and more important platform to obtain the latest development in CTM internationally and explore new opportunities in the international collaborations.
Collapse
Affiliation(s)
- Jiebai Zhou
- Department of Pulmonary Medicine, Fudan University School of Medicine, Zhongshan Hospital, Shanghai, China.
| | | | | | | | | | | |
Collapse
|
20
|
Wu X, Chen H, Wang X. Can lung cancer stem cells be targeted for therapies? Cancer Treat Rev 2012; 38:580-8. [DOI: 10.1016/j.ctrv.2012.02.013] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2011] [Revised: 02/26/2012] [Accepted: 02/28/2012] [Indexed: 12/26/2022]
|
21
|
Liu KQ, Liu ZP, Hao JK, Chen L, Zhao XM. Identifying dysregulated pathways in cancers from pathway interaction networks. BMC Bioinformatics 2012; 13:126. [PMID: 22676414 PMCID: PMC3443452 DOI: 10.1186/1471-2105-13-126] [Citation(s) in RCA: 100] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2011] [Accepted: 05/21/2012] [Indexed: 12/04/2022] Open
Abstract
Background Cancers, a group of multifactorial complex diseases, are generally caused by mutation of multiple genes or dysregulation of pathways. Identifying biomarkers that can characterize cancers would help to understand and diagnose cancers. Traditional computational methods that detect genes differentially expressed between cancer and normal samples fail to work due to small sample size and independent assumption among genes. On the other hand, genes work in concert to perform their functions. Therefore, it is expected that dysregulated pathways will serve as better biomarkers compared with single genes. Results In this paper, we propose a novel approach to identify dysregulated pathways in cancer based on a pathway interaction network. Our contribution is three-fold. Firstly, we present a new method to construct pathway interaction network based on gene expression, protein-protein interactions and cellular pathways. Secondly, the identification of dysregulated pathways in cancer is treated as a feature selection problem, which is biologically reasonable and easy to interpret. Thirdly, the dysregulated pathways are identified as subnetworks from the pathway interaction networks, where the subnetworks characterize very well the functional dependency or crosstalk between pathways. The benchmarking results on several distinct cancer datasets demonstrate that our method can obtain more reliable and accurate results compared with existing state of the art methods. Further functional analysis and independent literature evidence also confirm that our identified potential pathogenic pathways are biologically reasonable, indicating the effectiveness of our method. Conclusions Dysregulated pathways can serve as better biomarkers compared with single genes. In this work, by utilizing pathway interaction networks and gene expression data, we propose a novel approach that effectively identifies dysregulated pathways, which can not only be used as biomarkers to diagnose cancers but also serve as potential drug targets in the future.
Collapse
Affiliation(s)
- Ke-Qin Liu
- Institute of Systems Biology, Shanghai University, Shanghai 200444, China
| | | | | | | | | |
Collapse
|
22
|
Zheng Y, Bai C, Wang X. Potential significance of telocytes in the pathogenesis of lung diseases. Expert Rev Respir Med 2012; 6:45-9. [PMID: 22283578 DOI: 10.1586/ers.11.91] [Citation(s) in RCA: 48] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
Multiple cells play critical roles in the pathogenesis of lung diseases, even though the exact mechanisms are still not clear. Telocytes are characterized by telopodes, which are thin and long prolongations, and a small amount of cytoplasm rich with mitochondria, as shown by immune-positive staining against CD34, c-kit and vimentin. Telocytes have been found in many organs of mammals, including the trachea and lung. This report summarizes the latest findings associated with telocytes, with a special focus on the lung, and demonstrates that telocytes exist in the smooth muscle layer under cartilage and bronchiole in the lung, and also in the interstitial space of alveoli. Telocytes have a mediate connection with epithelial cells and direct connection with smooth muscle cells both in blood vessels and bronchiole in the lung. Telocytes also have a close relationship with other cell types, such as immune cells and stem cells. Telopodes appear with dichotomous branching and alternation of podom and podomer, forming a 3D network structure with complex homo- and hetero-cellular junctions. All characteristics of telocytes in lung tissue indicate that telocytes may play a potential, but important, role in the pathogenesis of lung diseases.
Collapse
Affiliation(s)
- Yonghua Zheng
- Zhongshan Hospital, Fudan University, Shanghai, China
| | | | | |
Collapse
|
23
|
Yang WH, Gu HB, Chen B, Li J, Fan QW, Yuan YF, Wang X. Evaluation of SLOG/TCI-III pediatric system on target control infusion of propofol. Lab Invest 2011; 9:187. [PMID: 22044738 PMCID: PMC3221635 DOI: 10.1186/1479-5876-9-187] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2011] [Accepted: 11/01/2011] [Indexed: 11/24/2022]
Abstract
Background The target-controlled infusion-III (SLOG/TCI-III) system was derived from a model set up by the local pediatric population for target control infusion of propofol. Methods The current study aimed at evaluating the difference between target concentrations of propofol and performance, which was measured using the SLOG/TCI-III system in children. Thirty children fulfilling the I-II criteria according to American Society of Anesthesiology were enrolled in the study. The target plasma concentration of propofol was fed into the SLOG/TCI-III system and compared with the measured concentrations of propofol. Blood samples were collected and analyzed by high performance liquid chromatography with fluorescence detector. The performance error (PE) was determined for each measured blood propofol concentration. The performances of the TCI-III system were determined by the median performance error (MDPE), the median absolute performance error (MDAPE), and Wobble (the median absolute deviation of each PE from the MDPE), respectively. Results Concentration against target concentration showed good linear correlation: concentration = 1.3428 target concentration - 0.2633 (r = 0.8667). The MDPE and MDAPE of the pediatric system were 10 and 22%, respectively, and the median value for Wobble was 24%. MDPE and MDAPE were less than 15 and 30%, respectively. Conclusions The performance of TCI-III system seems to be in the accepted limits for clinical practice in children.
Collapse
|