1
|
Nourian R, Motamedi SA, Pourfard M. BHBA-GRNet: Cancer detection through improved gene expression profiling using Binary Honey Badger Algorithm and Gene Residual-based Network. Comput Biol Med 2025; 184:109348. [PMID: 39615230 DOI: 10.1016/j.compbiomed.2024.109348] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2024] [Revised: 10/29/2024] [Accepted: 10/30/2024] [Indexed: 12/22/2024]
Abstract
Cancer, a pervasive and devastating disease, remains a leading global cause of mortality, emphasizing the growing urgency for effective detection methods. Gene Expression Microarray (GEM) data has emerged as a crucial tool in this context, offering insights into early cancer detection and treatment. While deep learning methods offer promise in detecting various cancers through GEM analysis, they suffer from high dimensionality inherent in gene sequences, preventing optimal detection performance across diverse cancer types. Additionally, existing methods often resort to synthetic features and data augmentation to enhance performance. To address these challenges and enhance accuracy, a novel Binary Honey Badger Algorithm (BHBA) integrated with the Gene Residual Network (GRNet) method has been proposed. Our approach capitalizes on BHBA's feature reduction mechanism, eliminating the need for additional preprocessing steps. Comprehensive evaluations on three well-established datasets representing lung and blood-type cancers demonstrate that our method reduces GEM data size by approximately 40 % and achieves a superior accuracy improvement of around 1 % in lung cancer types compared to state-of-the-art methods.
Collapse
Affiliation(s)
- Reza Nourian
- Electrical Engineering Department, Amirkabir University of Technology, No. 350, Hafez Ave, Valiasr Square, 15875-4413, Tehran, 159163-4311, Iran.
| | - Seyed Ahmad Motamedi
- Electrical Engineering Department, Amirkabir University of Technology, No. 350, Hafez Ave, Valiasr Square, 15875-4413, Tehran, 159163-4311, Iran.
| | - Mohammadreza Pourfard
- Electrical Engineering Department, Amirkabir University of Technology, No. 350, Hafez Ave, Valiasr Square, 15875-4413, Tehran, 159163-4311, Iran.
| |
Collapse
|
2
|
Rao J, Wang X, Wang Z. Integration of Microarray Data and Single-Cell Sequencing Analysis to Explore Key Genes Associated with Macrophage Infiltration in Heart Failure. J Inflamm Res 2024; 17:11257-11274. [PMID: 39717663 PMCID: PMC11665153 DOI: 10.2147/jir.s475633] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2024] [Accepted: 12/14/2024] [Indexed: 12/25/2024] Open
Abstract
Background Cardiac macrophages are a heterogeneous population with high plasticity and adaptability, and their mechanisms in heart failure (HF) remain poorly elucidated. Methods We used single-cell and bulk RNA sequencing data to reveal the heterogeneity of non-cardiomyocytes and assess the immunoreactivity of each subpopulation. Additionally, we employed four integrated machine learning algorithms to identify macrophage-related genes with diagnostic value, and in vivo validation was performed. To assess the immune infiltration characteristics in HF, we utilized the CIBERSORT and single sample gene set enrichment analysis (ssGSEA). An unsupervised consensus clustering algorithm was applied to identify the macrophage-related HF subtypes. Furthermore, the scMetabolism was employed to explore the specific metabolic patterns of the macrophage subtypes. Finally, CellChat was used to investigate cell-cell interactions among the identified subtypes. Results The immunoreactivity score of macrophages in the HF was higher than that in the other cell types. GSEA of macrophage clusters indicated a significant enrichment of leukocyte-mediated immune processes, antigen processing, and presentation. The intersection of the results from machine learning revealed that SERPINA3, GPAT3, ANPEP, and FCER1G can serve as feature genes and form a diagnostic model with a good predictive capability. Unsupervised consensus clustering algorithms reveal the immune and metabolic subtypes of macrophages. The metabolic heterogeneity of macrophage subpopulations can lead to macrophage polarization into different types, which may be related to the metabolic reprogramming between glycolysis and mitochondrial oxidative phosphorylation. Cellular communication revealed that macrophages form a network of interactions with neutrophils to support each other's functions and maintenance. The complex efferent and afferent signals are closely associated with myocardial fibrosis. Conclusion SERPINA3, GPAT3, ANPEP, and FCER1G can potentially serve as immune therapeutic targets and central biomarkers. The immunological and metabolic heterogeneity of macrophages may offer a more precise direction to explore the mechanisms underlying HF and novel immunotherapies.
Collapse
Affiliation(s)
- Jin Rao
- Department of Cardiothoracic Surgery, Changzheng Hospital, Naval Medical University, Shanghai, People’s Republic of China
| | - Xuefu Wang
- School of Health Science and Engineering, University of Shanghai for Science and Technology, Shanghai, People’s Republic of China
| | - Zhinong Wang
- Department of Cardiothoracic Surgery, Changzheng Hospital, Naval Medical University, Shanghai, People’s Republic of China
| |
Collapse
|
3
|
Djordjilović V, Ponzi E, Nøst TH, Thoresen M. penalizedclr: an R package for penalized conditional logistic regression for integration of multiple omics layers. BMC Bioinformatics 2024; 25:226. [PMID: 38937668 PMCID: PMC11212437 DOI: 10.1186/s12859-024-05850-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2024] [Accepted: 06/20/2024] [Indexed: 06/29/2024] Open
Abstract
BACKGROUND The matched case-control design, up until recently mostly pertinent to epidemiological studies, is becoming customary in biomedical applications as well. For instance, in omics studies, it is quite common to compare cancer and healthy tissue from the same patient. Furthermore, researchers today routinely collect data from various and variable sources that they wish to relate to the case-control status. This highlights the need to develop and implement statistical methods that can take these tendencies into account. RESULTS We present an R package penalizedclr, that provides an implementation of the penalized conditional logistic regression model for analyzing matched case-control studies. It allows for different penalties for different blocks of covariates, and it is therefore particularly useful in the presence of multi-source omics data. Both L1 and L2 penalties are implemented. Additionally, the package implements stability selection for variable selection in the considered regression model. CONCLUSIONS The proposed method fills a gap in the available software for fitting high-dimensional conditional logistic regression models accounting for the matched design and block structure of predictors/features. The output consists of a set of selected variables that are significantly associated with case-control status. These variables can then be investigated in terms of functional interpretation or validation in further, more targeted studies.
Collapse
Affiliation(s)
- Vera Djordjilović
- Department of Economics, Ca' Foscari University of Venice, Venice, Italy.
- Department of Biostatistics, University of Oslo, Oslo, Norway.
| | - Erica Ponzi
- Department of Biostatistics, University of Oslo, Oslo, Norway
| | - Therese Haugdahl Nøst
- Department of Public Health and Nursing, Norwegian University of Science and Technology, Trondheim, Norway
- Department of Community Medicine, Faculty of Health Sciences, The Arctic University of Norway, Tromsø, Norway
| | - Magne Thoresen
- Department of Biostatistics, University of Oslo, Oslo, Norway
| |
Collapse
|
4
|
Teghipco A, Newman-Norlund R, Gibson M, Bonilha L, Absher J, Fridriksson J, Rorden C. Stable multivariate lesion symptom mapping. APERTURE NEURO 2024; 4:10.52294/001c.117311. [PMID: 39364269 PMCID: PMC11449259 DOI: 10.52294/001c.117311] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/05/2024]
Abstract
Multivariate lesion-symptom mapping (MLSM) considers lesion information across the entire brain to predict impairments. The strength of this approach is also its weakness-considering many brain features together synergistically can uncover complex brain-behavior relationships but exposes a high-dimensional feature space that a model is expected to learn. Successfully distinguishing between features in this landscape can be difficult for models, particularly in the presence of irrelevant or redundant features. Here, we propose stable multivariate lesion-symptom mapping (sMLSM), which integrates the identification of reliable features with stability selection into conventional MLSM and describe our open-source MATLAB implementation. Usage is showcased with our publicly available dataset of chronic stroke survivors (N=167) and further validated in our independent public acute stroke dataset (N = 1106). We demonstrate that sMLSM eliminates inconsistent features highlighted by MLSM, reduces variation in feature weights, enables the model to learn more complex patterns of brain damage, and improves model accuracy for predicting aphasia severity in a way that tends to be robust regarding the choice of parameters for identifying reliable features. Critically, sMLSM more consistently outperforms predictions based on lesion size alone. This advantage is evident starting at modest sample sizes (N>75). Spatial distribution of feature importance is different in sMLSM, which highlights the features identified by univariate lesion symptom mapping while also implicating select regions emphasized by MLSM. Beyond improved prediction accuracy, sMLSM can offer deeper insight into reliable biomarkers of impairment, informing our understanding of neurobiology.
Collapse
Affiliation(s)
- Alex Teghipco
- Communication Sciences & Disorders, University of South Carolina
| | | | | | - Leonardo Bonilha
- Communication Sciences & Disorders, University of South Carolina
- Neurology, University of South Carolina School of Medicine
| | - John Absher
- Neurology, University of South Carolina School of Medicine
- School of Health Research, Clemson University
- Medicine, Neurosurgery and Radiology, Prisma Health
| | | | | |
Collapse
|
5
|
Tu D, Xu Q, Zuo X, Ma C. Uncovering hub genes and immunological characteristics for heart failure utilizing RRA, WGCNA and Machine learning. IJC HEART & VASCULATURE 2024; 51:101335. [PMID: 38371312 PMCID: PMC10869931 DOI: 10.1016/j.ijcha.2024.101335] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2023] [Revised: 12/24/2023] [Accepted: 01/02/2024] [Indexed: 02/20/2024]
Abstract
Background Heart failure (HF) is a major public health issue with high mortality and morbidity. This study aimed to find potential diagnostic markers for HF by the combination of bioinformatics analysis and machine learning, as well as analyze the role of immune infiltration in the pathological process of HF. Methods The gene expression profiles of 124 HF patients and 135 nonfailing donors (NFDs) were obtained from six datasets in the NCBI Gene Expression Omnibus (GEO) public database. We applied robust rank aggregation (RRA) and weighted gene co-expression network analysis (WGCNA) method to identify critical genes in HF. To discover novel diagnostic markers in HF, three machine learning methods were employed, including best subset regression, regularization technique, and support vector machine-recursive feature elimination (SVM-RFE). Besides, immune infiltration was investigated in HF by single-sample gene set enrichment analysis (ssGSEA). Results Combining RRA with WGCNA method, we recognized 39 critical genes associated with HF. Through integrating three machine learning methods, FCN3 and SMOC2 were determined as novel diagnostic markers in HF. Differences in immune infiltration signature were also found between HF patients and NFDs. Moreover, we explored the potential associations between two diagnostic markers and immune response in the pathogenesis of HF. Conclusions In summary, FCN3 and SMOC2 can be used as diagnostic markers of HF, and immune infiltration plays an important role in the initiation and progression of HF.
Collapse
Affiliation(s)
- Dingyuan Tu
- Cardiovascular Research Institute and Department of Cardiology, General Hospital of Northern Theater Command, State Key Laboratory of Frigid Zone Cardiovascular Diseases (SKLFZCD), Shenyang, 110000 Liaoning, China
- Department of Cardiology, The 961st Hospital of Joint Logistic Support Force of PLA, 71 Youzheng Road, Qiqihar, 161000 Heilongjiang, China
| | - Qiang Xu
- Department of Cardiology, Navy 905 Hospital, Naval Medical University, 1328 Huashan Road, Changning District, Shanghai 200052, China
| | - Xiaoli Zuo
- Department of Cardiology, The 961st Hospital of Joint Logistic Support Force of PLA, 71 Youzheng Road, Qiqihar, 161000 Heilongjiang, China
| | - Chaoqun Ma
- Cardiovascular Research Institute and Department of Cardiology, General Hospital of Northern Theater Command, State Key Laboratory of Frigid Zone Cardiovascular Diseases (SKLFZCD), Shenyang, 110000 Liaoning, China
| |
Collapse
|
6
|
Wang X, Rao J, Zhang L, Liu X, Zhang Y. Identification of circadian rhythm-related gene classification patterns and immune infiltration analysis in heart failure based on machine learning. Heliyon 2024; 10:e27049. [PMID: 38509983 PMCID: PMC10950509 DOI: 10.1016/j.heliyon.2024.e27049] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2023] [Revised: 12/17/2023] [Accepted: 02/22/2024] [Indexed: 03/22/2024] Open
Abstract
Background Circadian rhythms play a key role in the failing heart, but the exact molecular mechanisms linking changes in the expression of circadian rhythm-related genes to heart failure (HF) remain unclear. Methods By intersecting differentially expressed genes (DEGs) between normal and HF samples in the Gene Expression Omnibus (GEO) database with circadian rhythm-related genes (CRGs), differentially expressed circadian rhythm-related genes (DE-CRGs) were obtained. Machine learning algorithms were used to screen for feature genes, and diagnostic models were constructed based on these feature genes. Subsequently, consensus clustering algorithms and non-negative matrix factorization (NMF) algorithms were used for clustering analysis of HF samples. On this basis, immune infiltration analysis was used to score the immune infiltration status between HF and normal samples as well as among different subclusters. Gene Set Variation Analysis (GSVA) evaluated the biological functional differences among subclusters. Results 13 CRGs showed differential expression between HF patients and normal samples. Nine feature genes were obtained through cross-referencing results from four distinct machine learning algorithms. Multivariate LASSO regression and external dataset validation were performed to select five key genes with diagnostic value, including NAMPT, SERPINA3, MAPK10, NPPA, and SLC2A1. Moreover, consensus clustering analysis could divide HF patients into two distinct clusters, which exhibited different biological functions and immune characteristics. Additionally, two subgroups were distinguished using the NMF algorithm based on circadian rhythm associated differentially expressed genes. Studies on immune infiltration showed marked variances in levels of immune infiltration between these subgroups. Subgroup A had higher immune scores and more widespread immune infiltration. Finally, the Weighted Gene Co-expression Network Analysis (WGCNA) method was utilized to discern the modules that had the closest association with the two observed subgroups, and hub genes were pinpointed via protein-protein interaction (PPI) networks. GRIN2A, DLG1, ERBB4, LRRC7, and NRG1 were circadian rhythm-related hub genes closely associated with HF. Conclusion This study provides valuable references for further elucidating the pathogenesis of HF and offers beneficial insights for targeting circadian rhythm mechanisms to regulate immune responses and energy metabolism in HF treatment. Five genes identified by us as diagnostic features could be potential targets for therapy for HF.
Collapse
Affiliation(s)
- Xuefu Wang
- School of Health Science and Engineering, University of Shanghai for Science and Technology, Shanghai, China
| | - Jin Rao
- Department of Cardiothoracic Surgery, Shanghai Changzheng Hospital, Naval Medical University, Shanghai, China
| | - Li Zhang
- Guangxi University, Nanning, China
| | | | - Yufeng Zhang
- School of Health Science and Engineering, University of Shanghai for Science and Technology, Shanghai, China
- Department of Cardiothoracic Surgery, Shanghai Changzheng Hospital, Naval Medical University, Shanghai, China
| |
Collapse
|
7
|
Tu D, Xu Q, Luan Y, Sun J, Zuo X, Ma C. Integrative analysis of bioinformatics and machine learning to identify cuprotosis-related biomarkers and immunological characteristics in heart failure. Front Cardiovasc Med 2024; 11:1349363. [PMID: 38562184 PMCID: PMC10982316 DOI: 10.3389/fcvm.2024.1349363] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2023] [Accepted: 03/07/2024] [Indexed: 04/04/2024] Open
Abstract
Backgrounds Cuprotosis is a newly discovered programmed cell death by modulating tricarboxylic acid cycle. Emerging evidence showed that cuprotosis-related genes (CRGs) are implicated in the occurrence and progression of multiple diseases. However, the mechanism of cuprotosis in heart failure (HF) has not been investigated yet. Methods The HF microarray datasets GSE16499, GSE26887, GSE42955, GSE57338, GSE76701, and GSE79962 were downloaded from the Gene Expression Omnibus (GEO) database to identify differentially expressed CRGs between HF patients and nonfailing donors (NFDs). Four machine learning models were used to identify key CRGs features for HF diagnosis. The expression profiles of key CRGs were further validated in a merged GEO external validation dataset and human samples through quantitative reverse-transcription polymerase chain reaction (qRT-PCR). In addition, Gene Ontology (GO) function enrichment, Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment, and immune infiltration analysis were used to investigate potential biological functions of key CRGs. Results We discovered nine differentially expressed CRGs in heart tissues from HF patients and NFDs. With the aid of four machine learning algorithms, we identified three indicators of cuprotosis (DLAT, SLC31A1, and DLST) in HF, which showed good diagnostic properties. In addition, their differential expression between HF patients and NFDs was confirmed through qRT-PCR. Moreover, the results of enrichment analyses and immune infiltration exhibited that these diagnostic markers of CRGs were strongly correlated to energy metabolism and immune activity. Conclusions Our study discovered that cuprotosis was strongly related to the pathogenesis of HF, probably by regulating energy metabolism-associated and immune-associated signaling pathways.
Collapse
Affiliation(s)
- Dingyuan Tu
- Cardiovascular Research Institute and Department of Cardiology, General Hospital of Northern Theater Command, State Key Laboratory of Frigid Zone Cardiovascular Diseases (SKLFZCD), Shenyang, Liaoning, China
- Department of Cardiology, The 961st Hospital of PLA Joint Logistic Support Force, Qiqihar, Heilongjiang, China
| | - Qiang Xu
- Department of Cardiology, Changhai Hospital, Naval Medical University, Shanghai, China
- Department of Cardiology, Navy 905 Hospital, Naval Medical University, Shanghai, China
| | - Yanmin Luan
- Reproductive Medicine Center, Changhai Hospital, Naval Medical University, Shanghai, China
| | - Jie Sun
- Hospital-Acquired Infection Control Department, Yantai Ludong Hospital, Yantai, Shandong, China
| | - Xiaoli Zuo
- Department of Cardiology, The 961st Hospital of PLA Joint Logistic Support Force, Qiqihar, Heilongjiang, China
| | - Chaoqun Ma
- Cardiovascular Research Institute and Department of Cardiology, General Hospital of Northern Theater Command, State Key Laboratory of Frigid Zone Cardiovascular Diseases (SKLFZCD), Shenyang, Liaoning, China
| |
Collapse
|
8
|
Bailey R, Sarkar A, Singh A, Dobra A, Kahveci T. Optimal Supervised Reduction of High Dimensional Transcription Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:3093-3105. [PMID: 37276117 DOI: 10.1109/tcbb.2023.3280557] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
The plight of navigating high-dimensional transcription datasets remains a persistent problem. This problem is further amplified for complex disorders, such as cancer as these disorders are often multigenic traits with multiple subsets of genes collectively affecting the type, stage, and severity of the trait. We are often faced with a trade off between reducing the dimensionality of our datasets and maintaining the integrity of our data. To accomplish both tasks simultaneously for very high dimensional transcriptome for complex multigenic traits, we propose a new supervised technique, Class Separation Transformation (CST). CST accomplishes both tasks simultaneously by significantly reducing the dimensionality of the input space into a one-dimensional transformed space that provides optimal separation between the differing classes. Furthermore, CST offers an means of explainable ML, as it computes the relative importance of each feature for its contribution to class distinction, which can thus lead to deeper insights and discovery. We compare our method with existing state-of-the-art methods using both real and synthetic datasets, demonstrating that CST is the more accurate, robust, scalable, and computationally advantageous technique relative to existing methods. Code used in this paper is available on https://github.com/richiebailey74/CST.
Collapse
|
9
|
Afrash MR, Mirbagheri E, Mashoufi M, Kazemi-Arpanahi H. Optimizing prognostic factors of five-year survival in gastric cancer patients using feature selection techniques with machine learning algorithms: a comparative study. BMC Med Inform Decis Mak 2023; 23:54. [PMID: 37024885 PMCID: PMC10080884 DOI: 10.1186/s12911-023-02154-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2022] [Accepted: 03/15/2023] [Indexed: 04/08/2023] Open
Abstract
BACKGROUND Gastric cancer is the most common malignant tumor worldwide and a leading cause of cancer deaths. This neoplasm has a poor prognosis and heterogeneous outcomes. Survivability prediction may help select the best treatment plan based on an individual's prognosis. Numerous clinical and pathological features are generally used in predicting gastric cancer survival, and their influence on the survival of this cancer has not been fully elucidated. Moreover, the five-year survivability prognosis performances of feature selection methods with machine learning (ML) classifiers for gastric cancer have not been fully benchmarked. Therefore, we adopted several well-known feature selection methods and ML classifiers together to determine the best-paired feature selection-classifier for this purpose. METHODS This was a retrospective study on a dataset of 974 patients diagnosed with gastric cancer in the Ayatollah Talleghani Hospital, Abadan, Iran. First, four feature selection algorithms, including Relief, Boruta, least absolute shrinkage and selection operator (LASSO), and minimum redundancy maximum relevance (mRMR) were used to select a set of relevant features that are very informative for five-year survival prediction in gastric cancer patients. Then, each feature set was fed to three classifiers: XG Boost (XGB), hist gradient boosting (HGB), and support vector machine (SVM) to develop predictive models. Finally, paired feature selection-classifier methods were evaluated to select the best-paired method using the area under the curve (AUC), accuracy, sensitivity, specificity, and f1-score metrics. RESULTS The LASSO feature selection algorithm combined with the XG Boost classifier achieved an accuracy of 89.10%, a specificity of 87.15%, a sensitivity of 89.42%, an AUC of 89.37%, and an f1-score of 90.8%. Tumor stage, history of other cancers, lymphatic invasion, tumor site, type of treatment, body weight, histological type, and addiction were identified as the most significant factors affecting gastric cancer survival. CONCLUSIONS This study proved the worth of the paired feature selection-classifier to identify the best path that could augment the five-year survival prediction in gastric cancer patients. Our results were better than those of previous studies, both in terms of the time required to form the models and the performance measurement criteria of the algorithms. These findings may be very promising and can, therefore, inform clinical decision-making and shed light on future studies.
Collapse
Affiliation(s)
- Mohammad Reza Afrash
- Department of Artificial Intelligence, Smart University of Medical Sciences, Tehran, Iran
| | - Esmat Mirbagheri
- Department of Health Information Management, School of Health Management and Information Sciences, Iran University of Medical Sciences, Tehran, Iran
| | - Mehrnaz Mashoufi
- Department of Health Information Management, Ardabil University of Medical Sciences, Ardabil, Iran
| | - Hadi Kazemi-Arpanahi
- Department of Health Information Technology, Abadan University of Medical Sciences, Abadan, Iran.
| |
Collapse
|
10
|
Zhang L. A Feature Selection Method Using Conditional Correlation Dispersion and Redundancy Analysis. Neural Process Lett 2023. [DOI: 10.1007/s11063-023-11256-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/07/2023]
|
11
|
Ma C, Tu D, Xu Q, Wu Y, Song X, Guo Z, Zhao X. Identification of m 7G regulator-mediated RNA methylation modification patterns and related immune microenvironment regulation characteristics in heart failure. Clin Epigenetics 2023; 15:22. [PMID: 36782329 PMCID: PMC9926673 DOI: 10.1186/s13148-023-01439-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2022] [Accepted: 02/05/2023] [Indexed: 02/15/2023] Open
Abstract
BACKGROUND N7-methylguanosine (m7G) modification has been reported to regulate RNA expression in multiple pathophysiological processes. However, little is known about its role and association with immune microenvironment in heart failure (HF). RESULTS One hundred twenty-four HF patients and 135 nonfailing donors (NFDs) from six microarray datasets in the gene expression omnibus (GEO) database were included to evaluate the expression profiles of m7G regulators. Results revealed that 14 m7G regulators were differentially expressed in heart tissues from HF patients and NFDs. Furthermore, a five-gene m7G regulator diagnostic signature, NUDT16, NUDT4, CYFIP1, LARP1, and DCP2, which can easily distinguish HF patients and NFDs, was established by cross-combination of three machine learning methods, including best subset regression, regularization techniques, and random forest algorithm. The diagnostic value of five-gene m7G regulator signature was further validated in human samples through quantitative reverse-transcription polymerase chain reaction (qRT-PCR). In addition, consensus clustering algorithms were used to categorize HF patients into distinct molecular subtypes. We identified two distinct m7G subtypes of HF with unique m7G modification pattern, functional enrichment, and immune characteristics. Additionally, two gene subgroups based on m7G subtype-related genes were further discovered. Single-sample gene-set enrichment analysis (ssGSEA) was utilized to assess the alterations of immune microenvironment. Finally, utilizing protein-protein interaction network and weighted gene co-expression network analysis (WGCNA), we identified UQCRC1, NDUFB6, and NDUFA13 as m7G methylation-associated hub genes with significant clinical relevance to cardiac functions. CONCLUSIONS Our study discovered for the first time that m7G RNA modification and immune microenvironment are closely correlated in HF development. A five-gene m7G regulator diagnostic signature for HF (NUDT16, NUDT4, CYFIP1, LARP1, and DCP2) and three m7G methylation-associated hub genes (UQCRC1, NDUFB6, and NDUFA13) were identified, providing new insights into the underlying mechanisms and effective treatments of HF.
Collapse
Affiliation(s)
- Chaoqun Ma
- Cardiovascular Research Institute and Department of Cardiology, General Hospital of Northern Theater Command, Shenyang, 110000, Liaoning, China
| | - Dingyuan Tu
- Cardiovascular Research Institute and Department of Cardiology, General Hospital of Northern Theater Command, Shenyang, 110000, Liaoning, China
- Department of Cardiology, Changhai Hospital, Naval Medical University, 168 Changhai Rd, Shanghai, 200433, China
| | - Qiang Xu
- Department of Cardiology, Navy 905 Hospital, Naval Medical University, Shanghai, 200052, China
| | - Yan Wu
- Department of Cardiology, Navy 905 Hospital, Naval Medical University, Shanghai, 200052, China
| | - Xiaowei Song
- Department of Cardiology, Changhai Hospital, Naval Medical University, 168 Changhai Rd, Shanghai, 200433, China.
| | - Zhifu Guo
- Department of Cardiology, Changhai Hospital, Naval Medical University, 168 Changhai Rd, Shanghai, 200433, China.
| | - Xianxian Zhao
- Department of Cardiology, Changhai Hospital, Naval Medical University, 168 Changhai Rd, Shanghai, 200433, China.
| |
Collapse
|
12
|
Han K, Wang J, Wang Y, Zhang L, Yu M, Xie F, Zheng D, Xu Y, Ding Y, Wan J. A review of methods for predicting DNA N6-methyladenine sites. Brief Bioinform 2023; 24:6887111. [PMID: 36502371 DOI: 10.1093/bib/bbac514] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2022] [Revised: 10/07/2022] [Accepted: 10/27/2022] [Indexed: 12/14/2022] Open
Abstract
Deoxyribonucleic acid(DNA) N6-methyladenine plays a vital role in various biological processes, and the accurate identification of its site can provide a more comprehensive understanding of its biological effects. There are several methods for 6mA site prediction. With the continuous development of technology, traditional techniques with the high costs and low efficiencies are gradually being replaced by computer methods. Computer methods that are widely used can be divided into two categories: traditional machine learning and deep learning methods. We first list some existing experimental methods for predicting the 6mA site, then analyze the general process from sequence input to results in computer methods and review existing model architectures. Finally, the results were summarized and compared to facilitate subsequent researchers in choosing the most suitable method for their work.
Collapse
Affiliation(s)
- Ke Han
- School of Computer and Information Engineering, Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin University of Commerce, Harbin, 150028, China.,College of Pharmacy, Harbin University of Commerce, Harbin, 150076, China
| | - Jianchun Wang
- School of Computer and Information Engineering, Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin University of Commerce, Harbin, 150028, China
| | - Yu Wang
- School of Computer and Information Engineering, Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin University of Commerce, Harbin, 150028, China
| | - Lei Zhang
- School of Computer and Information Engineering, Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin University of Commerce, Harbin, 150028, China
| | - Mengyao Yu
- School of Computer and Information Engineering, Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin University of Commerce, Harbin, 150028, China
| | - Fang Xie
- School of Computer and Information Engineering, Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin University of Commerce, Harbin, 150028, China
| | - Dequan Zheng
- School of Computer and Information Engineering, Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin University of Commerce, Harbin, 150028, China
| | - Yaoqun Xu
- School of Computer and Information Engineering, Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin University of Commerce, Harbin, 150028, China
| | - Yijie Ding
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, 324000, China
| | - Jie Wan
- Laboratory for Space Environment and Physical Sciences, Harbin Institute of Technology, Harbin, 150001, China
| |
Collapse
|
13
|
Gerolami J, Wong JJM, Zhang R, Chen T, Imtiaz T, Smith M, Jamaspishvili T, Koti M, Glasgow JI, Mousavi P, Renwick N, Tyryshkin K. A Computational Approach to Identification of Candidate Biomarkers in High-Dimensional Molecular Data. Diagnostics (Basel) 2022; 12:diagnostics12081997. [PMID: 36010347 PMCID: PMC9407361 DOI: 10.3390/diagnostics12081997] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2022] [Revised: 08/16/2022] [Accepted: 08/17/2022] [Indexed: 12/13/2022] Open
Abstract
Complex high-dimensional datasets that are challenging to analyze are frequently produced through ‘-omics’ profiling. Typically, these datasets contain more genomic features than samples, limiting the use of multivariable statistical and machine learning-based approaches to analysis. Therefore, effective alternative approaches are urgently needed to identify features-of-interest in ‘-omics’ data. In this study, we present the molecular feature selection tool, a novel, ensemble-based, feature selection application for identifying candidate biomarkers in ‘-omics’ data. As proof-of-principle, we applied the molecular feature selection tool to identify a small set of immune-related genes as potential biomarkers of three prostate adenocarcinoma subtypes. Furthermore, we tested the selected genes in a model to classify the three subtypes and compared the results to models built using all genes and all differentially expressed genes. Genes identified with the molecular feature selection tool performed better than the other models in this study in all comparison metrics: accuracy, precision, recall, and F1-score using a significantly smaller set of genes. In addition, we developed a simple graphical user interface for the molecular feature selection tool, which is available for free download. This user-friendly interface is a valuable tool for the identification of potential biomarkers in gene expression datasets and is an asset for biomarker discovery studies.
Collapse
Affiliation(s)
- Justin Gerolami
- School of Computing, Queen’s University, Kingston, ON K7L 3N6, Canada
| | - Justin Jong Mun Wong
- Department of Pathology and Molecular Medicine, Queen’s University, Kingston, ON K7L 3N6, Canada
| | - Ricky Zhang
- School of Computing, Queen’s University, Kingston, ON K7L 3N6, Canada
| | - Tong Chen
- School of Computing, Queen’s University, Kingston, ON K7L 3N6, Canada
| | - Tashifa Imtiaz
- Department of Pathology and Molecular Medicine, Queen’s University, Kingston, ON K7L 3N6, Canada
| | - Miranda Smith
- School of Computing, Queen’s University, Kingston, ON K7L 3N6, Canada
| | - Tamara Jamaspishvili
- Department of Pathology and Molecular Medicine, Queen’s University, Kingston, ON K7L 3N6, Canada
- Department of Pathology & Laboratory Medicine, SUNY Upstate Medical University, Syracuse, NY 13210, USA
| | - Madhuri Koti
- Department of Biomedical and Molecular Sciences, Queen’s University, Kingston, ON K7L 3N6, Canada
| | | | - Parvin Mousavi
- School of Computing, Queen’s University, Kingston, ON K7L 3N6, Canada
| | - Neil Renwick
- Department of Pathology and Molecular Medicine, Queen’s University, Kingston, ON K7L 3N6, Canada
| | - Kathrin Tyryshkin
- School of Computing, Queen’s University, Kingston, ON K7L 3N6, Canada
- Department of Pathology and Molecular Medicine, Queen’s University, Kingston, ON K7L 3N6, Canada
- Correspondence: ; Tel.: +1-613-533-2345
| |
Collapse
|
14
|
Li F, Zhou Y, Zhang Y, Yin J, Qiu Y, Gao J, Zhu F. POSREG: proteomic signature discovered by simultaneously optimizing its reproducibility and generalizability. Brief Bioinform 2022; 23:6532538. [PMID: 35183059 DOI: 10.1093/bib/bbac040] [Citation(s) in RCA: 91] [Impact Index Per Article: 45.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2021] [Revised: 01/21/2022] [Accepted: 01/27/2022] [Indexed: 12/17/2022] Open
Abstract
Mass spectrometry-based proteomic technique has become indispensable in current exploration of complex and dynamic biological processes. Instrument development has largely ensured the effective production of proteomic data, which necessitates commensurate advances in statistical framework to discover the optimal proteomic signature. Current framework mainly emphasizes the generalizability of the identified signature in predicting the independent data but neglects the reproducibility among signatures identified from independently repeated trials on different sub-dataset. These problems seriously restricted the wide application of the proteomic technique in molecular biology and other related directions. Thus, it is crucial to enable the generalizable and reproducible discovery of the proteomic signature with the subsequent indication of phenotype association. However, no such tool has been developed and available yet. Herein, an online tool, POSREG, was therefore constructed to identify the optimal signature for a set of proteomic data. It works by (i) identifying the proteomic signature of good reproducibility and aggregating them to ensemble feature ranking by ensemble learning, (ii) assessing the generalizability of ensemble feature ranking to acquire the optimal signature and (iii) indicating the phenotype association of discovered signature. POSREG is unique in its capacity of discovering the proteomic signature by simultaneously optimizing its reproducibility and generalizability. It is now accessible free of charge without any registration or login requirement at https://idrblab.org/posreg/.
Collapse
Affiliation(s)
- Fengcheng Li
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Ying Zhou
- State Key Laboratory for Diagnosis and Treatment of Infectious Disease, Collaborative Innovation Center for Diagnosis and Treatment of Infectious Diseases, Zhejiang Provincial Key Laboratory for Drug Clinical Research and Evaluation, The First Affiliated Hospital, Zhejiang University, Hangzhou, Zhejiang 310000, China
| | - Ying Zhang
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Jiayi Yin
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Yunqing Qiu
- State Key Laboratory for Diagnosis and Treatment of Infectious Disease, Collaborative Innovation Center for Diagnosis and Treatment of Infectious Diseases, Zhejiang Provincial Key Laboratory for Drug Clinical Research and Evaluation, The First Affiliated Hospital, Zhejiang University, Hangzhou, Zhejiang 310000, China
| | - Jianqing Gao
- Westlake Laboratory of Life Sciences and Biomedicine, Hangzhou, Zhejiang, China
| | - Feng Zhu
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| |
Collapse
|
15
|
Alhenawi E, Al-Sayyed R, Hudaib A, Mirjalili S. Feature selection methods on gene expression microarray data for cancer classification: A systematic review. Comput Biol Med 2022; 140:105051. [PMID: 34839186 DOI: 10.1016/j.compbiomed.2021.105051] [Citation(s) in RCA: 37] [Impact Index Per Article: 18.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2021] [Revised: 11/01/2021] [Accepted: 11/15/2021] [Indexed: 11/29/2022]
Abstract
This systematic review provides researchers interested in feature selection (FS) for processing microarray data with comprehensive information about the main research directions for gene expression classification conducted during the recent seven years. A set of 132 researches published by three different publishers is reviewed. The studied papers are categorized into nine directions based on their objectives. The FS directions that received various levels of attention were then summarized. The review revealed that 'propose hybrid FS methods' represented the most interesting research direction with a percentage of 34.9%, while the other directions have lower percentages that ranged from 13.6% down to 3%. This guides researchers to select the most competitive research direction. Papers in each category are thoroughly reviewed based on six perspectives, mainly: method(s), classifier(s), dataset(s), dataset dimension(s) range, performance metric(s), and result(s) achieved.
Collapse
Affiliation(s)
- Esra'a Alhenawi
- King Abdullah II School for Information Technology, The University of Jordan, Amman, Jordan.
| | - Rizik Al-Sayyed
- King Abdullah II School for Information Technology, The University of Jordan, Amman, Jordan.
| | - Amjad Hudaib
- King Abdullah II School for Information Technology, The University of Jordan, Amman, Jordan.
| | - Seyedali Mirjalili
- Center for Artificial Intelligence Research and Optimization, Torrens University Australia, Fortitude Valley, Brisbane, 4006, QLD, Australia; Yonsei Frontier Lab, Yonsei University, Seoul, South Korea.
| |
Collapse
|
16
|
Wu X, Cheng Q. Algorithmic Stability and Generalization of an Unsupervised Feature Selection Algorithm. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 2021; 34:19860-19875. [PMID: 36187051 PMCID: PMC9524443] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
Feature selection, as a vital dimension reduction technique, reduces data dimension by identifying an essential subset of input features, which can facilitate interpretable insights into learning and inference processes. Algorithmic stability is a key characteristic of an algorithm regarding its sensitivity to perturbations of input samples. In this paper, we propose an innovative unsupervised feature selection algorithm attaining this stability with provable guarantees. The architecture of our algorithm consists of a feature scorer and a feature selector. The scorer trains a neural network (NN) to globally score all the features, and the selector adopts a dependent sub-NN to locally evaluate the representation abilities for selecting features. Further, we present algorithmic stability analysis and show that our algorithm has a performance guarantee via a generalization error bound. Extensive experimental results on real-world datasets demonstrate superior generalization performance of our proposed algorithm to strong baseline methods. Also, the properties revealed by our theoretical analysis and the stability of our algorithm-selected features are empirically confirmed.
Collapse
|
17
|
Feature selection in a neighborhood decision information system with application to single cell RNA data classification. Appl Soft Comput 2021. [DOI: 10.1016/j.asoc.2021.107876] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
|
18
|
Solorio-Fernández S, Carrasco-Ochoa JA, Martínez-Trinidad JF. A survey on feature selection methods for mixed data. Artif Intell Rev 2021. [DOI: 10.1007/s10462-021-10072-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
19
|
Agrawal S, Sisodia DS, Nagwani NK. Augmented sequence features and subcellular localization for functional characterization of unknown protein sequences. Med Biol Eng Comput 2021; 59:2297-2310. [PMID: 34545514 DOI: 10.1007/s11517-021-02436-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2020] [Accepted: 08/29/2021] [Indexed: 11/24/2022]
Abstract
Advances in high-throughput techniques lead to evolving a large number of unknown protein sequences (UPS). Functional characterization of UPS is significant for the investigation of disease symptoms and drug repositioning. Protein subcellular localization is imperative for the functional characterization of protein sequences. Diverse techniques are used on protein sequences for feature extraction. However, many times a single feature extraction technique leads to poor prediction performance. In this paper, two feature augmentations are described through sequence induced, physicochemical, and evolutionary information of the amino acid residues. While augmented features preserve the sequence-order-information and protein-residue-properties. Two bacterial protein datasets Gram-Positive (G +) and Gram-Negative (G-) are utilized for the experimental work. After performing essential preprocessing on protein datasets, two sets of feature vectors are obtained. These feature vectors are used separately to train the different individual and ensembles such as decision tree (C 4.5), k-nearest neighbor (k-NN), multi-layer perceptron (MLP), Naïve Bayes (NB), support vector machine (SVM), AdaBoost, gradient boosting machine (GBM), and random forest (RF) with fivefold cross-validation. Prediction results of the model demonstrate that overall accuracy reported by C4.5 is highest 99.57% on G + and 97.47% on G- datasets with known protein sequences. Similarly, for the UPS overall accuracy of G + is 85.17% with SVM and 82.45% with G- dataset using MLP.
Collapse
Affiliation(s)
- Saurabh Agrawal
- Department of Computer Science & Engineering, National Institute of Technology Raipur, GE Road, Raipur, Chhattisgarh, 492010, India.
| | - Dilip Singh Sisodia
- Department of Computer Science & Engineering, National Institute of Technology Raipur, GE Road, Raipur, Chhattisgarh, 492010, India
| | - Naresh Kumar Nagwani
- Department of Computer Science & Engineering, National Institute of Technology Raipur, GE Road, Raipur, Chhattisgarh, 492010, India
| |
Collapse
|
20
|
Del Giudice M, Peirone S, Perrone S, Priante F, Varese F, Tirtei E, Fagioli F, Cereda M. Artificial Intelligence in Bulk and Single-Cell RNA-Sequencing Data to Foster Precision Oncology. Int J Mol Sci 2021; 22:ijms22094563. [PMID: 33925407 PMCID: PMC8123853 DOI: 10.3390/ijms22094563] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2021] [Revised: 04/21/2021] [Accepted: 04/23/2021] [Indexed: 02/01/2023] Open
Abstract
Artificial intelligence, or the discipline of developing computational algorithms able to perform tasks that requires human intelligence, offers the opportunity to improve our idea and delivery of precision medicine. Here, we provide an overview of artificial intelligence approaches for the analysis of large-scale RNA-sequencing datasets in cancer. We present the major solutions to disentangle inter- and intra-tumor heterogeneity of transcriptome profiles for an effective improvement of patient management. We outline the contributions of learning algorithms to the needs of cancer genomics, from identifying rare cancer subtypes to personalizing therapeutic treatments.
Collapse
Affiliation(s)
- Marco Del Giudice
- Cancer Genomics and Bioinformatics Unit, IIGM—Italian Institute for Genomic Medicine, c/o IRCCS, Str. Prov.le 142, km 3.95, 10060 Candiolo, TO, Italy; (M.D.G.); (S.P.); (S.P.); (F.P.); (F.V.)
- Candiolo Cancer Institute, FPO—IRCCS, Str. Prov.le 142, km 3.95, 10060 Candiolo, TO, Italy
| | - Serena Peirone
- Cancer Genomics and Bioinformatics Unit, IIGM—Italian Institute for Genomic Medicine, c/o IRCCS, Str. Prov.le 142, km 3.95, 10060 Candiolo, TO, Italy; (M.D.G.); (S.P.); (S.P.); (F.P.); (F.V.)
- Department of Physics and INFN, Università degli Studi di Torino, via P.Giuria 1, 10125 Turin, Italy
| | - Sarah Perrone
- Cancer Genomics and Bioinformatics Unit, IIGM—Italian Institute for Genomic Medicine, c/o IRCCS, Str. Prov.le 142, km 3.95, 10060 Candiolo, TO, Italy; (M.D.G.); (S.P.); (S.P.); (F.P.); (F.V.)
- Department of Physics, Università degli Studi di Torino, via P.Giuria 1, 10125 Turin, Italy
| | - Francesca Priante
- Cancer Genomics and Bioinformatics Unit, IIGM—Italian Institute for Genomic Medicine, c/o IRCCS, Str. Prov.le 142, km 3.95, 10060 Candiolo, TO, Italy; (M.D.G.); (S.P.); (S.P.); (F.P.); (F.V.)
- Department of Physics, Università degli Studi di Torino, via P.Giuria 1, 10125 Turin, Italy
| | - Fabiola Varese
- Cancer Genomics and Bioinformatics Unit, IIGM—Italian Institute for Genomic Medicine, c/o IRCCS, Str. Prov.le 142, km 3.95, 10060 Candiolo, TO, Italy; (M.D.G.); (S.P.); (S.P.); (F.P.); (F.V.)
- Department of Life Science and System Biology, Università degli Studi di Torino, via Accademia Albertina 13, 10123 Turin, Italy
| | - Elisa Tirtei
- Paediatric Onco-Haematology Division, Regina Margherita Children’s Hospital, City of Health and Science of Turin, 10126 Turin, Italy; (E.T.); (F.F.)
| | - Franca Fagioli
- Paediatric Onco-Haematology Division, Regina Margherita Children’s Hospital, City of Health and Science of Turin, 10126 Turin, Italy; (E.T.); (F.F.)
- Department of Public Health and Paediatric Sciences, University of Torino, 10124 Turin, Italy
| | - Matteo Cereda
- Cancer Genomics and Bioinformatics Unit, IIGM—Italian Institute for Genomic Medicine, c/o IRCCS, Str. Prov.le 142, km 3.95, 10060 Candiolo, TO, Italy; (M.D.G.); (S.P.); (S.P.); (F.P.); (F.V.)
- Candiolo Cancer Institute, FPO—IRCCS, Str. Prov.le 142, km 3.95, 10060 Candiolo, TO, Italy
- Correspondence: ; Tel.: +39-011-993-3969
| |
Collapse
|
21
|
Abstract
Biology has become a data driven science largely due to the technological advances that have generated large volumes of data. To extract meaningful information from these data sets requires the use of sophisticated modeling approaches. Toward that, artificial neural network (ANN) based modeling is increasingly playing a very important role. The "black box" nature of ANNs acts as a barrier in providing biological interpretation of the model. Here, the basic steps toward building models for biological systems and interpreting them using calliper randomization approach to capture complex information are described.
Collapse
|
22
|
Wang C, Wu J, Xu L, Zou Q. NonClasGP-Pred: robust and efficient prediction of non-classically secreted proteins by integrating subset-specific optimal models of imbalanced data. Microb Genom 2020; 6:mgen000483. [PMID: 33245691 PMCID: PMC8116686 DOI: 10.1099/mgen.0.000483] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2020] [Accepted: 11/06/2020] [Indexed: 01/01/2023] Open
Abstract
Non-classically secreted proteins (NCSPs) are proteins that are located in the extracellular environment, although there is a lack of known signal peptides or secretion motifs. They usually perform different biological functions in intracellular and extracellular environments, and several of their biological functions are linked to bacterial virulence and cell defence. Accurate protein localization is essential for all living organisms, however, the performance of existing methods developed for NCSP identification has been unsatisfactory and in particular suffer from data deficiency and possible overfitting problems. Further improvement is desirable, especially to address the lack of informative features and mining subset-specific features in imbalanced datasets. In the present study, a new computational predictor was developed for NCSP prediction of gram-positive bacteria. First, to address the possible prediction bias caused by the data imbalance problem, ten balanced subdatasets were generated for ensemble model construction. Then, the F-score algorithm combined with sequential forward search was used to strengthen the feature representation ability for each of the training subdatasets. Third, the subset-specific optimal feature combination process was adopted to characterize the original data from different aspects, and all subdataset-based models were integrated into a unified model, NonClasGP-Pred, which achieved an excellent performance with an accuracy of 93.23 %, a sensitivity of 100 %, a specificity of 89.01 %, a Matthew's correlation coefficient of 87.68 % and an area under the curve value of 0.9975 for ten-fold cross-validation. Based on assessment on the independent test dataset, the proposed model outperformed state-of-the-art available toolkits. For availability and implementation, see: http://lab.malab.cn/~wangchao/softwares/NonClasGP/.
Collapse
Affiliation(s)
- Chao Wang
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, PR China
| | - Jin Wu
- School of Management, Shenzhen Polytechnic, Shenzhen, PR China
| | - Lei Xu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, PR China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, PR China
- Hainan Key Laboratory for Computational Science and Application, Hainan Normal University, Haikou, PR China
| |
Collapse
|
23
|
Abstract
Background:
Thermophilic proteins can maintain good activity under high temperature,
therefore, it is important to study thermophilic proteins for the thermal stability of proteins.
Objective:
In order to solve the problem of low precision and low efficiency in predicting
thermophilic proteins, a prediction method based on feature fusion and machine learning was
proposed in this paper.
Methods:
For the selected thermophilic data sets, firstly, the thermophilic protein sequence was
characterized based on feature fusion by the combination of g-gap dipeptide, entropy density and
autocorrelation coefficient. Then, Kernel Principal Component Analysis (KPCA) was used to reduce
the dimension of the expressed protein sequence features in order to reduce the training time and
improve efficiency. Finally, the classification model was designed by using the classification
algorithm.
Results:
A variety of classification algorithms was used to train and test on the selected thermophilic
dataset. By comparison, the accuracy of the Support Vector Machine (SVM) under the jackknife
method was over 92%. The combination of other evaluation indicators also proved that the SVM
performance was the best.
Conclusion:
Because of choosing an effectively feature representation method and a robust
classifier, the proposed method is suitable for predicting thermophilic proteins and is superior to
most reported methods.
Collapse
Affiliation(s)
- Xian-Fang Wang
- School of Computer and Information Engineering, Henan Normal University, Henan, China
| | - Peng Gao
- School of Computer and Information Engineering, Henan Normal University, Henan, China
| | - Yi-Feng Liu
- School of Computer and Information Engineering, Henan Normal University, Henan, China
| | - Hong-Fei Li
- School of Computer and Information Engineering, Henan Normal University, Henan, China
| | - Fan Lu
- School of Computer and Information Engineering, Henan Normal University, Henan, China
| |
Collapse
|
24
|
Vijayasarveswari V, Andrew AM, Jusoh M, Sabapathy T, Raof RAA, Yasin MNM, Ahmad RB, Khatun S, Rahim HA. Multi-stage feature selection (MSFS) algorithm for UWB-based early breast cancer size prediction. PLoS One 2020; 15:e0229367. [PMID: 32790672 PMCID: PMC7425918 DOI: 10.1371/journal.pone.0229367] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2020] [Accepted: 07/28/2020] [Indexed: 12/14/2022] Open
Abstract
Breast cancer is the most common cancer among women and it is one of the main causes of death for women worldwide. To attain an optimum medical treatment for breast cancer, an early breast cancer detection is crucial. This paper proposes a multi- stage feature selection method that extracts statistically significant features for breast cancer size detection using proposed data normalization techniques. Ultra-wideband (UWB) signals, controlled using microcontroller are transmitted via an antenna from one end of the breast phantom and are received on the other end. These ultra-wideband analogue signals are represented in both time and frequency domain. The preprocessed digital data is passed to the proposed multi- stage feature selection algorithm. This algorithm has four selection stages. It comprises of data normalization methods, feature extraction, data dimensional reduction and feature fusion. The output data is fused together to form the proposed datasets, namely, 8-HybridFeature, 9-HybridFeature and 10-HybridFeature datasets. The classification performance of these datasets is tested using the Support Vector Machine, Probabilistic Neural Network and Naïve Bayes classifiers for breast cancer size classification. The research findings indicate that the 8-HybridFeature dataset performs better in comparison to the other two datasets. For the 8-HybridFeature dataset, the Naïve Bayes classifier (91.98%) outperformed the Support Vector Machine (90.44%) and Probabilistic Neural Network (80.05%) classifiers in terms of classification accuracy. The finalized method is tested and visualized in the MATLAB based 2D and 3D environment.
Collapse
Affiliation(s)
- V. Vijayasarveswari
- Advanced Communication Engineering (ACE) Centre of Excellence, Universiti Malaysia Perlis, Kangar, Perlis, West Malaysia
| | - A. M. Andrew
- Advanced Communication Engineering (ACE) Centre of Excellence, Universiti Malaysia Perlis, Kangar, Perlis, West Malaysia
| | - M. Jusoh
- Advanced Communication Engineering (ACE) Centre of Excellence, Universiti Malaysia Perlis, Kangar, Perlis, West Malaysia
| | - T. Sabapathy
- Advanced Communication Engineering (ACE) Centre of Excellence, Universiti Malaysia Perlis, Kangar, Perlis, West Malaysia
| | - R. A. A. Raof
- Advanced Communication Engineering (ACE) Centre of Excellence, Universiti Malaysia Perlis, Kangar, Perlis, West Malaysia
| | - M. N. M. Yasin
- Advanced Communication Engineering (ACE) Centre of Excellence, Universiti Malaysia Perlis, Kangar, Perlis, West Malaysia
| | - R. B. Ahmad
- Advanced Communication Engineering (ACE) Centre of Excellence, Universiti Malaysia Perlis, Kangar, Perlis, West Malaysia
| | - S. Khatun
- Faculty of Electrical & Electronic Engineering, Universiti Malaysia Pahang, Pekan, Pahang
| | - H. A. Rahim
- Advanced Communication Engineering (ACE) Centre of Excellence, Universiti Malaysia Perlis, Kangar, Perlis, West Malaysia
| |
Collapse
|
25
|
Zeng R, Liao M. Developing a Multi-Layer Deep Learning Based Predictive Model to Identify DNA N4-Methylcytosine Modifications. Front Bioeng Biotechnol 2020; 8:274. [PMID: 32373597 PMCID: PMC7186498 DOI: 10.3389/fbioe.2020.00274] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2020] [Accepted: 03/16/2020] [Indexed: 12/21/2022] Open
Abstract
DNA N4-methylcytosine modification (4mC) plays an essential role in a variety of biological processes. Therefore, accurate identification the 4mC distribution in genome-scale is important for systematically understanding its biological functions. In this study, we present Deep4mcPred, a multi-layer deep learning based predictive model to identify DNA N4-methylcytosine modifications. In this predictor, we for the first time integrate residual network and recurrent neural network to build a multi-layer deep learning predictive system. As compared to existing predictors using traditional machine learning, our proposed method has two advantages. First, our deep learning framework does not need to specify the features when training the predictive model. It can automatically learn the high-level features and capture the characteristic specificity of 4mC sites, benefiting to distinguish true 4mC sites from non-4mC sites. On the other hand, our deep learning method outperforms the traditional machine learning predictors in performance by benchmarking comparison, demonstrating that the proposed Deep4mcPred is more effective in the DNA 4mC site prediction. Moreover, via experimental comparison, we found that attention mechanism introduced into the deep learning framework is useful to capture the critical features. Additionally, we develop a webserver implementing the proposed method for the academic use of research community, which is now available at http://server.malab.cn/Deep4mcPred.
Collapse
Affiliation(s)
- Rao Zeng
- Department of Software Engineering, School of Informatics, Xiamen University, Xiamen, China
| | - Minghong Liao
- Department of Software Engineering, School of Informatics, Xiamen University, Xiamen, China
| |
Collapse
|
26
|
Kalina J, Matonoha C. A sparse pair-preserving centroid-based supervised learning method for high-dimensional biomedical data or images. Biocybern Biomed Eng 2020. [DOI: 10.1016/j.bbe.2020.03.008] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
27
|
|
28
|
Cai J, Wang D, Chen R, Niu Y, Ye X, Su R, Xiao G, Wei L. A Bioinformatics Tool for the Prediction of DNA N6-Methyladenine Modifications Based on Feature Fusion and Optimization Protocol. Front Bioeng Biotechnol 2020; 8:502. [PMID: 32582654 PMCID: PMC7287168 DOI: 10.3389/fbioe.2020.00502] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2020] [Accepted: 04/29/2020] [Indexed: 01/04/2023] Open
Abstract
DNA N6-methyladenine (6mA) is closely involved with various biological processes. Identifying the distributions of 6mA modifications in genome-scale is of great significance to in-depth understand the functions. In recent years, various experimental and computational methods have been proposed for this purpose. Unfortunately, existing methods cannot provide accurate and fast 6mA prediction. In this study, we present 6mAPred-FO, a bioinformatics tool that enables researchers to make predictions based on sequences only. To sufficiently capture the characteristics of 6mA sites, we integrate the sequence-order information with nucleotide positional specificity information for feature encoding, and further improve the feature representation capacity by analysis of variance-based feature optimization protocol. The experimental results show that using this feature protocol, we can significantly improve the predictive performance. Via further feature analysis, we found that the sequence-order information and positional specificity information are complementary to each other, contributing to the performance improvement. On the other hand, the improvement is also due to the use of the feature optimization protocol, which is capable of effectively capturing the most informative features from the original feature space. Moreover, benchmarking comparison results demonstrate that our 6mAPred-FO outperforms several existing predictors. Finally, we establish a web-server that implements the proposed method for convenience of researchers' use, which is currently available at http://server.malab.cn/6mAPred-FO.
Collapse
Affiliation(s)
- Jianhua Cai
- Fujian Provincial Key Laboratory of Information Processing and Intelligent Control, College of Computer and Control Engineering, Minjiang University, Fuzhou, China
- College of Mathematics and Computer Science, Fuzhou University, Fuzhou, China
| | - Donghua Wang
- Department of General Surgery, Heilongjiang Province Land Reclamation Headquarters General Hospital, Harbin, China
| | - Riqing Chen
- College of Computer and Information Sciences, Fujian Agriculture and Forestry University, Fuzhou, China
| | - Yuzhen Niu
- Fujian Provincial Key Laboratory of Information Processing and Intelligent Control, College of Computer and Control Engineering, Minjiang University, Fuzhou, China
| | - Xiucai Ye
- Department of Computer Science, University of Tsukuba, Tsukuba, Japan
| | - Ran Su
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Guobao Xiao
- Fujian Provincial Key Laboratory of Information Processing and Intelligent Control, College of Computer and Control Engineering, Minjiang University, Fuzhou, China
- *Correspondence: Guobao Xiao
| | - Leyi Wei
- Fujian Provincial Key Laboratory of Information Processing and Intelligent Control, College of Computer and Control Engineering, Minjiang University, Fuzhou, China
- School of Software, Shandong University, Jinan, China
- Leyi Wei
| |
Collapse
|
29
|
Krzhizhanovskaya VV, Závodszky G, Lees MH, Dongarra JJ, Sloot PMA, Brissos S, Teixeira J. Analysis of Ensemble Feature Selection for Correlated High-Dimensional RNA-Seq Cancer Data. LECTURE NOTES IN COMPUTER SCIENCE 2020. [PMCID: PMC7304026 DOI: 10.1007/978-3-030-50420-5_39] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Discovery of diagnostic and prognostic molecular markers is important and actively pursued the research field in cancer research. For complex diseases, this process is often performed using Machine Learning. The current study compares two approaches for the discovery of relevant variables: by application of a single feature selection algorithm, versus by an ensemble of diverse algorithms. These approaches are used to identify variables that are relevant discerning of four cancer types using RNA-seq profiles from the Cancer Genome Atlas. The comparison is carried out in two directions: evaluating the predictive performance of models and monitoring the stability of selected variables. The most informative features are identified using a four feature selection algorithms, namely U-test, ReliefF, and two variants of the MDFS algorithm. Discerning normal and tumor tissues is performed using the Random Forest algorithm. The highest stability of the feature set was obtained when U-test was used. Unfortunately, models built on feature sets obtained from the ensemble of feature selection algorithms were no better than for models developed on feature sets obtained from individual algorithms. On the other hand, the feature selectors leading to the best classification results varied between data sets.
Collapse
|
30
|
Pirgazi J, Alimoradi M, Esmaeili Abharian T, Olyaee MH. An Efficient hybrid filter-wrapper metaheuristic-based gene selection method for high dimensional datasets. Sci Rep 2019; 9:18580. [PMID: 31819106 PMCID: PMC6901457 DOI: 10.1038/s41598-019-54987-1] [Citation(s) in RCA: 30] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2019] [Accepted: 11/22/2019] [Indexed: 01/06/2023] Open
Abstract
Feature selection problem is one of the most significant issues in data classification. The purpose of feature selection is selection of the least number of features in order to increase accuracy and decrease the cost of data classification. In recent years, due to appearance of high-dimensional datasets with low number of samples, classification models have encountered over-fitting problem. Therefore, the need for feature selection methods that are used to remove the extensions and irrelevant features is felt. Recently, although, various methods have been proposed for selecting the optimal subset of features with high precision, these methods have encountered some problems such as instability, high convergence time, selection of a semi-optimal solution as the final result. In other words, they have not been able to fully extract the effective features. In this paper, a hybrid method based on the IWSSr method and Shuffled Frog Leaping Algorithm (SFLA) is proposed to select effective features in a large-scale gene dataset. The proposed algorithm is implemented in two phases: filtering and wrapping. In the filter phase, the Relief method is used for weighting features. Then, in the wrapping phase, by using the SFLA and the IWSSr algorithms, the search for effective features in a feature-rich area is performed. The proposed method is evaluated by using some standard gene expression datasets. The experimental results approve that the proposed approach in comparison to similar methods, has been achieved a more compact set of features along with high accuracy. The source code and testing datasets are available at https://github.com/jimy2020/SFLA_IWSSr-Feature-Selection.
Collapse
Affiliation(s)
- Jamshid Pirgazi
- Faculty of Engineering, Department of Computer Engineering, University of Gonabad, Gonabad, Iran.
| | - Mohsen Alimoradi
- Faculty of Electronic, Computer & IT Department of Computer, Qazvin Islamic Azad University, Qazvin, Iran
| | - Tahereh Esmaeili Abharian
- Faculty of Electronic, Computer & IT Department of Computer, Qazvin Islamic Azad University, Qazvin, Iran
| | - Mohammad Hossein Olyaee
- Faculty of Engineering, Department of Computer Engineering, University of Gonabad, Gonabad, Iran
| |
Collapse
|
31
|
A novel matched-pairs feature selection method considering with tumor purity for differential gene expression analyses. Math Biosci 2019; 311:39-48. [DOI: 10.1016/j.mbs.2019.02.007] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2019] [Revised: 02/21/2019] [Accepted: 02/22/2019] [Indexed: 12/13/2022]
|