1
|
Doherty T, Dempster E, Hannon E, Mill J, Poulton R, Corcoran D, Sugden K, Williams B, Caspi A, Moffitt TE, Delany SJ, Murphy TM. A comparison of feature selection methodologies and learning algorithms in the development of a DNA methylation-based telomere length estimator. BMC Bioinformatics 2023; 24:178. [PMID: 37127563 PMCID: PMC10152624 DOI: 10.1186/s12859-023-05282-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2022] [Accepted: 04/11/2023] [Indexed: 05/03/2023] Open
Abstract
BACKGROUND The field of epigenomics holds great promise in understanding and treating disease with advances in machine learning (ML) and artificial intelligence being vitally important in this pursuit. Increasingly, research now utilises DNA methylation measures at cytosine-guanine dinucleotides (CpG) to detect disease and estimate biological traits such as aging. Given the challenge of high dimensionality of DNA methylation data, feature-selection techniques are commonly employed to reduce dimensionality and identify the most important subset of features. In this study, our aim was to test and compare a range of feature-selection methods and ML algorithms in the development of a novel DNA methylation-based telomere length (TL) estimator. We utilised both nested cross-validation and two independent test sets for the comparisons. RESULTS We found that principal component analysis in advance of elastic net regression led to the overall best performing estimator when evaluated using a nested cross-validation analysis and two independent test cohorts. This approach achieved a correlation between estimated and actual TL of 0.295 (83.4% CI [0.201, 0.384]) on the EXTEND test data set. Contrastingly, the baseline model of elastic net regression with no prior feature reduction stage performed less well in general-suggesting a prior feature-selection stage may have important utility. A previously developed TL estimator, DNAmTL, achieved a correlation of 0.216 (83.4% CI [0.118, 0.310]) on the EXTEND data. Additionally, we observed that different DNA methylation-based TL estimators, which have few common CpGs, are associated with many of the same biological entities. CONCLUSIONS The variance in performance across tested approaches shows that estimators are sensitive to data set heterogeneity and the development of an optimal DNA methylation-based estimator should benefit from the robust methodological approach used in this study. Moreover, our methodology which utilises a range of feature-selection approaches and ML algorithms could be applied to other biological markers and disease phenotypes, to examine their relationship with DNA methylation and predictive value.
Collapse
Affiliation(s)
- Trevor Doherty
- School of Biological, Health and Sports Sciences, Technological University Dublin, Dublin, Ireland.
- SFI Centre for Research Training in Machine Learning, Technological University Dublin, Dublin, Ireland.
| | - Emma Dempster
- University of Exeter Medical School, University of Exeter, Exeter, UK
| | - Eilis Hannon
- University of Exeter Medical School, University of Exeter, Exeter, UK
| | - Jonathan Mill
- University of Exeter Medical School, University of Exeter, Exeter, UK
| | - Richie Poulton
- Department of Psychology, University of Otago, Dunedin, 9016, New Zealand
| | - David Corcoran
- Center for Genomic and Computational Biology, Duke University, Durham, NC, 27708, USA
| | - Karen Sugden
- Department of Psychology and Neuroscience, Duke University, Durham, NC, USA
| | - Ben Williams
- Department of Psychology and Neuroscience, Duke University, Durham, NC, USA
| | - Avshalom Caspi
- Social, Genetic and Developmental Psychiatry Centre, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK
- Department of Psychology and Neuroscience, Duke University, Durham, NC, USA
| | - Terrie E Moffitt
- Social, Genetic and Developmental Psychiatry Centre, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, UK
- Department of Psychology and Neuroscience, Duke University, Durham, NC, USA
| | - Sarah Jane Delany
- School of Computer Science, Technological University Dublin, Dublin, Ireland
| | - Therese M Murphy
- School of Biological, Health and Sports Sciences, Technological University Dublin, Dublin, Ireland
| |
Collapse
|
2
|
Bernabeu E, McCartney DL, Gadd DA, Hillary RF, Lu AT, Murphy L, Wrobel N, Campbell A, Harris SE, Liewald D, Hayward C, Sudlow C, Cox SR, Evans KL, Horvath S, McIntosh AM, Robinson MR, Vallejos CA, Marioni RE. Refining epigenetic prediction of chronological and biological age. Genome Med 2023; 15:12. [PMID: 36855161 PMCID: PMC9976489 DOI: 10.1186/s13073-023-01161-y] [Citation(s) in RCA: 24] [Impact Index Per Article: 24.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2022] [Accepted: 02/06/2023] [Indexed: 03/02/2023] Open
Abstract
BACKGROUND Epigenetic clocks can track both chronological age (cAge) and biological age (bAge). The latter is typically defined by physiological biomarkers and risk of adverse health outcomes, including all-cause mortality. As cohort sample sizes increase, estimates of cAge and bAge become more precise. Here, we aim to develop accurate epigenetic predictors of cAge and bAge, whilst improving our understanding of their epigenomic architecture. METHODS First, we perform large-scale (N = 18,413) epigenome-wide association studies (EWAS) of chronological age and all-cause mortality. Next, to create a cAge predictor, we use methylation data from 24,674 participants from the Generation Scotland study, the Lothian Birth Cohorts (LBC) of 1921 and 1936, and 8 other cohorts with publicly available data. In addition, we train a predictor of time to all-cause mortality as a proxy for bAge using the Generation Scotland cohort (1214 observed deaths). For this purpose, we use epigenetic surrogates (EpiScores) for 109 plasma proteins and the 8 component parts of GrimAge, one of the current best epigenetic predictors of survival. We test this bAge predictor in four external cohorts (LBC1921, LBC1936, the Framingham Heart Study and the Women's Health Initiative study). RESULTS Through the inclusion of linear and non-linear age-CpG associations from the EWAS, feature pre-selection in advance of elastic net regression, and a leave-one-cohort-out (LOCO) cross-validation framework, we obtain cAge prediction with a median absolute error equal to 2.3 years. Our bAge predictor was found to slightly outperform GrimAge in terms of the strength of its association to survival (HRGrimAge = 1.47 [1.40, 1.54] with p = 1.08 × 10-52, and HRbAge = 1.52 [1.44, 1.59] with p = 2.20 × 10-60). Finally, we introduce MethylBrowsR, an online tool to visualise epigenome-wide CpG-age associations. CONCLUSIONS The integration of multiple large datasets, EpiScores, non-linear DNAm effects, and new approaches to feature selection has facilitated improvements to the blood-based epigenetic prediction of biological and chronological age.
Collapse
Affiliation(s)
- Elena Bernabeu
- Centre for Genomic and Experimental Medicine, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, UK
| | - Daniel L McCartney
- Centre for Genomic and Experimental Medicine, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, UK
| | - Danni A Gadd
- Centre for Genomic and Experimental Medicine, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, UK
| | - Robert F Hillary
- Centre for Genomic and Experimental Medicine, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, UK
| | - Ake T Lu
- Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, CA, USA
- Altos Labs, San Diego, USA
| | - Lee Murphy
- Edinburgh Clinical Research Facility, University of Edinburgh, Edinburgh, UK
| | - Nicola Wrobel
- Edinburgh Clinical Research Facility, University of Edinburgh, Edinburgh, UK
| | - Archie Campbell
- Centre for Genomic and Experimental Medicine, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, UK
| | - Sarah E Harris
- Department of Psychology, Lothian Birth Cohorts, University of Edinburgh, Edinburgh, UK
| | - David Liewald
- Department of Psychology, Lothian Birth Cohorts, University of Edinburgh, Edinburgh, UK
| | - Caroline Hayward
- Centre for Genomic and Experimental Medicine, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, UK
- Medical Research Council Human Genetics Unit, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, UK
| | - Cathie Sudlow
- Centre for Clinical Brain Sciences, University of Edinburgh, Edinburgh, UK
- BHF Data Science Centre, Health Data Research UK, London, UK
- Edinburgh Medical School, Usher Institute, University of Edinburgh, Edinburgh, UK
| | - Simon R Cox
- Department of Psychology, Lothian Birth Cohorts, University of Edinburgh, Edinburgh, UK
| | - Kathryn L Evans
- Centre for Genomic and Experimental Medicine, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, UK
| | - Steve Horvath
- Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, CA, USA
- Altos Labs, San Diego, USA
| | - Andrew M McIntosh
- Centre for Genomic and Experimental Medicine, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, UK
- Division of Psychiatry, University of Edinburgh, Royal Edinburgh Hospital, Edinburgh, UK
| | | | - Catalina A Vallejos
- Medical Research Council Human Genetics Unit, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, UK
- The Alan Turing Institute, London, UK
| | - Riccardo E Marioni
- Centre for Genomic and Experimental Medicine, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, UK.
| |
Collapse
|
3
|
A Heuristic Machine Learning-Based Optimization Technique to Predict Lung Cancer Patient Survival. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2023; 2023:4506488. [PMID: 36776617 PMCID: PMC9911240 DOI: 10.1155/2023/4506488] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/11/2022] [Revised: 08/26/2022] [Accepted: 11/24/2022] [Indexed: 02/05/2023]
Abstract
Cancer has been a significant threat to human health and well-being, posing the biggest obstacle in the history of human sickness. The high death rate in cancer patients is primarily due to the complexity of the disease and the wide range of clinical outcomes. Increasing the accuracy of the prediction is equally crucial as predicting the survival rate of cancer patients, which has become a key issue of cancer research. Many models have been suggested at the moment. However, most of them simply use single genetic data or clinical data to construct prediction models for cancer survival. There is a lot of emphasis in present survival studies on determining whether or not a patient will survive five years. The personal issue of how long a lung cancer patient will survive remains unanswered. The proposed technique Naive Bayes and SSA is estimating the overall survival time with lung cancer. Two machine learning challenges are derived from a single customized query. To begin with, determining whether a patient will survive for more than five years is a simple binary question. The second step is to develop a five-year survival model using regression analysis. When asked to forecast how long a lung cancer patient would survive within five years, the mean absolute error (MAE) of this technique's predictions is accurate within a month. Several biomarker genes have been associated with lung cancers. The accuracy, recall, and precision achieved from this algorithm are 98.78%, 98.4%, and 98.6%, respectively.
Collapse
|
4
|
Construction of genetic classification model for coronary atherosclerosis heart disease using three machine learning methods. BMC Cardiovasc Disord 2022; 22:42. [PMID: 35151267 PMCID: PMC8840658 DOI: 10.1186/s12872-022-02481-4] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2021] [Accepted: 01/24/2022] [Indexed: 12/05/2022] Open
Abstract
Background Although the diagnostic method for coronary atherosclerosis heart disease (CAD) is constantly innovated, CAD in the early stage is still missed diagnosis for the absence of any symptoms. The gene expression levels varied during disease development; therefore, a classifier based on gene expression might contribute to CAD diagnosis. This study aimed to construct genetic classification models for CAD using gene expression data, which may provide new insight into the understanding of its pathogenesis. Methods All statistical analysis was completed by R 3.4.4 software. Three raw gene expression datasets (GSE12288, GSE7638 and GSE66360) related to CAD were downloaded from the Gene Expression Omnibus database and included for analysis. Limma package was performed to identify differentially expressed genes (DEGs) between CAD samples and healthy controls. The WGCNA package was conducted to recognize CAD-related gene modules and hub genes, followed by recursive feature elimination analysis to select the optimal features genes (OFGs). The genetic classification models were established using support vector machine (SVM), random forest (RF) and logistic regression (LR), respectively. Further validation and receiver operating characteristic (ROC) curve analysis were conducted to evaluate the classification performance. Results In total, 374 DEGs, eight gene modules, 33 hub genes and 12 OFGs (HTR4, KISS1, CA12, CAMK2B, KLK2, DDC, CNGB1, DERL1, BCL6, LILRA2, HCK, MTF2) were identified. ROC curve analysis showed that the accuracy of SVM, RF and LR were 75.58%, 63.57% and 63.95% in validation; with area under the curve of 0.813 (95% confidence interval, 95% CI 0.761–0.866, P < 0.0001), 0.727 (95% CI 0.665–0.788, P < 0.0001) and 0.783 (95% CI 0.725–0.841, P < 0.0001), respectively. Conclusions In conclusion, this study found 12 gene signatures involved in the pathogenic mechanism of CAD. Among the CAD classifiers constructed by three machine learning methods, the SVM model has the best performance. Supplementary Information The online version contains supplementary material available at 10.1186/s12872-022-02481-4.
Collapse
|
5
|
Ghosh S, Samanta G, De la Sen M. Feature selection and classification approaches in gene expression of breast cancer. AIMS BIOPHYSICS 2021. [DOI: 10.3934/biophy.2021029] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
<abstract>
<p>DNA microarray technology with biological data-set can monitor the expression levels of thousands of genes simultaneously. Microarray data analysis is important in phenotype classification of diseases. In this work, the computational part basically predicts the tendency towards mortality using different classification techniques by identifying features from the high dimensional dataset. We have analyzed the breast cancer transcriptional genomic data of 1554 transcripts captured over from 272 samples. This work presents effective methods for gene classification using Logistic Regression (LR), Random Forest (RF), Decision Tree (DT) and constructs a classifier with an upgraded rate of accuracy than all features together. The performance of these underlying methods are also compared with dimension reduction method, namely, Principal Component Analysis (PCA). The methods of feature reduction with RF, LR and decision tree (DT) provide better performance than PCA. It is observed that both techniques LR and RF identify TYMP, ERS1, C-MYB and TUBA1a genes. But some features corresponding to the genes such as ARID4B, DNMT3A, TOX3, RGS17 and PNLIP are uniquely pointed out by LR method which are leading to a significant role in breast cancer. The simulation is based on <italic>R</italic>-software.</p>
</abstract>
Collapse
|
6
|
Zhang R, Ye J, Huang H, Du X. Mining featured biomarkers associated with vascular invasion in HCC by bioinformatics analysis with TCGA RNA sequencing data. Biomed Pharmacother 2019; 118:109274. [DOI: 10.1016/j.biopha.2019.109274] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2019] [Revised: 07/03/2019] [Accepted: 07/25/2019] [Indexed: 12/20/2022] Open
|
7
|
Tang Z, Wei G, Zhang L, Xu Z. Signature microRNAs and long noncoding RNAs in laryngeal cancer recurrence identified using a competing endogenous RNA network. Mol Med Rep 2019; 19:4806-4818. [PMID: 31059106 PMCID: PMC6522811 DOI: 10.3892/mmr.2019.10143] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2018] [Accepted: 03/25/2019] [Indexed: 12/20/2022] Open
Abstract
The aim of the present study was to identify novel microRNA (miRNA) or long noncoding RNA (lncRNA) signatures of laryngeal cancer recurrence and to investigate the regulatory mechanisms associated with this malignancy. Datasets of recurrent and nonrecurrent laryngeal cancer samples were downloaded from The Cancer Genome Atlas (TCGA) and the Gene Expression Omnibus database (GSE27020 and GSE25727) to examine differentially expressed miRNAs (DE-miRs), lncRNAs (DE-lncRs) and mRNAs (DEGs). miRNA-mRNA and lncRNA-miRNA networks were constructed by investigating the associations among these RNAs in various databases. Subsequently, the interactions identified were combined into a competing endogenous RNA (ceRNA) regulatory network. Feature genes in the miRNA-mRNA network were identified via topological analysis and a recursive feature elimination algorithm. A support vector machine (SVM) classifier was established using the betweenness centrality values in the miRNA-mRNA network, consisting of 32 optimal feature-coding genes. The classification effect was tested using two validation datasets. Furthermore, coding genes in the ceRNA network were examined via pathway enrichment analyses. In total, 21 DE-lncRs, 507 DEGs and 55 DE-miRs were selected. The SVM classifier exhibited an accuracy of 94.05% (79/84) for sample classification prediction in the TCGA dataset, and 92.66 and 91.07% in the two validation datasets. The ceRNA regulatory network comprised 203 nodes, corresponding to mRNAs, miRNAs and lncRNAs, and 346 lines, corresponding to the interactions among RNAs. In particular, the interactions with the highest scores were HLA complex group 4 (HCG4)-miR-33b, HOX transcript antisense RNA (HOTAIR)-miR-1-MAGE family member A2 (MAGEA2), EMX2 opposite strand/antisense RNA (EMX2OS)-miR-124-calcitonin related polypeptide α (CALCA) and EMX2OS-miR-124-γ-aminobutyric acid type A receptor γ2 subunit (GABRG2). Gene enrichment analysis of the genes in the ceRNA network identified that 11 pathway terms and 16 molecular function terms were significantly enriched. The SVM classifier based on 32 feature coding genes exhibited high accuracy in the classification of laryngeal cancer samples. miR-1, miR-33b, miR-124, HOTAIR, HCG4 and EMX2OS may be novel biomarkers of recurrent laryngeal cancer, and HCG4-miR-33b, HOTAIR-miR-1-MAGEA2 and EMX2OS-miR-124-CALCA/GABRG2 may be associated with the molecular mechanisms regulating recurrent laryngeal cancer.
Collapse
Affiliation(s)
- Zhengyi Tang
- Department of Otolaryngology Head and Neck Surgery, The First Affiliated Hospital of Guangxi Medical University, Nanning, Guangxi 530021, P.R China
| | - Ganguan Wei
- Department of Otolaryngology Head and Neck Surgery, 923 Hospital of People's Liberation Army, Nanning, Guangxi 530021, P.R China
| | - Longcheng Zhang
- Department of Otolaryngology Head and Neck Surgery, 923 Hospital of People's Liberation Army, Nanning, Guangxi 530021, P.R China
| | - Zhiwen Xu
- Department of Otolaryngology Head and Neck Surgery, The First Affiliated Hospital of Guangxi Medical University, Nanning, Guangxi 530021, P.R China
| |
Collapse
|
8
|
Su R, Liu X, Wei L. MinE-RFE: determine the optimal subset from RFE by minimizing the subset-accuracy–defined energy. Brief Bioinform 2019; 21:687-698. [DOI: 10.1093/bib/bbz021] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2018] [Revised: 01/24/2019] [Accepted: 02/02/2019] [Indexed: 01/18/2023] Open
Abstract
Abstract
Recursive feature elimination (RFE), as one of the most popular feature selection algorithms, has been extensively applied to bioinformatics. During the training, a group of candidate subsets are generated by iteratively eliminating the least important features from the original features. However, how to determine the optimal subset from them still remains ambiguous. Among most current studies, either overall accuracy or subset size (SS) is used to select the most predictive features. Using which one or both and how they affect the prediction performance are still open questions. In this study, we proposed MinE-RFE, a novel RFE-based feature selection approach by sufficiently considering the effect of both factors. Subset decision problem was reflected into subset-accuracy space and became an energy-minimization problem. We also provided a mathematical description of the relationship between the overall accuracy and SS using Gaussian Mixture Models together with spline fitting. Besides, we comprehensively reviewed a variety of state-of-the-art applications in bioinformatics using RFE. We compared their approaches of deciding the final subset from all the candidate subsets with MinE-RFE on diverse bioinformatics data sets. Additionally, we also compared MinE-RFE with some well-used feature selection algorithms. The comparative results demonstrate that the proposed approach exhibits the best performance among all the approaches. To facilitate the use of MinE-RFE, we further established a user-friendly web server with the implementation of the proposed approach, which is accessible at http://qgking.wicp.net/MinE/. We expect this web server will be a useful tool for research community.
Collapse
Affiliation(s)
- Ran Su
- School of Computer Software, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Xinyi Liu
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Leyi Wei
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| |
Collapse
|
9
|
Kakade A, Kumari B, Dholaniya PS. Feature selection using logistic regression in case-control DNA methylation data of Parkinson's disease: A comparative study. J Theor Biol 2018; 457:14-18. [PMID: 30120951 DOI: 10.1016/j.jtbi.2018.08.018] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2018] [Revised: 08/12/2018] [Accepted: 08/13/2018] [Indexed: 01/18/2023]
Abstract
Parkinson's disease (PD) is described as a progressive neurological disorder caused by the degeneration of dopaminergic neurons in substantia nigra pars compacta. The pathogenesis of the disease is not fully understood but it has been linked with complex genetic, epigenetic and environmental interactions. A substantial number of studies have shown the role of epigenetic modifications in support of the progression of PD. In the present study, we have analyzed the data containing methylation patterns of 1726 transcripts captured over from 66 samples of 450k, which includes 43 controls and 23 diseased samples. We used Logistic Regression (LR) for feature reduction and build a classifier with an improved accuracy rate than all features together. The performance of the classifier was compared with other feature reduction approaches viz. Random Forest (RF) and Principal Component Analysis (PCA). Feature reduction with LR and RF performed better than PCA. Some of the features corresponding to the genes such as COMT, DCTN1 and PRNP were uniquely identified by LR and are reported to play a significant role in PD.
Collapse
Affiliation(s)
- Aishwarya Kakade
- Department of Biotechnology, School of Life Sciences, University of Hyderabad, Hyderabad, Telangana 500 046, India
| | - Baby Kumari
- Department of Biotechnology, School of Life Sciences, University of Hyderabad, Hyderabad, Telangana 500 046, India
| | - Pankaj Singh Dholaniya
- Department of Biotechnology, School of Life Sciences, University of Hyderabad, Hyderabad, Telangana 500 046, India.
| |
Collapse
|
10
|
Wang X, Han L, Zhou L, Wang L, Zhang LM. Prediction of candidate RNA signatures for recurrent ovarian cancer prognosis by the construction of an integrated competing endogenous RNA network. Oncol Rep 2018; 40:2659-2673. [PMID: 30226545 PMCID: PMC6151886 DOI: 10.3892/or.2018.6707] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2018] [Accepted: 09/10/2018] [Indexed: 12/28/2022] Open
Abstract
Tumor recurrence hinders treatment of ovarian cancer. The present study aimed to identify potential biomarkers for ovarian cancer recurrence prognosis and explore relevant mechanisms. RNA-sequencing of data from the TCGA database and GSE17260 dataset was carried out. Samples of the data were grouped according to tumor recurrence information. Following data normalization, differentially expressed genes/micro RNAs (miRNAs)/long non-coding (lncRNAs) (DEGs/DEMs/DELs) were selected between recurrent and non-recurrent samples. Their correlations with clinical information were analyzed to identify prognostic RNAs. A support vector machine classifier was used to find the optimal gene set with feature genes that could conclusively distinguish different samples. A protein-protein interaction (PPI) network was established for DEGs using relevant protein databases. An integrated ‘lncRNA/miRNA/mRNA’ competing endogenous RNA (ceRNA) network was constructed to reveal potential regulatory relationships among different RNAs. We identified 36 feature genes (e.g. TP53 and RBPMS) for the classification of recurrent and non-recurrent ovarian cancer samples. Prediction with this gene set had a high accuracy (91.8%). Three DELs (WT1-AS, NBR2 and ZNF883) were highly associated with the prognosis of recurrent ovarian cancer. Predominant DEMs with their targets were hsa-miR-375 (target: RBPMS), hsa-miR-141 (target: RBPMS), and hsa-miR-27b (target: TP53). Highlighted interactions in the ceRNA network were ‘WT1-AS-hsa-miR-375-RBPMS’ and ‘WT1-AS-hsa-miR-27b-TP53’. TP53, RBPMS, hsa-miR-375, hsa-miR-141, hsa-miR-27b, and WT1-AS may be biomarkers for recurrent ovarian cancer. The interactions of ‘WT1-AS-hsa-miR-375-RBPMS’ and ‘WT1-AS-hsa-miR-27b-TP53’ may be potential regulatory mechanisms during cancer recurrence.
Collapse
Affiliation(s)
- Xin Wang
- Department of Gynecology and Obstetrics, The 306 Hospital of PLA, Beijing 100101, P.R. China
| | - Lei Han
- Department of Gynecology and Obstetrics, The 306 Hospital of PLA, Beijing 100101, P.R. China
| | - Ling Zhou
- Department of Gynecology and Obstetrics, The 306 Hospital of PLA, Beijing 100101, P.R. China
| | - Li Wang
- Department of Gynecology and Obstetrics, The 306 Hospital of PLA, Beijing 100101, P.R. China
| | - Lan-Mei Zhang
- Department of Gynecology and Obstetrics, The 306 Hospital of PLA, Beijing 100101, P.R. China
| |
Collapse
|
11
|
Chen Q, Meng Z, Liu X, Jin Q, Su R. Decision Variants for the Automatic Determination of Optimal Feature Subset in RF-RFE. Genes (Basel) 2018; 9:genes9060301. [PMID: 29914084 PMCID: PMC6027449 DOI: 10.3390/genes9060301] [Citation(s) in RCA: 52] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2018] [Revised: 05/30/2018] [Accepted: 06/06/2018] [Indexed: 11/24/2022] Open
Abstract
Feature selection, which identifies a set of most informative features from the original feature space, has been widely used to simplify the predictor. Recursive feature elimination (RFE), as one of the most popular feature selection approaches, is effective in data dimension reduction and efficiency increase. A ranking of features, as well as candidate subsets with the corresponding accuracy, is produced through RFE. The subset with highest accuracy (HA) or a preset number of features (PreNum) are often used as the final subset. However, this may lead to a large number of features being selected, or if there is no prior knowledge about this preset number, it is often ambiguous and subjective regarding final subset selection. A proper decision variant is in high demand to automatically determine the optimal subset. In this study, we conduct pioneering work to explore the decision variant after obtaining a list of candidate subsets from RFE. We provide a detailed analysis and comparison of several decision variants to automatically select the optimal feature subset. Random forest (RF)-recursive feature elimination (RF-RFE) algorithm and a voting strategy are introduced. We validated the variants on two totally different molecular biology datasets, one for a toxicogenomic study and the other one for protein sequence analysis. The study provides an automated way to determine the optimal feature subset when using RF-RFE.
Collapse
Affiliation(s)
- Qi Chen
- School of Computer Software, Tianjin University, Tianjin 300350, China.
- The Military Transportation Command Department, Army Military Transportation University, Tianjin 300361, China.
| | - Zhaopeng Meng
- School of Computer Software, Tianjin University, Tianjin 300350, China.
- Tianjin University of Traditional Chinese Medicine, Tianjin 300193, China.
| | - Xinyi Liu
- School of Computer Software, Tianjin University, Tianjin 300350, China.
| | - Qianguo Jin
- School of Computer Software, Tianjin University, Tianjin 300350, China.
| | - Ran Su
- School of Computer Software, Tianjin University, Tianjin 300350, China.
- State Key Laboratory of Medicinal Chemical Biology, Nankai University, Tianjin 300074, China.
| |
Collapse
|
12
|
Zhao J, Cheng W, He X, Liu Y, Li J, Sun J, Li J, Wang F, Gao Y. Construction of a specific SVM classifier and identification of molecular markers for lung adenocarcinoma based on lncRNA-miRNA-mRNA network. Onco Targets Ther 2018; 11:3129-3140. [PMID: 29872324 PMCID: PMC5975616 DOI: 10.2147/ott.s151121] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
Abstract
Background Novel diagnostic predictors and drug targets are needed for LUAD (lung adenocarcinoma). We aimed to build a specific SVM (support vector machine) classifier for diagnosis of LUAD and identify molecular markers with prognostic value for LUAD. Methods The expression differences of miRNAs, lncRNAs and mRNAs between LUAD and normal samples were compared using data from TCGA (The Cancer Genome Atlas) database. A LUAD related miRNA-lncRNA-mRNA network was constructed, based on which feature genes were selected for the construction of LUAD specific SVM classifier. The robustness and transferability of SVM classifier were validated using gene expression profile datasets GSE43458 and GSE10072. Prognostic markers were identified from the network. A set of LUAD-related differentially expressed miRNAs, lncRNAs and miRNAs were identified and a LUAD related miRNA-lncRNA-mRNA network was obtained. The LUAD specific SVM classifier constructed on the basis of the network was robust and efficient for classification of samples from TCGA dataset and two independent validation datasets. Results Eight RNAs with prognostic value were identified, including hsa-miR-96, hsa-miR-204, PGM5P2 (phosphoglucomutase 5 pseudogene 2), SFTA1P (surfactant associated 1), RGS20 (regulator of G protein signaling 20), RGS9BP (RGS9-binding protein), FGB (fibrinogen beta chain) and INA (alpha-internexin). Among them, RGS20 and INA were regulated by hsa-miR-96. RGS20 was also regulated by hsa-miR-204, which was a potential target of SFTA1P. Conclusion The LUAD specific SVM classifier may serve as a novel diagnostic predictor. hsa-miR-96, hsa-miR-204, PGM5P2, SFTA1P, RGS20, RGS9BP, FGB and INA may serve as prognostic markers in clinical practice.
Collapse
Affiliation(s)
- Jingming Zhao
- Department of Respiratory Medicine, The Affiliated Hospital of Qingdao University, Qingdao, P.R. China
| | - Wei Cheng
- Department of Respiratory Medicine, The Affiliated Hospital of Qingdao University, Qingdao, P.R. China
| | - Xigang He
- Department of Respiratory Medicine, People's Hospital of Rizhao Lanshan, Lanshan District, Rizhao, P.R. China
| | - Yanli Liu
- Department of Respiratory Medicine, The Affiliated Hospital of Qingdao University, Qingdao, P.R. China
| | - Ji Li
- Department of Pharmacy, Qilu Hospital of Shandong University (Qingdao), Qingdao, P.R. China
| | - Jiaxing Sun
- Department of Respiratory Medicine, The Affiliated Hospital of Qingdao University, Qingdao, P.R. China
| | - Jinfeng Li
- Department of Respiratory Medicine, The Affiliated Hospital of Qingdao University, Qingdao, P.R. China
| | - Fangfang Wang
- Department of Respiratory Medicine, The Affiliated Hospital of Qingdao University, Qingdao, P.R. China
| | - Yufang Gao
- Department of President's Office, The Affiliated Hospital of Qingdao University, Qingdao, P.R. China
| |
Collapse
|
13
|
Chen S, Fan X, Gu H, Zhang L, Zhao W. Competing endogenous RNA regulatory network in papillary thyroid carcinoma. Mol Med Rep 2018; 18:695-704. [PMID: 29767230 PMCID: PMC6059698 DOI: 10.3892/mmr.2018.9009] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2017] [Accepted: 11/15/2017] [Indexed: 11/16/2022] Open
Abstract
The present study aimed to screen all types of RNAs involved in the development of papillary thyroid carcinoma (PTC). RNA-sequencing data of PTC and normal samples were used for screening differentially expressed (DE) microRNAs (DE-miRNAs), long non-coding RNAs (DE-lncRNAs) and genes (DEGs). Subsequently, lncRNA-miRNA, miRNA-gene (that is, miRNA-mRNA) and gene-gene interaction pairs were extracted and used to construct regulatory networks. Feature genes in the miRNA-mRNA network were identified by topological analysis and recursive feature elimination analysis. A support vector machine (SVM) classifier was built using 15 feature genes, and its classification effect was validated using two microarray data sets that were downloaded from the Gene Expression Omnibus (GEO) database. In addition, Gene Ontology function and Kyoto Encyclopedia Genes and Genomes pathway enrichment analyses were conducted for genes identified in the ceRNA network. A total of 506 samples, including 447 tumor samples and 59 normal samples, were obtained from The Cancer Genome Atlas (TCGA); 16 DE-lncRNAs, 917 DEGs and 30 DE-miRNAs were screened. The miRNA-mRNA regulatory network comprised 353 nodes and 577 interactions. From these data, 15 feature genes with high predictive precision (>95%) were extracted from the network and were used to form an SVM classifier with an accuracy of 96.05% (486/506) for PTC samples downloaded from TCGA, and accuracies of 96.81 and 98.46% for GEO downloaded data sets. The ceRNA regulatory network comprised 596 lines (or interactions) and 365 nodes. Genes in the ceRNA network were significantly enriched in ‘neuron development’, ‘differentiation’, ‘neuroactive ligand-receptor interaction’, ‘metabolism of xenobiotics by cytochrome P450’, ‘drug metabolism’ and ‘cytokine-cytokine receptor interaction’ pathways. Hox transcript antisense RNA, miRNA-206 and kallikrein-related peptidase 10 were nodes in the ceRNA regulatory network of the selected feature gene, and they may serve import roles in the development of PTC.
Collapse
Affiliation(s)
- Shouhua Chen
- Department of Breast and Thyroid Surgery, Shandong Provincial Qianfoshan Hospital, Shandong University, Jinan, Shandong 250014, P.R. China
| | - Xiaobin Fan
- Department of Operation Room, Shandong Provincial Qianfoshan Hospital, Shandong University, Jinan, Shandong 250014, P.R. China
| | - He Gu
- Department of Breast and Thyroid Surgery, Shandong Provincial Qianfoshan Hospital, Shandong University, Jinan, Shandong 250014, P.R. China
| | - Lili Zhang
- Department of Breast and Thyroid Surgery, Shandong Provincial Qianfoshan Hospital, Shandong University, Jinan, Shandong 250014, P.R. China
| | - Wenhua Zhao
- Department of Oncology, Shandong Provincial Qianfoshan Hospital, Shandong University, Jinan, Shandong 250014, P.R. China
| |
Collapse
|
14
|
|
15
|
He Y, Ma J, Wang A, Wang W, Luo S, Liu Y, Ye X. A support vector machine and a random forest classifier indicates a 15-miRNA set related to osteosarcoma recurrence. Onco Targets Ther 2018; 11:253-269. [PMID: 29379305 PMCID: PMC5759858 DOI: 10.2147/ott.s148394] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open
Abstract
BACKGROUND Osteosarcoma, which originates in the mesenchymal tissue, is the prevalent primary solid malignancy of the bone. It is of great importance to explore the mechanisms of metastasis and recurrence, which are two primary reasons accounting for the high death rate in osteosarcoma. DATA AND METHODS Three miRNA expression profiles related to osteosarcoma were downloaded from GEO DataSets. Differentially expressed miRNAs (DEmiRs) were screened using MetaDE.ES of the MetaDE package. A support vector machine (SVM) classifier was constructed using optimal miRNAs, and its prediction efficiency for recurrence was detected in independent datasets. Finally, a co-expression network was constructed based on the DEmiRs and their target genes. RESULTS In total, 78 significantly DEmiRs were screened. The SVM classifier constructed by 15 miRNAs could accurately classify 58 samples in 65 samples (89.2%) in the GSE39040 database, which was validated in another two databases, GSE39052 (84.62%, 22/26) and GSE79181 (91.3%, 21/23). Cox regression showed that four miRNAs, including hsa-miR-10b, hsa-miR-1227, hsa-miR-146b-3p, and hsa-miR-873, significantly correlated with tumor recurrence time. There were 137, 147, 145, and 77 target genes of the above four miRNAs, respectively, which were assigned to 17 gene ontology functionally annotated terms and 14 Kyoto Encyclopedia of Genes and Genomes pathways. Among them, the "Osteoclast differentiation" pathway contained a total of seven target genes and was analyzed further. CONCLUSION The 15-miRNAs-based SVM classifier provides a potential useful tool to predict the recurrence of osteosarcoma. Our results suggest the possible mechanisms of osteosarcoma metastasis and recurrence and provide fresh DEmiRs as potential biomarkers or therapeutic targets for osteosarcoma.
Collapse
Affiliation(s)
- Yunfei He
- Department of Orthopaedics, Changzheng Hospital Affiliated with Second Military Medical University, Shanghai
- Department of Orthopaedics, Lanzhou General Hospital of Lanzhou Military Command Region, Lanzhou
| | - Jun Ma
- Department of Orthopaedics, Changzheng Hospital Affiliated with Second Military Medical University, Shanghai
| | - An Wang
- Department of Orthopaedics, Changzheng Hospital Affiliated with Second Military Medical University, Shanghai
- Department of Orthopaedics, Shanghai Armed Police Force Hospital, Shanghai, People’s Republic of China
| | - Weiheng Wang
- Department of Orthopaedics, Changzheng Hospital Affiliated with Second Military Medical University, Shanghai
| | - Shengchang Luo
- Department of Orthopaedics, Changzheng Hospital Affiliated with Second Military Medical University, Shanghai
| | - Yaoming Liu
- Department of Orthopaedics, Lanzhou General Hospital of Lanzhou Military Command Region, Lanzhou
| | - Xiaojian Ye
- Department of Orthopaedics, Changzheng Hospital Affiliated with Second Military Medical University, Shanghai
| |
Collapse
|
16
|
Cheng M, An S, Li J. CDKN2B - AS may indirectly regulate coronary artery disease-associated genes via targeting miR - 92a. Gene 2017; 629:101-107. [DOI: 10.1016/j.gene.2017.07.070] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2017] [Revised: 06/20/2017] [Accepted: 07/27/2017] [Indexed: 12/27/2022]
|