1
|
Fu M, He R, Zhang Z, Ma F, Shen L, Zhang Y, Duan M, Zhang Y, Wang Y, Zhu L, He J. Multinomial machine learning identifies independent biomarkers by integrated metabolic analysis of acute coronary syndrome. Sci Rep 2023; 13:20535. [PMID: 37996510 PMCID: PMC10667512 DOI: 10.1038/s41598-023-47783-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2023] [Accepted: 11/18/2023] [Indexed: 11/25/2023] Open
Abstract
A multi-class classification model for acute coronary syndrome (ACS) remains to be constructed based on multi-fluid metabolomics. Major confounders may exert spurious effects on the relationship between metabolism and ACS. The study aims to identify an independent biomarker panel for the multiclassification of HC, UA, and AMI by integrating serum and urinary metabolomics. We performed a liquid chromatography-tandem mass spectrometry (LC-MS/MS)-based metabolomics study on 300 serum and urine samples from 44 patients with unstable angina (UA), 77 with acute myocardial infarction (AMI), and 29 healthy controls (HC). Multinomial machine learning approaches, including multinomial adaptive least absolute shrinkage and selection operator (LASSO) regression and random forest (RF), and assessment of the confounders were applied to integrate a multi-class classification biomarker panel for HC, UA and AMI. Different metabolic landscapes were portrayed during the transition from HC to UA and then to AMI. Glycerophospholipid metabolism and arginine biosynthesis were predominant during the progression from HC to UA and then to AMI. The multiclass metabolic diagnostic model (MDM) dependent on ACS, including 2-ketobutyric acid, LysoPC(18:2(9Z,12Z)), argininosuccinic acid, and cyclic GMP, demarcated HC, UA, and AMI, providing a C-index of 0.84 (HC vs. UA), 0.98 (HC vs. AMI), and 0.89 (UA vs. AMI). The diagnostic value of MDM largely derives from the contribution of 2-ketobutyric acid, and LysoPC(18:2(9Z,12Z)) in serum. Higher 2-ketobutyric acid and cyclic GMP levels were positively correlated with ACS risk and atherosclerosis plaque burden, while LysoPC(18:2(9Z,12Z)) and argininosuccinic acid showed the reverse relationship. An independent multiclass biomarker panel for HC, UA, and AMI was constructed using the multinomial machine learning methods based on serum and urinary metabolite signatures.
Collapse
Affiliation(s)
- Meijiao Fu
- Ningxia Medical University, Yinchuan, 750004, Ningxia, China
| | - Ruhua He
- Department of Cardiology, General Hospital of Ningxia Medical University, Yinchuan, 750004, Ningxia, China
| | - Zhihan Zhang
- Department of Cardiology, Hanzhong Central Hospital, Hanzhong, 723200, Shanxi, China
| | - Fuqing Ma
- Department of Cardiology, The Fifth People's Hospital of Ningxia, Shizuishan, 753000, Ningxia, China
| | - Libo Shen
- Center for Cardiovascular Diseases, People's Hospital of Ningxia Hui Autonomous Region, Yinchuan, 750002, Ningxia, China
| | - Yu Zhang
- Ningxia Medical University, Yinchuan, 750004, Ningxia, China
| | - Mingyu Duan
- Ningxia Medical University, Yinchuan, 750004, Ningxia, China
| | - Yameng Zhang
- Department of Cardiology, The Second Affiliated Hospital of Henan University of Science and Technology, Luoyang, 471000, Henan, China
| | - Yifan Wang
- Department of Radiology, General Hospital of Ningxia Medical University, Yinchuan, 750004, Ningxia, China
| | - Li Zhu
- Department of Radiology, General Hospital of Ningxia Medical University, Yinchuan, 750004, Ningxia, China.
| | - Jun He
- Department of Cardiology, General Hospital of Ningxia Medical University, Yinchuan, 750004, Ningxia, China.
| |
Collapse
|
2
|
Yang Q, Gong Y, Zhu F. Critical Assessment of the Biomarker Discovery and Classification Methods for Multiclass Metabolomics. Anal Chem 2023; 95:5542-5552. [PMID: 36944135 DOI: 10.1021/acs.analchem.2c04402] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/23/2023]
Abstract
Multiclass metabolomics has been widely applied in clinical practice to understand pathophysiological processes involved in disease progression and diagnostic biomarkers of various disorders. In contrast to the binary problem, the multiclass classification problem is more difficult in terms of obtaining reliable and stable results due to the increase in the complexity of determining exact class decision boundaries. In particular, methods of biomarker discovery and classification have a significant effect on the multiclass model because different methods with significantly varied theories produce conflicting results even for the same dataset. However, a systematic assessment for selecting the most appropriate methods of biomarker discovery and classification for multiclass metabolomics is still lacking. Therefore, a comprehensive assessment is essential to measure the suitability of methods in multiclass classification models from multiple perspectives. In this study, five biomarker discovery methods and nine classification methods were assessed based on four benchmark datasets of multiclass metabolomics. The performance assessment of the biomarker discovery and classification methods was performed using three evaluation criteria: assessment a (cluster analysis of sample grouping), assessment b (biomarker consistency in multiple subgroups), and assessment c (accuracy in the classification model). As a result, 13 combining strategies with superior performance were selected under multiple criteria based on these benchmark datasets. In conclusion, superior strategies that performed consistently well are suggested for the discovery of biomarkers and the construction of a classification model for multiclass metabolomics.
Collapse
Affiliation(s)
- Qingxia Yang
- Department of Bioinformatics, School of Geographic and Biologic Information, Nanjing University of Posts and Telecommunications, Nanjing 210023, China
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Yaguo Gong
- School of Pharmacy, Macau University of Science and Technology, Macau 999078, China
| | - Feng Zhu
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| |
Collapse
|
3
|
Nagpal S, Pinna NK, Pant N, Singh R, Srivastava D, Mande SS. Can machines learn the mutation signatures of SARS-CoV-2 and enable viral-genotype guided predictive prognosis? J Mol Biol 2022; 434:167684. [PMID: 35700770 PMCID: PMC9188262 DOI: 10.1016/j.jmb.2022.167684] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2021] [Revised: 06/05/2022] [Accepted: 06/08/2022] [Indexed: 11/30/2022]
Abstract
MOTIVATION Continuous emergence of new variants through appearance/accumulation/disappearance of mutations is a hallmark of many viral diseases. SARS-CoV-2 variants have particularly exerted tremendous pressure on global healthcare system owing to their life threatening and debilitating implications. The sheer plurality of variants and huge scale of genomic data have added to the challenges of tracing the mutations/variants and their relationship to infection severity (if any). RESULTS We explored the suitability of virus-genotype guided machine-learning in infection prognosis and identification of features/mutations-of-interest. Total 199,519 outcome-traced genomes, representing 45,625 nucleotide-mutations, were employed. Among these, post data-cleaning, Low and High severity genomes were classified using an integrated model (employing virus genotype, epitopic-influence and patient-age) with consistently high ROC-AUC (Asia:0.97 ± 0.01, Europe:0.94 ± 0.01, N.America:0.92 ± 0.02, Africa:0.94 ± 0.07, S.America:0.93 ± 03). Although virus-genotype alone could enable high predictivity (0.97 ± 0.01, 0.89 ± 0.02, 0.86 ± 0.04, 0.95 ± 0.06, 0.9 ± 0.04), the performance was not found to be consistent and the models for a few geographies displayed significant improvement in predictivity when the influence of age and/or epitope was incorporated with virus-genotype (Wilcoxon p_BH < 0.05). Neither age or epitopic-influence or clade information could out-perform the integrated features. A sparse model (6 features), developed using patient-age and epitopic-influence of the mutations, performed reasonably well (>0.87 ± 0.03, 0.91 ± 0.01, 0.87 ± 0.03, 0.84 ± 0.08, 0.89 ± 0.05). High-performance models were employed for inferring the important mutations-of-interest using Shapley Additive exPlanations (SHAP). The changes in HLA interactions of the mutated epitopes of reference SARS-CoV-2 were then subsequently probed. Notably, we also describe the significance of a 'temporal-modeling approach' to benchmark the models linked with continuously evolving pathogens. We conclude that while machine learning can play a vital role in identifying relevant mutations and factors driving the severity, caution should be exercised in using the genotypic signatures for predictive prognosis.
Collapse
Affiliation(s)
- Sunil Nagpal
- Tata Consultancy Services Ltd, Pune 411013, India; CSIR-Institute of Genomics and Integrative Biology (CSIR-IGIB), New Delhi 110025, India; Academy of Scientific and Innovative Research (AcSIR), Ghaziabad 201002, India. https://twitter.com/NagpalSun
| | - Nishal Kumar Pinna
- Tata Consultancy Services Ltd, Pune 411013, India. https://twitter.com/nishal_pinna
| | - Namrata Pant
- Tata Consultancy Services Ltd, Pune 411013, India
| | - Rohan Singh
- Tata Consultancy Services Ltd, Pune 411013, India
| | | | | |
Collapse
|
4
|
Prognostic value of a microRNA-pair signature in laryngeal squamous cell carcinoma patients. Eur Arch Otorhinolaryngol 2022; 279:4451-4460. [PMID: 35478043 DOI: 10.1007/s00405-022-07404-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2022] [Accepted: 04/13/2022] [Indexed: 11/03/2022]
Abstract
PURPOSE Predicting the prognosis in laryngeal squamous cell carcinoma (LSCC) patients will improve clinical decision-making. Here, we aimed to identify a qualitative signature based on the within-sample relative expression orderings (REOs) of microRNA (miRNA) pairs to predict the overall survival (OS) of LSCC patients. METHODS First, we constructed non-repeating miRNA pairs based on differentially expressed miRNAs (DEmiRNAs) between LSCC and normal tissues. Then, we applied a bootstrap-based feature selection method to identify a robust miRNA-pair signature. The bootstrap-based feature selection improved the stability of feature selection by an ensemble based on the data perturbation. Furthermore, a series of bioinformatics analyses were carried out to explore the potential mechanisms of the signature and potential drug targets for LSCC. RESULTS Based on the REOs of miRNA pairs, we identified a qualitative signature that consisted of 12 miRNA pairs. The constructed signature has good performance in predicting the OS of LSCC patients. It is robust against batch effects and more suitable for individual clinical applications. Furthermore, we identified several hub genes that may be potential drug targets for LSCC. CONCLUSION Overall, our findings provided a promising signature for predicting the OS of LSCC patients.
Collapse
|
5
|
Sevakula RK, Singh V, Verma NK, Kumar C, Cui Y. Transfer Learning for Molecular Cancer Classification Using Deep Neural Networks. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:2089-2100. [PMID: 29993662 DOI: 10.1109/tcbb.2018.2822803] [Citation(s) in RCA: 41] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
The emergence of deep learning has impacted numerous machine learning based applications and research. The reason for its success lies in two main advantages: 1) it provides the ability to learn very complex non-linear relationships between features and 2) it allows one to leverage information from unlabeled data that does not belong to the problem being handled. This paper presents a transfer learning procedure for cancer classification, which uses feature selection and normalization techniques in conjunction with s sparse auto-encoders on gene expression data. While classifying any two tumor types, data of other tumor types were used in unsupervised manner to improve the feature representation. The performance of our algorithm was tested on 36 two-class benchmark datasets from the GEMLeR repository. On performing statistical tests, it is clearly ascertained that our algorithm statistically outperforms several generally used cancer classification approaches. The deep learning based molecular disease classification can be used to guide decisions made on the diagnosis and treatment of diseases, and therefore may have important applications in precision medicine.
Collapse
|
6
|
Alanni R, Hou J, Azzawi H, Xiang Y. Cancer adjuvant chemotherapy prediction model for non‐small cell lung cancer. IET Syst Biol 2019; 13:129-135. [DOI: 10.1049/iet-syb.2018.5060] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023] Open
Affiliation(s)
- Russul Alanni
- School of Information Technology, Deakin UniversityBurwoodAustralia
| | - Jingyu Hou
- School of Information Technology, Deakin UniversityBurwoodAustralia
| | - Hasseeb Azzawi
- School of Information Technology, Deakin UniversityBurwoodAustralia
| | - Yong Xiang
- School of Information Technology, Deakin UniversityBurwoodAustralia
| |
Collapse
|
7
|
Singh V, Verma NK, Cui Y. Type-2 Fuzzy PCA Approach in Extracting Salient Features for Molecular Cancer Diagnostics and Prognostics. IEEE Trans Nanobioscience 2019; 18:482-489. [PMID: 31107656 DOI: 10.1109/tnb.2019.2917814] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Machine learning is becoming a powerful tool for cancer diagnosis and prognosis based on classification using high dimensional molecular data. However, extracting classification features from high-dimensional datasets remains a challenging problem. Principal component analysis (PCA) is a widely used method for dimensionality reduction. However, it is well-known that PCA and most PCA-based feature extraction methods are sensitive to noise, which may affect the accuracy of the subsequent classification. To address this problem, here we have proposed a robust fuzzy principal component analysis (PCA) with interval type-2 (IT-2) fuzzy membership functions for feature extraction. We have tested the performance of three widely used classifiers using the features extracted by proposed approaches and other feature extraction methods - PCA-based feature extraction methods (i.e. conventional PCA and fuzzy PCA), linear discriminant analysis (LDA), and support vector machine recursive feature elimination (SVM-RFE). The proposed feature extraction approaches showed better performance on cancer transcriptome and proteome datasets.
Collapse
|
8
|
Li Y, Pan Y, Liu Z. Multiclass Nonnegative Matrix Factorization for Comprehensive Feature Pattern Discovery. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2019; 30:615-629. [PMID: 30010601 DOI: 10.1109/tnnls.2018.2849932] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
In this big data era, interpretable machine learning models are strongly demanded for the comprehensive analytics of large-scale multiclass data. Characterizing all features from such data is a key but challenging step to understand the complexity. However, existing feature selection methods do not meet this need. In this paper, to address this problem, we propose a Bayesian multiclass nonnegative matrix factorization (MC-NMF) model with structured sparsity that is able to discover ubiquitous and class-specific features. Variational update rules were derived for efficient decomposition. In order to relieve the need of model selection and stably describe feature patterns, we further propose MC-NMF with stability selection, an ensemble method that collectively detects feature patterns from many runs of MC-NMF using different hyperparameter values and training subsets. We assessed our models on both simulated count data and multitumor ribonucleic acid-seq data. The experiments revealed that our models were able to recover predefined feature patterns from the simulated data and identify biologically meaningful patterns from the pan-cancer data.
Collapse
|
9
|
Abstract
The advent of DNA microarray datasets has stimulated a new line of research both in bioinformatics and in machine learning. This type of data is used to collect information from tissue and cell samples regarding gene expression differences that could be useful for disease diagnosis or for distinguishing specific types of tumor. Microarray data classification is a difficult challenge for machine learning researchers due to its high number of features and the small sample sizes. This chapter is devoted to reviewing the microarray databases most frequently used in the literature. We also make the interested reader aware of the problematic of data characteristics in this domain, such as the imbalance of the data, their complexity, and the so-called dataset shift.
Collapse
|
10
|
Al-Anni R, Hou J, Abdu-Aljabar RD, Xiang Y. Prediction of NSCLC recurrence from microarray data with GEP. IET Syst Biol 2017; 11:77-85. [PMID: 28518058 DOI: 10.1049/iet-syb.2016.0033] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023] Open
Abstract
Lung cancer is one of the deadliest diseases in the world. Non-small cell lung cancer (NSCLC) is the most common and dangerous type of lung cancer. Despite the fact that NSCLC is preventable and curable for some cases if diagnosed at early stages, the vast majority of patients are diagnosed very late. Furthermore, NSCLC usually recurs sometime after treatment. Therefore, it is of paramount importance to predict NSCLC recurrence, so that specific and suitable treatments can be sought. Nonetheless, conventional methods of predicting cancer recurrence rely solely on histopathology data and predictions are not reliable in many cases. The microarray gene expression (GE) technology provides a promising and reliable way to predict NSCLC recurrence by analysing the GE of sample cells. This study proposes a new model from GE programming to use microarray datasets for NSCLC recurrence prediction. To this end, the authors also propose a hybrid method to rank and select relevant prognostic genes that are related to NSCLC recurrence prediction. The proposed model was evaluated on real NSCLC microarray datasets and compared with other representational models. The results demonstrated the effectiveness of the proposed model.
Collapse
Affiliation(s)
- Russul Al-Anni
- School of Information Technology, Deakin University, Victoria, Australia.
| | - Jingyu Hou
- School of Information Technology, Deakin University, Victoria, Australia
| | | | - Yong Xiang
- School of Information Technology, Deakin University, Victoria, Australia
| |
Collapse
|
11
|
Urda D, Luque-Baena RM, Franco L, Jerez JM, Sanchez-Marono N. Machine learning models to search relevant genetic signatures in clinical context. 2017 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN) 2017:1649-1656. [DOI: 10.1109/ijcnn.2017.7966049] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/04/2025]
|
12
|
A Meta-Review of Feature Selection Techniques in the Context of Microarray Data. BIOINFORMATICS AND BIOMEDICAL ENGINEERING 2017. [DOI: 10.1007/978-3-319-56148-6_3] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
|
13
|
Weighted doubly regularized support vector machine and its application to microarray classification with noise. Neurocomputing 2016. [DOI: 10.1016/j.neucom.2015.08.002] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
14
|
Vu T, Sima C, Braga-Neto UM, Dougherty ER. Unbiased bootstrap error estimation for linear discriminant analysis. EURASIP JOURNAL ON BIOINFORMATICS & SYSTEMS BIOLOGY 2014; 2014:15. [PMID: 28194165 PMCID: PMC5270504 DOI: 10.1186/s13637-014-0015-0] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/17/2014] [Accepted: 08/18/2014] [Indexed: 11/26/2022]
Abstract
Convex bootstrap error estimation is a popular tool for classifier error estimation in gene expression studies. A basic question is how to determine the weight for the convex combination between the basic bootstrap estimator and the resubstitution estimator such that the resulting estimator is unbiased at finite sample sizes. The well-known 0.632 bootstrap error estimator uses asymptotic arguments to propose a fixed 0.632 weight, whereas the more recent 0.632+ bootstrap error estimator attempts to set the weight adaptively. In this paper, we study the finite sample problem in the case of linear discriminant analysis under Gaussian populations. We derive exact expressions for the weight that guarantee unbiasedness of the convex bootstrap error estimator in the univariate and multivariate cases, without making asymptotic simplifications. Using exact computation in the univariate case and an accurate approximation in the multivariate case, we obtain the required weight and show that it can deviate significantly from the constant 0.632 weight, depending on the sample size and Bayes error for the problem. The methodology is illustrated by application on data from a well-known cancer classification study.
Collapse
Affiliation(s)
- Thang Vu
- Department of Electrical and Computer Engineering, Texas A&M University, 3128 TAMU, College Station, 77843 TX USA
| | - Chao Sima
- Center for Bioinformatics and Genomic Systems Engineering, Texas A&M University, 101 Gateway, Suite A, College Station, 77845 TX USA
| | - Ulisses M Braga-Neto
- Department of Electrical and Computer Engineering, Texas A&M University, 3128 TAMU, College Station, 77843 TX USA.,Center for Bioinformatics and Genomic Systems Engineering, Texas A&M University, 101 Gateway, Suite A, College Station, 77845 TX USA
| | - Edward R Dougherty
- Department of Electrical and Computer Engineering, Texas A&M University, 3128 TAMU, College Station, 77843 TX USA.,Center for Bioinformatics and Genomic Systems Engineering, Texas A&M University, 101 Gateway, Suite A, College Station, 77845 TX USA
| |
Collapse
|
15
|
Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A, Benítez J, Herrera F. A review of microarray datasets and applied feature selection methods. Inf Sci (N Y) 2014. [DOI: 10.1016/j.ins.2014.05.042] [Citation(s) in RCA: 386] [Impact Index Per Article: 35.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
16
|
Luque-Baena RM, Urda D, Subirats JL, Franco L, Jerez JM. Application of genetic algorithms and constructive neural networks for the analysis of microarray cancer data. Theor Biol Med Model 2014; 11 Suppl 1:S7. [PMID: 25077572 PMCID: PMC4108856 DOI: 10.1186/1742-4682-11-s1-s7] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Background Extracting relevant information from microarray data is a very complex task due to the characteristics of the data sets, as they comprise a large number of features while few samples are generally available. In this sense, feature selection is a very important aspect of the analysis helping in the tasks of identifying relevant genes and also for maximizing predictive information. Methods Due to its simplicity and speed, Stepwise Forward Selection (SFS) is a widely used feature selection technique. In this work, we carry a comparative study of SFS and Genetic Algorithms (GA) as general frameworks for the analysis of microarray data with the aim of identifying group of genes with high predictive capability and biological relevance. Six standard and machine learning-based techniques (Linear Discriminant Analysis (LDA), Support Vector Machines (SVM), Naive Bayes (NB), C-MANTEC Constructive Neural Network, K-Nearest Neighbors (kNN) and Multilayer perceptron (MLP)) are used within both frameworks using six free-public datasets for the task of predicting cancer outcome. Results Better cancer outcome prediction results were obtained using the GA framework noting that this approach, in comparison to the SFS one, leads to a larger selection set, uses a large number of comparison between genetic profiles and thus it is computationally more intensive. Also the GA framework permitted to obtain a set of genes that can be considered to be more biologically relevant. Regarding the different classifiers used standard feedforward neural networks (MLP), LDA and SVM lead to similar and best results, while C-MANTEC and k-NN followed closely but with a lower accuracy. Further, C-MANTEC, MLP and LDA permitted to obtain a more limited set of genes in comparison to SVM, NB and kNN, and in particular C-MANTEC resulted in the most robust classifier in terms of changes in the parameter settings. Conclusions This study shows that if prediction accuracy is the objective, the GA-based approach lead to better results respect to the SFS approach, independently of the classifier used. Regarding classifiers, even if C-MANTEC did not achieve the best overall results, the performance was competitive with a very robust behaviour in terms of the parameters of the algorithm, and thus it can be considered as a candidate technique for future studies.
Collapse
|
17
|
Tian S, Chang HH, Wang C, Jiang J, Wang X, Niu J. Multi-TGDR, a multi-class regularization method, identifies the metabolic profiles of hepatocellular carcinoma and cirrhosis infected with hepatitis B or hepatitis C virus. BMC Bioinformatics 2014; 15:97. [PMID: 24707821 PMCID: PMC4234477 DOI: 10.1186/1471-2105-15-97] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2013] [Accepted: 03/25/2014] [Indexed: 01/18/2023] Open
Abstract
Background Over the last decade, metabolomics has evolved into a mainstream enterprise utilized by many laboratories globally. Like other “omics” data, metabolomics data has the characteristics of a smaller sample size compared to the number of features evaluated. Thus the selection of an optimal subset of features with a supervised classifier is imperative. We extended an existing feature selection algorithm, threshold gradient descent regularization (TGDR), to handle multi-class classification of “omics” data, and proposed two such extensions referred to as multi-TGDR. Both multi-TGDR frameworks were used to analyze a metabolomics dataset that compares the metabolic profiles of hepatocellular carcinoma (HCC) infected with hepatitis B (HBV) or C virus (HCV) with that of cirrhosis induced by HBV/HCV infection; the goal was to improve early-stage diagnosis of HCC. Results We applied two multi-TGDR frameworks to the HCC metabolomics data that determined TGDR thresholds either globally across classes, or locally for each class. Multi-TGDR global model selected 45 metabolites with a 0% misclassification rate (the error rate on the training data) and had a 3.82% 5-fold cross-validation (CV-5) predictive error rate. Multi-TGDR local selected 48 metabolites with a 0% misclassification rate and a 5.34% CV-5 error rate. Conclusions One important advantage of multi-TGDR local is that it allows inference for determining which feature is related specifically to the class/classes. Thus, we recommend multi-TGDR local be used because it has similar predictive performance and requires the same computing time as multi-TGDR global, but may provide class-specific inference.
Collapse
Affiliation(s)
- Suyan Tian
- Division of Clinical Epidemiology, First Hospital of the Jilin University, 71Xinmin Street, Changchun, Jilin 130021, China.
| | | | | | | | | | | |
Collapse
|
18
|
Kinoshita R, Iwadate M, Umeyama H, Taguchi YH. Genes associated with genotype-specific DNA methylation in squamous cell carcinoma as candidate drug targets. BMC SYSTEMS BIOLOGY 2014; 8 Suppl 1:S4. [PMID: 24565165 PMCID: PMC4080267 DOI: 10.1186/1752-0509-8-s1-s4] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Background Aberrant DNA methylation is often associated with cancers. Thus, screening genes with cancer-associated aberrant DNA methylation is a useful method to identify candidate cancer-causing genes. Aberrant DNA methylation is also genotype dependent. Thus, the selection of genes with genotype-specific aberrant DNA methylation in cancers is potentially important for tailor-made medicine. The selected genes are important candidate drug targets. Results The recently proposed principal component analysis based selection of genes with aberrant DNA methylation was applied to genotype and DNA methylation patterns in squamous cell carcinoma measured using single nucleotide polymorphism (SNP) arrays. SNPs that are frequently found in cancers are usually highly methylated, and the genes that were selected using this method were reported previously to be related to cancers. Thus, genes with genotype-specific DNA methylation patterns will be good therapeutic candidates. The tertiary structures of the proteins encoded by the selected genes were successfully inferred using two profile-based protein structure servers, FAMS and Phyre2. Candidate drugs for three of these proteins, tyrosine kinase receptor (ALK), EGLN3 protein, and NUAK family SNF1-like kinase 1 (NUAK1), were identified by ChooseLD. Conclusions We detected genes with genotype-specific DNA methylation in squamous cell carcinoma that are candidate drug targets. Using in silico drug discovery, we successfully identified several candidate drugs for the ALK, EGLN3 and NUAK1 genes that displayed genotype-specific DNA methylation.
Collapse
|
19
|
Applications of Bayesian gene selection and classification with mixtures of generalized singular g-priors. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2014; 2013:420412. [PMID: 24382981 PMCID: PMC3870637 DOI: 10.1155/2013/420412] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/04/2013] [Revised: 11/10/2013] [Accepted: 11/10/2013] [Indexed: 11/17/2022]
Abstract
Recent advancement in microarray technologies has led to a collection of an enormous number of genetic markers in disease association studies, and yet scientists are interested in selecting a smaller set of genes to explore the relation between genes and disease. Current approaches either adopt a single marker test which ignores the possible interaction among genes or consider a multistage procedure that reduces the large size of genes before evaluation of the association. Among the latter, Bayesian analysis can further accommodate the correlation between genes through the specification of a multivariate prior distribution and estimate the probabilities of association through latent variables. The covariance matrix, however, depends on an unknown parameter. In this research, we suggested a reference hyperprior distribution for such uncertainty, outlined the implementation of its computation, and illustrated this fully Bayesian approach with a colon and leukemia cancer study. Comparison with other existing methods was also conducted. The classification accuracy of our proposed model is higher with a smaller set of selected genes. The results not only replicated findings in several earlier studies, but also provided the strength of association with posterior probabilities.
Collapse
|
20
|
Shaik R, Ramakrishna W. Machine learning approaches distinguish multiple stress conditions using stress-responsive genes and identify candidate genes for broad resistance in rice. PLANT PHYSIOLOGY 2014; 164:481-95. [PMID: 24235132 PMCID: PMC3875824 DOI: 10.1104/pp.113.225862] [Citation(s) in RCA: 74] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/31/2013] [Accepted: 11/13/2013] [Indexed: 05/18/2023]
Abstract
Abiotic and biotic stress responses are traditionally thought to be regulated by discrete signaling mechanisms. Recent experimental evidence revealed a more complex picture where these mechanisms are highly entangled and can have synergistic and antagonistic effects on each other. In this study, we identified shared stress-responsive genes between abiotic and biotic stresses in rice (Oryza sativa) by performing meta-analyses of microarray studies. About 70% of the 1,377 common differentially expressed genes showed conserved expression status, and the majority of the rest were down-regulated in abiotic stresses and up-regulated in biotic stresses. Using dimension reduction techniques, principal component analysis, and partial least squares discriminant analysis, we were able to segregate abiotic and biotic stresses into separate entities. The supervised machine learning model, recursive-support vector machine, could classify abiotic and biotic stresses with 100% accuracy using a subset of differentially expressed genes. Furthermore, using a random forests decision tree model, eight out of 10 stress conditions were classified with high accuracy. Comparison of genes contributing most to the accurate classification by partial least squares discriminant analysis, recursive-support vector machine, and random forests revealed 196 common genes with a dynamic range of expression levels in multiple stresses. Functional enrichment and coexpression network analysis revealed the different roles of transcription factors and genes responding to phytohormones or modulating hormone levels in the regulation of stress responses. We envisage the top-ranked genes identified in this study, which highly discriminate abiotic and biotic stresses, as key components to further our understanding of the inherently complex nature of multiple stress responses in plants.
Collapse
|