1
|
Zhang ZG, Xu L, Zhang PJ, Han L. Evaluation of the value of multiparameter combined analysis of serum markers in the early diagnosis of gastric cancer. World J Gastrointest Oncol 2020; 12:483-491. [PMID: 32368325 PMCID: PMC7191329 DOI: 10.4251/wjgo.v12.i4.483] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/21/2019] [Revised: 02/05/2020] [Accepted: 03/22/2020] [Indexed: 02/05/2023] Open
Abstract
BACKGROUND In early gastric cancer (GC), tumor markers are increased in the blood. The levels of these markers have been used as important indexes for GC screening, early diagnosis and prognostic evaluation. However, specific tumor markers have not yet been discovered. Diagnosis based on a single tumor marker has limited significance. The detection rate of GC is still very low.
AIM To improve the diagnostic value of blood markers for GC.
METHODS We used a multiparameter joint analysis of 77 indexes of malignant GC and gastric polyp (GP), 64 indexes of GC and healthy controls (Ctrls).
RESULTS By analyzing the data, there are 27 indexes in the final Ctrls vs GC with P values < 0.01, the area under the curve (AUC) of albumin is the largest in Ctrls vs GC, and the AUC was 0.907. 30 indexes in GP vs GC have P values < 0.01. Among them, the D-dimer showed an AUC of 0.729. The 27 indexes in Ctrls vs GC and 30 indexes in GP vs GC were used for binary logistic regression, discriminant analysis, classification tree analysis and artificial neural network analysis model. For the ability to distinguish between Ctrls vs GC, GP vs GC, artificial neural networks had better diagnostic value when compared with classification tree, binary logistic regression, and discriminant analysis. When compared Ctrl and GC, the overall prediction accuracy was 92.9%, and the AUC was 0.992 (0.980, 1.000). When compared GP and GC, the overall prediction accuracy was 77.9%, and the AUC was 0.969 (0.948, 0.990).
CONCLUSION The diagnostic effect of multi-parameter joint artificial neural networks analysis is significantly better than the single-index test diagnosis, and it may provide an assistant method for the detection of GC.
Collapse
Affiliation(s)
- Zhi-Guo Zhang
- Department of Oncology, Beijing Daxing District People’s Hospital, Beijing 102600, China
| | - Liang Xu
- Key Laboratory of Carcinogenesis and Translational Research (Ministry of Education/Beijing), Interventional Therapy Department, Peking University Cancer Hospital and Institute, Beijing 100142, China
| | - Peng-Jun Zhang
- Key Laboratory of Carcinogenesis and Translational Research (Ministry of Education/Beijing), Interventional Therapy Department, Peking University Cancer Hospital and Institute, Beijing 100142, China
| | - Lei Han
- Department of Oncology, Beijing Daxing District People’s Hospital, Beijing 102600, China
| |
Collapse
|
2
|
Liu Z, Elashoff D, Piantadosi S. Sparse support vector machines with L 0 approximation for ultra-high dimensional omics data. Artif Intell Med 2019; 96:134-141. [PMID: 31164207 DOI: 10.1016/j.artmed.2019.04.004] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2018] [Revised: 03/31/2019] [Accepted: 04/27/2019] [Indexed: 12/30/2022]
Abstract
Omics data usually have ultra-high dimension (p) and small sample size (n). Standard support vector machines (SVMs), which minimize the L2 norm for the primal variables, only lead to sparse solutions for the dual variables. L1 based SVMs, directly minimizing the L1 norm, have been used for feature selection with omics data. However, most current methods directly solve the primal formulations of the problem, which are not computationally scalable. The computational complexity increases with the number of features. In addition, L1 norm is known to be asymptotically biased and not consistent for feature selection. In this paper, we develop an efficient method for sparse support vector machines with L0 norm approximation. The proposed method approximates the L0 minimization through solving a series of L2 optimization problems, which can be formulated with dual variables. It finds the optimal solution for p primal variables through estimating n dual variables, which is more efficient as long as the sample size is small. L0 approximation leads to sparsity in both dual and primal variables, and can be used for both feature and sample selections. The proposed method identifies much less number of features and achieves similar performances in simulations. We apply the proposed method to feature selections with metagenomic sequencing and gene expression data. It can identify biologically important genes and taxa efficiently.
Collapse
Affiliation(s)
- Zhenqiu Liu
- Department of Public Health Sciences, Penn State College of Medicine, Hershey, PA 17033, USA.
| | - David Elashoff
- Department of Medicine, University of California at Los Angeles, CA 90024, USA
| | - Steven Piantadosi
- Samuel Oschin Cancer Center, Cedars-Sinai Medical Center, Los Angeles, CA 90048, USA
| |
Collapse
|
3
|
Li J, Dong W, Meng D. Grouped Gene Selection of Cancer via Adaptive Sparse Group Lasso Based on Conditional Mutual Information. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:2028-2038. [PMID: 29028206 DOI: 10.1109/tcbb.2017.2761871] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
This paper deals with the problems of cancer classification and grouped gene selection. The weighted gene co-expression network on cancer microarray data is employed to identify modules corresponding to biological pathways, based on which a strategy of dividing genes into groups is presented. Using the conditional mutual information within each divided group, an integrated criterion is proposed and the data-driven weights are constructed. They are shown with the ability to evaluate both the individual gene significance and the influence to improve correlation of all the other pairwise genes in each group. Furthermore, an adaptive sparse group lasso is proposed, by which an improved blockwise descent algorithm is developed. The results on four cancer data sets demonstrate that the proposed adaptive sparse group lasso can effectively perform classification and grouped gene selection.
Collapse
|
4
|
Gui J, Sun Z, Ji S, Tao D, Tan T. Feature Selection Based on Structured Sparsity: A Comprehensive Study. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2017; 28:1490-1507. [PMID: 28287983 DOI: 10.1109/tnnls.2016.2551724] [Citation(s) in RCA: 122] [Impact Index Per Article: 17.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Feature selection (FS) is an important component of many pattern recognition tasks. In these tasks, one is often confronted with very high-dimensional data. FS algorithms are designed to identify the relevant feature subset from the original features, which can facilitate subsequent analysis, such as clustering and classification. Structured sparsity-inducing feature selection (SSFS) methods have been widely studied in the last few years, and a number of algorithms have been proposed. However, there is no comprehensive study concerning the connections between different SSFS methods, and how they have evolved. In this paper, we attempt to provide a survey on various SSFS methods, including their motivations and mathematical representations. We then explore the relationship among different formulations and propose a taxonomy to elucidate their evolution. We group the existing SSFS methods into two categories, i.e., vector-based feature selection (feature selection based on lasso) and matrix-based feature selection (feature selection based on lr,p-norm). Furthermore, FS has been combined with other machine learning algorithms for specific applications, such as multitask learning, multilabel learning, multiview learning, classification, and clustering. This paper not only compares the differences and commonalities of these methods based on regression and regularization strategies, but also provides useful guidelines to practitioners working in related fields to guide them how to do feature selection.
Collapse
|
5
|
Efficient Regularized Regression with L0 Penalty for Variable Selection and Network Construction. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2016; 2016:3456153. [PMID: 27843486 PMCID: PMC5098106 DOI: 10.1155/2016/3456153] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/06/2016] [Revised: 08/29/2016] [Accepted: 09/20/2016] [Indexed: 12/22/2022]
Abstract
Variable selections for regression with high-dimensional big data have found many applications in bioinformatics and computational biology. One appealing approach is the L0 regularized regression which penalizes the number of nonzero features in the model directly. However, it is well known that L0 optimization is NP-hard and computationally challenging. In this paper, we propose efficient EM (L0EM) and dual L0EM (DL0EM) algorithms that directly approximate the L0 optimization problem. While L0EM is efficient with large sample size, DL0EM is efficient with high-dimensional (n ≪ m) data. They also provide a natural solution to all Lp
p ∈ [0,2] problems, including lasso with p = 1 and elastic net with p ∈ [1,2]. The regularized parameter λ can be determined through cross validation or AIC and BIC. We demonstrate our methods through simulation and high-dimensional genomic data. The results indicate that L0 has better performance than lasso, SCAD, and MC+, and L0 with AIC or BIC has similar performance as computationally intensive cross validation. The proposed algorithms are efficient in identifying the nonzero variables with less bias and constructing biologically important networks with high-dimensional big data.
Collapse
|
6
|
Peng JX, Rafferty K, Ferguson S. Building support vector machines in the context of regularized least squares. Neurocomputing 2016. [DOI: 10.1016/j.neucom.2016.03.087] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
7
|
Wang L, Wang Y, Chang Q. Feature selection methods for big data bioinformatics: A survey from the search perspective. Methods 2016; 111:21-31. [PMID: 27592382 DOI: 10.1016/j.ymeth.2016.08.014] [Citation(s) in RCA: 110] [Impact Index Per Article: 13.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2016] [Revised: 08/25/2016] [Accepted: 08/30/2016] [Indexed: 11/26/2022] Open
Abstract
This paper surveys main principles of feature selection and their recent applications in big data bioinformatics. Instead of the commonly used categorization into filter, wrapper, and embedded approaches to feature selection, we formulate feature selection as a combinatorial optimization or search problem and categorize feature selection methods into exhaustive search, heuristic search, and hybrid methods, where heuristic search methods may further be categorized into those with or without data-distilled feature ranking measures.
Collapse
Affiliation(s)
- Lipo Wang
- School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore.
| | - Yaoli Wang
- College of Information Engineering, Taiyuan University of Technology, Taiyuan, China.
| | - Qing Chang
- College of Information Engineering, Taiyuan University of Technology, Taiyuan, China.
| |
Collapse
|
8
|
Ghanat Bari M, Ma X, Zhang J. PeakLink: a new peptide peak linking method in LC-MS/MS using wavelet and SVM. Bioinformatics 2014; 30:2464-70. [PMID: 24813213 DOI: 10.1093/bioinformatics/btu299] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION In liquid chromatography-mass spectrometry/tandem mass spectrometry (LC-MS/MS), it is necessary to link tandem MS-identified peptide peaks so that protein expression changes between the two runs can be tracked. However, only a small number of peptides can be identified and linked by tandem MS in two runs, and it becomes necessary to link peptide peaks with tandem identification in one run to their corresponding ones in another run without identification. In the past, peptide peaks are linked based on similarities in retention time (rt), mass or peak shape after rt alignment, which corrects mean rt shifts between runs. However, the accuracy in linking is still limited especially for complex samples collected from different conditions. Consequently, large-scale proteomics studies that require comparison of protein expression profiles of hundreds of patients can not be carried out effectively. METHOD In this article, we consider the problem of linking peptides from a pair of LC-MS/MS runs and propose a new method, PeakLink (PL), which uses information in both the time and frequency domain as inputs to a non-linear support vector machine (SVM) classifier. The PL algorithm first uses a threshold on an rt likelihood ratio score to remove candidate corresponding peaks with excessively large elution time shifts, then PL calculates the correlation between a pair of candidate peaks after reducing noise through wavelet transformation. After converting rt and peak shape correlation to statistical scores, an SVM classifier is trained and applied for differentiating corresponding and non-corresponding peptide peaks. RESULTS PL is tested in multiple challenging cases, in which LC-MS/MS samples are collected from different disease states, different instruments and different laboratories. Testing results show significant improvement in linking accuracy compared with other algorithms. AVAILABILITY AND IMPLEMENTATION M files for the PL alignment method are available at http://compgenomics.utsa.edu/zgroup/PeakLink. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Mehrab Ghanat Bari
- Department of Electrical and Computer Engineering, The University of Texas at San Antonio, San Antonio, TX 78246, USA
| | - Xuepo Ma
- Department of Electrical and Computer Engineering, The University of Texas at San Antonio, San Antonio, TX 78246, USA
| | - Jianqiu Zhang
- Department of Electrical and Computer Engineering, The University of Texas at San Antonio, San Antonio, TX 78246, USA
| |
Collapse
|
9
|
Liu Z, Chen D, Sheng L, Liu AY. Class prediction and feature selection with linear optimization for metagenomic count data. PLoS One 2013; 8:e53253. [PMID: 23555553 PMCID: PMC3608598 DOI: 10.1371/journal.pone.0053253] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2012] [Accepted: 11/27/2012] [Indexed: 11/29/2022] Open
Abstract
The amount of metagenomic data is growing rapidly while the computational methods for metagenome analysis are still in their infancy. It is important to develop novel statistical learning tools for the prediction of associations between bacterial communities and disease phenotypes and for the detection of differentially abundant features. In this study, we presented a novel statistical learning method for simultaneous association prediction and feature selection with metagenomic samples from two or multiple treatment populations on the basis of count data. We developed a linear programming based support vector machine with L(1) and joint L(1,∞) penalties for binary and multiclass classifications with metagenomic count data (metalinprog). We evaluated the performance of our method on several real and simulation datasets. The proposed method can simultaneously identify features and predict classes with the metagenomic count data.
Collapse
Affiliation(s)
- Zhenqiu Liu
- University of Maryland Greenebaum Cancer Center, Baltimore, Maryland, USA.
| | | | | | | |
Collapse
|
10
|
Irsoy O, Yildiz OT, Alpaydin E. Design and analysis of classifier learning experiments in bioinformatics: survey and case studies. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2012; 9:1663-1675. [PMID: 22908127 DOI: 10.1109/tcbb.2012.117] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/01/2023]
Abstract
In many bioinformatics applications, it is important to assess and compare the performances of algorithms trained from data, to be able to draw conclusions unaffected by chance and are therefore significant. Both the design of such experiments and the analysis of the resulting data using statistical tests should be done carefully for the results to carry significance. In this paper, we first review the performance measures used in classification, the basics of experiment design and statistical tests. We then give the results of our survey over 1,500 papers published in the last two years in three bioinformatics journals (including this one). Although the basics of experiment design are well understood, such as resampling instead of using a single training set and the use of different performance metrics instead of error, only 21 percent of the papers use any statistical test for comparison. In the third part, we analyze four different scenarios which we encounter frequently in the bioinformatics literature, discussing the proper statistical methodology as well as showing an example case study for each. With the supplementary software, we hope that the guidelines we discuss will play an important role in future studies.
Collapse
Affiliation(s)
- Ozan Irsoy
- Department of Computer Engineering, Boğaziçi University, Bebek 34342, Istanbul, Turkey.
| | | | | |
Collapse
|
11
|
Wu MY, Dai DQ, Shi Y, Yan H, Zhang XF. Biomarker identification and cancer classification based on microarray data using Laplace naive Bayes model with mean shrinkage. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2012; 9:1649-1662. [PMID: 22868679 DOI: 10.1109/tcbb.2012.105] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/01/2023]
Abstract
Biomarker identification and cancer classification are two closely related problems. In gene expression data sets, the correlation between genes can be high when they share the same biological pathway. Moreover, the gene expression data sets may contain outliers due to either chemical or electrical reasons. A good gene selection method should take group effects into account and be robust to outliers. In this paper, we propose a Laplace naive Bayes model with mean shrinkage (LNB-MS). The Laplace distribution instead of the normal distribution is used as the conditional distribution of the samples for the reasons that it is less sensitive to outliers and has been applied in many fields. The key technique is the L1 penalty imposed on the mean of each class to achieve automatic feature selection. The objective function of the proposed model is a piecewise linear function with respect to the mean of each class, of which the optimal value can be evaluated at the breakpoints simply. An efficient algorithm is designed to estimate the parameters in the model. A new strategy that uses the number of selected features to control the regularization parameter is introduced. Experimental results on simulated data sets and 17 publicly available cancer data sets attest to the accuracy, sparsity, efficiency, and robustness of the proposed algorithm. Many biomarkers identified with our method have been verified in biochemical or biomedical research. The analysis of biological and functional correlation of the genes based on Gene Ontology (GO) terms shows that the proposed method guarantees the selection of highly correlated genes simultaneously
Collapse
Affiliation(s)
- Meng-Yun Wu
- Center for Computer Vision and Department of Mathematics, Sun Yat-Sen University,Guangzhou 510275, China.
| | | | | | | | | |
Collapse
|
12
|
Liu Z, Bensmail H, Tan M. Efficient feature selection and multiclass classification with integrated instance and model based learning. Evol Bioinform Online 2012; 8:197-205. [PMID: 22577297 PMCID: PMC3347893 DOI: 10.4137/ebo.s9407] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023] Open
Abstract
Multiclass classification and feature (variable) selections are commonly encountered in many biological and medical applications. However, extending binary classification approaches to multiclass problems is not trivial. Instance-based methods such as the K nearest neighbor (KNN) can naturally extend to multiclass problems and usually perform well with unbalanced data, but suffer from the curse of dimensionality. Their performance is degraded when applied to high dimensional data. On the other hand, model-based methods such as logistic regression require the decomposition of the multiclass problem into several binary problems with one-vs.-one or one-vs.-rest schemes. Even though they can be applied to high dimensional data with L(1) or L(p) penalized methods, such approaches can only select independent features and the features selected with different binary problems are usually different. They also produce unbalanced classification problems with one vs. the rest scheme even if the original multiclass problem is balanced.By combining instance-based and model-based learning, we propose an efficient learning method with integrated KNN and constrained logistic regression (KNNLog) for simultaneous multiclass classification and feature selection. Our proposed method simultaneously minimizes the intra-class distance and maximizes the interclass distance with fewer estimated parameters. It is very efficient for problems with small sample size and unbalanced classes, a case common in many real applications. In addition, our model-based feature selection methods can identify highly correlated features simultaneously avoiding the multiplicity problem due to multiple tests. The proposed method is evaluated with simulation and real data including one unbalanced microRNA dataset for leukemia and one multiclass metagenomic dataset from the Human Microbiome Project (HMP). It performs well with limited computational experiments.
Collapse
Affiliation(s)
- Zhenqiu Liu
- Greenebaum Cancer Center and Department of Epidemiology and Public Health, University of Maryland at Baltimore, 655 W. Baltimore Street, Baltimore, MD 21201, USA
| | - Halima Bensmail
- Qatar Computing Research Institute, PO Box 5825, Doha, Qatar
| | - Ming Tan
- Greenebaum Cancer Center and Department of Epidemiology and Public Health, University of Maryland at Baltimore, 655 W. Baltimore Street, Baltimore, MD 21201, USA
| |
Collapse
|
13
|
Liu Z, Hsiao W, Cantarel BL, Drábek EF, Fraser-Liggett C. Sparse distance-based learning for simultaneous multiclass classification and feature selection of metagenomic data. Bioinformatics 2011; 27:3242-9. [PMID: 21984758 PMCID: PMC3223360 DOI: 10.1093/bioinformatics/btr547] [Citation(s) in RCA: 44] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2011] [Revised: 08/05/2011] [Accepted: 09/28/2011] [Indexed: 12/22/2022] Open
Abstract
MOTIVATION Direct sequencing of microbes in human ecosystems (the human microbiome) has complemented single genome cultivation and sequencing to understand and explore the impact of commensal microbes on human health. As sequencing technologies improve and costs decline, the sophistication of data has outgrown available computational methods. While several existing machine learning methods have been adapted for analyzing microbiome data recently, there is not yet an efficient and dedicated algorithm available for multiclass classification of human microbiota. RESULTS By combining instance-based and model-based learning, we propose a novel sparse distance-based learning method for simultaneous class prediction and feature (variable or taxa, which is used interchangeably) selection from multiple treatment populations on the basis of 16S rRNA sequence count data. Our proposed method simultaneously minimizes the intraclass distance and maximizes the interclass distance with many fewer estimated parameters than other methods. It is very efficient for problems with small sample sizes and unbalanced classes, which are common in metagenomic studies. We implemented this method in a MATLAB toolbox called MetaDistance. We also propose several approaches for data normalization and variance stabilization transformation in MetaDistance. We validate this method on several real and simulated 16S rRNA datasets to show that it outperforms existing methods for classifying metagenomic data. This article is the first to address simultaneous multifeature selection and class prediction with metagenomic count data. AVAILABILITY The MATLAB toolbox is freely available online at http://metadistance.igs.umaryland.edu/. CONTACT zliu@umm.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Zhenqiu Liu
- Department of Epidemiology and Public Health, University of Maryland Greenebaum Cancer Center, University of Maryland School of Medicine, Baltimore, MD 21201, USA.
| | | | | | | | | |
Collapse
|