1
|
Abbasi M, Sanderford CR, Raghu N, Pasha M, Bartelle BB. Sparse representation learning derives biological features with explicit gene weights from the Allen Mouse Brain Atlas. PLoS One 2023; 18:e0282171. [PMID: 36877707 PMCID: PMC9987823 DOI: 10.1371/journal.pone.0282171] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2022] [Accepted: 02/08/2023] [Indexed: 03/07/2023] Open
Abstract
Unsupervised learning methods are commonly used to detect features within transcriptomic data and ultimately derive meaningful representations of biology. Contributions of individual genes to any feature however becomes convolved with each learning step, requiring follow up analysis and validation to understand what biology might be represented by a cluster on a low dimensional plot. We sought learning methods that could preserve the gene information of detected features, using the spatial transcriptomic data and anatomical labels of the Allen Mouse Brain Atlas as a test dataset with verifiable ground truth. We established metrics for accurate representation of molecular anatomy to find sparse learning approaches were uniquely capable of generating anatomical representations and gene weights in a single learning step. Fit to labeled anatomy was highly correlated with intrinsic properties of the data, offering a means to optimize parameters without established ground truth. Once representations were derived, complementary gene lists could be further compressed to generate a low complexity dataset, or to probe for individual features with >95% accuracy. We demonstrate the utility of sparse learning as a means to derive biologically meaningful representations from transcriptomic data and reduce the complexity of large datasets while preserving intelligible gene information throughout the analysis.
Collapse
Affiliation(s)
- Mohammad Abbasi
- School for Biological and Health Systems Engineering, Arizona State University, Tempe, Arizona, United States of America
| | - Connor R Sanderford
- School for Biological and Health Systems Engineering, Arizona State University, Tempe, Arizona, United States of America
| | - Narendiran Raghu
- School for Biological and Health Systems Engineering, Arizona State University, Tempe, Arizona, United States of America
| | - Mirjeta Pasha
- Department of Mathematics, Tufts University, Medford, Massachusetts, United States of America
| | - Benjamin B Bartelle
- School for Biological and Health Systems Engineering, Arizona State University, Tempe, Arizona, United States of America
| |
Collapse
|
2
|
Bayesian nonnegative matrix factorization in an incremental manner for data representation. APPL INTELL 2022. [DOI: 10.1007/s10489-022-03522-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023]
|
3
|
Zheng X, Zhang C. Gene selection for microarray data classification via dual latent representation learning. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2021.07.047] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
4
|
Peng S, Yang Y, Liu W, Li F, Liao X. Discriminant Projection Shared Dictionary Learning for Classification of Tumors Using Gene Expression Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1464-1473. [PMID: 31675339 DOI: 10.1109/tcbb.2019.2950209] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
With a variety of tumor subtypes, personalized treatments need to identify the subtype of a tumor as accurately as possible. The development of DNA microarrays provides an opportunity to predict tumor classification. One strategy is to use gene expression profiling to extend current biological insights into the disease. However, overfitting problems exist in most machine learning methods when classifying tumor gene expression profile data characterized by high dimensional, small samples and nonlinearities. As a new machine learning methods, dictionary learning has become a more effective algorithm for gene expression profile classification. Here, a new method called discriminant projection shared dictionary learning (DPSDL) is proposed for classifying tumor subtypes using LINCS gene expression profile data. The method trains a shared dictionary, embeds Fisher discriminant criteria to obtain a class-specific sub-dictionary and coding coefficients. At the same time, a projection matrix is trained to widen the distance between different classes of samples. Experimental results show that our method performs better classification based on gene expression profile than the other dictionary learning methods and machine learning methods.
Collapse
|
5
|
Yang X, Tian L, Chen Y, Yang L, Xu S, Wu W. Inverse Projection Representation and Category Contribution Rate for Robust Tumor Recognition. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:1262-1275. [PMID: 30575544 DOI: 10.1109/tcbb.2018.2886334] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Sparse representation based classification (SRC) methods have achieved remarkable results. SRC, however, still suffer from requiring enough training samples, insufficient use of test samples, and instability of representation. In this paper, a stable inverse projection representation based classification (IPRC) is presented to tackle these problems by effectively using test samples. An IPR is first proposed and its feasibility and stability are analyzed. A classification criterion named category contribution rate is constructed to match the IPR and complete classification. Moreover, a statistical measure is introduced to quantify the stability of representation-based classification methods. Based on the IPRC technique, a robust tumor recognition framework is presented by interpreting microarray gene expression data, where a two-stage hybrid gene selection method is introduced to select informative genes. Finally, the functional analysis of candidate's pathogenicity-related genes is given. Extensive experiments on six public tumor microarray gene expression datasets demonstrate the proposed technique is competitive with state-of-the-art methods.
Collapse
|
6
|
SGL-SVM: A novel method for tumor classification via support vector machine with sparse group Lasso. J Theor Biol 2020; 486:110098. [DOI: 10.1016/j.jtbi.2019.110098] [Citation(s) in RCA: 33] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2019] [Revised: 11/27/2019] [Accepted: 11/28/2019] [Indexed: 02/07/2023]
|
7
|
Chang CC, Chen SH. Developing a Novel Machine Learning-Based Classification Scheme for Predicting SPCs in Breast Cancer Survivors. Front Genet 2019; 10:848. [PMID: 31620166 PMCID: PMC6759630 DOI: 10.3389/fgene.2019.00848] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2019] [Accepted: 08/14/2019] [Indexed: 11/13/2022] Open
Abstract
Due to the high effectiveness of cancer screening and therapies, the diagnosis of second primary cancers (SPCs) has increased in women with breast cancer. The present study was conducted to develop a novel machine learning-based classification scheme for predicting the risk factors of SPCs in breast cancer survivors. The proposed scheme was based on the XGBoost classifier with the following four comparable strategies: transformation, resampling, clustering, and ensemble learning, to improve the training balanced accuracy. Results suggested that the best prediction accuracy for an empirical case is the XGBoost associated with the strategies of resampling and clustering. The experimental results showed that age, sequence of radiotherapy and surgery, surgical margins of the primary site, human epidermal growth factor, high-dose clinical target volume, and estrogen receptors are relatively more important risk factors associated with SPCs in patients with breast cancer. These risk factors should be monitored for the early detection of breast cancer. In conclusion, the proposed scheme can support the important influence of personality and clinical symptom representations in all phases of the primary treatment trajectory. Our results further suggested that adaptive machine learning techniques require the incorporation of significant variables for optimal predictions.
Collapse
Affiliation(s)
- Chi-Chang Chang
- School of Medical Informatics, Chung Shan Medical University, Taichung, Taiwan.,IT Office, Chung Shan Medical University Hospital, Taichung, Taiwan
| | - Ssu-Han Chen
- Department of Industrial Engineering and Management, Ming Chi University of Technology, New Taipei City, Taiwan
| |
Collapse
|
8
|
Gene selection for microarray data classification via subspace learning and manifold regularization. Med Biol Eng Comput 2017; 56:1271-1284. [PMID: 29256006 DOI: 10.1007/s11517-017-1751-6] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2017] [Accepted: 11/03/2017] [Indexed: 10/18/2022]
Abstract
With the rapid development of DNA microarray technology, large amount of genomic data has been generated. Classification of these microarray data is a challenge task since gene expression data are often with thousands of genes but a small number of samples. In this paper, an effective gene selection method is proposed to select the best subset of genes for microarray data with the irrelevant and redundant genes removed. Compared with original data, the selected gene subset can benefit the classification task. We formulate the gene selection task as a manifold regularized subspace learning problem. In detail, a projection matrix is used to project the original high dimensional microarray data into a lower dimensional subspace, with the constraint that the original genes can be well represented by the selected genes. Meanwhile, the local manifold structure of original data is preserved by a Laplacian graph regularization term on the low-dimensional data space. The projection matrix can serve as an importance indicator of different genes. An iterative update algorithm is developed for solving the problem. Experimental results on six publicly available microarray datasets and one clinical dataset demonstrate that the proposed method performs better when compared with other state-of-the-art methods in terms of microarray data classification. Graphical Abstract The graphical abstract of this work.
Collapse
|
9
|
Lossless medical image compression using geometry-adaptive partitioning and least square-based prediction. Med Biol Eng Comput 2017; 56:957-966. [PMID: 29105018 DOI: 10.1007/s11517-017-1741-8] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2017] [Accepted: 10/20/2017] [Indexed: 10/18/2022]
Abstract
To improve the compression rates for lossless compression of medical images, an efficient algorithm, based on irregular segmentation and region-based prediction, is proposed in this paper. Considering that the first step of a region-based compression algorithm is segmentation, this paper proposes a hybrid method by combining geometry-adaptive partitioning and quadtree partitioning to achieve adaptive irregular segmentation for medical images. Then, least square (LS)-based predictors are adaptively designed for each region (regular subblock or irregular subregion). The proposed adaptive algorithm not only exploits spatial correlation between pixels but it utilizes local structure similarity, resulting in efficient compression performance. Experimental results show that the average compression performance of the proposed algorithm is 10.48, 4.86, 3.58, and 0.10% better than that of JPEG 2000, CALIC, EDP, and JPEG-LS, respectively. Graphical abstract ᅟ.
Collapse
|
10
|
Wang X, Zheng Y, Gan L, Wang X, Sang X, Kong X, Zhao J. Liver segmentation from CT images using a sparse priori statistical shape model (SP-SSM). PLoS One 2017; 12:e0185249. [PMID: 28981530 PMCID: PMC5628825 DOI: 10.1371/journal.pone.0185249] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2016] [Accepted: 09/09/2017] [Indexed: 11/19/2022] Open
Abstract
This study proposes a new liver segmentation method based on a sparse a priori statistical shape model (SP-SSM). First, mark points are selected in the liver a priori model and the original image. Then, the a priori shape and its mark points are used to obtain a dictionary for the liver boundary information. Second, the sparse coefficient is calculated based on the correspondence between mark points in the original image and those in the a priori model, and then the sparse statistical model is established by combining the sparse coefficients and the dictionary. Finally, the intensity energy and boundary energy models are built based on the intensity information and the specific boundary information of the original image. Then, the sparse matching constraint model is established based on the sparse coding theory. These models jointly drive the iterative deformation of the sparse statistical model to approximate and accurately extract the liver boundaries. This method can solve the problems of deformation model initialization and a priori method accuracy using the sparse dictionary. The SP-SSM can achieve a mean overlap error of 4.8% and a mean volume difference of 1.8%, whereas the average symmetric surface distance and the root mean square symmetric surface distance can reach 0.8 mm and 1.4 mm, respectively.
Collapse
Affiliation(s)
- Xuehu Wang
- School of Electronic and Information Engineering, Hebei University, Baoding, China
- Key Laboratory of Digital Medical Engineering of Hebei Province, Baoding, China
| | - Yongchang Zheng
- Department of Liver Surgery, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
- * E-mail:
| | - Lan Gan
- School of Information Engineering, East China Jiaotong University, Nanchang, China
| | - Xuan Wang
- Department of Liver Surgery, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
| | - Xinting Sang
- Department of Liver Surgery, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
| | - Xiangfeng Kong
- Department of Liver Surgery, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
| | - Jie Zhao
- School of Electronic and Information Engineering, Hebei University, Baoding, China
- Key Laboratory of Digital Medical Engineering of Hebei Province, Baoding, China
| |
Collapse
|
11
|
Li J, Song Y, Zhu Z, Zhao J. Highly undersampled MR image reconstruction using an improved dual-dictionary learning method with self-adaptive dictionaries. Med Biol Eng Comput 2016; 55:807-822. [PMID: 27538399 DOI: 10.1007/s11517-016-1556-z] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2015] [Accepted: 07/30/2016] [Indexed: 02/05/2023]
Abstract
Dual-dictionary learning (Dual-DL) method utilizes both a low-resolution dictionary and a high-resolution dictionary, which are co-trained for sparse coding and image updating, respectively. It can effectively exploit a priori knowledge regarding the typical structures, specific features, and local details of training sets images. The prior knowledge helps to improve the reconstruction quality greatly. This method has been successfully applied in magnetic resonance (MR) image reconstruction. However, it relies heavily on the training sets, and dictionaries are fixed and nonadaptive. In this research, we improve Dual-DL by using self-adaptive dictionaries. The low- and high-resolution dictionaries are updated correspondingly along with the image updating stage to ensure their self-adaptivity. The updated dictionaries incorporate both the prior information of the training sets and the test image directly. Both dictionaries feature improved adaptability. Experimental results demonstrate that the proposed method can efficiently and significantly improve the quality and robustness of MR image reconstruction.
Collapse
Affiliation(s)
- Jiansen Li
- School of Biomedical Engineering, Shanghai Jiao Tong University, 800 Dongchuan Rd., Minhang, Shanghai, 200240, China
| | - Ying Song
- Department of Radiation Oncology, West China Hospital, Sichuan University, Chengdu, 610041, Sichuan, China
| | - Zhen Zhu
- Department of Radiology, Children's Hospital of Shanghai, Shanghai Jiao Tong University, Shanghai, China
| | - Jun Zhao
- School of Biomedical Engineering, Shanghai Jiao Tong University, 800 Dongchuan Rd., Minhang, Shanghai, 200240, China.
| |
Collapse
|