1
|
Subramanian V, Syeda-Mahmood T, Do MN. Modelling-based joint embedding of histology and genomics using canonical correlation analysis for breast cancer survival prediction. Artif Intell Med 2024; 149:102787. [PMID: 38462287 DOI: 10.1016/j.artmed.2024.102787] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2023] [Revised: 01/23/2024] [Accepted: 01/24/2024] [Indexed: 03/12/2024]
Abstract
Traditional approaches to predicting breast cancer patients' survival outcomes were based on clinical subgroups, the PAM50 genes, or the histological tissue's evaluation. With the growth of multi-modality datasets capturing diverse information (such as genomics, histology, radiology and clinical data) about the same cancer, information can be integrated using advanced tools and have improved survival prediction. These methods implicitly exploit the key observation that different modalities originate from the same cancer source and jointly provide a complete picture of the cancer. In this work, we investigate the benefits of explicitly modelling multi-modality data as originating from the same cancer under a probabilistic framework. Specifically, we consider histology and genomics as two modalities originating from the same breast cancer under a probabilistic graphical model (PGM). We construct maximum likelihood estimates of the PGM parameters based on canonical correlation analysis (CCA) and then infer the underlying properties of the cancer patient, such as survival. Equivalently, we construct CCA-based joint embeddings of the two modalities and input them to a learnable predictor. Real-world properties of sparsity and graph-structures are captured in the penalized variants of CCA (pCCA) and are better suited for cancer applications. For generating richer multi-dimensional embeddings with pCCA, we introduce two novel embedding schemes that encourage orthogonality to generate more informative embeddings. The efficacy of our proposed prediction pipeline is first demonstrated via low prediction errors of the hidden variable and the generation of informative embeddings on simulated data. When applied to breast cancer histology and RNA-sequencing expression data from The Cancer Genome Atlas (TCGA), our model can provide survival predictions with average concordance-indices of up to 68.32% along with interpretability. We also illustrate how the pCCA embeddings can be used for survival analysis through Kaplan-Meier curves.
Collapse
Affiliation(s)
- Vaishnavi Subramanian
- Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Urbana, 61801, IL, USA.
| | | | - Minh N Do
- Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Urbana, 61801, IL, USA
| |
Collapse
|
2
|
Mandal A, Maji P. Multiview Regularized Discriminant Canonical Correlation Analysis: Sequential Extraction of Relevant Features From Multiblock Data. IEEE TRANSACTIONS ON CYBERNETICS 2023; 53:5497-5509. [PMID: 35417362 DOI: 10.1109/tcyb.2022.3155875] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
One of the important issues associated with real-life high-dimensional data analysis is how to extract significant and relevant features from multiview data. The multiset canonical correlation analysis (MCCA) is a well-known statistical method for multiview data integration. It finds a linear subspace that maximizes the correlations among different views. However, the existing methods to find the multiset canonical variables are computationally very expensive, which restricts the application of the MCCA in real-life big data analysis. The covariance matrix of each high-dimensional view may also suffer from the singularity problem due to the limited number of samples. Moreover, the MCCA-based existing feature extraction algorithms are, in general, unsupervised in nature. In this regard, a new supervised feature extraction algorithm is proposed, which integrates multimodal multidimensional data sets by solving maximal correlation problem of the MCCA. A new block matrix representation is introduced to reduce the computational complexity for computing the canonical variables of the MCCA. The analytical formulation enables efficient computation of the multiset canonical variables under supervised ridge regression optimization technique. It deals with the "curse of dimensionality" problem associated with high-dimensional data and facilitates the sequential generation of relevant features with significantly lower computational cost. The effectiveness of the proposed multiblock data integration algorithm, along with a comparison with other existing methods, is demonstrated on several benchmark and real-life cancer data.
Collapse
|
3
|
Chen X, Xie H, Li Z, Cheng G, Leng M, Wang FL. Information fusion and artificial intelligence for smart healthcare: a bibliometric study. Inf Process Manag 2023. [DOI: 10.1016/j.ipm.2022.103113] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
4
|
Neonatal encephalopathy prediction of poor outcome with diffusion-weighted imaging connectome and fixel-based analysis. Pediatr Res 2022; 91:1505-1515. [PMID: 33966055 PMCID: PMC9053106 DOI: 10.1038/s41390-021-01550-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/10/2020] [Revised: 04/01/2021] [Accepted: 04/08/2021] [Indexed: 02/03/2023]
Abstract
BACKGROUND Better biomarkers of eventual outcome are needed for neonatal encephalopathy. To identify the most potent neonatal imaging marker associated with 2-year outcomes, we retrospectively performed diffusion-weighted imaging connectome (DWIC) and fixel-based analysis (FBA) on magnetic resonance imaging (MRI) obtained in the first 4 weeks of life in term neonatal encephalopathy newborns. METHODS Diffusion tractography was available in 15 out of 24 babies with MRI, five each with normal, abnormal motor outcome, or death. All 15 except one underwent hypothermia as initial treatment. In abnormal motor and death groups, DWIC found 19 white matter pathways with severely disrupted fiber orientation distributions. RESULTS Using random forest classification, these disruptions predicted the follow-up outcomes with 89-99% accuracy. These pathways showed reduced integrity in abnormal motor and death vs. normal tone groups (p < 10-6). Using ranked supervised multi-view canonical correlation and depicting just three of the five dimensions of the analysis, the abnormal motor and death were clearly differentiated from each other and the normal tone group. CONCLUSIONS This study suggests that a machine-learning model for prediction using early DWIC and FBA could be a possible way of developing biomarkers in large MRI datasets having clinical outcomes. IMPACT Early connectome and FBA of clinically acquired DWI provide a new noninvasive imaging tool to predict the long-term motor outcomes after birth, based on the severity of white matter injury. Disrupted white matter connectivity as a novel neonatal marker achieves high accuracy of 89-99% to predict 2-year motor outcomes using conventional machine-learning classification. The proposed neonatal marker may allow better prognostication that is important to elucidate neural repair mechanisms and evaluate treatment modalities in neonatal encephalopathy.
Collapse
|
5
|
Momenzadeh N, Hafezalseheh H, Nayebpour M, Fathian M, Noorossana R. A hybrid machine learning approach for predicting survival of patients with prostate cancer: A SEER-based population study. INFORMATICS IN MEDICINE UNLOCKED 2021. [DOI: 10.1016/j.imu.2021.100763] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
|
6
|
Mandal A, Maji P. CanSuR: a robust method for staining pattern recognition of HEp-2 cell IIF images. Neural Comput Appl 2020. [DOI: 10.1007/s00521-019-04108-w] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
7
|
Carleton NM, Lee G, Madabhushi A, Veltri RW. Advances in the computational and molecular understanding of the prostate cancer cell nucleus. J Cell Biochem 2018; 119:7127-7142. [PMID: 29923622 PMCID: PMC6150831 DOI: 10.1002/jcb.27156] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2018] [Accepted: 05/18/2018] [Indexed: 12/17/2022]
Abstract
Nuclear alterations are a hallmark of many types of cancers, including prostate cancer (PCa). Recent evidence shows that subvisual changes, ones that may not be visually perceptible to a pathologist, to the nucleus and its ultrastructural components can precede visual histopathological recognition of cancer. Alterations to nuclear features, such as nuclear size and shape, texture, and spatial architecture, reflect the complex molecular-level changes that occur during oncogenesis. Quantitative nuclear morphometry, a field that uses computational approaches to identify and quantify malignancy-induced nuclear changes, can enable a detailed and objective analysis of the PCa cell nucleus. Recent advances in machine learning-based approaches can now automatically mine data related to these changes to aid in the diagnosis, decision making, and prediction of PCa prognoses. In this review, we use PCa as a case study to connect the molecular-level mechanisms that underlie these nuclear changes to the machine learning computational approaches, bridging the gap between the clinical and computational understanding of PCa. First, we will discuss recent developments to our understanding of the molecular events that drive nuclear alterations in the context of PCa: the role of the nuclear matrix and lamina in size and shape changes, the role of 3-dimensional chromatin organization and epigenetic modifications in textural changes, and the role of the tumor microenvironment in altering nuclear spatial topology. We will then discuss the advances in the applications of machine learning algorithms to automatically segment nuclei in prostate histopathological images, extract nuclear features to aid in diagnostic decision making, and predict potential outcomes, such as biochemical recurrence and survival. Finally, we will discuss the challenges and opportunities associated with translation of the quantitative nuclear morphometry methodology into the clinical space. Ultimately, accurate identification and quantification of nuclear alterations can contribute to the field of nucleomics and has applications for computationally driven precision oncologic patient care.
Collapse
Affiliation(s)
- Neil M. Carleton
- Department of Biomedical Engineering, Carnegie Mellon University, Pittsburgh, PA 15213
| | - George Lee
- Department of Biomedical Engineering, Case Western Reserve University, Cleveland, OH 44106
| | - Anant Madabhushi
- Department of Biomedical Engineering, Case Western Reserve University, Cleveland, OH 44106
| | - Robert W. Veltri
- The James Buchanan Brady Urological Institute, Department of Urology, The Johns Hopkins University School of Medicine, Baltimore, MD 21287
| |
Collapse
|
8
|
Mandal A, Maji P. FaRoC: Fast and Robust Supervised Canonical Correlation Analysis for Multimodal Omics Data. IEEE TRANSACTIONS ON CYBERNETICS 2018; 48:1229-1241. [PMID: 28391216 DOI: 10.1109/tcyb.2017.2685625] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
One of the main problems associated with high dimensional multimodal real life data sets is how to extract relevant and significant features. In this regard, a fast and robust feature extraction algorithm, termed as FaRoC, is proposed, integrating judiciously the merits of canonical correlation analysis (CCA) and rough sets. The proposed method extracts new features sequentially from two multidimensional data sets by maximizing their relevance with respect to class label and significance with respect to already-extracted features. To generate canonical variables sequentially, an analytical formulation is introduced to establish the relation between regularization parameters and CCA. The formulation enables the proposed method to extract required number of correlated features sequentially with lesser computational cost as compared to existing methods. To compute both significance and relevance measures of a feature, the concept of hypercuboid equivalence partition matrix of rough hypercuboid approach is used. It also provides an efficient way to find optimum regularization parameters employed in CCA. The efficacy of the proposed FaRoC algorithm, along with a comparison with other existing methods, is extensively established on several real life data sets.
Collapse
|
9
|
Bhargava R, Madabhushi A. Emerging Themes in Image Informatics and Molecular Analysis for Digital Pathology. Annu Rev Biomed Eng 2017; 18:387-412. [PMID: 27420575 DOI: 10.1146/annurev-bioeng-112415-114722] [Citation(s) in RCA: 86] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
Pathology is essential for research in disease and development, as well as for clinical decision making. For more than 100 years, pathology practice has involved analyzing images of stained, thin tissue sections by a trained human using an optical microscope. Technological advances are now driving major changes in this paradigm toward digital pathology (DP). The digital transformation of pathology goes beyond recording, archiving, and retrieving images, providing new computational tools to inform better decision making for precision medicine. First, we discuss some emerging innovations in both computational image analytics and imaging instrumentation in DP. Second, we discuss molecular contrast in pathology. Molecular DP has traditionally been an extension of pathology with molecularly specific dyes. Label-free, spectroscopic images are rapidly emerging as another important information source, and we describe the benefits and potential of this evolution. Third, we describe multimodal DP, which is enabled by computational algorithms and combines the best characteristics of structural and molecular pathology. Finally, we provide examples of application areas in telepathology, education, and precision medicine. We conclude by discussing challenges and emerging opportunities in this area.
Collapse
Affiliation(s)
- Rohit Bhargava
- Departments of Bioengineering, Chemical and Biomolecular Engineering, Electrical and Computer Engineering, Mechanical Science and Engineering, and Chemistry, and Beckman Institute for Advanced Science and Technology, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801;
| | - Anant Madabhushi
- Center for Computational Imaging and Personalized Diagnostics; Departments of Biomedical Engineering, Urology, Pathology, Radiology, Radiation Oncology, General Medical Sciences, Electrical Engineering, and Computer Science; and Case Comprehensive Cancer Center, Case Western Reserve University, Cleveland, Ohio 44106;
| |
Collapse
|
10
|
Singanamalli A, Wang H, Madabhushi A. Cascaded Multi-view Canonical Correlation (CaMCCo) for Early Diagnosis of Alzheimer's Disease via Fusion of Clinical, Imaging and Omic Features. Sci Rep 2017; 7:8137. [PMID: 28811553 PMCID: PMC5558022 DOI: 10.1038/s41598-017-03925-0] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2016] [Accepted: 04/24/2017] [Indexed: 12/14/2022] Open
Abstract
The introduction of mild cognitive impairment (MCI) as a diagnostic category adds to the challenges of diagnosing Alzheimer’s Disease (AD). No single marker has been proven to accurately categorize patients into their respective diagnostic groups. Thus, previous studies have attempted to develop fused predictors of AD and MCI. These studies have two main limitations. Most do not simultaneously consider all diagnostic categories and provide suboptimal fused representations using the same set of modalities for prediction of all classes. In this work, we present a combined framework, cascaded multiview canonical correlation (CaMCCo), for fusion and cascaded classification that incorporates all diagnostic categories and optimizes classification by selectively combining a subset of modalities at each level of the cascade. CaMCCo is evaluated on a data cohort comprising 149 patients for whom neurophysiological, neuroimaging, proteomic and genomic data were available. Results suggest that fusion of select modalities for each classification task outperforms (mean AUC = 0.92) fusion of all modalities (mean AUC = 0.54) and individual modalities (mean AUC = 0.90, 0.53, 0.71, 0.73, 0.62, 0.68). In addition, CaMCCo outperforms all other multi-class classification methods for MCI prediction (PPV: 0.80 vs. 0.67, 0.63).
Collapse
Affiliation(s)
- Asha Singanamalli
- Department of Biomedical Engineering, Case Western Reserve University, Cleveland, OH, 44106, USA.
| | - Haibo Wang
- Department of Biomedical Engineering, Case Western Reserve University, Cleveland, OH, 44106, USA
| | - Anant Madabhushi
- Department of Biomedical Engineering, Case Western Reserve University, Cleveland, OH, 44106, USA.
| | | |
Collapse
|
11
|
Viswanath SE, Tiwari P, Lee G, Madabhushi A. Dimensionality reduction-based fusion approaches for imaging and non-imaging biomedical data: concepts, workflow, and use-cases. BMC Med Imaging 2017; 17:2. [PMID: 28056889 PMCID: PMC5217665 DOI: 10.1186/s12880-016-0172-6] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2016] [Accepted: 12/09/2016] [Indexed: 12/27/2022] Open
Abstract
BACKGROUND With a wide array of multi-modal, multi-protocol, and multi-scale biomedical data being routinely acquired for disease characterization, there is a pressing need for quantitative tools to combine these varied channels of information. The goal of these integrated predictors is to combine these varied sources of information, while improving on the predictive ability of any individual modality. A number of application-specific data fusion methods have been previously proposed in the literature which have attempted to reconcile the differences in dimensionalities and length scales across different modalities. Our objective in this paper was to help identify metholodological choices that need to be made in order to build a data fusion technique, as it is not always clear which strategy is optimal for a particular problem. As a comprehensive review of all possible data fusion methods was outside the scope of this paper, we have focused on fusion approaches that employ dimensionality reduction (DR). METHODS In this work, we quantitatively evaluate 4 non-overlapping existing instantiations of DR-based data fusion, within 3 different biomedical applications comprising over 100 studies. These instantiations utilized different knowledge representation and knowledge fusion methods, allowing us to examine the interplay of these modules in the context of data fusion. The use cases considered in this work involve the integration of (a) radiomics features from T2w MRI with peak area features from MR spectroscopy for identification of prostate cancer in vivo, (b) histomorphometric features (quantitative features extracted from histopathology) with protein mass spectrometry features for predicting 5 year biochemical recurrence in prostate cancer patients, and (c) volumetric measurements on T1w MRI with protein expression features to discriminate between patients with and without Alzheimers' Disease. RESULTS AND CONCLUSIONS Our preliminary results in these specific use cases indicated that the use of kernel representations in conjunction with DR-based fusion may be most effective, as a weighted multi-kernel-based DR approach resulted in the highest area under the ROC curve of over 0.8. By contrast non-optimized DR-based representation and fusion methods yielded the worst predictive performance across all 3 applications. Our results suggest that when the individual modalities demonstrate relatively poor discriminability, many of the data fusion methods may not yield accurate, discriminatory representations either. In summary, to outperform the predictive ability of individual modalities, methodological choices for data fusion must explicitly account for the sparsity of and noise in the feature space.
Collapse
Affiliation(s)
- Satish E Viswanath
- Department of Biomedical Engineering, Case Western Reserve University, 10900 Euclid Ave, Wickenden 523, Cleveland, OH, USA.
| | - Pallavi Tiwari
- Department of Biomedical Engineering, Case Western Reserve University, 10900 Euclid Ave, Wickenden 523, Cleveland, OH, USA
| | - George Lee
- Department of Biomedical Engineering, Case Western Reserve University, 10900 Euclid Ave, Wickenden 523, Cleveland, OH, USA
| | - Anant Madabhushi
- Department of Biomedical Engineering, Case Western Reserve University, 10900 Euclid Ave, Wickenden 523, Cleveland, OH, USA
| | | |
Collapse
|
12
|
Maji P, Mandal A. Multimodal Omics Data Integration Using Max Relevance--Max Significance Criterion. IEEE Trans Biomed Eng 2016; 64:1841-1851. [PMID: 27834637 DOI: 10.1109/tbme.2016.2624823] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
OBJECTIVE This paper presents a novel supervised regularized canonical correlation analysis, termed as CuRSaR, to extract relevant and significant features from multimodal high dimensional omics datasets. METHODS The proposed method extracts a new set of features from two multidimensional datasets by maximizing the relevance of extracted features with respect to sample categories and significance among them. It integrates judiciously the merits of regularized canonical correlation analysis (RCCA) and rough hypercuboid approach. An analytical formulation, based on spectral decomposition, is introduced to establish the relation between canonical correlation analysis (CCA) and RCCA. The concept of hypercuboid equivalence partition matrix of rough hypercuboid is used to compute both relevance and significance of a feature. SIGNIFICANCE The analytical formulation makes the computational complexity of the proposed algorithm significantly lower than existing methods. The equivalence partition matrix offers an efficient way to find optimum regularization parameters employed in CCA. RESULTS The superiority of the proposed algorithm over other existing methods, in terms of computational complexity and classification accuracy, is established extensively on real life data.
Collapse
|
13
|
Leo P, Lee G, Shih NNC, Elliott R, Feldman MD, Madabhushi A. Evaluating stability of histomorphometric features across scanner and staining variations: prostate cancer diagnosis from whole slide images. J Med Imaging (Bellingham) 2016; 3:047502. [PMID: 27803941 DOI: 10.1117/1.jmi.3.4.047502] [Citation(s) in RCA: 42] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2016] [Accepted: 09/16/2016] [Indexed: 01/04/2023] Open
Abstract
Quantitative histomorphometry (QH) is the process of computerized feature extraction from digitized tissue slide images to predict disease presence, behavior, and outcome. Feature stability between sites may be compromised by laboratory-specific variables including dye batch, slice thickness, and the whole slide scanner used. We present two new measures, preparation-induced instability score and latent instability score, to quantify feature instability across and within datasets. In a use case involving prostate cancer, we examined QH features which may detect cancer on whole slide images. Using our method, we found that five feature families (graph, shape, co-occurring gland tensor, sub-graph, and texture) were different between datasets in 19.7% to 48.6% of comparisons while the values expected without site variation were 4.2% to 4.6%. Color normalizing all images to a template did not reduce instability. Scanning the same 34 slides on three scanners demonstrated that Haralick features were most substantively affected by scanner variation, being unstable in 62% of comparisons. We found that unstable feature families performed significantly worse in inter- than intrasite classification. Our results appear to suggest QH features should be evaluated across sites to assess robustness, and class discriminability alone should not represent the benchmark for digital pathology feature selection.
Collapse
Affiliation(s)
- Patrick Leo
- Case Western Reserve University , Department of Biomedical Engineering, 2071 Martin Luther King Jr. Drive, Cleveland, Ohio 44106, United States
| | - George Lee
- Case Western Reserve University , Department of Biomedical Engineering, 2071 Martin Luther King Jr. Drive, Cleveland, Ohio 44106, United States
| | - Natalie N C Shih
- University of Pennsylvania , Department of Pathology, 3400 Spruce Street, Philadelphia, Pennsylvania 19104, United States
| | - Robin Elliott
- Case Western Reserve University , Department of Pathology, 11100 Euclid Avenue, Cleveland, Ohio 44106, United States
| | - Michael D Feldman
- University of Pennsylvania , Department of Pathology, 3400 Spruce Street, Philadelphia, Pennsylvania 19104, United States
| | - Anant Madabhushi
- Case Western Reserve University , Department of Biomedical Engineering, 2071 Martin Luther King Jr. Drive, Cleveland, Ohio 44106, United States
| |
Collapse
|
14
|
Adaptive Dimensionality Reduction with Semi-Supervision (AdDReSS): Classifying Multi-Attribute Biomedical Data. PLoS One 2016; 11:e0159088. [PMID: 27421116 PMCID: PMC4946789 DOI: 10.1371/journal.pone.0159088] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2016] [Accepted: 06/27/2016] [Indexed: 11/19/2022] Open
Abstract
Medical diagnostics is often a multi-attribute problem, necessitating sophisticated tools for analyzing high-dimensional biomedical data. Mining this data often results in two crucial bottlenecks: 1) high dimensionality of features used to represent rich biological data and 2) small amounts of labelled training data due to the expense of consulting highly specific medical expertise necessary to assess each study. Currently, no approach that we are aware of has attempted to use active learning in the context of dimensionality reduction approaches for improving the construction of low dimensional representations. We present our novel methodology, AdDReSS (Adaptive Dimensionality Reduction with Semi-Supervision), to demonstrate that fewer labeled instances identified via AL in embedding space are needed for creating a more discriminative embedding representation compared to randomly selected instances. We tested our methodology on a wide variety of domains ranging from prostate gene expression, ovarian proteomic spectra, brain magnetic resonance imaging, and breast histopathology. Across these various high dimensional biomedical datasets with 100+ observations each and all parameters considered, the median classification accuracy across all experiments showed AdDReSS (88.7%) to outperform SSAGE, a SSDR method using random sampling (85.5%), and Graph Embedding (81.5%). Furthermore, we found that embeddings generated via AdDReSS achieved a mean 35.95% improvement in Raghavan efficiency, a measure of learning rate, over SSAGE. Our results demonstrate the value of AdDReSS to provide low dimensional representations of high dimensional biomedical data while achieving higher classification rates with fewer labelled examples as compared to without active learning.
Collapse
|
15
|
Ginsburg SB, Lee G, Ali S, Madabhushi A. Feature Importance in Nonlinear Embeddings (FINE): Applications in Digital Pathology. IEEE TRANSACTIONS ON MEDICAL IMAGING 2016; 35:76-88. [PMID: 26186772 DOI: 10.1109/tmi.2015.2456188] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
Quantitative histomorphometry (QH) refers to the process of computationally modeling disease appearance on digital pathology images by extracting hundreds of image features and using them to predict disease presence or outcome. Since constructing a robust and interpretable classifier is challenging in a high dimensional feature space, dimensionality reduction (DR) is often implemented prior to classifier construction. However, when DR is performed it can be challenging to quantify the contribution of each of the original features to the final classification result. We have previously presented a method for scoring features based on their importance for classification on an embedding derived via principal components analysis (PCA). However, nonlinear DR involves the eigen-decomposition of a kernel matrix rather than the data itself, compounding the issue of classifier interpretability. In this paper we present feature importance in nonlinear embeddings (FINE), an extension of our PCA-based feature scoring method to kernel PCA (KPCA), as well as several NLDR algorithms that can be cast as variants of KPCA. FINE is applied to four digital pathology datasets to identify key QH features for predicting the risk of breast and prostate cancer recurrence. Measures of nuclear and glandular architecture and clusteredness were found to play an important role in predicting the likelihood of recurrence of both breast and prostate cancers. Compared to the t-test, Fisher score, and Gini index, FINE was able to identify a stable set of features that provide good classification accuracy on four publicly available datasets from the NIPS 2003 Feature Selection Challenge.
Collapse
|
16
|
Lee G, Singanamalli A, Wang H, Feldman MD, Master SR, Shih NNC, Spangler E, Rebbeck T, Tomaszewski JE, Madabhushi A. Supervised multi-view canonical correlation analysis (sMVCCA): integrating histologic and proteomic features for predicting recurrent prostate cancer. IEEE TRANSACTIONS ON MEDICAL IMAGING 2015; 34:284-297. [PMID: 25203987 DOI: 10.1109/tmi.2014.2355175] [Citation(s) in RCA: 44] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
In this work, we present a new methodology to facilitate prediction of recurrent prostate cancer (CaP) following radical prostatectomy (RP) via the integration of quantitative image features and protein expression in the excised prostate. Creating a fused predictor from high-dimensional data streams is challenging because the classifier must 1) account for the "curse of dimensionality" problem, which hinders classifier performance when the number of features exceeds the number of patient studies and 2) balance potential mismatches in the number of features across different channels to avoid classifier bias towards channels with more features. Our new data integration methodology, supervised Multi-view Canonical Correlation Analysis (sMVCCA), aims to integrate infinite views of highdimensional data to provide more amenable data representations for disease classification. Additionally, we demonstrate sMVCCA using Spearman's rank correlation which, unlike Pearson's correlation, can account for nonlinear correlations and outliers. Forty CaP patients with pathological Gleason scores 6-8 were considered for this study. 21 of these men revealed biochemical recurrence (BCR) following RP, while 19 did not. For each patient, 189 quantitative histomorphometric attributes and 650 protein expression levels were extracted from the primary tumor nodule. The fused histomorphometric/proteomic representation via sMVCCA combined with a random forest classifier predicted BCR with a mean AUC of 0.74 and a maximum AUC of 0.9286. We found sMVCCA to perform statistically significantly (p < 0.05) better than comparative state-of-the-art data fusion strategies for predicting BCR. Furthermore, Kaplan-Meier analysis demonstrated improved BCR-free survival prediction for the sMVCCA-fused classifier as compared to histology or proteomic features alone.
Collapse
|
17
|
NCI Workshop Report: Clinical and Computational Requirements for Correlating Imaging Phenotypes with Genomics Signatures. Transl Oncol 2014; 7:556-69. [PMID: 25389451 PMCID: PMC4225695 DOI: 10.1016/j.tranon.2014.07.007] [Citation(s) in RCA: 60] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2014] [Revised: 07/25/2014] [Accepted: 07/29/2014] [Indexed: 12/21/2022] Open
Abstract
The National Cancer Institute (NCI) Cancer Imaging Program organized two related workshops on June 26–27, 2013, entitled “Correlating Imaging Phenotypes with Genomics Signatures Research” and “Scalable Computational Resources as Required for Imaging-Genomics Decision Support Systems.” The first workshop focused on clinical and scientific requirements, exploring our knowledge of phenotypic characteristics of cancer biological properties to determine whether the field is sufficiently advanced to correlate with imaging phenotypes that underpin genomics and clinical outcomes, and exploring new scientific methods to extract phenotypic features from medical images and relate them to genomics analyses. The second workshop focused on computational methods that explore informatics and computational requirements to extract phenotypic features from medical images and relate them to genomics analyses and improve the accessibility and speed of dissemination of existing NIH resources. These workshops linked clinical and scientific requirements of currently known phenotypic and genotypic cancer biology characteristics with imaging phenotypes that underpin genomics and clinical outcomes. The group generated a set of recommendations to NCI leadership and the research community that encourage and support development of the emerging radiogenomics research field to address short-and longer-term goals in cancer research.
Collapse
|
18
|
Integration of high-volume molecular and imaging data for composite biomarker discovery in the study of melanoma. BIOMED RESEARCH INTERNATIONAL 2014; 2014:145243. [PMID: 24527435 PMCID: PMC3914284 DOI: 10.1155/2014/145243] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/29/2013] [Revised: 09/28/2013] [Accepted: 10/12/2013] [Indexed: 12/19/2022]
Abstract
In this work the effects of simple imputations are studied, regarding the integration of multimodal data originating from different patients. Two separate datasets of cutaneous melanoma are used, an image analysis (dermoscopy) dataset together with a transcriptomic one, specifically DNA microarrays. Each modality is related to a different set of patients, and four imputation methods are employed to the formation of a unified, integrative dataset. The application of backward selection together with ensemble classifiers (random forests), followed by principal components analysis and linear discriminant analysis, illustrates the implication of the imputations on feature selection and dimensionality reduction methods. The results suggest that the expansion of the feature space through the data integration, achieved by the exploitation of imputation schemes in general, aids the classification task, imparting stability as regards the derivation of putative classifiers. In particular, although the biased imputation methods increase significantly the predictive performance and the class discrimination of the datasets, they still contribute to the study of prominent features and their relations. The fusion of separate datasets, which provide a multimodal description of the same pathology, represents an innovative, promising avenue, enhancing robust composite biomarker derivation and promoting the interpretation of the biomedical problem studied.
Collapse
|
19
|
Sparks R, Madabhushi A. Statistical Shape Model for Manifold Regularization: Gleason grading of prostate histology. COMPUTER VISION AND IMAGE UNDERSTANDING : CVIU 2013; 117:1138-1146. [PMID: 23888106 PMCID: PMC3718190 DOI: 10.1016/j.cviu.2012.11.011] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
Gleason patterns of prostate cancer histopathology, characterized primarily by morphological and architectural attributes of histological structures (glands and nuclei), have been found to be highly correlated with disease aggressiveness and patient outcome. Gleason patterns 4 and 5 are highly correlated with more aggressive disease and poorer patient outcome, while Gleason patterns 1-3 tend to reflect more favorable patient outcome. Because Gleason grading is done manually by a pathologist visually examining glass (or digital) slides subtle morphologic and architectural differences of histological attributes, in addition to other factors, may result in grading errors and hence cause high inter-observer variability. Recently some researchers have proposed computerized decision support systems to automatically grade Gleason patterns by using features pertaining to nuclear architecture, gland morphology, as well as tissue texture. Automated characterization of gland morphology has been shown to distinguish between intermediate Gleason patterns 3 and 4 with high accuracy. Manifold learning (ML) schemes attempt to generate a low dimensional manifold representation of a higher dimensional feature space while simultaneously preserving nonlinear relationships between object instances. Classification can then be performed in the low dimensional space with high accuracy. However ML is sensitive to the samples contained in the dataset; changes in the dataset may alter the manifold structure. In this paper we present a manifold regularization technique to constrain the low dimensional manifold to a specific range of possible manifold shapes, the range being determined via a statistical shape model of manifolds (SSMM). In this work we demonstrate applications of the SSMM in (1) identifying samples on the manifold which contain noise, defined as those samples which deviate from the SSMM, and (2) accurate out-of-sample extrapolation (OSE) of newly acquired samples onto a manifold constrained by the SSMM. We demonstrate these applications of the SSMM in the context of distinguish between Gleason patterns 3 and 4 using glandular morphologic features in a prostate histopathology dataset of 58 patient studies. Identifying and eliminating noisy samples from the manifold via the SSMM results in a statistically significant improvement in area under the receiver operator characteristic curve (AUC), 0.832 ± 0.048 with removal of noisy samples compared to a AUC of 0.779 ± 0.075 without removal of samples. The use of the SSMM for OSE of newly acquired glands also shows statistically significant improvement in AUC, 0.834 ± 0.051 with the SSMM compared to 0.779 ± 0.054 without the SSMM. Similar results were observed for the synthetic Swiss Roll and Helix datasets.
Collapse
Affiliation(s)
- Rachel Sparks
- Department of Biomedical Engineering, Rutgers University, Piscataway, NJ, 08854
- Department of Biomedical Engineering, Case Western Reserve University, Cleveland, OH, 44106
| | - Anant Madabhushi
- Department of Biomedical Engineering, Case Western Reserve University, Cleveland, OH, 44106
| |
Collapse
|
20
|
Ginsburg S, Ali S, Lee G, Basavanhally A, Madabhushi A. Variable importance in nonlinear kernels (VINK): classification of digitized histopathology. MEDICAL IMAGE COMPUTING AND COMPUTER-ASSISTED INTERVENTION : MICCAI ... INTERNATIONAL CONFERENCE ON MEDICAL IMAGE COMPUTING AND COMPUTER-ASSISTED INTERVENTION 2013; 16:238-45. [PMID: 24579146 DOI: 10.1007/978-3-642-40763-5_30] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Abstract
Quantitative histomorphometry is the process of modeling appearance of disease morphology on digitized histopathology images via image-based features (e.g., texture, graphs). Due to the curse of dimensionality, building classifiers with large numbers of features requires feature selection (which may require a large training set) or dimensionality reduction (DR). DR methods map the original high-dimensional features in terms of eigenvectors and eigenvalues, which limits the potential for feature transparency or interpretability. Although methods exist for variable selection and ranking on embeddings obtained via linear DR schemes (e.g., principal components analysis (PCA)), similar methods do not yet exist for nonlinear DR (NLDR) methods. In this work we present a simple yet elegant method for approximating the mapping between the data in the original feature space and the transformed data in the kernel PCA (KPCA) embedding space; this mapping provides the basis for quantification of variable importance in nonlinear kernels (VINK). We show how VINK can be implemented in conjunction with the popular Isomap and Laplacian eigenmap algorithms. VINK is evaluated in the contexts of three different problems in digital pathology: (1) predicting five year PSA failure following radical prostatectomy, (2) predicting Oncotype DX recurrence risk scores for ER+ breast cancers, and (3) distinguishing good and poor outcome p16+ oropharyngeal tumors. We demonstrate that subsets of features identified by VINK provide similar or better classification or regression performance compared to the original high dimensional feature sets.
Collapse
Affiliation(s)
- Shoshana Ginsburg
- Department of Biomedical Engineering, Case Western Reserve University, USA
| | - Sahirzeeshan Ali
- Department of Biomedical Engineering, Case Western Reserve University, USA
| | - George Lee
- Department of Biomedical Engineering, Rutgers University, USA
| | | | - Anant Madabhushi
- Department of Biomedical Engineering, Case Western Reserve University, USA
| |
Collapse
|