1
|
Bernaola N, Michiels M, Larrañaga P, Bielza C. Learning massive interpretable gene regulatory networks of the human brain by merging Bayesian networks. PLoS Comput Biol 2023; 19:e1011443. [PMID: 38039337 PMCID: PMC10745139 DOI: 10.1371/journal.pcbi.1011443] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2022] [Revised: 12/22/2023] [Accepted: 08/19/2023] [Indexed: 12/03/2023] Open
Abstract
We present the Fast Greedy Equivalence Search (FGES)-Merge, a new method for learning the structure of gene regulatory networks via merging locally learned Bayesian networks, based on the fast greedy equivalent search algorithm. The method is competitive with the state of the art in terms of the Matthews correlation coefficient, which takes into account both precision and recall, while also improving upon it in terms of speed, scaling up to tens of thousands of variables and being able to use empirical knowledge about the topological structure of gene regulatory networks. To showcase the ability of our method to scale to massive networks, we apply it to learning the gene regulatory network for the full human genome using data from samples of different brain structures (from the Allen Human Brain Atlas). Furthermore, this Bayesian network model should predict interactions between genes in a way that is clear to experts, following the current trends in explainable artificial intelligence. To achieve this, we also present a new open-access visualization tool that facilitates the exploration of massive networks and can aid in finding nodes of interest for experimental tests.
Collapse
Affiliation(s)
- Niko Bernaola
- Computational Intelligence Group, Departamento de Inteligencia Artificial, Universidad Politécnica de Madrid, Madrid, Spain
| | - Mario Michiels
- Centro Integral de Neurociencias Abarca Campal, Hospital Universitario HM Puerta del Sur, Madrid, Spain
| | - Pedro Larrañaga
- Computational Intelligence Group, Departamento de Inteligencia Artificial, Universidad Politécnica de Madrid, Madrid, Spain
| | - Concha Bielza
- Computational Intelligence Group, Departamento de Inteligencia Artificial, Universidad Politécnica de Madrid, Madrid, Spain
| |
Collapse
|
2
|
Deng WQ, Craiu RV. Exploring dimension learning via a penalized probabilistic principal component analysis. J STAT COMPUT SIM 2022. [DOI: 10.1080/00949655.2022.2100890] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/16/2022]
Affiliation(s)
- Wei Q. Deng
- Department of Psychiatry and Behavioural Neurosciences, McMaster University, Hamilton, Canada
- Peter Boris Centre for Addictions Research, St. Joseph's Healthcare Hamilton, Hamilton, Canada
| | - Radu V. Craiu
- Department of Statistical Sciences, University of Toronto, Toronto, Canada
| |
Collapse
|
3
|
Tang W, Zhou H, Quan T, Chen X, Zhang H, Lin Y, Wu R. XGboost Prediction Model Based on 3.0T Diffusion Kurtosis Imaging Improves the Diagnostic Accuracy of MRI BiRADS 4 Masses. Front Oncol 2022; 12:833680. [PMID: 35372060 PMCID: PMC8968064 DOI: 10.3389/fonc.2022.833680] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2021] [Accepted: 02/21/2022] [Indexed: 02/05/2023] Open
Abstract
BACKGROUND The malignant probability of MRI BiRADS 4 breast lesions ranges from 2% to 95%, leading to unnecessary biopsies. The purpose of this study was to construct an optimal XGboost prediction model through a combination of DKI independently or jointly with other MR imaging features and clinical characterization, which was expected to reduce false positive rate of MRI BiRADS 4 masses and improve the diagnosis efficiency of breast cancer. METHODS 120 patients with 158 breast lesions were enrolled. DKI, Diffusion-weighted Imaging (DWI), Proton Magnetic Resonance Spectroscopy (1H-MRS) and Dynamic Contrast-Enhanced MRI (DCE-MRI) were performed on a 3.0-T scanner. Wilcoxon signed-rank test and χ2 test were used to compare patient's clinical characteristics, mean kurtosis (MK), mean diffusivity (MD), apparent diffusion coefficient (ADC), total choline (tCho) peak, extravascular extracellular volume fraction (Ve), flux rate constant (Kep) and volume transfer constant (Ktrans). ROC curve analysis was used to analyze the diagnostic performances of the imaging parameters. Spearman correlation analysis was performed to evaluate the associations of imaging parameters with prognostic factors and breast cancer molecular subtypes. The Least Absolute Shrinkage and Selectionator operator (lasso) and the area under the curve (AUC) of imaging parameters were used to select discriminative features for differentiating the breast benign lesions from malignant ones. Finally, an XGboost prediction model was constructed based on the discriminative features and its diagnostic efficiency was verified in BiRADS 4 masses. RESULTS MK derived from DKI performed better for differentiating between malignant and benign lesions than ADC, MD, tCho, Kep and Ktrans (p < 0.05). Also, MK was shown to be more strongly correlated with histological grade, Ki-67 expression and lymph node status. MD, MK, age, shape and menstrual status were selected to be the optimized feature subsets to construct an XGboost model, which exhibited superior diagnostic ability for breast cancer characterization and an improved evaluation of suspicious breast tumors in MRI BiRADS 4. CONCLUSIONS DKI is promising for breast cancer diagnosis and prognostic factor assessment. An optimized XGboost model that included DKI, age, shape and menstrual status is effective in improving the diagnostic accuracy of BiRADS 4 masses.
Collapse
Affiliation(s)
- Wan Tang
- Radiology Department, Second Affiliated Hospital of Shantou University Medical College, Shantou, China
- Institute of Health Monitoring, Inspection and Protection, Hubei Provincial Center for Disease Control and Prevention, Wuhan, China
| | - Han Zhou
- Radiology Department, Second Affiliated Hospital of Shantou University Medical College, Shantou, China
| | - Tianhong Quan
- Department of Electronic and information Engineering, College of Engineering, Shantou University, Shantou, China
| | - Xiaoyan Chen
- Radiology Department, Second Affiliated Hospital of Shantou University Medical College, Shantou, China
| | - Huanian Zhang
- Radiology Department, Second Affiliated Hospital of Shantou University Medical College, Shantou, China
| | - Yan Lin
- Radiology Department, Second Affiliated Hospital of Shantou University Medical College, Shantou, China
- Guangdong Provincial Key Laboratory for Breast Cancer Diagnosis and Treatment, Cancer Hospital of Shantou University Medical College, Shantou, China
| | - Renhua Wu
- Radiology Department, Second Affiliated Hospital of Shantou University Medical College, Shantou, China
- Guangdong Provincial Key Laboratory for Breast Cancer Diagnosis and Treatment, Cancer Hospital of Shantou University Medical College, Shantou, China
| |
Collapse
|
4
|
Cisneros-Villanueva M, Hidalgo-Pérez L, Cedro-Tanda A, Peña-Luna M, Mancera-Rodríguez MA, Hurtado-Cordova E, Rivera-Salgado I, Martínez-Aguirre A, Jiménez-Morales S, Alfaro-Ruiz LA, Arellano-Llamas R, Tenorio-Torres A, Domínguez-Reyes C, Villegas-Carlos F, Ríos-Romero M, Hidalgo-Miranda A. LINC00460 Is a Dual Biomarker That Acts as a Predictor for Increased Prognosis in Basal-Like Breast Cancer and Potentially Regulates Immunogenic and Differentiation-Related Genes. Front Oncol 2021; 11:628027. [PMID: 33912452 PMCID: PMC8074675 DOI: 10.3389/fonc.2021.628027] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2020] [Accepted: 03/10/2021] [Indexed: 12/23/2022] Open
Abstract
Breast cancer (BRCA) is a serious public health problem, as it is the most frequent malignant tumor in women worldwide. BRCA is a molecularly heterogeneous disease, particularly at gene expression (mRNAs) level. Recent evidence shows that coding RNAs represent only 34% of the total transcriptome in a human cell. The rest of the 66% of RNAs are non−coding, so we might be missing relevant biological, clinical or regulatory information. In this report, we identified two novel tumor types from TCGA with LINC00460 deregulation. We used survival analysis to demonstrate that LINC00460 expression is a marker for poor overall (OS), relapse-free (RFS) and distant metastasis-free survival (DMFS) in basal-like BRCA patients. LINC00460 expression is a potential marker for aggressive phenotypes in distinct tumors, including HPV-negative HNSC, stage IV KIRC, locally advanced lung cancer and basal-like BRCA. We show that the LINC00460 prognostic expression effect is tissue-specific, since its upregulation can predict poor OS in some tumors, but also predicts an improved clinical course in BRCA patients. We found that the LINC00460 expression is significantly enriched in the Basal-like 2 (BL2) TNBC subtype and potentially regulates the WNT differentiation pathway. LINC00460 can also modulate a plethora of immunogenic related genes in BRCA, such as SFRP5, FOSL1, IFNK, CSF2, DUSP7 and IL1A and interacts with miR-103-a-1, in-silico, which, in turn, can no longer target WNT7A. Finally, LINC00460:WNT7A ratio constitutes a composite marker for decreased OS and DMFS in Basal-like BRCA, and can predict anthracycline therapy response in ER-BRCA patients. This evidence confirms that LINC00460 is a master regulator in BRCA molecular circuits and influences clinical outcome.
Collapse
Affiliation(s)
- Mireya Cisneros-Villanueva
- Laboratorio de Genómica del Cáncer, Instituto Nacional de Medicina Genómica (INMEGEN), Ciudad de México, México.,Laboratorio de Epigenética del Cáncer, Facultad de Ciencias Químico Biológicas, Universidad Autónoma de Guerrero, Chilpancingo de los Bravo, Mexico
| | - Lizbett Hidalgo-Pérez
- Laboratorio de Genómica del Cáncer, Instituto Nacional de Medicina Genómica (INMEGEN), Ciudad de México, México.,Programa de Doctorado en Ciencias Biomédicas, Facultad de Medicina, Universidad Nacional Autónoma de México (UNAM), Ciudad de México, Mexico
| | - Alberto Cedro-Tanda
- Laboratorio de Genómica del Cáncer, Instituto Nacional de Medicina Genómica (INMEGEN), Ciudad de México, México
| | - Mónica Peña-Luna
- Laboratorio de Genómica del Cáncer, Instituto Nacional de Medicina Genómica (INMEGEN), Ciudad de México, México
| | | | - Eduardo Hurtado-Cordova
- Laboratorio de Genómica del Cáncer, Instituto Nacional de Medicina Genómica (INMEGEN), Ciudad de México, México
| | - Irene Rivera-Salgado
- Departamento de Anatomía Patológica, Hospital Central Sur de Alta Especialidad, Petróleos Mexicanos, Ciudad de México, México
| | - Alejandro Martínez-Aguirre
- Departamento de Anatomía Patológica, Hospital Central Sur de Alta Especialidad, Petróleos Mexicanos, Ciudad de México, México
| | - Silvia Jiménez-Morales
- Laboratorio de Genómica del Cáncer, Instituto Nacional de Medicina Genómica (INMEGEN), Ciudad de México, México
| | - Luis Alberto Alfaro-Ruiz
- Laboratorio de Genómica del Cáncer, Instituto Nacional de Medicina Genómica (INMEGEN), Ciudad de México, México
| | - Rocío Arellano-Llamas
- Laboratorio de Genómica del Cáncer, Instituto Nacional de Medicina Genómica (INMEGEN), Ciudad de México, México
| | | | | | | | - Magdalena Ríos-Romero
- Laboratorio de Genómica del Cáncer, Instituto Nacional de Medicina Genómica (INMEGEN), Ciudad de México, México.,Posgrado en Ciencias Biológicas, Unidad de Posgrado, Universidad Nacional Autónoma de México (UNAM), Ciudad de México, México
| | - Alfredo Hidalgo-Miranda
- Laboratorio de Genómica del Cáncer, Instituto Nacional de Medicina Genómica (INMEGEN), Ciudad de México, México
| |
Collapse
|
5
|
A prognostic model for overall survival of patients with early-stage non-small cell lung cancer: a multicentre, retrospective study. LANCET DIGITAL HEALTH 2020; 2:e594-e606. [PMID: 33163952 PMCID: PMC7646741 DOI: 10.1016/s2589-7500(20)30225-9] [Citation(s) in RCA: 26] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
Background Intratumoural heterogeneity has been previously shown to be related to clonal evolution and genetic instability and associated with tumour progression. Phenotypically, it is reflected in the diversity of appearance and morphology within cell populations. Computer-extracted features relating to tumour cellular diversity on routine tissue images might correlate with outcome. This study investigated the prognostic ability of computer-extracted features of tumour cellular diversity (CellDiv) from haematoxylin and eosin (H&E)-stained histology images of non-small cell lung carcinomas (NSCLCs). Methods In this multicentre, retrospective study, we included 1057 patients with early-stage NSCLC with corresponding diagnostic histology slides and overall survival information from four different centres. CellDiv features quantifying local cellular morphological diversity from H&E-stained histology images were extracted from the tumour epithelium region. A Cox proportional hazards model based on CellDiv was used to construct risk scores for lung adenocarcinoma (LUAD; 270 patients) and lung squamous cell carcinoma (LUSC; 216 patients) separately using data from two of the cohorts, and was validated in the two remaining independent cohorts (comprising 236 patients with LUAD and 335 patients with LUSC). We used multivariable Cox regression analysis to examine the predictive ability of CellDiv features for 5-year overall survival, controlling for the effects of clinical and pathological parameters. We did a gene set enrichment and Gene Ontology analysis on 405 patients to identify associations with differentially expressed biological pathways implicated in lung cancer pathogenesis. Findings For prognosis of patients with early-stage LUSC, the CellDiv LUSC model included 11 discriminative CellDiv features, whereas for patients with early-stage LUAD, the model included 23 features. In the independent validation cohorts, patients predicted to be at a higher risk by the univariable CellDiv model had significantly worse 5-year overall survival (hazard ratio 1·48 [95% CI 1·06–2·08]; p=0·022 for The Cancer Genome Atlas [TCGA] LUSC group, 2·24 [1·04–4·80]; p=0·039 for the University of Bern LUSC group, and 1·62 [1·15–2·30]; p=0·0058 for the TCGA LUAD group). The identified CellDiv features were also found to be strongly associated with apoptotic signalling and cell differentiation pathways. Interpretation CellDiv features were strongly prognostic of 5-year overall survival in patients with early-stage NSCLC and also associated with apoptotic signalling and cell differentiation pathways. The CellDiv-based risk stratification model could potentially help to determine which patients with early-stage NSCLC might receive added benefit from adjuvant therapy. Funding National Institue of Health and US Department of Defense.
Collapse
|
6
|
Abstract
We consider data-analysis settings where data are missing not at random. In these cases, the two basic modeling approaches are 1) pattern-mixture models, with separate distributions for missing data and observed data, and 2) selection models, with a distribution for the data preobservation and a missing-data mechanism that selects which data are observed. These two modeling approaches lead to distinct factorizations of the joint distribution of the observed-data and missing-data indicators. In this paper, we explore a third approach, apparently originally proposed by J. W. Tukey as a remark in a discussion between Rubin and Hartigan, and reported by Holland in a two-page note, which has been so far neglected. Data analyses typically rely upon assumptions about the missingness mechanisms that lead to observed versus missing data, assumptions that are typically unassessable. We explore an approach where the joint distribution of observed data and missing data are specified in a nonstandard way. In this formulation, which traces back to a representation of the joint distribution of the data and missingness mechanism, apparently first proposed by J. W. Tukey, the modeling assumptions about the distributions are either assessable or are designed to allow relatively easy incorporation of substantive knowledge about the problem at hand, thereby offering a possibly realistic portrayal of the data, both observed and missing. We develop Tukey’s representation for exponential-family models, propose a computationally tractable approach to inference in this class of models, and offer some general theoretical comments. We then illustrate the utility of this approach with an example in systems biology.
Collapse
|
7
|
Pan T, Yin Y. Improving the accuracy of identifying the lognormal curve in the Johnson system. COMMUN STAT-SIMUL C 2020. [DOI: 10.1080/03610918.2018.1494834] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
Affiliation(s)
| | - Yue Yin
- University of Illinois at Chicago, Chicago, Illinois, USA
| |
Collapse
|
8
|
Guvakova MA. Improving patient classification and biomarker assessment using Gaussian Mixture Models and Bayes' rule. Oncoscience 2020; 6:383-385. [PMID: 31984216 PMCID: PMC6959929 DOI: 10.18632/oncoscience.494] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2019] [Accepted: 10/01/2019] [Indexed: 11/25/2022] Open
Abstract
In clinical research, determining cutoff values for continuous variables in test results remains challenging, particularly when considering candidate biomarkers or therapeutic targets for disease. Distribution of a continuous variable into two populations is known as dichotomization and has been commonly used in clinical studies. We recently reported a new method for determining multiple cutoffs for continuous variables. The development of this original approach was based on fitting Gaussian Mixture Models (GMM) onto real-world clinical data. We also explored how to leverage Bayesian probability to minimize uncertainty while classifying individual patients into respective subpopulations. In addition, we investigated the performance of the proposed method for the distribution of classical prognostic markers in breast cancer. Finally, we applied the proposed method to analyze a candidate marker and a target for cancer therapy. Here, we present an overview of this method and our prospects for its implementation in biomedical and clinical research.
Collapse
Affiliation(s)
- Marina A Guvakova
- Department of Surgery, Division of Endocrine & Oncologic Surgery, Harrison Department of Surgical Research, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA
| |
Collapse
|
9
|
Church BV, Williams HT, Mar JC. Investigating skewness to understand gene expression heterogeneity in large patient cohorts. BMC Bioinformatics 2019; 20:668. [PMID: 31861976 PMCID: PMC6923883 DOI: 10.1186/s12859-019-3252-0] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022] Open
Abstract
BACKGROUND Skewness is an under-utilized statistical measure that captures the degree of asymmetry in the distribution of any dataset. This study applied a new metric based on skewness to identify regulators or genes that have outlier expression in large patient cohorts. RESULTS We investigated whether specific patterns of skewed expression were related to the enrichment of biological pathways or genomic properties like DNA methylation status. Our study used publicly available datasets that were generated using both RNA-sequencing and microarray technology platforms. For comparison, the datasets selected for this study also included different samples derived from control donors and cancer patients. When comparing the shift in expression skewness between cancer and control datasets, we observed an enrichment of pathways related to the immune function that reflects an increase towards positive skewness in the cancer relative to control datasets. A significant correlation was also detected between expression skewness and the top 500 genes corresponding to the most significant differential DNA methylation occurring in the promotor regions for four Cancer Genome Atlas cancer cohorts. CONCLUSIONS Our results indicate that expression skewness can reveal new insights into transcription based on outlier and asymmetrical behaviour in large patient cohorts.
Collapse
Affiliation(s)
- Benjamin V. Church
- Department of Systems and Computational Biology, Albert Einstein College of Medicine, 1300 Morris Park Avenue, Bronx, 10461 NY USA
- Department of Mathematics, Columbia University, 2990 Broadway, New York, 10027 NY USA
| | - Henry T. Williams
- Department of Systems and Computational Biology, Albert Einstein College of Medicine, 1300 Morris Park Avenue, Bronx, 10461 NY USA
- Department of Mathematics, Columbia University, 2990 Broadway, New York, 10027 NY USA
| | - Jessica C. Mar
- Department of Systems and Computational Biology, Albert Einstein College of Medicine, 1300 Morris Park Avenue, Bronx, 10461 NY USA
- Department of Epidemiology and Population Health, Albert Einstein College of Medicine, 1300 Morris Park Avenue, Bronx, 10461 NY USA
- Australian Institute for Bioengineering and Nanotechnology, The University of Queensland, Brisbane, 4072 QLD Australia
| |
Collapse
|
10
|
Prabakaran I, Wu Z, Lee C, Tong B, Steeman S, Koo G, Zhang PJ, Guvakova MA. Gaussian Mixture Models for Probabilistic Classification of Breast Cancer. Cancer Res 2019; 79:3492-3502. [PMID: 31113820 DOI: 10.1158/0008-5472.can-19-0573] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2019] [Revised: 04/12/2019] [Accepted: 05/17/2019] [Indexed: 11/16/2022]
Abstract
In the era of omics-driven research, it remains a common dilemma to stratify individual patients based on the molecular characteristics of their tumors. To improve molecular stratification of patients with breast cancer, we developed the Gaussian mixture model (GMM)-based classifier. This probabilistic classifier was built on mRNA expression data from more than 300 clinical samples of breast cancer and healthy tissue and was validated on datasets of ESR1, PGR, and ERBB2, which encode standard clinical markers and therapeutic targets. To demonstrate how a GMM approach could be exploited for multiclass classification using data from a candidate marker, we analyzed the insulin-like growth factor I receptor (IGF1R), a promising target, but a marker of uncertain importance in breast cancer. The GMM defined subclasses with downregulated (40%), unchanged (39%), upregulated (19%), and overexpressed (2%) IGF1R levels; inter- and intrapatient analyses of IGF1R transcript and protein levels supported these predictions. Overexpressed IGF1R was observed in a small percentage of tumors. Samples with unchanged and upregulated IGF1R were differentiated tumors, and downregulation of IGF1R correlated with poorly differentiated, high-risk hormone receptor-negative and HER2-positive tumors. A similar correlation was found in the independent cohort of carcinoma in situ, suggesting that loss or low expression of IGF1R is a marker of aggressiveness in subsets of preinvasive and invasive breast cancer. These results demonstrate the importance of probabilistic modeling that delves deeper into molecular data and aims to improve diagnostic classification, prognostic assessment, and treatment selection. SIGNIFICANCE: A GMM classifier demonstrates potential use for clinical validation of markers and determination of target populations, particularly when availability of specimens for marker development is low.
Collapse
MESH Headings
- Biomarkers, Tumor/genetics
- Biomarkers, Tumor/metabolism
- Breast Neoplasms/classification
- Breast Neoplasms/genetics
- Breast Neoplasms/metabolism
- Breast Neoplasms/pathology
- Case-Control Studies
- Cohort Studies
- Female
- Humans
- Models, Statistical
- Neoplasm Invasiveness
- Prognosis
- Receptor, ErbB-2/genetics
- Receptor, ErbB-2/metabolism
- Receptor, IGF Type 1/genetics
- Receptor, IGF Type 1/metabolism
- Receptors, Estrogen/genetics
- Receptors, Estrogen/metabolism
- Receptors, Progesterone/genetics
- Receptors, Progesterone/metabolism
Collapse
Affiliation(s)
- Indira Prabakaran
- Department of Surgery, Division of Endocrine & Oncologic Surgery, Harrison Department of Surgical Research, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania
| | - Zhengdong Wu
- Department of Materials Science and Engineering, School of Engineering and Applied Science, Philadelphia, Pennsylvania
| | - Changgun Lee
- Finance Department, Wharton School, University of Pennsylvania, Philadelphia, Pennsylvania
| | - Brian Tong
- Department of Surgery, Division of Endocrine & Oncologic Surgery, Harrison Department of Surgical Research, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania
| | - Samantha Steeman
- Department of Surgery, Division of Endocrine & Oncologic Surgery, Harrison Department of Surgical Research, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania
| | - Gabriel Koo
- Department of Surgery, Division of Endocrine & Oncologic Surgery, Harrison Department of Surgical Research, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania
| | - Paul J Zhang
- Department of Pathology and Laboratory Medicine, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania
| | - Marina A Guvakova
- Department of Surgery, Division of Endocrine & Oncologic Surgery, Harrison Department of Surgical Research, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania.
| |
Collapse
|
11
|
Mar JC. The rise of the distributions: why non-normality is important for understanding the transcriptome and beyond. Biophys Rev 2019; 11:89-94. [PMID: 30617454 PMCID: PMC6381358 DOI: 10.1007/s12551-018-0494-4] [Citation(s) in RCA: 26] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2018] [Accepted: 12/17/2018] [Indexed: 01/08/2023] Open
Abstract
The application of statistics has been instrumental in clarifying our understanding of the genome. While insights have been derived for almost all levels of genome function, most importantly, statistics has had the greatest impact on improving our knowledge of transcriptional regulation. But the drive to extract the most meaningful inferences from big data can often force us to overlook the fundamental role that statistics plays, and specifically, the basic assumptions that we make about big data. Normality is a statistical property that is often swept up into an assumption that we may or may not be consciously aware of making. This review highlights the inherent value of non-normal distributions to big data analysis by discussing use cases of non-normality that focus on gene expression data. Collectively, these examples help to motivate the premise of why at this stage, now more than ever, non-normality is important for learning about gene regulation, transcriptomics, and more.
Collapse
Affiliation(s)
- Jessica C Mar
- Australian Institute for Bioengineering and Nanotechnology, University of Queensland, QLD, Brisbane, 4072, Australia.
| |
Collapse
|
12
|
Bhadra A, Rao A, Baladandayuthapani V. Inferring network structure in non-normal and mixed discrete-continuous genomic data. Biometrics 2018; 74:185-195. [PMID: 28437848 PMCID: PMC5654714 DOI: 10.1111/biom.12711] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2015] [Revised: 02/01/2017] [Accepted: 03/01/2017] [Indexed: 11/28/2022]
Abstract
Inferring dependence structure through undirected graphs is crucial for uncovering the major modes of multivariate interaction among high-dimensional genomic markers that are potentially associated with cancer. Traditionally, conditional independence has been studied using sparse Gaussian graphical models for continuous data and sparse Ising models for discrete data. However, there are two clear situations when these approaches are inadequate. The first occurs when the data are continuous but display non-normal marginal behavior such as heavy tails or skewness, rendering an assumption of normality inappropriate. The second occurs when a part of the data is ordinal or discrete (e.g., presence or absence of a mutation) and the other part is continuous (e.g., expression levels of genes or proteins). In this case, the existing Bayesian approaches typically employ a latent variable framework for the discrete part that precludes inferring conditional independence among the data that are actually observed. The current article overcomes these two challenges in a unified framework using Gaussian scale mixtures. Our framework is able to handle continuous data that are not normal and data that are of mixed continuous and discrete nature, while still being able to infer a sparse conditional sign independence structure among the observed data. Extensive performance comparison in simulations with alternative techniques and an analysis of a real cancer genomics data set demonstrate the effectiveness of the proposed approach.
Collapse
Affiliation(s)
- Anindya Bhadra
- Department of Statistics, Purdue University, 250 N. University Street, West Lafayette, Indiana 47907, U.S.A
| | - Arvind Rao
- Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, 1400 Pressler Dr., Houston, Texas 77030, U.S.A
| | - Veerabhadran Baladandayuthapani
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, 1400 Pressler Dr., Houston,Texas 77030, U.S.A
| |
Collapse
|
13
|
Wang Y, Qian W, Yuan B. A Graphical Model of Smoking-Induced Global Instability in Lung Cancer. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:1-14. [PMID: 27542180 DOI: 10.1109/tcbb.2016.2599867] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Smoking is the major cause of lung cancer and the leading cause of cancer-related death in the world. The most current view about lung cancer is no longer limited to individual genes being mutated by any carcinogenic insults from smoking. Instead, tumorigenesis is a phenotype conferred by many systematic and global alterations, leading to extensive heterogeneity and variation for both the genotypes and phenotypes of individual cancer cells. Thus, strategically it is foremost important to develop a methodology to capture any consistent and global alterations presumably shared by most of the cancerous cells for a given population. This is particularly true that almost all of the data collected from solid cancers (including lung cancers) are usually distant apart over a large span of temporal or even spatial contexts. Here, we report a multiple non-Gaussian graphical model to reconstruct the gene interaction network using two previously published gene expression datasets. Our graphical model aims to selectively detect gross structural changes at the level of gene interaction networks. Our methodology is extensively validated, demonstrating good robustness, as well as the selectivity and specificity expected based on our biological insights. In summary, gene regulatory networks are still relatively stable during presumably the early stage of neoplastic transformation. But drastic structural differences can be found between lung cancer and its normal control, including the gain of functional modules for cellular proliferations such as EGFR and PDGFRA, as well as the lost of the important IL6 module, supporting their roles as potential drug targets. Interestingly, our method can also detect early modular changes, with the ALDH3A1 and its associated interactions being strongly implicated as a potential early marker, whose activations appear to alter LCN2 module as well as its interactions with the important TP53-MDM2 circuitry. Our strategy using the graphical model to reconstruct gene interaction work with biologically-inspired constraints exemplifies the importance and beauty of biology in developing any bio-computational approach.
Collapse
|
14
|
Saberkari H, Shamsi M, Joroughi M, Golabi F, Sedaaghi MH. Cancer Classification in Microarray Data using a Hybrid Selective Independent Component Analysis and υ-Support Vector Machine Algorithm. JOURNAL OF MEDICAL SIGNALS & SENSORS 2014; 4:291-8. [PMID: 25426433 PMCID: PMC4236808] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2014] [Accepted: 07/31/2014] [Indexed: 11/20/2022]
Abstract
Microarray data have an important role in identification and classification of the cancer tissues. Having a few samples of microarrays in cancer researches is always one of the most concerns which lead to some problems in designing the classifiers. For this matter, preprocessing gene selection techniques should be utilized before classification to remove the noninformative genes from the microarray data. An appropriate gene selection method can significantly improve the performance of cancer classification. In this paper, we use selective independent component analysis (SICA) for decreasing the dimension of microarray data. Using this selective algorithm, we can solve the instability problem occurred in the case of employing conventional independent component analysis (ICA) methods. First, the reconstruction error and selective set are analyzed as independent components of each gene, which have a small part in making error in order to reconstruct new sample. Then, some of the modified support vector machine (υ-SVM) algorithm sub-classifiers are trained, simultaneously. Eventually, the best sub-classifier with the highest recognition rate is selected. The proposed algorithm is applied on three cancer datasets (leukemia, breast cancer and lung cancer datasets), and its results are compared with other existing methods. The results illustrate that the proposed algorithm (SICA + υ-SVM) has higher accuracy and validity in order to increase the classification accuracy. Such that, our proposed algorithm exhibits relative improvements of 3.3% in correctness rate over ICA + SVM and SVM algorithms in lung cancer dataset.
Collapse
Affiliation(s)
- Hamidreza Saberkari
- Department of Electrical Engineering, Genomic Signal Processing Laboratory, Sahand University of Technology, Tabriz, Iran
| | - Mousa Shamsi
- Department of Electrical Engineering, Genomic Signal Processing Laboratory, Sahand University of Technology, Tabriz, Iran
| | - Mahsa Joroughi
- Department of Electrical Engineering, Genomic Signal Processing Laboratory, Sahand University of Technology, Tabriz, Iran
| | - Faegheh Golabi
- Department of Electrical Engineering, Genomic Signal Processing Laboratory, Sahand University of Technology, Tabriz, Iran
| | - Mohammad Hossein Sedaaghi
- Department of Electrical Engineering, Genomic Signal Processing Laboratory, Sahand University of Technology, Tabriz, Iran
| |
Collapse
|