1
|
Winnicki MJ, Brown CA, Porter HL, Giles CB, Wren JD. BioVDB: biological vector database for high-throughput gene expression meta-analysis. Front Artif Intell 2024; 7:1366273. [PMID: 38525301 PMCID: PMC10957786 DOI: 10.3389/frai.2024.1366273] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2024] [Accepted: 02/26/2024] [Indexed: 03/26/2024] Open
Abstract
High-throughput sequencing has created an exponential increase in the amount of gene expression data, much of which is freely, publicly available in repositories such as NCBI's Gene Expression Omnibus (GEO). Querying this data for patterns such as similarity and distance, however, becomes increasingly challenging as the total amount of data increases. Furthermore, vectorization of the data is commonly required in Artificial Intelligence and Machine Learning (AI/ML) approaches. We present BioVDB, a vector database for storage and analysis of gene expression data, which enhances the potential for integrating biological studies with AI/ML tools. We used a previously developed approach called Automatic Label Extraction (ALE) to extract sample labels from metadata, including age, sex, and tissue/cell-line. BioVDB stores 438,562 samples from eight microarray GEO platforms. We show that it allows for efficient querying of data using similarity search, which can also be useful for identifying and inferring missing labels of samples, and for rapid similarity analysis.
Collapse
Affiliation(s)
- Michał J. Winnicki
- Genes and Human Disease Research Program, Oklahoma Medical Research Foundation, Oklahoma City, OK, United States
| | - Chase A. Brown
- Genes and Human Disease Research Program, Oklahoma Medical Research Foundation, Oklahoma City, OK, United States
- Oklahoma Center for Neuroscience, University of Oklahoma Health Sciences Center, Oklahoma City, OK, United States
| | - Hunter L. Porter
- Genes and Human Disease Research Program, Oklahoma Medical Research Foundation, Oklahoma City, OK, United States
| | - Cory B. Giles
- Genes and Human Disease Research Program, Oklahoma Medical Research Foundation, Oklahoma City, OK, United States
| | - Jonathan D. Wren
- Genes and Human Disease Research Program, Oklahoma Medical Research Foundation, Oklahoma City, OK, United States
- Oklahoma Center for Neuroscience, University of Oklahoma Health Sciences Center, Oklahoma City, OK, United States
- Department of Biochemistry and Molecular Biology, University of Oklahoma Health Sciences Center, Oklahoma City, OK, United States
- Oklahoma Nathan Shock Center, Oklahoma City, OK, United States
| |
Collapse
|
2
|
Zhu A, Liu Y, Liu Y. Identification of key genes and regulatory mechanisms in adult degenerative scoliosis. J Clin Neurosci 2024; 119:170-179. [PMID: 38103507 DOI: 10.1016/j.jocn.2023.12.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2023] [Revised: 12/05/2023] [Accepted: 12/07/2023] [Indexed: 12/19/2023]
Abstract
BACKGROUND Adult degenerative scoliosis (ADS) is a spinal disorder, but its pathogenesis remain unclear. Therefore, in this study, we utilized data from the GEO database and explored the key genes and regulatory mechanisms involved in ADS. METHODS We performed bioinformatics analysis on the GSE209825 dataset of GEO database. Weighted gene co-expression network analysis (WGCNA) was used to identify ADS-related gene modules, and we performed gene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analyses. We constructed a protein-protein interaction (PPI) network using the STRING database. We validated the specificity of hub genes in ADS using the GSE34095 dataset and plotted ROC curves for the identification of different degenerative spinal diseases based on the hub genes expression RESULTS: We identified 113 differentially expressed lncRNAs. WGCNA identified the MEblack module had the strongest correlation to ADS. GO and KEGG analyses of target genes in lncRNAs revealed their involvement in immune responses, inflammation, cellular processes, and metabolic pathways. Through PPI and ROC analysis, 10 hub genes linked to ADS diseases with certain specificity were found: ELANE, LTF, DEFA1B, SLC2A4, DEFA1, FAXDC2, LCN2, CTSB, FDFT1, and AURKA. CONCLUSIONS We identified 10 potential hub genes associated with ADS and constructed a transcription factors (TFs)-lncRNAs-hub genes regulatory network. These findings provide a new direction and research basis for the targeted treatment and mechanism research of ADS.
Collapse
Affiliation(s)
- Aoran Zhu
- Department of Spinal Surgery, The First Hospital of Jilin University, Changchun 130021, China
| | - Ying Liu
- Department of Spinal Surgery, The First Hospital of Jilin University, Changchun 130021, China
| | - Yan Liu
- Department of Spinal Surgery, The First Hospital of Jilin University, Changchun 130021, China.
| |
Collapse
|
3
|
Thomaidis GV, Papadimitriou K, Michos S, Chartampilas E, Tsamardinos I. A characteristic cerebellar biosignature for bipolar disorder, identified with fully automatic machine learning. IBRO Neurosci Rep 2023; 15:77-89. [PMID: 38025660 PMCID: PMC10668096 DOI: 10.1016/j.ibneur.2023.06.008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2023] [Revised: 05/19/2023] [Accepted: 06/29/2023] [Indexed: 12/01/2023] Open
Abstract
Background Transcriptomic profile differences between patients with bipolar disorder and healthy controls can be identified using machine learning and can provide information about the potential role of the cerebellum in the pathogenesis of bipolar disorder.With this aim, user-friendly, fully automated machine learning algorithms can achieve extremely high classification scores and disease-related predictive biosignature identification, in short time frames and scaled down to small datasets. Method A fully automated machine learning platform, based on the most suitable algorithm selection and relevant set of hyper-parameter values, was applied on a preprocessed transcriptomics dataset, in order to produce a model for biosignature selection and to classify subjects into groups of patients and controls. The parent GEO datasets were originally produced from the cerebellar and parietal lobe tissue of deceased bipolar patients and healthy controls, using Affymetrix Human Gene 1.0 ST Array. Results Patients and controls were classified into two separate groups, with no close-to-the-boundary cases, and this classification was based on the cerebellar transcriptomic biosignature of 25 features (genes), with Area Under Curve 0.929 and Average Precision 0.955. The biosignature includes both genes connected before to bipolar disorder, depression, psychosis or epilepsy, as well as genes not linked before with any psychiatric disease. Kyoto Encyclopedia of Genes and Genomes (KEGG) analysis revealed participation of 4 identified features in 6 pathways which have also been associated with bipolar disorder. Conclusion Automated machine learning (AutoML) managed to identify accurately 25 genes that can jointly - in a multivariate-fashion - separate bipolar patients from healthy controls with high predictive power. The discovered features lead to new biological insights. Machine Learning (ML) analysis considers the features in combination (in contrast to standard differential expression analysis), removing both irrelevant as well as redundant markers, and thus, focusing to biological interpretation.
Collapse
Affiliation(s)
- Georgios V. Thomaidis
- Greek National Health System, Psychiatric Department, Katerini General Hospital, Katerini, Greece
| | - Konstantinos Papadimitriou
- Greek National Health System, G. Papanikolaou General Hospital, Organizational Unit - Psychiatric Hospital of Thessaloniki, Thessaloniki, Greece
| | | | - Evangelos Chartampilas
- Laboratory of Radiology, AHEPA General Hospital, University of Thessaloniki, Thessaloniki, Greece
| | | |
Collapse
|
4
|
Bhandari N, Walambe R, Kotecha K, Khare SP. A comprehensive survey on computational learning methods for analysis of gene expression data. Front Mol Biosci 2022; 9:907150. [PMID: 36458095 PMCID: PMC9706412 DOI: 10.3389/fmolb.2022.907150] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2022] [Accepted: 09/28/2022] [Indexed: 09/19/2023] Open
Abstract
Computational analysis methods including machine learning have a significant impact in the fields of genomics and medicine. High-throughput gene expression analysis methods such as microarray technology and RNA sequencing produce enormous amounts of data. Traditionally, statistical methods are used for comparative analysis of gene expression data. However, more complex analysis for classification of sample observations, or discovery of feature genes requires sophisticated computational approaches. In this review, we compile various statistical and computational tools used in analysis of expression microarray data. Even though the methods are discussed in the context of expression microarrays, they can also be applied for the analysis of RNA sequencing and quantitative proteomics datasets. We discuss the types of missing values, and the methods and approaches usually employed in their imputation. We also discuss methods of data normalization, feature selection, and feature extraction. Lastly, methods of classification and class discovery along with their evaluation parameters are described in detail. We believe that this detailed review will help the users to select appropriate methods for preprocessing and analysis of their data based on the expected outcome.
Collapse
Affiliation(s)
- Nikita Bhandari
- Computer Science Department, Symbiosis Institute of Technology, Symbiosis International (Deemed University), Pune, India
| | - Rahee Walambe
- Electronics and Telecommunication Department, Symbiosis Institute of Technology, Symbiosis International (Deemed University), Pune, India
- Symbiosis Center for Applied AI (SCAAI), Symbiosis International (Deemed University), Pune, India
| | - Ketan Kotecha
- Computer Science Department, Symbiosis Institute of Technology, Symbiosis International (Deemed University), Pune, India
- Symbiosis Center for Applied AI (SCAAI), Symbiosis International (Deemed University), Pune, India
| | - Satyajeet P. Khare
- Symbiosis School of Biological Sciences, Symbiosis International (Deemed University), Pune, India
| |
Collapse
|
5
|
Hephzibah Cathryn R, Udhaya Kumar S, Younes S, Zayed H, George Priya Doss C. A review of bioinformatics tools and web servers in different microarray platforms used in cancer research. ADVANCES IN PROTEIN CHEMISTRY AND STRUCTURAL BIOLOGY 2022; 131:85-164. [PMID: 35871897 DOI: 10.1016/bs.apcsb.2022.05.002] [Citation(s) in RCA: 19] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/26/2023]
Abstract
Over the past decade, conventional lab work strategies have gradually shifted from being limited to a laboratory setting towards a bioinformatics era to help manage and process the vast amounts of data generated by omics technologies. The present work outlines the latest contributions of bioinformatics in analyzing microarray data and their application to cancer. We dissect different microarray platforms and their use in gene expression in cancer models. We highlight how computational advances empowered the microarray technology in gene expression analysis. The study on protein-protein interaction databases classified into primary, derived, meta-database, and prediction databases describes the strategies to curate and predict novel interaction networks in silico. In addition, we summarize the areas of bioinformatics where neural graph networks are currently being used, such as protein functions, protein interaction prediction, and in silico drug discovery and development. We also discuss the role of deep learning as a potential tool in the prognosis, diagnosis, and treatment of cancer. Integrating these resources efficiently, practically, and ethically is likely to be the most challenging task for the healthcare industry over the next decade; however, we believe that it is achievable in the long term.
Collapse
Affiliation(s)
- R Hephzibah Cathryn
- Laboratory of Integrative Genomics, Department of Integrative Biology, School of Biosciences and Technology, Vellore Institute of Technology, Vellore, India
| | - S Udhaya Kumar
- Laboratory of Integrative Genomics, Department of Integrative Biology, School of Biosciences and Technology, Vellore Institute of Technology, Vellore, India
| | - Salma Younes
- Department of Biomedical Sciences, College of Health and Sciences, Qatar University, QU Health, Doha, Qatar
| | - Hatem Zayed
- Department of Biomedical Sciences, College of Health and Sciences, Qatar University, QU Health, Doha, Qatar
| | - C George Priya Doss
- Laboratory of Integrative Genomics, Department of Integrative Biology, School of Biosciences and Technology, Vellore Institute of Technology, Vellore, India.
| |
Collapse
|
6
|
Just Add Data: automated predictive modeling for knowledge discovery and feature selection. NPJ Precis Oncol 2022; 6:38. [PMID: 35710826 PMCID: PMC9203777 DOI: 10.1038/s41698-022-00274-8] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2020] [Accepted: 04/13/2022] [Indexed: 01/20/2023] Open
Abstract
Fully automated machine learning (AutoML) for predictive modeling is becoming a reality, giving rise to a whole new field. We present the basic ideas and principles of Just Add Data Bio (JADBio), an AutoML platform applicable to the low-sample, high-dimensional omics data that arise in translational medicine and bioinformatics applications. In addition to predictive and diagnostic models ready for clinical use, JADBio focuses on knowledge discovery by performing feature selection and identifying the corresponding biosignatures, i.e., minimal-size subsets of biomarkers that are jointly predictive of the outcome or phenotype of interest. It also returns a palette of useful information for interpretation, clinical use of the models, and decision making. JADBio is qualitatively and quantitatively compared against Hyper-Parameter Optimization Machine Learning libraries. Results show that in typical omics dataset analysis, JADBio manages to identify signatures comprising of just a handful of features while maintaining competitive predictive performance and accurate out-of-sample performance estimation.
Collapse
|
7
|
Karagiannaki I, Gourlia K, Lagani V, Pantazis Y, Tsamardinos I. Learning biologically-interpretable latent representations for gene expression data: Pathway Activity Score Learning Algorithm. Mach Learn 2022; 112:4257-4287. [PMID: 37900054 PMCID: PMC10600308 DOI: 10.1007/s10994-022-06158-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2021] [Revised: 11/12/2021] [Accepted: 02/19/2022] [Indexed: 11/24/2022]
Abstract
Molecular gene-expression datasets consist of samples with tens of thousands of measured quantities (i.e., high dimensional data). However, lower-dimensional representations that retain the useful biological information do exist. We present a novel algorithm for such dimensionality reduction called Pathway Activity Score Learning (PASL). The major novelty of PASL is that the constructed features directly correspond to known molecular pathways (genesets in general) and can be interpreted as pathway activity scores. Hence, unlike PCA and similar methods, PASL's latent space has a fairly straightforward biological interpretation. PASL is shown to outperform in predictive performance the state-of-the-art method (PLIER) on two collections of breast cancer and leukemia gene expression datasets. PASL is also trained on a large corpus of 50000 gene expression samples to construct a universal dictionary of features across different tissues and pathologies. The dictionary validated on 35643 held-out samples for reconstruction error. It is then applied on 165 held-out datasets spanning a diverse range of diseases. The AutoML tool JADBio is employed to show that the predictive information in the PASL-created feature space is retained after the transformation. The code is available at https://github.com/mensxmachina/PASL.
Collapse
Affiliation(s)
- Ioulia Karagiannaki
- Institute of Electronic Structure and Laser, Foundation for Research and Technology-Hellas (IESL-FORTH), Heraklion, Greece
| | | | - Vincenzo Lagani
- Institute of Chemical Biology, Ilia State University, Tbilisi, 0162 Georgia
- JADBio, Gnosis Data Analysis PC, Heraklion, Crete Greece
| | - Yannis Pantazis
- Institute of Applied and Computational Mathematics, Foundation for Research and Technology - Hellas, Heraklion, Greece
| | - Ioannis Tsamardinos
- Department of Computer Science, University of Crete, Heraklion, Greece
- JADBio, Gnosis Data Analysis PC, Heraklion, Crete Greece
- Institute of Applied and Computational Mathematics, Foundation for Research and Technology - Hellas, Heraklion, Greece
| |
Collapse
|
8
|
Tsagris M, Papadovasilakis Z, Lakiotaki K, Tsamardinos I. The γ-OMP Algorithm for Feature Selection With Application to Gene Expression Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:1214-1224. [PMID: 33035156 DOI: 10.1109/tcbb.2020.3029952] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Feature selection for predictive analytics is the problem of identifying a minimal-size subset of features that is maximally predictive of an outcome of interest. To apply to molecular data, feature selection algorithms need to be scalable to tens of thousands of features. In this paper, we propose γ-OMP, a generalisation of the highly-scalable Orthogonal Matching Pursuit feature selection algorithm. γ-OMP can handle (a)various types of outcomes, such as continuous, binary, nominal, time-to-event, (b)discrete (categorical)features, (c)different statistical-based stopping criteria, (d)several predictive models (e.g., linear or logistic regression), (e)various types of residuals, and (f)different types of association. We compare γ-OMP against LASSO, a prototypical, widely used algorithm for high-dimensional data. On both simulated data and several real gene expression datasets, γ-OMP is on par, or outperforms LASSO in binary classification (case-control data), regression (quantified outcomes), and time-to-event data (censored survival times). γ-OMP is based on simple statistical ideas, it is easy to implement and to extend, and our extensive evaluation shows that it is also effective in bioinformatics analysis settings.
Collapse
|
9
|
Xiang J, Zhang J, Zhao Y, Wu FX, Li M. Biomedical data, computational methods and tools for evaluating disease-disease associations. Brief Bioinform 2022; 23:6522999. [PMID: 35136949 DOI: 10.1093/bib/bbac006] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2021] [Revised: 01/04/2022] [Accepted: 01/05/2022] [Indexed: 12/12/2022] Open
Abstract
In recent decades, exploring potential relationships between diseases has been an active research field. With the rapid accumulation of disease-related biomedical data, a lot of computational methods and tools/platforms have been developed to reveal intrinsic relationship between diseases, which can provide useful insights to the study of complex diseases, e.g. understanding molecular mechanisms of diseases and discovering new treatment of diseases. Human complex diseases involve both external phenotypic abnormalities and complex internal molecular mechanisms in organisms. Computational methods with different types of biomedical data from phenotype to genotype can evaluate disease-disease associations at different levels, providing a comprehensive perspective for understanding diseases. In this review, available biomedical data and databases for evaluating disease-disease associations are first summarized. Then, existing computational methods for disease-disease associations are reviewed and classified into five groups in terms of the usages of biomedical data, including disease semantic-based, phenotype-based, function-based, representation learning-based and text mining-based methods. Further, we summarize software tools/platforms for computation and analysis of disease-disease associations. Finally, we give a discussion and summary on the research of disease-disease associations. This review provides a systematic overview for current disease association research, which could promote the development and applications of computational methods and tools/platforms for disease-disease associations.
Collapse
Affiliation(s)
- Ju Xiang
- School of Computer Science and Engineering, Central South University, China
| | - Jiashuai Zhang
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha, Hunan 410083, China
| | - Yichao Zhao
- School of Computer Science and Engineering, Central South University, China
| | - Fang-Xiang Wu
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha, Hunan 410083, China
| | - Min Li
- Division of Biomedical Engineering and Department of Mechanical Engineering at University of Saskatchewan, Saskatoon, Canada
| |
Collapse
|
10
|
Wang LR, Wong L, Goh WWB. How doppelgänger effects in biomedical data confound machine learning. Drug Discov Today 2021; 27:678-685. [PMID: 34743902 DOI: 10.1016/j.drudis.2021.10.017] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2021] [Revised: 09/22/2021] [Accepted: 10/22/2021] [Indexed: 12/26/2022]
Abstract
Machine learning (ML) models have been increasingly adopted in drug development for faster identification of potential targets. Cross-validation techniques are commonly used to evaluate these models. However, the reliability of such validation methods can be affected by the presence of data doppelgängers. Data doppelgängers occur when independently derived data are very similar to each other, causing models to perform well regardless of how they are trained (i.e., the doppelgänger effect). Despite the abundance of data doppelgängers in biomedical data and their inflationary effects, they remain uncharacterized. We show their prevalence in biomedical data, demonstrate how doppelgängers arise, and provide proof of their confounding effects. To mitigate the doppelgänger effect, we recommend identifying data doppelgängers before the training-validation split.
Collapse
Affiliation(s)
- Li Rong Wang
- School of Computer Science and Engineering, Nanyang Technological University, Singapore
| | - Limsoon Wong
- Department of Computer Science, National University of Singapore, Singapore; Department of Pathology, National University of Singapore, Singapore
| | - Wilson Wen Bin Goh
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore; School of Biological Sciences, Nanyang Technological University, Singapore.
| |
Collapse
|
11
|
Karaglani M, Gourlia K, Tsamardinos I, Chatzaki E. Accurate Blood-Based Diagnostic Biosignatures for Alzheimer's Disease via Automated Machine Learning. J Clin Med 2020; 9:E3016. [PMID: 32962113 PMCID: PMC7563988 DOI: 10.3390/jcm9093016] [Citation(s) in RCA: 24] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2020] [Revised: 09/04/2020] [Accepted: 09/14/2020] [Indexed: 12/17/2022] Open
Abstract
Alzheimer's disease (AD) is the most common form of neurodegenerative dementia and its timely diagnosis remains a major challenge in biomarker discovery. In the present study, we analyzed publicly available high-throughput low-sample -omics datasets from studies in AD blood, by the AutoML technology Just Add Data Bio (JADBIO), to construct accurate predictive models for use as diagnostic biosignatures. Considering data from AD patients and age-sex matched cognitively healthy individuals, we produced three best performing diagnostic biosignatures specific for the presence of AD: A. A 506-feature transcriptomic dataset from 48 AD and 22 controls led to a miRNA-based biosignature via Support Vector Machines with three miRNA predictors (AUC 0.975 (0.906, 1.000)), B. A 38,327-feature transcriptomic dataset from 134 AD and 100 controls led to six mRNA-based statistically equivalent signatures via Classification Random Forests with 25 mRNA predictors (AUC 0.846 (0.778, 0.905)) and C. A 9483-feature proteomic dataset from 25 AD and 37 controls led to a protein-based biosignature via Ridge Logistic Regression with seven protein predictors (AUC 0.921 (0.849, 0.972)). These performance metrics were also validated through the JADBIO pipeline confirming stability. In conclusion, using the automated machine learning tool JADBIO, we produced accurate predictive biosignatures extrapolating available low sample -omics data. These results offer options for minimally invasive blood-based diagnostic tests for AD, awaiting clinical validation based on respective laboratory assays. They also highlight the value of AutoML in biomarker discovery.
Collapse
Affiliation(s)
- Makrina Karaglani
- Laboratory of Pharmacology, Medical School, Democritus University of Thrace, 68100 Alexandroupolis, Greece;
- Gnosis Data Analysis PC, Science and Technology Park of Crete, N. Plastira 100, GR-700 13 Vassilika Vouton, Greece;
| | - Krystallia Gourlia
- Department of Computer Science, University of Crete, GR-700 13 Vassilika Vouton, Greece;
| | - Ioannis Tsamardinos
- Gnosis Data Analysis PC, Science and Technology Park of Crete, N. Plastira 100, GR-700 13 Vassilika Vouton, Greece;
- Department of Computer Science, University of Crete, GR-700 13 Vassilika Vouton, Greece;
- Institute of Applied and Computational Mathematics, Foundation for Research and Technology Hellas, GR-700 13 Vassilika Vouton, Greece
| | - Ekaterini Chatzaki
- Laboratory of Pharmacology, Medical School, Democritus University of Thrace, 68100 Alexandroupolis, Greece;
- Institute of Agri-Food and Life Sciences, University Research Centre, Hellenic Mediterranean University, GR-71410 Heraklion, Greece
| |
Collapse
|
12
|
Chatzipantsiou C, Dimitriadis M, Papadakis M, Tsagris M. JMASM 52: Extremely Efficient Permutation and Bootstrap Hypothesis Tests Using R. JOURNAL OF MODERN APPLIED STATISTICAL METHODS 2020. [DOI: 10.22237/jmasm/1604189940] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
Re-sampling based statistical tests are known to be computationally heavy, but reliable when small sample sizes are available. Despite their nice theoretical properties not much effort has been put to make them efficient. Computationally efficient method for calculating permutation-based p-values for the Pearson correlation coefficient and two independent samples t-test are proposed. The method is general and can be applied to other similar two sample mean or two mean vectors cases.
Collapse
|
13
|
Zhao H, Chen M, Wang J, Cao G, Chen W, Xu J. PCNA-associated factor KIAA0101 transcriptionally induced by ELK1 controls cell proliferation and apoptosis in nasopharyngeal carcinoma: an integrated bioinformatics and experimental study. Aging (Albany NY) 2020; 12:5992-6017. [PMID: 32275642 PMCID: PMC7185143 DOI: 10.18632/aging.102991] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2019] [Accepted: 03/09/2020] [Indexed: 12/16/2022]
Abstract
KIAA0101, previously identified as PCNA-associated factor, is overexpressed among almost majority of human cancers and has emerged as an important regulator of cancer progression; however, its function in human nasopharyngeal carcinoma (NPC) remain unknown. Integrated bioinformatics approaches were employed to determine the KIAA0101 expressions in the NPC samples. Lentiviral vectors carrying KIAA0101 shRNA were constructed and stable transfected cells were validated by qRT-PCR and western blot. Cellular functions were then evaluated by MTT, colony formation, Brdu staining, and flow cytometry. Mechanistic studies were systematically investigated by UCSC Genome Browser, GEO, UALCAN, QIAGEN, PROMO and JASPAR, ChIP, and the cBioPortal, et al. The results showed that KIAA0101 ranked top overexpressed gene lists in GSE6631 dataset. KIAA0101 was highly expressed in NPC tissues and cell lines. Furthermore, knockdown of KIAA0101 significantly inhibited cell proliferation and DNA replication, promoted apoptosis and cell cycle arrest in vitro. Meanwhile, the mechanistic study revealed that MAP kinase phosphorylation-dependent activation of ELK1 may enhance neighbor gene expressions of KIAA0101 and TRIP4 by binding both promotor regions in the NPC cells. Taken together, our findings indicate that overexpression of KIAA0101 activated by MAP kinase phosphorylation-dependent activation of ELK1 may play an important role in NPC progression.
Collapse
Affiliation(s)
- Hu Zhao
- Fujian Provincial Key Laboratory of Transplant Biology, Department of Urology, 900 Hospital of the Joint Logistics Team, Xiamen University, Fuzhou 350025, Fujian, P.R. China.,Office of Science Education, 900 Hospital of the Joint Logistics Team, Xiamen University, Fuzhou 350025, Fujian, P.R. China
| | - Miaosheng Chen
- Pathology Department, Longyan First Hospital Affiliated to Fujian Medical University, Longyan 364000, Fujian, P.R. China
| | - Jie Wang
- Fujian Provincial Key Laboratory of Transplant Biology, Department of Urology, 900 Hospital of the Joint Logistics Team, Xiamen University, Fuzhou 350025, Fujian, P.R. China
| | - Gang Cao
- Department of Oral and Maxillofacial Surgery, Medical School of Nanjing University, Nanjing 210002, Jiangsu, P.R. China
| | - Wei Chen
- Department of Oral and Maxillofacial Surgery, Medical School of Nanjing University, Nanjing 210002, Jiangsu, P.R. China
| | - Jinke Xu
- Department of Oral and Maxillofacial Surgery, Medical School of Nanjing University, Nanjing 210002, Jiangsu, P.R. China
| |
Collapse
|
14
|
Malliaraki N, Lakiotaki K, Vamvoukaki R, Notas G, Tsamardinos I, Kampa M, Castanas E. Translating vitamin D transcriptomics to clinical evidence: Analysis of data in asthma and chronic obstructive pulmonary disease, followed by clinical data meta-analysis. J Steroid Biochem Mol Biol 2020; 197:105505. [PMID: 31669573 DOI: 10.1016/j.jsbmb.2019.105505] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/01/2019] [Revised: 09/29/2019] [Accepted: 10/22/2019] [Indexed: 12/29/2022]
Abstract
Vitamin D (VitD) continues to trigger intense scientific controversy, regarding both its bi ological targets and its supplementation doses and regimens. In an effort to resolve this dispute, we mapped VitD transcriptome-wide events in humans, in order to unveil shared patterns or mechanisms with diverse pathologies/tissue profiles and reveal causal effects between VitD actions and specific human diseases, using a recently developed bioinformatics methodology. Using the similarities in analyzed transcriptome data (c-SKL method), we validated our methodology with osteoporosis as an example and further analyzed two other strong hits, specifically chronic obstructive pulmonary disease (COPD) and asthma. The latter revealed no impact of VitD on known molecular pathways. In accordance to this finding, review and meta-analysis of published data, based on an objective measure (Forced Expiratory Volume at one second, FEV1%) did not further reveal any significant effect of VitD on the objective amelioration of either condition. This study may, therefore, be regarded as the first one to explore, in an objective, unbiased and unsupervised manner, the impact of VitD levels and/or interventions in a number of human pathologies.
Collapse
Affiliation(s)
- Niki Malliaraki
- Laboratory of Experimental Endocrinology, University of Crete, School of Medicine, Heraklion, Greece; Laboratory of Clinical Chemistry/Biochemistry, University Hospital, Heraklion, Greece
| | - Kleanthi Lakiotaki
- Department of Computer Science, University of Crete, School of Sciences, Heraklion, Greece
| | - Rodanthi Vamvoukaki
- Laboratory of Experimental Endocrinology, University of Crete, School of Medicine, Heraklion, Greece
| | - George Notas
- Laboratory of Experimental Endocrinology, University of Crete, School of Medicine, Heraklion, Greece
| | - Ioannis Tsamardinos
- Department of Computer Science, University of Crete, School of Sciences, Heraklion, Greece; Gnosis Data Analysis PC, Heraklion, Greece
| | - Marilena Kampa
- Laboratory of Experimental Endocrinology, University of Crete, School of Medicine, Heraklion, Greece
| | - Elias Castanas
- Laboratory of Experimental Endocrinology, University of Crete, School of Medicine, Heraklion, Greece.
| |
Collapse
|
15
|
Appice A, Tsoumakas G, Manolopoulos Y, Matwin S. Pathway Activity Score Learning for Dimensionality Reduction of Gene Expression Data. DISCOVERY SCIENCE 2020. [PMCID: PMC7556388 DOI: 10.1007/978-3-030-61527-7_17] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
Abstract
Abstract
Molecular gene-expression datasets consist of samples with tens of thousands of measured quantities (e.g., high dimensional data). However, there exist lower-dimensional representations that retain the useful information. We present a novel algorithm for such dimensionality reduction called Pathway Activity Score Learning (PASL). The major novelty of PASL is that the constructed features directly correspond to known molecular pathways and can be interpreted as pathway activity scores. Hence, unlike PCA and similar methods, PASL’s latent space has a relatively straight-forward biological interpretation. As a use-case, PASL is applied on two collections of breast cancer and leukemia gene expression datasets. We show that PASL does retain the predictive information for disease classification on new, unseen datasets, as well as outperforming PLIER, a recently proposed competitive method. We also show that differential activation pathway analysis provides complementary information to standard gene set enrichment analysis. The code is available at https://github.com/mensxmachina/PASL.
Collapse
|
16
|
Lakiotaki K, Georgakopoulos G, Castanas E, Røe OD, Borboudakis G, Tsamardinos I. A data driven approach reveals disease similarity on a molecular level. NPJ Syst Biol Appl 2019; 5:39. [PMID: 31666984 PMCID: PMC6814739 DOI: 10.1038/s41540-019-0117-0] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2019] [Accepted: 09/26/2019] [Indexed: 02/06/2023] Open
Abstract
Could there be unexpected similarities between different studies, diseases, or treatments, on a molecular level due to common biological mechanisms involved? To answer this question, we develop a method for computing similarities between empirical, statistical distributions of high-dimensional, low-sample datasets, and apply it on hundreds of -omics studies. The similarities lead to dataset-to-dataset networks visualizing the landscape of a large portion of biological data. Potentially interesting similarities connecting studies of different diseases are assembled in a disease-to-disease network. Exploring it, we discover numerous non-trivial connections between Alzheimer's disease and schizophrenia, asthma and psoriasis, or liver cancer and obesity, to name a few. We then present a method that identifies the molecular quantities and pathways that contribute the most to the identified similarities and could point to novel drug targets or provide biological insights. The proposed method acts as a "statistical telescope" providing a global view of the constellation of biological data; readers can peek through it at: http://datascope.csd.uoc.gr:25000/.
Collapse
Affiliation(s)
| | | | - Elias Castanas
- Laboratory of Experimental Endocrinology, School of Medicine, University of Crete, Heraklion, Greece
| | - Oluf Dimitri Røe
- Norwegian University of Science and Technology, Department of Clinical Research and Molecular Medicine, Trondheim, Norway
- Levanger Hospital, Nord-Trøndelag Hospital Trust, Cancer Clinic, Norway
- Clinical Cancer Research Center, Department of Clinical Medicine, Aalborg, Denmark
| | | | - Ioannis Tsamardinos
- Computer Science Department, University of Crete, Heraklion, Greece
- Gnosis Data Analysis PC, Heraklion Crete, Greece
- Institute of Computational and Applied Mathematics, Foundation for Research and Technology, Heraklion, Greece
| |
Collapse
|
17
|
Tsagris M, Alenazi A, Verrou KM, Pandis N. Hypothesis testing for two population means: parametric or non-parametric test? J STAT COMPUT SIM 2019. [DOI: 10.1080/00949655.2019.1677659] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
Affiliation(s)
- Michail Tsagris
- Department of Economics, University of Crete, Rethymnon, Greece
- Statistical Learning Lab, Institute of Applied and Computational Mathematics, Foundation of Research and Tehchnology Hellas, Herakleion, Greece
| | - Abdulaziz Alenazi
- Department of Mathematics, Northern Border University, Arar, Saudi Arabia
| | | | - Nikolaos Pandis
- Department of Orthodontics and Dentofacial Orthopedics, University of Bern, Bern, Switzerland
| |
Collapse
|