1
|
Zelli V, Manno A, Compagnoni C, Ibraheem RO, Zazzeroni F, Alesse E, Rossi F, Arbib C, Tessitore A. Classification of tumor types using XGBoost machine learning model: a vector space transformation of genomic alterations. J Transl Med 2023; 21:836. [PMID: 37990214 PMCID: PMC10664515 DOI: 10.1186/s12967-023-04720-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2023] [Accepted: 11/10/2023] [Indexed: 11/23/2023] Open
Abstract
BACKGROUND Machine learning (ML) represents a powerful tool to capture relationships between molecular alterations and cancer types and to extract biological information. Here, we developed a plain ML model aimed at distinguishing cancer types based on genetic lesions, providing an additional tool to improve cancer diagnosis, particularly for tumors of unknown origin. METHODS TCGA data from 9,927 samples spanning 32 different cancer types were downloaded from cBioportal. A vector space model type data transformation technique was designed to build consistently homogeneous new datasets containing, as predictive features, calls for somatic point mutations and copy number variations at chromosome arm-level, thus allowing the use of the XGBoost classifier models. Considering the imbalance in the dataset, due to large difference in the number of cases for each tumor, two preprocessing strategies were considered: i) setting a percentage cut-off threshold to remove less represented cancer types, ii) dividing cancer types into different groups based on biological criteria and training a specific XGBoost model for each of them. The performance of all trained models was mainly assessed by the out-of-sample balanced accuracy (BACC) and the AUC scores. RESULTS The XGBoost classifier achieved the best performance (BACC 77%; AUC 97%) on a dataset containing the 10 most represented tumor types. Moreover, dividing the 18 most represented cancers into three different groups (endocrine-related carcinomas, other carcinomas and other cancers),such analysis models achieved 78%, 71% and 86% BACC, respectively, with AUC scores greater than 96%. In addition, the model capable of linking each group to a specific cancer type reached 81% BACC and 94% AUC. Overall, the diagnostic potential of our model was comparable/higher with respect to others already described in literature and based on similar molecular data and ML approaches. CONCLUSIONS A boosted ML approach able to accurately discriminate different cancer types was developed. The methodology builds datasets simpler and more interpretable than the original data, while keeping enough information to accurately train standard ML models without resorting to sophisticated Deep Learning architectures. In combination with histopathological examinations, this approach could improve cancer diagnosis by using specific DNA alterations, processed by a replicable and easy-to-use automated technology. The study encourages new investigations which could further increase the classifier's performance, for example by considering more features and dividing tumors into their main molecular subtypes.
Collapse
Affiliation(s)
- Veronica Zelli
- Department of Biotechnological and Applied Clinical Sciences, University of L'Aquila, 67100, L'Aquila, Italy
- Center for Molecular Diagnostics and Advanced Therapies, University of L'Aquila, Via Petrini, 67100, L'Aquila, Italy
| | - Andrea Manno
- Department of Information Engineering, Computer Science and Mathematics, Center of Excellence DEWS, University of L'Aquila, 67100, L'Aquila, Italy
| | - Chiara Compagnoni
- Department of Biotechnological and Applied Clinical Sciences, University of L'Aquila, 67100, L'Aquila, Italy
| | - Rasheed Oyewole Ibraheem
- Department of Information Engineering, Computer Science and Mathematics, Center of Excellence DEWS, University of L'Aquila, 67100, L'Aquila, Italy
| | - Francesca Zazzeroni
- Department of Biotechnological and Applied Clinical Sciences, University of L'Aquila, 67100, L'Aquila, Italy
| | - Edoardo Alesse
- Department of Biotechnological and Applied Clinical Sciences, University of L'Aquila, 67100, L'Aquila, Italy
| | - Fabrizio Rossi
- Department of Information Engineering, Computer Science and Mathematics, Center of Excellence DEWS, University of L'Aquila, 67100, L'Aquila, Italy
| | - Claudio Arbib
- Department of Information Engineering, Computer Science and Mathematics, Center of Excellence DEWS, University of L'Aquila, 67100, L'Aquila, Italy
| | - Alessandra Tessitore
- Department of Biotechnological and Applied Clinical Sciences, University of L'Aquila, 67100, L'Aquila, Italy.
- Center for Molecular Diagnostics and Advanced Therapies, University of L'Aquila, Via Petrini, 67100, L'Aquila, Italy.
| |
Collapse
|
2
|
Gonzales EL, Jeon SJ, Han KM, Yang SJ, Kim Y, Remonde CG, Ahn TJ, Ham BJ, Shin CY. Correlation between immune-related genes and depression-like features in an animal model and in humans. Brain Behav Immun 2023; 113:29-43. [PMID: 37379963 DOI: 10.1016/j.bbi.2023.06.017] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/08/2022] [Revised: 06/01/2023] [Accepted: 06/22/2023] [Indexed: 06/30/2023] Open
Abstract
A growing body of evidence suggests that immune-related genes play pivotal roles in the pathophysiology of depression. In the present study, we investigated a plausible connection between gene expression, DNA methylation, and brain structural changes in the pathophysiology of depression using a combined approach of murine and human studies. We ranked the immobility behaviors of 30 outbred Crl:CD1 (ICR) mice in the forced swim test (FST) and harvested their prefrontal cortices for RNA sequencing. Of the 24,532 analyzed genes, 141 showed significant correlations with FST immobility time, as determined through linear regression analysis with p ≤ 0.01. The identified genes were mostly involved in immune responses, especially interferon signaling pathways. Moreover, induction of virus-like neuroinflammation in the brains of two separate mouse cohorts (n = 30 each) using intracerebroventricular polyinosinic:polycytidylic acid injection resulted in increased immobility during FST and similar expression of top immobility-correlated genes. In human blood samples, candidate gene (top 5%) expression profiling using DNA methylation analysis found the interferon-related USP18 (cg25484698, p = 7.04 × 10-11, Δβ = 1.57 × 10-2; cg02518889, p = 2.92 × 10-3, Δβ = - 8.20 × 10-3) and IFI44 (cg07107453, p = 3.76 × 10-3, Δβ = - 4.94 × 10-3) genes to be differentially methylated between patients with major depressive disorder (n = 350) and healthy controls (n = 161). Furthermore, cortical thickness analyses using T1-weighted images revealed that the DNA methylation scores for USP18 were negatively correlated with the thicknesses of several cortical regions, including the prefrontal cortex. Our results reveal the important role of the interferon pathway in depression and suggest USP18 as a potential candidate target. The results of the correlation analysis between transcriptomic data and animal behavior carried out in this study provide insights that could enhance our understanding of depression in humans.
Collapse
Affiliation(s)
- Edson Luck Gonzales
- School of Medicine and Center for Neuroscience Research, Konkuk University, Seoul 05029, Republic of Korea
| | - Se Jin Jeon
- School of Medicine and Center for Neuroscience Research, Konkuk University, Seoul 05029, Republic of Korea; Department of Integrative Biotechnology, College of Science and Technology, Sahmyook University, Seoul 01795, Republic of Korea
| | - Kyu-Man Han
- Department of Psychiatry, Korea University Anam Hospital, Korea University College of Medicine, Seoul 02841, Republic of Korea
| | - Seung Jin Yang
- Department of Life Science, Handong Global University, Pohang 37554, Republic of Korea
| | - Yujeong Kim
- School of Medicine and Center for Neuroscience Research, Konkuk University, Seoul 05029, Republic of Korea
| | - Chilly Gay Remonde
- School of Medicine and Center for Neuroscience Research, Konkuk University, Seoul 05029, Republic of Korea
| | - Tae Jin Ahn
- Department of Life Science, Handong Global University, Pohang 37554, Republic of Korea.
| | - Byung-Joo Ham
- Department of Psychiatry, Korea University Anam Hospital, Korea University College of Medicine, Seoul 02841, Republic of Korea.
| | - Chan Young Shin
- School of Medicine and Center for Neuroscience Research, Konkuk University, Seoul 05029, Republic of Korea.
| |
Collapse
|
3
|
Chicco D, Cumbo F, Angione C. Ten quick tips for avoiding pitfalls in multi-omics data integration analyses. PLoS Comput Biol 2023; 19:e1011224. [PMID: 37410704 DOI: 10.1371/journal.pcbi.1011224] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/08/2023] Open
Abstract
Data are the most important elements of bioinformatics: Computational analysis of bioinformatics data, in fact, can help researchers infer new knowledge about biology, chemistry, biophysics, and sometimes even medicine, influencing treatments and therapies for patients. Bioinformatics and high-throughput biological data coming from different sources can even be more helpful, because each of these different data chunks can provide alternative, complementary information about a specific biological phenomenon, similar to multiple photos of the same subject taken from different angles. In this context, the integration of bioinformatics and high-throughput biological data gets a pivotal role in running a successful bioinformatics study. In the last decades, data originating from proteomics, metabolomics, metagenomics, phenomics, transcriptomics, and epigenomics have been labelled -omics data, as a unique name to refer to them, and the integration of these omics data has gained importance in all biological areas. Even if this omics data integration is useful and relevant, due to its heterogeneity, it is not uncommon to make mistakes during the integration phases. We therefore decided to present these ten quick tips to perform an omics data integration correctly, avoiding common mistakes we experienced or noticed in published studies in the past. Even if we designed our ten guidelines for beginners, by using a simple language that (we hope) can be understood by anyone, we believe our ten recommendations should be taken into account by all the bioinformaticians performing omics data integration, including experts.
Collapse
Affiliation(s)
- Davide Chicco
- Institute of Health Policy Management and Evaluation, University of Toronto, Toronto, Ontario, Canada
| | - Fabio Cumbo
- Genomic Medicine Institute, Lerner Research Institute, Cleveland Clinic, Cleveland, Ohio, United States of America
| | - Claudio Angione
- School of Computing Engineering and Digital Technologies, Teesside University, Middlesbrough, United Kingdom
| |
Collapse
|
4
|
Effects and Mechanism of Particulate Matter on Tendon Healing Based on Integrated Analysis of DNA Methylation and RNA Sequencing Data in a Rat Model. Int J Mol Sci 2022; 23:ijms23158170. [PMID: 35897746 PMCID: PMC9332732 DOI: 10.3390/ijms23158170] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2022] [Revised: 07/22/2022] [Accepted: 07/22/2022] [Indexed: 02/01/2023] Open
Abstract
Exposure to particulate matter (PM) has been linked with the severity of various diseases. To date, there is no study on the relationship between PM exposure and tendon healing. Open Achilles tenotomy of 20 rats was performed. The animals were divided into two groups according to exposure to PM: a PM group and a non-PM group. After 6 weeks of PM exposure, the harvest and investigations of lungs, blood samples, and Achilles tendons were performed. Compared to the non-PM group, the white blood cell count and tumor necrosis factor-alpha expression in the PM group were significantly higher. The Achilles tendons in PM group showed significantly increased inflammatory outcomes. A TEM analysis showed reduced collagen fibrils in the PM group. A biomechanical analysis demonstrated that the load to failure value was lower in the PM group. An upregulation of the gene encoding cyclic AMP response element-binding protein (CREB) was detected in the PM group by an integrated analysis of DNA methylation and RNA sequencing data, as confirmed via a Western blot analysis showing significantly elevated levels of phosphorylated CREB. In summary, PM exposure caused a deleterious effect on tendon healing. The molecular data indicate that the action mechanism of PM may be associated with upregulated CREB signaling.
Collapse
|
5
|
Crawford J, Christensen BC, Chikina M, Greene CS. Widespread redundancy in -omics profiles of cancer mutation states. Genome Biol 2022; 23:137. [PMID: 35761387 PMCID: PMC9238138 DOI: 10.1186/s13059-022-02705-y] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2021] [Accepted: 06/14/2022] [Indexed: 02/04/2023] Open
Abstract
BACKGROUND In studies of cellular function in cancer, researchers are increasingly able to choose from many -omics assays as functional readouts. Choosing the correct readout for a given study can be difficult, and which layer of cellular function is most suitable to capture the relevant signal remains unclear. RESULTS We consider prediction of cancer mutation status (presence or absence) from functional -omics data as a representative problem that presents an opportunity to quantify and compare the ability of different -omics readouts to capture signals of dysregulation in cancer. From the TCGA Pan-Cancer Atlas that contains genetic alteration data, we focus on RNA sequencing, DNA methylation arrays, reverse phase protein arrays (RPPA), microRNA, and somatic mutational signatures as -omics readouts. Across a collection of genes recurrently mutated in cancer, RNA sequencing tends to be the most effective predictor of mutation state. We find that one or more other data types for many of the genes are approximately equally effective predictors. Performance is more variable between mutations than that between data types for the same mutation, and there is little difference between the top data types. We also find that combining data types into a single multi-omics model provides little or no improvement in predictive ability over the best individual data type. CONCLUSIONS Based on our results, for the design of studies focused on the functional outcomes of cancer mutations, there are often multiple -omics types that can serve as effective readouts, although gene expression seems to be a reasonable default option.
Collapse
Affiliation(s)
- Jake Crawford
- Genomics and Computational Biology Graduate Group, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Brock C Christensen
- Department of Epidemiology, Geisel School of Medicine, Dartmouth College, Lebanon, NH, USA
- Department of Molecular and Systems Biology, Geisel School of Medicine, Dartmouth College, Lebanon, NH, USA
| | - Maria Chikina
- Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA
| | - Casey S Greene
- Department of Biochemistry and Molecular Genetics, University of Colorado School of Medicine, Aurora, CO, USA.
- Center for Health AI, University of Colorado School of Medicine, Aurora, CO, USA.
| |
Collapse
|
6
|
Ofusa K, Chijimatsu R, Ishii H. Techniques to detect epitranscriptomic marks. Am J Physiol Cell Physiol 2022; 322:C787-C793. [PMID: 35294846 DOI: 10.1152/ajpcell.00460.2021] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Similar to epigenetic DNA modification, RNA can be methylated and altered for stability and processing. RNA modifications, i.e., epitranscriptomes involve three functions, that is, writing, erasing, and reading of marks. Methods for measurement and position detection are useful for the assessment of cellular function and human disease biomarkers. Since the first detection of pyrimidine 5-methylcytosine hundred years ago, numerous techniques have been developed to study the modifications of nucleotides, including RNAs. Recent studies focused on high throughput and direct measurements to investigate the precise function of epitranscriptomes, including the characterization of SARS-CoV-2. The current work presents an overview of the development of detection techniques for epitranscriptomic marks and updates recent progress on the related field.
Collapse
Affiliation(s)
- Ken Ofusa
- Prophoenix Division, Food and Life-Science Laboratory, Idea Consultants, Inc., Osaka-city, Osaka, Japan.,Center of Medical Innovation and Translational Research, Osaka University Graduate School of Medicine, Suita, Osaka, Japan
| | - Ryota Chijimatsu
- Center of Medical Innovation and Translational Research, Osaka University Graduate School of Medicine, Suita, Osaka, Japan
| | - Hideshi Ishii
- Center of Medical Innovation and Translational Research, Osaka University Graduate School of Medicine, Suita, Osaka, Japan
| |
Collapse
|
7
|
Pidò S, Crovari P, Garzotto F. Modelling the bioinformatics tertiary analysis research process. BMC Bioinformatics 2021; 22:452. [PMID: 34592928 PMCID: PMC8482564 DOI: 10.1186/s12859-021-04310-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2021] [Accepted: 07/29/2021] [Indexed: 11/13/2022] Open
Abstract
Background With the advancements of Next Generation Techniques, a tremendous amount of genomic information has been made available to be analyzed by means of computational methods. Bioinformatics Tertiary Analysis is a complex multidisciplinary process that represents the final step of the whole bioinformatics analysis pipeline. Despite the popularity of the subject, the Bioinformatics Tertiary Analysis process has not yet been specified in a systematic way. The lack of a reference model results into a plethora of technological tools that are designed mostly on the data and not on the human process involved in Tertiary Analysis, making such systems difficult to use and to integrate. Methods To address this problem, we propose a conceptual model that captures the salient characteristics of the research methods and human tasks involved in Bioinformatics Tertiary Analysis. The model is grounded on a user study that involved bioinformatics specialists for the elicitation of a hierarchical task tree representing the Tertiary Analysis process. The outcome was refined and validated using the results of a vast survey of the literature reporting examples of Bioinformatics Tertiary Analysis activities. Results The final hierarchical task tree was then converted into an ontological representation using an ontology standard formalism. The results of our research provides a reference process model for Tertiary Analysis that can be used both to analyze and to compare existing tools, or to design new tools. Conclusions To highlight the potential of our approach and to exemplify its concrete applications, we describe a new bioinformatics tool and how the proposed process model informed its design. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04310-5.
Collapse
Affiliation(s)
- Sara Pidò
- Department of Electronics, Information and Bioengineering, Politecnico di Milano, Milan, Italy.
| | - Pietro Crovari
- Department of Electronics, Information and Bioengineering, Politecnico di Milano, Milan, Italy
| | - Franca Garzotto
- Department of Electronics, Information and Bioengineering, Politecnico di Milano, Milan, Italy
| |
Collapse
|
8
|
Liñares-Blanco J, Pazos A, Fernandez-Lozano C. Machine learning analysis of TCGA cancer data. PeerJ Comput Sci 2021; 7:e584. [PMID: 34322589 PMCID: PMC8293929 DOI: 10.7717/peerj-cs.584] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2021] [Accepted: 05/17/2021] [Indexed: 06/13/2023]
Abstract
In recent years, machine learning (ML) researchers have changed their focus towards biological problems that are difficult to analyse with standard approaches. Large initiatives such as The Cancer Genome Atlas (TCGA) have allowed the use of omic data for the training of these algorithms. In order to study the state of the art, this review is provided to cover the main works that have used ML with TCGA data. Firstly, the principal discoveries made by the TCGA consortium are presented. Once these bases have been established, we begin with the main objective of this study, the identification and discussion of those works that have used the TCGA data for the training of different ML approaches. After a review of more than 100 different papers, it has been possible to make a classification according to following three pillars: the type of tumour, the type of algorithm and the predicted biological problem. One of the conclusions drawn in this work shows a high density of studies based on two major algorithms: Random Forest and Support Vector Machines. We also observe the rise in the use of deep artificial neural networks. It is worth emphasizing, the increase of integrative models of multi-omic data analysis. The different biological conditions are a consequence of molecular homeostasis, driven by both protein coding regions, regulatory elements and the surrounding environment. It is notable that a large number of works make use of genetic expression data, which has been found to be the preferred method by researchers when training the different models. The biological problems addressed have been classified into five types: prognosis prediction, tumour subtypes, microsatellite instability (MSI), immunological aspects and certain pathways of interest. A clear trend was detected in the prediction of these conditions according to the type of tumour. That is the reason for which a greater number of works have focused on the BRCA cohort, while specific works for survival, for example, were centred on the GBM cohort, due to its large number of events. Throughout this review, it will be possible to go in depth into the works and the methodologies used to study TCGA cancer data. Finally, it is intended that this work will serve as a basis for future research in this field of study.
Collapse
Affiliation(s)
- Jose Liñares-Blanco
- CITIC-Research Center of Information and Communication Technologies, University of A Coruna, A Coruña, Spain
- Department of Computer Science and Information Technologies, Faculty of Computer Science, University of A Coruna, A Coruña, Spain
| | - Alejandro Pazos
- CITIC-Research Center of Information and Communication Technologies, University of A Coruna, A Coruña, Spain
- Department of Computer Science and Information Technologies, Faculty of Computer Science, University of A Coruna, A Coruña, Spain
- Grupo de Redes de Neuronas Artificiales y Sistemas Adaptativos. Imagen Médica y Diagnóstico Radiológico (RNASA-IMEDIR). Complexo Hospitalario Universitario de A Coruña (CHUAC), SERGAS, Universidade da Coruña, Instituto de Investigación Biomédica de A Coruña (INIBIC), A Coruña, Spain
| | - Carlos Fernandez-Lozano
- CITIC-Research Center of Information and Communication Technologies, University of A Coruna, A Coruña, Spain
- Department of Computer Science and Information Technologies, Faculty of Computer Science, University of A Coruna, A Coruña, Spain
- Grupo de Redes de Neuronas Artificiales y Sistemas Adaptativos. Imagen Médica y Diagnóstico Radiológico (RNASA-IMEDIR). Complexo Hospitalario Universitario de A Coruña (CHUAC), SERGAS, Universidade da Coruña, Instituto de Investigación Biomédica de A Coruña (INIBIC), A Coruña, Spain
| |
Collapse
|
9
|
Haghshenas S, Bhai P, Aref-Eshghi E, Sadikovic B. Diagnostic Utility of Genome-Wide DNA Methylation Analysis in Mendelian Neurodevelopmental Disorders. Int J Mol Sci 2020; 21:ijms21239303. [PMID: 33291301 PMCID: PMC7730976 DOI: 10.3390/ijms21239303] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2020] [Revised: 12/03/2020] [Accepted: 12/04/2020] [Indexed: 12/14/2022] Open
Abstract
Mendelian neurodevelopmental disorders customarily present with complex and overlapping symptoms, complicating the clinical diagnosis. Individuals with a growing number of the so-called rare disorders exhibit unique, disorder-specific DNA methylation patterns, consequent to the underlying gene defects. Besides providing insights to the pathophysiology and molecular biology of these disorders, we can use these epigenetic patterns as functional biomarkers for the screening and diagnosis of these conditions. This review summarizes our current understanding of DNA methylation episignatures in rare disorders and describes the underlying technology and analytical approaches. We discuss the computational parameters, including statistical and machine learning methods, used for the screening and classification of genetic variants of uncertain clinical significance. Describing the rationale and principles applied to the specific computational models that are used to develop and adapt the DNA methylation episignatures for the diagnosis of rare disorders, we highlight the opportunities and challenges in this emerging branch of diagnostic medicine.
Collapse
Affiliation(s)
- Sadegheh Haghshenas
- Department of Pathology and Laboratory Medicine, Western University, London, ON N6A 3K7, Canada;
- Molecular Genetics Laboratory, Molecular Diagnostics Division, London Health Sciences Centre, London, ON N6A 5W9, Canada;
| | - Pratibha Bhai
- Molecular Genetics Laboratory, Molecular Diagnostics Division, London Health Sciences Centre, London, ON N6A 5W9, Canada;
| | - Erfan Aref-Eshghi
- Division of Genomic Diagnostics, Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA;
| | - Bekim Sadikovic
- Department of Pathology and Laboratory Medicine, Western University, London, ON N6A 3K7, Canada;
- Molecular Genetics Laboratory, Molecular Diagnostics Division, London Health Sciences Centre, London, ON N6A 5W9, Canada;
- Schulich School of Medicine and Dentistry, Western University, London, ON N6A 5C1, Canada
- Correspondence:
| |
Collapse
|
10
|
A pattern recognition model to distinguish cancerous DNA sequences via signal processing methods. Soft comput 2020. [DOI: 10.1007/s00500-020-04942-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
11
|
Autoencoded DNA methylation data to predict breast cancer recurrence: Machine learning models and gene-weight significance. Artif Intell Med 2020; 110:101976. [PMID: 33250148 DOI: 10.1016/j.artmed.2020.101976] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2019] [Revised: 08/05/2020] [Accepted: 10/18/2020] [Indexed: 12/29/2022]
Abstract
Breast cancer is the most frequent cancer in women and the second most frequent overall after lung cancer. Although the 5-year survival rate of breast cancer is relatively high, recurrence is also common which often involves metastasis with its consequent threat for patients. DNA methylation-derived databases have become an interesting primary source for supervised knowledge extraction regarding breast cancer. Unfortunately, the study of DNA methylation involves the processing of hundreds of thousands of features for every patient. DNA methylation is featured by High Dimension Low Sample Size which has shown well-known issues regarding feature selection and generation. Autoencoders (AEs) appear as a specific technique for conducting nonlinear feature fusion. Our main objective in this work is to design a procedure to summarize DNA methylation by taking advantage of AEs. Our proposal is able to generate new features from the values of CpG sites of patients with and without recurrence. Then, a limited set of relevant genes to characterize breast cancer recurrence is proposed by the application of survival analysis and a pondered ranking of genes according to the distribution of their CpG sites. To test our proposal we have selected a dataset from The Cancer Genome Atlas data portal and an AE with a single-hidden layer. The literature and enrichment analysis (based on genomic context and functional annotation) conducted regarding the genes obtained with our experiment confirmed that all of these genes were related to breast cancer recurrence.
Collapse
|
12
|
Stuckel AJ, Zhang W, Zhang X, Zeng S, Dougherty U, Mustafi R, Zhang Q, Perreand E, Khare T, Joshi T, West-Szymanski DC, Bissonnette M, Khare S. Enhanced CXCR4 Expression Associates with Increased Gene Body 5-Hydroxymethylcytosine Modification but not Decreased Promoter Methylation in Colorectal Cancer. Cancers (Basel) 2020; 12:cancers12030539. [PMID: 32110952 PMCID: PMC7139960 DOI: 10.3390/cancers12030539] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2020] [Revised: 02/19/2020] [Accepted: 02/24/2020] [Indexed: 12/19/2022] Open
Abstract
In colorectal cancer (CRC), upregulation of the C-X-C motif chemokine receptor 4 (CXCR4) is correlated with metastasis and poor prognosis, highlighting the need to further elucidate CXCR4’s regulation in CRC. For the first time, DNA methylation and 5-hydroxymethylcytosine aberrations were investigated to better understand the epigenetic regulation of CXCR4 in CRC. CXCR4 expression levels were measured using qPCR and immunoblotting in normal colon tissues, primary colon cancer tissues and CRC cell lines. Publicly available RNA-seq and methylation data from The Cancer Genome Atlas (TCGA) were extracted from tumors from CRC patients. The DNA methylation status spanning CXCR4 gene was evaluated using combined bisulfite restriction analysis (COBRA). The methylation status in the CXCR4 gene body was analyzed using previously performed nano-hmC-seal data from colon cancers and adjacent normal colonic mucosa. CXCR4 expression levels were significantly increased in tumor stromal cells and in tumor colonocytes, compared to matched cell types from adjacent normal-appearing mucosa. CXCR4 promoter methylation was detected in a minority of colorectal tumors in the TCGA. The CpG island of the CXCR4 promoter showed increased methylation in three of four CRC cell lines. CXCR4 protein expression differences were also notable between microsatellite stable (MSS) and microsatellite instable (MSI) tumor cell lines. While differential methylation was not detected in CXCR4, enrichment of 5-hydroxymethylcytosine (5hmC) in CXCR4 gene bodies in CRC was observed compared to adjacent mucosa.
Collapse
Affiliation(s)
- Alexei J. Stuckel
- Department of Medicine, Division of Gastroenterology and Hepatology, University of Missouri, Columbia, MO 65212, USA (Q.Z.); (E.P.); (T.K.)
| | - Wei Zhang
- Department of Preventive Medicine and The Robert H. Lurie Comprehensive Cancer Center, Northwestern University Feinberg School of Medicine, Chicago, IL 60611, USA;
| | - Xu Zhang
- Department of Medicine, University of Illinois, Chicago, IL 60607, USA;
| | - Shuai Zeng
- Bond Life Sciences Center, University of Missouri, Columbia, MO 65201, USA; (S.Z.); (T.J.)
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO 65201, USA
| | - Urszula Dougherty
- Department of Medicine, Section of Gastroenterology, Hepatology and Nutrition, The University of Chicago, Chicago, IL 60637, USA; (U.D.); (R.M.); (D.C.W.-S.); (M.B.)
| | - Reba Mustafi
- Department of Medicine, Section of Gastroenterology, Hepatology and Nutrition, The University of Chicago, Chicago, IL 60637, USA; (U.D.); (R.M.); (D.C.W.-S.); (M.B.)
| | - Qiong Zhang
- Department of Medicine, Division of Gastroenterology and Hepatology, University of Missouri, Columbia, MO 65212, USA (Q.Z.); (E.P.); (T.K.)
| | - Elsa Perreand
- Department of Medicine, Division of Gastroenterology and Hepatology, University of Missouri, Columbia, MO 65212, USA (Q.Z.); (E.P.); (T.K.)
| | - Tripti Khare
- Department of Medicine, Division of Gastroenterology and Hepatology, University of Missouri, Columbia, MO 65212, USA (Q.Z.); (E.P.); (T.K.)
| | - Trupti Joshi
- Bond Life Sciences Center, University of Missouri, Columbia, MO 65201, USA; (S.Z.); (T.J.)
- Institute for Data Science and Informatics, University of Missouri, Columbia, MO 65211, USA
- Department of Health Management and Informatics, School of Medicine, University of Missouri, Columbia, MO 65212, USA
| | - Diana C. West-Szymanski
- Department of Medicine, Section of Gastroenterology, Hepatology and Nutrition, The University of Chicago, Chicago, IL 60637, USA; (U.D.); (R.M.); (D.C.W.-S.); (M.B.)
| | - Marc Bissonnette
- Department of Medicine, Section of Gastroenterology, Hepatology and Nutrition, The University of Chicago, Chicago, IL 60637, USA; (U.D.); (R.M.); (D.C.W.-S.); (M.B.)
| | - Sharad Khare
- Department of Medicine, Division of Gastroenterology and Hepatology, University of Missouri, Columbia, MO 65212, USA (Q.Z.); (E.P.); (T.K.)
- Harry S. Truman Memorial Veterans’ Hospital, Columbia, MO 65201, USA
- Correspondence:
| |
Collapse
|
13
|
Scala G, Federico A, Fortino V, Greco D, Majello B. Knowledge Generation with Rule Induction in Cancer Omics. Int J Mol Sci 2019; 21:E18. [PMID: 31861438 PMCID: PMC6981587 DOI: 10.3390/ijms21010018] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2019] [Revised: 11/26/2019] [Accepted: 12/13/2019] [Indexed: 12/21/2022] Open
Abstract
The explosion of omics data availability in cancer research has boosted the knowledge of the molecular basis of cancer, although the strategies for its definitive resolution are still not well established. The complexity of cancer biology, given by the high heterogeneity of cancer cells, leads to the development of pharmacoresistance for many patients, hampering the efficacy of therapeutic approaches. Machine learning techniques have been implemented to extract knowledge from cancer omics data in order to address fundamental issues in cancer research, as well as the classification of clinically relevant sub-groups of patients and for the identification of biomarkers for disease risk and prognosis. Rule induction algorithms are a group of pattern discovery approaches that represents discovered relationships in the form of human readable associative rules. The application of such techniques to the modern plethora of collected cancer omics data can effectively boost our understanding of cancer-related mechanisms. In fact, the capability of these methods to extract a huge amount of human readable knowledge will eventually help to uncover unknown relationships between molecular attributes and the malignant phenotype. In this review, we describe applications and strategies for the usage of rule induction approaches in cancer omics data analysis. In particular, we explore the canonical applications and the future challenges and opportunities posed by multi-omics integration problems.
Collapse
Affiliation(s)
- Giovanni Scala
- Department of Biology, University of Naples Federico II, 80126 Naples, Italy;
| | - Antonio Federico
- Faculty of Medicine and Health Technology, Tampere University, 33014 Tampere, Finland; (A.F.); (D.G.)
| | - Vittorio Fortino
- Institute of Biomedicine, University of Eastern Finland, 70210 Kuopio, Finland;
| | - Dario Greco
- Faculty of Medicine and Health Technology, Tampere University, 33014 Tampere, Finland; (A.F.); (D.G.)
- Institute of Biotechnology, University of Helsinki, 00014 Helsinki, Finland
| | - Barbara Majello
- Department of Biology, University of Naples Federico II, 80126 Naples, Italy;
| |
Collapse
|
14
|
Abstract
Background DNA methylation is an epigenetic event that may regulate gene expression. Because of this regulation role, aberrant DNA methylation is often associated with many diseases. Within-sample DNA co-methylation is the similarity of methylation in nearby cytosine sites of a chromosome. It is important to study co-methylation patterns. However, it is not well studied yet, and it is unclear to us what co-methylation patterns normal DNA samples have. Are the co-methylation patterns of the same tissue across several samples different? Are the co-methylation patterns of various tissues of the same sample different? To answer these questions, we conduct analyses using two sets of data: 3-sample-1-tissue (3S1T) and 1-sample-8-tissue (1S8T). Results To study the co-methylation patterns of the two datasets, 3S1T and 1S8T, we investigate the following questions: How often does one methylation state change to other methylation states and how is this change associated with chromosome distance? Based on the 3S1T data, we find there is not significant co-methylation difference among the same spleen tissues of three different samples. However, the analysis results of 1S8T data show that there were significant differences among eight tissues of one sample. For both 3S1T and 1S8T data, we find that the no/low methylation state A and high/full methylation state D tend to remain the same along a chromosome region. We also find that the low/partial methylation state B and partial/high methylation state C tend to change to higher methylation states along a chromosome. Finally, we find that lengths of most co-methylation regions are very short with only a few hundred base pairs. In fact, only a small proportion of methylated regions are longer than 1000 base pairs. Conclusions In this paper, we have addressed a few questions regarding within-sample co-methylation patterns in normal tissues. Our statistical analysis results and answers may help researchers to better understand the biological process of DNA methylation. This may pave the way to develop better analysis methods for future methylation research. Electronic supplementary material The online version of this article (10.1186/s13040-019-0198-8) contains supplementary material, which is available to authorized users.
Collapse
|
15
|
Abstract
Multiclass classification in cancer diagnostics, using DNA or Gene Expression Signatures, but also classification of bacteria species fingerprints in MALDI-TOF mass spectrometry data, is challenging because of imbalanced data and the high number of dimensions with respect to the number of instances. In this study, a new oversampling technique called LICIC will be presented as a valuable instrument in countering both class imbalance, and the famous “curse of dimensionality” problem. The method enables preservation of non-linearities within the dataset, while creating new instances without adding noise. The method will be compared with other oversampling methods, such as Random Oversampling, SMOTE, Borderline-SMOTE, and ADASYN. F1 scores show the validity of this new technique when used with imbalanced, multiclass, and high-dimensional datasets.
Collapse
|