1
|
Alkhanbouli R, Al-Aamri A, Maalouf M, Taha K, Henschel A, Homouz D. Analysis of Cancer-Associated Mutations of POLB Using Machine Learning and Bioinformatics. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:1436-1444. [PMID: 38691429 DOI: 10.1109/tcbb.2024.3395777] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/03/2024]
Abstract
DNA damage is a critical factor in the onset and progression of cancer. When DNA is damaged, the number of genetic mutations increases, making it necessary to activate DNA repair mechanisms. A crucial factor in the base excision repair process, which helps maintain the stability of the genome, is an enzyme called DNA polymerase β (Pol β) encoded by the POLB gene. It plays a vital role in the repair of damaged DNA. Additionally, variations known as Single Nucleotide Polymorphisms (SNPs) in the POLB gene can potentially affect the ability to repair DNA. This study uses bioinformatics tools that extract important features from SNPs to construct a feature matrix, which is then used in combination with machine learning algorithms to predict the likelihood of developing cancer associated with a specific mutation. Eight different machine learning algorithms were used to investigate the relationship between POLB gene variations and their potential role in cancer onset. This study not only highlights the complex link between POLB gene SNPs and cancer, but also underscores the effectiveness of machine learning approaches in genomic studies, paving the way for advanced predictive models in genetic and cancer research.
Collapse
|
2
|
Brigante G, Lazzaretti C, Paradiso E, Nuzzo F, Sitti M, Tüttelmann F, Moretti G, Silvestri R, Gemignani F, Försti A, Hemminki K, Elisei R, Romei C, Zizzi EA, Deriu MA, Simoni M, Landi S, Casarini L. Genetic signature of differentiated thyroid carcinoma susceptibility: a machine learning approach. Eur Thyroid J 2022; 11:e220058. [PMID: 35976137 PMCID: PMC9513665 DOI: 10.1530/etj-22-0058] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/17/2022] [Accepted: 08/17/2022] [Indexed: 11/30/2022] Open
Abstract
To identify a peculiar genetic combination predisposing to differentiated thyroid carcinoma (DTC), we selected a set of single nucleotide polymorphisms (SNPs) associated with DTC risk, considering polygenic risk score (PRS), Bayesian statistics and a machine learning (ML) classifier to describe cases and controls in three different datasets. Dataset 1 (649 DTC, 431 controls) has been previously genotyped in a genome-wide association study (GWAS) on Italian DTC. Dataset 2 (234 DTC, 101 controls) and dataset 3 (404 DTC, 392 controls) were genotyped. Associations of 171 SNPs reported to predispose to DTC in candidate studies were extracted from the GWAS of dataset 1, followed by replication of SNPs associated with DTC risk (P < 0.05) in dataset 2. The reliability of the identified SNPs was confirmed by PRS and Bayesian statistics after merging the three datasets. SNPs were used to describe the case/control state of individuals by ML classifier. Starting from 171 SNPs associated with DTC, 15 were positive in both datasets 1 and 2. Using these markers, PRS revealed that individuals in the fifth quintile had a seven-fold increased risk of DTC than those in the first. Bayesian inference confirmed that the selected 15 SNPs differentiate cases from controls. Results were corroborated by ML, finding a maximum AUC of about 0.7. A restricted selection of only 15 DTC-associated SNPs is able to describe the inner genetic structure of Italian individuals, and ML allows a fair prediction of case or control status based solely on the individual genetic background.
Collapse
Affiliation(s)
- Giulia Brigante
- Unit of Endocrinology, Department of Biomedical, Metabolic and Neural Sciences, University of Modena and Reggio Emilia, Modena, Italy
- Unit of Endocrinology, Department of Medical Specialties, Azienda Ospedaliero-Universitaria, Modena, Italy
| | - Clara Lazzaretti
- Unit of Endocrinology, Department of Biomedical, Metabolic and Neural Sciences, University of Modena and Reggio Emilia, Modena, Italy
| | - Elia Paradiso
- Unit of Endocrinology, Department of Biomedical, Metabolic and Neural Sciences, University of Modena and Reggio Emilia, Modena, Italy
| | - Federico Nuzzo
- Unit of Endocrinology, Department of Biomedical, Metabolic and Neural Sciences, University of Modena and Reggio Emilia, Modena, Italy
| | - Martina Sitti
- Unit of Endocrinology, Department of Biomedical, Metabolic and Neural Sciences, University of Modena and Reggio Emilia, Modena, Italy
| | - Frank Tüttelmann
- Institute of Reproductive Genetics, University of Münster, Münster, Germany
| | | | | | | | - Asta Försti
- Hopp Children’s Cancer Center (KiTZ), Heidelberg, Germany
- Division of Pediatric Neurooncology, German Cancer Research Center (DKFZ), German Cancer Consortium (DKTK), Heidelberg, Germany
| | - Kari Hemminki
- Biomedical Center, Faculty of Medicine and Biomedical Center in Pilsen, Charles University in Prague, Pilsen, Czech Republic
- Division of Cancer Epidemiology, German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Rossella Elisei
- Department of Endocrinology, University Hospital, Pisa, Italy
| | - Cristina Romei
- Department of Endocrinology, University Hospital, Pisa, Italy
| | - Eric Adriano Zizzi
- Polito Med Lab, Department of Mechanical and Aerospace Engineering, Politecnico di Torino, Italy
| | - Marco Agostino Deriu
- Polito Med Lab, Department of Mechanical and Aerospace Engineering, Politecnico di Torino, Italy
| | - Manuela Simoni
- Unit of Endocrinology, Department of Biomedical, Metabolic and Neural Sciences, University of Modena and Reggio Emilia, Modena, Italy
- Unit of Endocrinology, Department of Medical Specialties, Azienda Ospedaliero-Universitaria, Modena, Italy
- Center for Genomic Research, University of Modena and Reggio Emilia, Modena, Italy
| | - Stefano Landi
- Department of Biology, University of Pisa, Pisa, Italy
| | - Livio Casarini
- Unit of Endocrinology, Department of Biomedical, Metabolic and Neural Sciences, University of Modena and Reggio Emilia, Modena, Italy
- Center for Genomic Research, University of Modena and Reggio Emilia, Modena, Italy
| |
Collapse
|
3
|
Demetci P, Cheng W, Darnell G, Zhou X, Ramachandran S, Crawford L. Multi-scale inference of genetic trait architecture using biologically annotated neural networks. PLoS Genet 2021; 17:e1009754. [PMID: 34411094 PMCID: PMC8407593 DOI: 10.1371/journal.pgen.1009754] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2020] [Revised: 08/31/2021] [Accepted: 07/31/2021] [Indexed: 01/01/2023] Open
Abstract
In this article, we present Biologically Annotated Neural Networks (BANNs), a nonlinear probabilistic framework for association mapping in genome-wide association (GWA) studies. BANNs are feedforward models with partially connected architectures that are based on biological annotations. This setup yields a fully interpretable neural network where the input layer encodes SNP-level effects, and the hidden layer models the aggregated effects among SNP-sets. We treat the weights and connections of the network as random variables with prior distributions that reflect how genetic effects manifest at different genomic scales. The BANNs software uses variational inference to provide posterior summaries which allow researchers to simultaneously perform (i) mapping with SNPs and (ii) enrichment analyses with SNP-sets on complex traits. Through simulations, we show that our method improves upon state-of-the-art association mapping and enrichment approaches across a wide range of genetic architectures. We then further illustrate the benefits of BANNs by analyzing real GWA data assayed in approximately 2,000 heterogenous stock of mice from the Wellcome Trust Centre for Human Genetics and approximately 7,000 individuals from the Framingham Heart Study. Lastly, using a random subset of individuals of European ancestry from the UK Biobank, we show that BANNs is able to replicate known associations in high and low-density lipoprotein cholesterol content.
Collapse
Affiliation(s)
- Pinar Demetci
- Department of Computer Science, Brown University, Providence, Rhode Island, United States of America
- Center for Computational Molecular Biology, Brown University, Providence, Rhode Island, United States of America
| | - Wei Cheng
- Center for Computational Molecular Biology, Brown University, Providence, Rhode Island, United States of America
- Department of Ecology and Evolutionary Biology, Brown University, Providence, Rhode Island, United States of America
| | - Gregory Darnell
- Center for Computational Molecular Biology, Brown University, Providence, Rhode Island, United States of America
| | - Xiang Zhou
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, United States of America
- Center for Statistical Genetics, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Sohini Ramachandran
- Department of Computer Science, Brown University, Providence, Rhode Island, United States of America
- Center for Computational Molecular Biology, Brown University, Providence, Rhode Island, United States of America
- Department of Ecology and Evolutionary Biology, Brown University, Providence, Rhode Island, United States of America
| | - Lorin Crawford
- Center for Computational Molecular Biology, Brown University, Providence, Rhode Island, United States of America
- Microsoft Research New England, Cambridge, Massachusetts, United States of America
- Department of Biostatistics, Brown University, Providence, Rhode Island, United States of America
| |
Collapse
|
4
|
Lin PC, Chen HO, Lee CJ, Yeh YM, Shen MR, Chiang JH. Comprehensive assessments of germline deletion structural variants reveal the association between prognostic MUC4 and CEP72 deletions and immune response gene expression in colorectal cancer patients. Hum Genomics 2021; 15:3. [PMID: 33431054 PMCID: PMC7802320 DOI: 10.1186/s40246-020-00302-3] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2020] [Accepted: 12/22/2020] [Indexed: 12/30/2022] Open
Abstract
Background Functional disruptions by large germline genomic structural variants in susceptible genes are known risks for cancer. We used deletion structural variants (DSVs) generated from germline whole-genome sequencing (WGS) and DSV immune-related association tumor microenvironment (TME) to predict cancer risk and prognosis. Methods We investigated the contribution of germline DSVs to cancer susceptibility and prognosis by silicon and causal inference models. DSVs in germline WGS data were generated from the blood samples of 192 cancer and 499 non-cancer subjects. Clinical information, including family cancer history (FCH), was obtained from the National Cheng Kung University Hospital and Taiwan Biobank. Ninety-nine colorectal cancer (CRC) patients had immune response gene expression data. We used joint calling tools and an attention-weighted model to build the cancer risk predictive model and identify DSVs in familial cancer. The survival support vector machine (survival-SVM) was used to select prognostic DSVs. Results We identified 671 DSVs that could predict cancer risk. The area under the curve (AUC) of the receiver operating characteristic curve (ROC) of the attention-weighted model was 0.71. The 3 most frequent DSV genes observed in cancer patients were identified as ADCY9, AURKAPS1, and RAB3GAP2 (p < 0.05). The DSVs in SGSM2 and LHFPL3 were relevant to colorectal cancer. We found a higher incidence of FCH in cancer patients than in non-cancer subjects (p < 0.05). SMYD3 and NKD2DSV genes were associated with cancer patients with FCH (p < 0.05). We identified 65 immune-associated DSV markers for assessing cancer prognosis (p < 0.05). The functional protein of MUC4 DSV gene interacted with MAGE1 expression, according to the STRING database. The causal inference model showed that deleting the CEP72 DSV gene affect the recurrence-free survival (RFS) of IFIT1 expression. Conclusions We established an explainable attention-weighted model for cancer risk prediction and used the survival-SVM for prognostic stratification by using germline DSVs and immune gene expression datasets. Comprehensive assessments of germline DSVs can predict the cancer risk and clinical outcome of colon cancer patients. Supplementary Information The online version contains supplementary material available at 10.1186/s40246-020-00302-3.
Collapse
Affiliation(s)
- Peng-Chan Lin
- Department of Computer Science and Information Engineering, College of Electrical Engineering and Computer Science, National Cheng Kung University, Tainan, Taiwan.,Institute of Medical Informatics, National Cheng Kung University, Tainan, Taiwan.,Department of Oncology, National Cheng Kung University Hospital, College of Medicine, National Cheng Kung University, Tainan, Taiwan.,Department of Internal Medicine, National Cheng Kung University Hospital, College of Medicine, National Cheng Kung University, Tainan, Taiwan
| | - Hui-O Chen
- Department of Computer Science and Information Engineering, College of Electrical Engineering and Computer Science, National Cheng Kung University, Tainan, Taiwan
| | - Chih-Jung Lee
- Department of Computer Science and Information Engineering, College of Electrical Engineering and Computer Science, National Cheng Kung University, Tainan, Taiwan
| | - Yu-Min Yeh
- Department of Oncology, National Cheng Kung University Hospital, College of Medicine, National Cheng Kung University, Tainan, Taiwan.,Department of Internal Medicine, National Cheng Kung University Hospital, College of Medicine, National Cheng Kung University, Tainan, Taiwan
| | - Meng-Ru Shen
- Graduate Institute of Clinical Medicine, College of Medicine, National Cheng Kung University, Tainan, Taiwan.,Department of Obstetrics and Gynecology, National Cheng Kung University Hospital, College of Medicine, National Cheng Kung University, Tainan, Taiwan.,Department of Pharmacology, National Cheng Kung University Hospital, College of Medicine, National Cheng Kung University, Tainan, Taiwan
| | - Jung-Hsien Chiang
- Department of Computer Science and Information Engineering, College of Electrical Engineering and Computer Science, National Cheng Kung University, Tainan, Taiwan. .,Institute of Medical Informatics, National Cheng Kung University, Tainan, Taiwan.
| |
Collapse
|
5
|
Li X, Li S, Wang Y, Zhang S, Wong KC. Identification of pan-cancer Ras pathway activation with deep learning. Brief Bioinform 2020; 22:5943785. [PMID: 33126245 DOI: 10.1093/bib/bbaa258] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2020] [Revised: 08/27/2020] [Accepted: 09/11/2020] [Indexed: 01/06/2023] Open
Abstract
The identification of hidden responders is often an essential challenge in precision oncology. A recent attempt based on machine learning has been proposed for classifying aberrant pathway activity from multiomic cancer data. However, we note several critical limitations there, such as high-dimensionality, data sparsity and model performance. Given the central importance and broad impact of precision oncology, we propose nature-inspired deep Ras activation pan-cancer (NatDRAP), a deep neural network (DNN) model, to address those restrictions for the identification of hidden responders. In this study, we develop the nature-inspired deep learning model that integrates bulk RNA sequencing, copy number and mutation data from PanCanAltas to detect pan-cancer Ras pathway activation. In NatDRAP, we propose to synergize the nature-inspired artificial bee colony algorithm with different gradient-based optimizers in one framework for optimizing DNNs in a collaborative manner. Multiple experiments were conducted on 33 different cancer types across PanCanAtlas. The experimental results demonstrate that the proposed NatDRAP can provide superior performance over other benchmark methods with strong robustness towards diagnosing RAS aberrant pathway activity across different cancer types. In addition, gene ontology enrichment and pathological analysis are conducted to reveal novel insights into the RAS aberrant pathway activity identification and characterization. NatDRAP is written in Python and available at https://github.com/lixt314/NatDRAP1.
Collapse
Affiliation(s)
- Xiangtao Li
- School of Artificial Intelligence, Jilin University
| | - Shaochuan Li
- School of Computer Science, Northeast Normal University
| | - Yunhe Wang
- School of Computer Science, Northeast Normal University
| | - Shixiong Zhang
- Department of Computer science, City University of Hong Kong, Hong Kong SAR
| | - Ka-Chun Wong
- Department of Computer science, City University of Hong Kong, Hong Kong SAR
| |
Collapse
|
6
|
Bertsimas D, Wiberg H. Machine Learning in Oncology: Methods, Applications, and Challenges. JCO Clin Cancer Inform 2020; 4:885-894. [PMID: 33058693 PMCID: PMC7608565 DOI: 10.1200/cci.20.00072] [Citation(s) in RCA: 31] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 08/26/2020] [Indexed: 01/16/2023] Open
Affiliation(s)
- Dimitris Bertsimas
- Sloan School of Management, Massachusetts Institute of Technology, Cambridge, MA
- Operations Research Center, Massachusetts Institute of Technology, Cambridge, MA
| | - Holly Wiberg
- Operations Research Center, Massachusetts Institute of Technology, Cambridge, MA
| |
Collapse
|
7
|
Kim BH, Yu K, Lee PCW. Cancer classification of single-cell gene expression data by neural network. Bioinformatics 2020; 36:1360-1366. [PMID: 31603465 DOI: 10.1093/bioinformatics/btz772] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2019] [Revised: 08/13/2019] [Accepted: 10/08/2019] [Indexed: 01/16/2023] Open
Abstract
MOTIVATION Cancer classification based on gene expression profiles has provided insight on the causes of cancer and cancer treatment. Recently, machine learning-based approaches have been attempted in downstream cancer analysis to address the large differences in gene expression values, as determined by single-cell RNA sequencing (scRNA-seq). RESULTS We designed cancer classifiers that can identify 21 types of cancers and normal tissues based on bulk RNA-seq as well as scRNA-seq data. Training was performed with 7398 cancer samples and 640 normal samples from 21 tumors and normal tissues in TCGA based on the 300 most significant genes expressed in each cancer. Then, we compared neural network (NN), support vector machine (SVM), k-nearest neighbors (kNN) and random forest (RF) methods. The NN performed consistently better than other methods. We further applied our approach to scRNA-seq transformed by kNN smoothing and found that our model successfully classified cancer types and normal samples. AVAILABILITY AND IMPLEMENTATION Cancer classification by neural network. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Bong-Hyun Kim
- Department of Biomedical Sciences, University of Ulsan College of Medicine, ASAN Medical Center, Seoul 05505, Korea.,Advanced Bio Computing Center, Frederick National Laboratory for Cancer Research, Frederick, MD 21702, USA
| | - Kijin Yu
- Department of Biomedical Sciences, University of Ulsan College of Medicine, ASAN Medical Center, Seoul 05505, Korea
| | - Peter C W Lee
- Department of Biomedical Sciences, University of Ulsan College of Medicine, ASAN Medical Center, Seoul 05505, Korea
| |
Collapse
|
8
|
Sanchez-Ibarra HE, Jiang X, Gallegos-Gonzalez EY, Cavazos-González AC, Chen Y, Morcos F, Barrera-Saldaña HA. KRAS, NRAS, and BRAF mutation prevalence, clinicopathological association, and their application in a predictive model in Mexican patients with metastatic colorectal cancer: A retrospective cohort study. PLoS One 2020; 15:e0235490. [PMID: 32628708 PMCID: PMC7337295 DOI: 10.1371/journal.pone.0235490] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2020] [Accepted: 06/16/2020] [Indexed: 01/10/2023] Open
Abstract
Mutations in KRAS, NRAS, and BRAF (RAS/BRAF) genes are the main predictive biomarkers for the response to anti-EGFR monoclonal antibodies (MAbs) targeted therapy in metastatic colorectal cancer (mCRC). This retrospective study aimed to report the mutational status prevalence of these genes, explore their possible associations with clinicopathological features, and build and validate a predictive model. To achieve these objectives, 500 mCRC Mexican patients were screened for clinically relevant mutations in RAS/BRAF genes. Fifty-two percent of these specimens harbored clinically relevant mutations in at least one screened gene. Among these, 86% had a mutation in KRAS, 7% in NRAS, 6% in BRAF, and 2% in both NRAS and BRAF. Only tumor location in the proximal colon exhibited a significant correlation with KRAS and BRAF mutational status (p-value = 0.0414 and 0.0065, respectively). Further t-SNE analyses were made to 191 specimens to reveal patterns among patients with clinical parameters and KRAS mutational status. Then, directed by the results from classical statistical tests and t-SNE analysis, neural network models utilized entity embeddings to learn patterns and build predictive models using a minimal number of trainable parameters. This study could be the first step in the prediction for RAS/BRAF mutational status from tumoral features and could lead the way to a more detailed and more diverse dataset that could benefit from machine learning methods.
Collapse
Affiliation(s)
| | - Xianli Jiang
- Evolutionary Information Laboratory, Department of Biological Sciences, the University of Texas at Dallas, Richardson, Texas, United States of America
| | | | | | - Yenho Chen
- Evolutionary Information Laboratory, Department of Biological Sciences, the University of Texas at Dallas, Richardson, Texas, United States of America
| | - Faruck Morcos
- Evolutionary Information Laboratory, Department of Biological Sciences, the University of Texas at Dallas, Richardson, Texas, United States of America
| | | |
Collapse
|
9
|
Classification of Kidney Cancer Data Using Cost-Sensitive Hybrid Deep Learning Approach. Symmetry (Basel) 2020. [DOI: 10.3390/sym12010154] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
Abstract
Recently, large-scale bioinformatics and genomic data have been generated using advanced biotechnology methods, thus increasing the importance of analyzing such data. Numerous data mining methods have been developed to process genomic data in the field of bioinformatics. We extracted significant genes for the prognosis prediction of 1157 patients using gene expression data from patients with kidney cancer. We then proposed an end-to-end, cost-sensitive hybrid deep learning (COST-HDL) approach with a cost-sensitive loss function for classification tasks on imbalanced kidney cancer data. Here, we combined the deep symmetric auto encoder; the decoder is symmetric to the encoder in terms of layer structure, with reconstruction loss for non-linear feature extraction and neural network with balanced classification loss for prognosis prediction to address data imbalance problems. Combined clinical data from patients with kidney cancer and gene data were used to determine the optimal classification model and estimate classification accuracy by sample type, primary diagnosis, tumor stage, and vital status as risk factors representing the state of patients. Experimental results showed that the COST-HDL approach was more efficient with gene expression data for kidney cancer prognosis than other conventional machine learning and data mining techniques. These results could be applied to extract features from gene biomarkers for prognosis prediction of kidney cancer and prevention and early diagnosis.
Collapse
|
10
|
Bi Q, Goodman KE, Kaminsky J, Lessler J. What is Machine Learning? A Primer for the Epidemiologist. Am J Epidemiol 2019; 188:2222-2239. [PMID: 31509183 DOI: 10.1093/aje/kwz189] [Citation(s) in RCA: 94] [Impact Index Per Article: 18.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2018] [Revised: 07/29/2019] [Accepted: 08/14/2019] [Indexed: 12/22/2022] Open
Abstract
Machine learning is a branch of computer science that has the potential to transform epidemiologic sciences. Amid a growing focus on "Big Data," it offers epidemiologists new tools to tackle problems for which classical methods are not well-suited. In order to critically evaluate the value of integrating machine learning algorithms and existing methods, however, it is essential to address language and technical barriers between the two fields that can make it difficult for epidemiologists to read and assess machine learning studies. Here, we provide an overview of the concepts and terminology used in machine learning literature, which encompasses a diverse set of tools with goals ranging from prediction to classification to clustering. We provide a brief introduction to 5 common machine learning algorithms and 4 ensemble-based approaches. We then summarize epidemiologic applications of machine learning techniques in the published literature. We recommend approaches to incorporate machine learning in epidemiologic research and discuss opportunities and challenges for integrating machine learning and existing epidemiologic research methods.
Collapse
Affiliation(s)
- Qifang Bi
- Department of Epidemiology, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, Maryland
| | - Katherine E Goodman
- Department of Epidemiology, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, Maryland
| | - Joshua Kaminsky
- Department of Epidemiology, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, Maryland
| | - Justin Lessler
- Department of Epidemiology, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, Maryland
| |
Collapse
|
11
|
The impact of artificial intelligence on the current and future practice of clinical cancer genomics. Genet Res (Camb) 2019; 101:e9. [PMID: 31668155 PMCID: PMC7044964 DOI: 10.1017/s0016672319000089] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023] Open
Abstract
Artificial intelligence (AI) is one of the most significant fields of development in the current digital age. Rapid advancements have raised speculation as to its potential benefits in a wide range of fields, with healthcare often at the forefront. However, amidst this optimism, apprehension and opposition continue to strongly persist. Oft-cited concerns include the threat of unemployment, harm to the doctor–patient relationship and questions of safety and accuracy. In this article, we review both the current and future medical applications of AI within the sub-speciality of cancer genomics.
Collapse
|
12
|
Long GS, Hussen M, Dench J, Aris-Brosou S. Identifying genetic determinants of complex phenotypes from whole genome sequence data. BMC Genomics 2019; 20:470. [PMID: 31182025 PMCID: PMC6558885 DOI: 10.1186/s12864-019-5820-0] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2018] [Accepted: 05/21/2019] [Indexed: 02/08/2023] Open
Abstract
BACKGROUND A critical goal in biology is to relate the phenotype to the genotype, that is, to find the genetic determinants of various traits. However, while simple monofactorial determinants are relatively easy to identify, the underpinnings of complex phenotypes are harder to predict. While traditional approaches rely on genome-wide association studies based on Single Nucleotide Polymorphism data, the ability of machine learning algorithms to find these determinants in whole proteome data is still not well known. RESULTS To better understand the applicability of machine learning in this case, we implemented two such algorithms, adaptive boosting (AB) and repeated random forest (RRF), and developed a chunking layer that facilitates the analysis of whole proteome data. We first assessed the performance of these algorithms and tuned them on an influenza data set, for which the determinants of three complex phenotypes (infectivity, transmissibility, and pathogenicity) are known based on experimental evidence. This allowed us to show that chunking improves runtimes by an order of magnitude. Based on simulations, we showed that chunking also increases sensitivity of the predictions, reaching 100% with as few as 20 sequences in a small proteome as in the influenza case (5k sites), but may require at least 30 sequences to reach 90% on larger alignments (500k sites). While RRF has less specificity than random forest, it was never <50%, and RRF sensitivity was significantly higher at smaller chunk sizes. We then used these algorithms to predict the determinants of three types of drug resistance (to Ciprofloxacin, Ceftazidime, and Gentamicin) in a bacterium, Pseudomonas aeruginosa. While both algorithms performed well in the case of the influenza data, results were more nuanced in the bacterial case, with RRF making more sensible predictions, with smaller errors rates, than AB. CONCLUSIONS Altogether, we demonstrated that ML algorithms can be used to identify genetic determinants in small proteomes (viruses), even when trained on small numbers of individuals. We further showed that our RRF algorithm may deserve more scrutiny, which should be facilitated by the decreasing costs of both sequencing and phenotyping of large cohorts of individuals.
Collapse
Affiliation(s)
- George S Long
- Department of Biology, University of Ottawa, Ottawa, Ontario, Canada
| | - Mohammed Hussen
- Department of Biology, University of Ottawa, Ottawa, Ontario, Canada
| | - Jonathan Dench
- Department of Biology, University of Ottawa, Ottawa, Ontario, Canada
| | - Stéphane Aris-Brosou
- Department of Biology, University of Ottawa, Ottawa, Ontario, Canada. .,Department of Mathematics and Statistics, University of Ottawa, Ottawa, Ontario, Canada.
| |
Collapse
|
13
|
McIntosh AM, Sullivan PF, Lewis CM. Uncovering the Genetic Architecture of Major Depression. Neuron 2019; 102:91-103. [PMID: 30946830 PMCID: PMC6482287 DOI: 10.1016/j.neuron.2019.03.022] [Citation(s) in RCA: 95] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2019] [Revised: 03/05/2019] [Accepted: 03/14/2019] [Indexed: 12/21/2022]
Abstract
There have been several recent studies addressing the genetic architecture of depression. This review serves to take stock of what is known now about the genetics of depression, how it has increased our knowledge and understanding of its mechanisms, and how the information and knowledge can be leveraged to improve the care of people affected. We identify four priorities for how the field of MD genetics research may move forward in future years, namely by increasing the sample sizes available for genome-wide association studies (GWASs), greater inclusion of diverse ancestries and low-income countries, the closer integration of psychiatric genetics with electronic medical records, and the development of the neuroscience toolkit for polygenic disorders.
Collapse
Affiliation(s)
- Andrew M McIntosh
- Division of Psychiatry, Centre for Clinical Brain Sciences, University of Edinburgh, Edinburgh, UK; Centre for Cognitive Ageing and Cognitive Epidemiology, University of Edinburgh, Edinburgh, UK.
| | - Patrick F Sullivan
- Departments of Genetics and Psychiatry, University of North Carolina, Chapel Hill, NC, USA; Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden
| | - Cathryn M Lewis
- Social, Genetic and Developmental Psychiatry Centre, King's College London, London, UK; Department of Medical and Molecular Genetics, King's College London, London UK
| |
Collapse
|
14
|
Mochida K, Koda S, Inoue K, Hirayama T, Tanaka S, Nishii R, Melgani F. Computer vision-based phenotyping for improvement of plant productivity: a machine learning perspective. Gigascience 2019; 8:5232233. [PMID: 30520975 PMCID: PMC6312910 DOI: 10.1093/gigascience/giy153] [Citation(s) in RCA: 45] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2018] [Revised: 09/06/2018] [Accepted: 11/24/2018] [Indexed: 11/29/2022] Open
Abstract
Employing computer vision to extract useful information from images and videos is becoming a key technique for identifying phenotypic changes in plants. Here, we review the emerging aspects of computer vision for automated plant phenotyping. Recent advances in image analysis empowered by machine learning-based techniques, including convolutional neural network-based modeling, have expanded their application to assist high-throughput plant phenotyping. Combinatorial use of multiple sensors to acquire various spectra has allowed us to noninvasively obtain a series of datasets, including those related to the development and physiological responses of plants throughout their life. Automated phenotyping platforms accelerate the elucidation of gene functions associated with traits in model plants under controlled conditions. Remote sensing techniques with image collection platforms, such as unmanned vehicles and tractors, are also emerging for large-scale field phenotyping for crop breeding and precision agriculture. Computer vision-based phenotyping will play significant roles in both the nowcasting and forecasting of plant traits through modeling of genotype/phenotype relationships.
Collapse
Affiliation(s)
- Keiichi Mochida
- Bioproductivity Informatics Research Team, RIKEN Center for Sustainable Resource Science, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa 230-0045, Japan
- Microalgae Production Control Technology Laboratory, RIKEN Baton Zone Program, RIKEN Cluster for Science, Technology and Innovation Hub, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa 230-0045, Japan
- Institute of Plant Science and Resources, Okayama University, 2-20-1 Chuo, Kurashiki, Okayama 710-0046, Japan
- Kihara Institute for Biological Research, Yokohama City University, 641-12 Maioka-cho, Totsuka-ku, Yokohama, Kanagawa 244–0813, Japan
- Graduate School of Nanobioscience, Yokohama City University, 22-2 Seto, Kanazawa-ku, Yokohama, Kanagawa 236-0027, Japan
| | - Satoru Koda
- Graduate School of Mathematics, Kyushu University, 744 Motooka, Nishi-ku, Fukuoka 819-0395, Japan
| | - Komaki Inoue
- Bioproductivity Informatics Research Team, RIKEN Center for Sustainable Resource Science, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa 230-0045, Japan
| | - Takashi Hirayama
- Institute of Plant Science and Resources, Okayama University, 2-20-1 Chuo, Kurashiki, Okayama 710-0046, Japan
| | - Shojiro Tanaka
- Hiroshima University of Economics, 5-37-1, Gion, Asaminami, Hiroshima-shi Hiroshima 731-0138, Japan
| | - Ryuei Nishii
- Institute of Mathematics for Industry, Kyushu University, 744 Motooka, Nishi-ku, Fukuoka 819-0395, Japan
| | - Farid Melgani
- Department of Information Engineering and Computer Science, University of Trento, Via Sommarive 9, 38123 Trento, Italy
| |
Collapse
|
15
|
Mochida K, Koda S, Inoue K, Hirayama T, Tanaka S, Nishii R, Melgani F. Computer vision-based phenotyping for improvement of plant productivity: a machine learning perspective. Gigascience 2019. [PMID: 30520975 DOI: 10.1093/gigascience/giy153/5232233] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/04/2023] Open
Abstract
Employing computer vision to extract useful information from images and videos is becoming a key technique for identifying phenotypic changes in plants. Here, we review the emerging aspects of computer vision for automated plant phenotyping. Recent advances in image analysis empowered by machine learning-based techniques, including convolutional neural network-based modeling, have expanded their application to assist high-throughput plant phenotyping. Combinatorial use of multiple sensors to acquire various spectra has allowed us to noninvasively obtain a series of datasets, including those related to the development and physiological responses of plants throughout their life. Automated phenotyping platforms accelerate the elucidation of gene functions associated with traits in model plants under controlled conditions. Remote sensing techniques with image collection platforms, such as unmanned vehicles and tractors, are also emerging for large-scale field phenotyping for crop breeding and precision agriculture. Computer vision-based phenotyping will play significant roles in both the nowcasting and forecasting of plant traits through modeling of genotype/phenotype relationships.
Collapse
Affiliation(s)
- Keiichi Mochida
- Bioproductivity Informatics Research Team, RIKEN Center for Sustainable Resource Science, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa 230-0045, Japan
- Microalgae Production Control Technology Laboratory, RIKEN Baton Zone Program, RIKEN Cluster for Science, Technology and Innovation Hub, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa 230-0045, Japan
- Institute of Plant Science and Resources, Okayama University, 2-20-1 Chuo, Kurashiki, Okayama 710-0046, Japan
- Kihara Institute for Biological Research, Yokohama City University, 641-12 Maioka-cho, Totsuka-ku, Yokohama, Kanagawa 244-0813, Japan
- Graduate School of Nanobioscience, Yokohama City University, 22-2 Seto, Kanazawa-ku, Yokohama, Kanagawa 236-0027, Japan
| | - Satoru Koda
- Graduate School of Mathematics, Kyushu University, 744 Motooka, Nishi-ku, Fukuoka 819-0395, Japan
| | - Komaki Inoue
- Bioproductivity Informatics Research Team, RIKEN Center for Sustainable Resource Science, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa 230-0045, Japan
| | - Takashi Hirayama
- Institute of Plant Science and Resources, Okayama University, 2-20-1 Chuo, Kurashiki, Okayama 710-0046, Japan
| | - Shojiro Tanaka
- Hiroshima University of Economics, 5-37-1, Gion, Asaminami, Hiroshima-shi Hiroshima 731-0138, Japan
| | - Ryuei Nishii
- Institute of Mathematics for Industry, Kyushu University, 744 Motooka, Nishi-ku, Fukuoka 819-0395, Japan
| | - Farid Melgani
- Department of Information Engineering and Computer Science, University of Trento, Via Sommarive 9, 38123 Trento, Italy
| |
Collapse
|