1
|
Lai J, Yang H, Huang J, He L. Investigating the impact of Wnt pathway-related genes on biomarker and diagnostic model development for osteoporosis in postmenopausal females. Sci Rep 2024; 14:2880. [PMID: 38311613 PMCID: PMC10838932 DOI: 10.1038/s41598-024-52429-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2023] [Accepted: 01/18/2024] [Indexed: 02/06/2024] Open
Abstract
The Wnt signaling pathway is essential for bone development and maintaining skeletal homeostasis, making it particularly relevant in osteoporosis patients. Our study aimed to identify distinct molecular clusters associated with the Wnt pathway and develop a diagnostic model for osteoporosis in postmenopausal Caucasian women. We downloaded three datasets (GSE56814, GSE56815 and GSE2208) related to osteoporosis from the GEO database. Our analysis identified a total of 371 differentially expressed genes (DEGs) between low and high bone mineral density (BMD) groups, with 12 genes associated with the Wnt signaling pathway, referred to as osteoporosis-associated Wnt pathway-related genes. Employing four independent machine learning models, we established a diagnostic model using the 12 osteoporosis-associated Wnt pathway-related genes in the training set. The XGB model showed the most promising discriminative potential. We further validate the predictive capability of our diagnostic model by applying it to three external datasets specifically related to osteoporosis. Subsequently, we constructed a diagnostic nomogram based on the five crucial genes identified from the XGB model. In addition, through the utilization of DGIdb, we identified a total of 30 molecular compounds or medications that exhibit potential as promising therapeutic targets for osteoporosis. In summary, our comprehensive analysis provides valuable insights into the relationship between the osteoporosis and Wnt signaling pathway.
Collapse
Affiliation(s)
- Jinzhi Lai
- Department of Oncology, The Second Affiliated Hospital of Fujian Medical University, Quanzhou, 362000, Fujian, China
| | - Hainan Yang
- Department of Ultrasound, First Affiliated Hospital of Xiamen University, Xiamen, 361003, Fujian, China
| | - Jingshan Huang
- Department of General Surgery, The Second Affiliated Hospital of Fujian Medical University, Quanzhou, 362000, Fujian, China.
| | - Lijiang He
- Department of Orthopaedic Surgery, The Second Affiliated Hospital of Fujian Medical University, Quanzhou, 362000, Fujian, China.
| |
Collapse
|
2
|
Karabekmez ME, Yarıcı M. Parameterization of asymmetric sigmoid functions in weighted gene co-expression network analysis. Comput Biol Chem 2024; 108:107998. [PMID: 38071762 DOI: 10.1016/j.compbiolchem.2023.107998] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2023] [Revised: 11/22/2023] [Accepted: 12/03/2023] [Indexed: 01/22/2024]
Abstract
In most the biological contexts, examining gene expressions at the genomic level gives more accurate results than examining genes individually. It can improve understanding of the molecular mechanisms that cause molecular alterations. Weighted gene co-expression network analysis (WGCNA), which has recently been widely used to cluster transcriptomic datasets, implements a soft thresholding procedure using power function. However, these functions may sometimes exaggerate minor differences in expression correlations. We have previously proposed to use asymmetric sigmoid functions in soft thresholding as an alternative solution. However, the number of variables in asymmetric sigmoid functions may vary and parameterization can be problematic. In this study, we have introduced a systematic procedure for parameterizing asymmetric sigmoid function to ease using it as an alternative soft-thresholding solution in WGCNA. The efficiency of the employment was shown on four different COVID-19 datasets, on a yeast dataset, and on an E.Coli dataset. The results indicate that this approach provides biologically plausible associations for the resulting modules.
Collapse
Affiliation(s)
| | - Merve Yarıcı
- Istanbul Medeniyet University, Department of Bioengineering, Istanbul, Turkey
| |
Collapse
|
3
|
Kuang J, Scoglio C, Michel K. Feature learning and network structure from noisy node activity data. Phys Rev E 2022; 106:064301. [PMID: 36671154 PMCID: PMC9869472 DOI: 10.1103/physreve.106.064301] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2021] [Accepted: 11/17/2022] [Indexed: 06/17/2023]
Abstract
In the studies of network structures, much attention has been devoted to developing approaches to reconstruct networks and predict missing links when edge-related information is given. However, such approaches are not applicable when we are only given noisy node activity data with missing values. This work presents an unsupervised learning framework to learn node vectors and construct networks from such node activity data. First, we design a scheme to generate random node sequences from node context sets, which are generated from node activity data. Then, a three-layer neural network is adopted training the node sequences to obtain node vectors, which allow us to construct networks and capture nodes with synergistic roles. Furthermore, we present an entropy-based approach to select the most meaningful neighbors for each node in the resulting network. Finally, the effectiveness of the method is validated through both synthetic and real data.
Collapse
Affiliation(s)
- Junyao Kuang
- Department of Electrical and Computer Engineering
| | | | | |
Collapse
|
4
|
Abreu MT, Davies JM, Quintero MA, Delmas A, Diaz S, Martinez CD, Venables T, Reich A, Crynen G, Deshpande AR, Kerman DH, Damas OM, Fernandez I, Santander AM, Pignac-Kobinger J, Burgueno JF, Sundrud MS. Transcriptional Behavior of Regulatory T Cells Predicts IBD Patient Responses to Vedolizumab Therapy. Inflamm Bowel Dis 2022; 28:1800-1812. [PMID: 35993552 DOI: 10.1093/ibd/izac151] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/11/2021] [Indexed: 12/07/2022]
Abstract
BACKGROUND Inflammatory bowel disease (IBD) involves chronic T cell-mediated inflammatory responses. Vedolizumab (VDZ), a monoclonal antibody against α4β7 integrin, inhibits lymphocyte extravasation into intestinal mucosae and is effective in ulcerative colitis (UC) and Crohn's disease (CD). AIM We sought to identify immune cell phenotypic and gene expression signatures that related to response to VDZ. METHODS Peripheral blood (PBMC) and lamina propria mononuclear cells (LPMCs) were analyzed by flow cytometry and Cytofkit. Sorted CD4 + memory (Tmem) or regulatory T (Treg) cells from PBMC and LPMC were analyzed by RNA sequencing (RNA-seq). Clinical response (≥2-point drop in partial Mayo scores [UC] or Harvey-Bradshaw index [CD]) was assessed 14 to 22 weeks after VDZ initiation. Machine-learning models were used to infer combinatorial traits that predicted response to VDZ. RESULTS Seventy-one patients were enrolled: 37 received VDZ and 21 patients remained on VDZ >2 years. Fourteen of 37 patients (38%; 8 UC, 6 CD) responded to VDZ. Immune cell phenotypes and CD4 + Tmem and Treg transcriptional behaviors were most divergent between the ileum and colon, irrespective of IBD subtype or inflammation status. Vedolizumab treatment had the greatest impact on Treg metabolic pathways, and response was associated with increased expression of genes involved in oxidative phosphorylation. The strongest clinical predictor of VDZ efficacy was concurrent use of thiopurines. Mucosal tissues offered the greatest number of response-predictive biomarkers, whereas PBMC Treg-expressed genes were the best predictors in combinatorial models of response. CONCLUSIONS Mucosal and peripheral blood immune cell phenotypes and transcriptional profiles can inform VDZ efficacy and inform new opportunities for combination therapies.
Collapse
Affiliation(s)
- Maria T Abreu
- Division of Gastroenterology and Hepatology, Department of Medicine, University of Miami Miller School of Medicine, Miami, Florida, USA
| | - Julie M Davies
- Division of Gastroenterology and Hepatology, Department of Medicine, University of Miami Miller School of Medicine, Miami, Florida, USA
| | - Maria A Quintero
- Division of Gastroenterology and Hepatology, Department of Medicine, University of Miami Miller School of Medicine, Miami, Florida, USA
| | - Amber Delmas
- Department of Immunology and Microbiology, The Scripps Research Institute, Jupiter, Florida, USA
| | - Sophia Diaz
- Division of Gastroenterology and Hepatology, Department of Medicine, University of Miami Miller School of Medicine, Miami, Florida, USA
| | - Catherine D Martinez
- Department of Immunology and Microbiology, The Scripps Research Institute, Jupiter, Florida, USA
| | - Thomas Venables
- Department of Immunology and Microbiology, The Scripps Research Institute, Jupiter, Florida, USA
| | - Adrian Reich
- Center for Computational Biology and Bioinformatics, The Scripps Research Institute, Jupiter, Florida, USA
| | - Gogce Crynen
- Center for Computational Biology and Bioinformatics, The Scripps Research Institute, Jupiter, Florida, USA
| | - Amar R Deshpande
- Division of Gastroenterology and Hepatology, Department of Medicine, University of Miami Miller School of Medicine, Miami, Florida, USA
| | - David H Kerman
- Division of Gastroenterology and Hepatology, Department of Medicine, University of Miami Miller School of Medicine, Miami, Florida, USA
| | - Oriana M Damas
- Division of Gastroenterology and Hepatology, Department of Medicine, University of Miami Miller School of Medicine, Miami, Florida, USA
| | - Irina Fernandez
- Division of Gastroenterology and Hepatology, Department of Medicine, University of Miami Miller School of Medicine, Miami, Florida, USA
| | - Ana M Santander
- Division of Gastroenterology and Hepatology, Department of Medicine, University of Miami Miller School of Medicine, Miami, Florida, USA
| | - Judith Pignac-Kobinger
- Division of Gastroenterology and Hepatology, Department of Medicine, University of Miami Miller School of Medicine, Miami, Florida, USA
| | - Juan F Burgueno
- Division of Gastroenterology and Hepatology, Department of Medicine, University of Miami Miller School of Medicine, Miami, Florida, USA
| | - Mark S Sundrud
- Department of Immunology and Microbiology, The Scripps Research Institute, Jupiter, Florida, USA
| |
Collapse
|
5
|
Li J, Wu QQ, Zhu RH, Lv X, Wang WQ, Wang JL, Liang BY, Huang ZY, Zhang EL. Machine learning predicts portal vein thrombosis after splenectomy in patients with portal hypertension: Comparative analysis of three practical models. World J Gastroenterol 2022; 28:4681-4697. [PMID: 36157936 PMCID: PMC9476873 DOI: 10.3748/wjg.v28.i32.4681] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/16/2022] [Revised: 05/25/2022] [Accepted: 08/01/2022] [Indexed: 02/06/2023] Open
Abstract
BACKGROUND For patients with portal hypertension (PH), portal vein thrombosis (PVT) is a fatal complication after splenectomy. Postoperative platelet elevation is considered the foremost reason for PVT. However, the value of postoperative platelet elevation rate (PPER) in predicting PVT has never been studied.
AIM To investigate the predictive value of PPER for PVT and establish PPER-based prediction models to early identify individuals at high risk of PVT after splenectomy.
METHODS We retrospectively reviewed 483 patients with PH related to hepatitis B virus who underwent splenectomy between July 2011 and September 2018, and they were randomized into either a training (n = 338) or a validation (n = 145) cohort. The generalized linear (GL) method, least absolute shrinkage and selection operator (LASSO), and random forest (RF) were used to construct models. The receiver operating characteristic curves (ROC), calibration curve, decision curve analysis (DCA), and clinical impact curve (CIC) were used to evaluate the robustness and clinical practicability of the GL model (GLM), LASSO model (LSM), and RF model (RFM).
RESULTS Multivariate analysis exhibited that the first and third days for PPER (PPER1, PPER3) were strongly associated with PVT [odds ratio (OR): 1.78, 95% confidence interval (CI): 1.24-2.62, P = 0.002; OR: 1.43, 95%CI: 1.16-1.77, P < 0.001, respectively]. The areas under the ROC curves of the GLM, LSM, and RFM in the training cohort were 0.83 (95%CI: 0.79-0.88), 0.84 (95%CI: 0.79-0.88), and 0.84 (95%CI: 0.79-0.88), respectively; and were 0.77 (95%CI: 0.69-0.85), 0.83 (95%CI: 0.76-0.90), and 0.78 (95%CI: 0.70-0.85) in the validation cohort, respectively. The calibration curves showed satisfactory agreement between prediction by models and actual observation. DCA and CIC indicated that all models conferred high clinical net benefits.
CONCLUSION PPER1 and PPER3 are effective indicators for postoperative prediction of PVT. We have successfully developed PPER-based practical models to accurately predict PVT, which would conveniently help clinicians rapidly differentiate individuals at high risk of PVT, and thus guide the adoption of timely interventions.
Collapse
Affiliation(s)
- Jian Li
- Hepatic Surgery Center, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan 430030, Hubei Province, China
| | - Qi-Qi Wu
- Department of Trauma Surgery, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan 430030, Hubei Province, China
| | - Rong-Hua Zhu
- Hepatic Surgery Center, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan 430030, Hubei Province, China
| | - Xing Lv
- Hepatic Surgery Center, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan 430030, Hubei Province, China
| | - Wen-Qiang Wang
- Hepatic Surgery Center, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan 430030, Hubei Province, China
| | - Jin-Lin Wang
- Hepatic Surgery Center, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan 430030, Hubei Province, China
| | - Bin-Yong Liang
- Hepatic Surgery Center, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan 430030, Hubei Province, China
| | - Zhi-Yong Huang
- Hepatic Surgery Center, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan 430030, Hubei Province, China
| | - Er-Lei Zhang
- Hepatic Surgery Center, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan 430030, Hubei Province, China
| |
Collapse
|
6
|
Chen T, Su P, Shen Y, Chen L, Mahmud M, Zhao Y, Antoniou G. A dominant set-informed interpretable fuzzy system for automated diagnosis of dementia. Front Neurosci 2022; 16:867664. [PMID: 35979331 PMCID: PMC9376621 DOI: 10.3389/fnins.2022.867664] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2022] [Accepted: 07/05/2022] [Indexed: 11/13/2022] Open
Abstract
Dementia is an incurable neurodegenerative disease primarily affecting the older population, for which the World Health Organisation has set to promoting early diagnosis and timely management as one of the primary goals for dementia care. While a range of popular machine learning algorithms and their variants have been applied for dementia diagnosis, fuzzy systems, which have been known effective in dealing with uncertainty and offer to explicitly reason how a diagnosis can be inferred, sporadically appear in recent literature. Given the advantages of a fuzzy rule-based model, which could potentially result in a clinical decision support system that offers understandable rules and a transparent inference process to support dementia diagnosis, this paper proposes a novel fuzzy inference system by adapting the concept of dominant sets that arise from the study of graph theory. A peeling-off strategy is used to iteratively extract from the constructed edge-weighted graph a collection of dominant sets. Each dominant set is further converted into a parameterized fuzzy rule, which is finally optimized in a supervised adaptive network-based fuzzy inference framework. An illustrative example is provided that demonstrates the interpretable rules and the transparent reasoning process of reaching a decision. Further systematic experiments conducted on data from the Open Access Series of Imaging Studies (OASIS) repository, also validate its superior performance over alternative methods.
Collapse
Affiliation(s)
- Tianhua Chen
- Department of Computer Science, School of Computing and Engineering, University of Huddersfield, Huddersfield, United Kingdom
| | - Pan Su
- School of Control and Computer Engineering, North China Electric Power University, Beijing, China
| | - Yinghua Shen
- School of Economics and Business Administration, Chongqing University, Chongqing, China
| | - Lu Chen
- Institute of Big Data Science and Industry, Shanxi University, Taiyuan, China
| | - Mufti Mahmud
- Department of Computer Science, Nottingham Trent University, Nottingham, United Kingdom
| | - Yitian Zhao
- Cixi Institute of Biomedical Engineering, Ningbo Institute of Materials Technology and Engineering, Chinese Academy of Sciences, Ningbo, China
| | - Grigoris Antoniou
- Department of Computer Science, School of Computing and Engineering, University of Huddersfield, Huddersfield, United Kingdom
| |
Collapse
|
7
|
Obesity-Associated Differentially Methylated Regions in Colon Cancer. J Pers Med 2022; 12:jpm12050660. [PMID: 35629083 PMCID: PMC9142939 DOI: 10.3390/jpm12050660] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2022] [Revised: 04/11/2022] [Accepted: 04/18/2022] [Indexed: 02/01/2023] Open
Abstract
Obesity with adiposity is a common disorder in modern days, influenced by environmental factors such as eating and lifestyle habits and affecting the epigenetics of adipose-based gene regulations and metabolic pathways in colorectal cancer (CRC). We compared epigenetic changes of differentially methylated regions (DMR) of genes in colon tissues of 225 colon cancer cases (154 non-obese and 71 obese) and 15 healthy non-obese controls by accessing The Cancer Genome Atlas (TCGA) data. We applied machine-learning-based analytics including generalized regression (GR) as a confirmatory validation model to identify the factors that could contribute to DMRs impacting colon cancer to enhance prediction accuracy. We found that age was a significant predictor in obese cancer patients, both alone (p = 0.003) and interacting with hypomethylated DMRs of ZBTB46, a tumor suppressor gene (p = 0.008). DMRs of three additional genes: HIST1H3I (p = 0.001), an oncogene with a hypomethylated DMR in the promoter region; SRGAP2C (p = 0.006), a tumor suppressor gene with a hypermethylated DMR in the promoter region; and NFATC4 (p = 0.006), an adipocyte differentiating oncogene with a hypermethylated DMR in an intron region, are also significant predictors of cancer in obese patients, independent of age. The genes affected by these DMR could be potential novel biomarkers of colon cancer in obese patients for cancer prevention and progression.
Collapse
|
8
|
Hulstaert E, Levanon K, Morlion A, Van Aelst S, Christidis AA, Zamar R, Anckaert J, Verniers K, Bahar-Shany K, Sapoznik S, Vandesompele J, Mestdagh P. RNA biomarkers from proximal liquid biopsy for diagnosis of ovarian cancer. Neoplasia 2022; 24:155-164. [PMID: 34998206 PMCID: PMC8740458 DOI: 10.1016/j.neo.2021.12.008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2021] [Accepted: 12/20/2021] [Indexed: 10/29/2022]
Abstract
BACKGROUND Most ovarian cancer patients are diagnosed at an advanced stage and have a high mortality rate. Current screening strategies fail to improve prognosis because markers that are sensitive for early stage disease are lacking. This medical need justifies the search for novel approaches using utero-tubal lavage as a proximal liquid biopsy. METHODS In this study, we explore the extracellular transcriptome of utero-tubal lavage fluid obtained from 26 ovarian cancer patients and 48 controls using messenger RNA (mRNA) capture and small RNA sequencing. RESULTS We observed an enrichment of ovarian and fallopian tube specific messenger RNAs in utero-tubal lavage fluid compared to other human biofluids. Over 300 mRNAs and 41 miRNAs were upregulated in ovarian cancer samples compared with controls. Upregulated genes were enriched for genes involved in cell cycle activation and proliferation, hinting at a tumor-derived signal. CONCLUSION This is a proof-of-principle that mRNA capture sequencing of utero-tubal lavage fluid is technically feasible, and that the extracellular transcriptome of utero-tubal lavage should be further explored in larger cohorts to assess the diagnostic value of the biomarkers identified in this study. IMPACT Proximal liquid biopsy from the gynecologic tract is a promising source for mRNA and miRNA biomarkers for diagnosis of early-stage ovarian cancer.
Collapse
Affiliation(s)
- Eva Hulstaert
- Department of Biomolecular Medicine, Ghent University, Corneel Heymanslaan 10, 9000 Ghent, Belgium; OncoRNALab, Cancer Research Institute Ghent (CRIG), Corneel Heymanslaan 10, 9000 Ghent, Belgium; Department of Dermatology, Ghent University Hospital, Corneel Heymanslaan 10, 9000 Ghent, Belgium
| | - Keren Levanon
- Sheba Cancer Research Center, Chaim Sheba Medical Center, Ramat Gan, Israel; Sackler Faculty of Medicine, Tel Aviv University, Ramat Aviv, Israel
| | - Annelien Morlion
- Department of Biomolecular Medicine, Ghent University, Corneel Heymanslaan 10, 9000 Ghent, Belgium; OncoRNALab, Cancer Research Institute Ghent (CRIG), Corneel Heymanslaan 10, 9000 Ghent, Belgium
| | | | | | - Ruben Zamar
- Department of Statistics, University of British Columbia, Vancouver, Canada
| | - Jasper Anckaert
- Department of Biomolecular Medicine, Ghent University, Corneel Heymanslaan 10, 9000 Ghent, Belgium; OncoRNALab, Cancer Research Institute Ghent (CRIG), Corneel Heymanslaan 10, 9000 Ghent, Belgium
| | - Kimberly Verniers
- Department of Biomolecular Medicine, Ghent University, Corneel Heymanslaan 10, 9000 Ghent, Belgium; OncoRNALab, Cancer Research Institute Ghent (CRIG), Corneel Heymanslaan 10, 9000 Ghent, Belgium
| | - Keren Bahar-Shany
- Sheba Cancer Research Center, Chaim Sheba Medical Center, Ramat Gan, Israel
| | - Stav Sapoznik
- Sheba Cancer Research Center, Chaim Sheba Medical Center, Ramat Gan, Israel
| | - Jo Vandesompele
- Department of Biomolecular Medicine, Ghent University, Corneel Heymanslaan 10, 9000 Ghent, Belgium; OncoRNALab, Cancer Research Institute Ghent (CRIG), Corneel Heymanslaan 10, 9000 Ghent, Belgium
| | - Pieter Mestdagh
- Department of Biomolecular Medicine, Ghent University, Corneel Heymanslaan 10, 9000 Ghent, Belgium; OncoRNALab, Cancer Research Institute Ghent (CRIG), Corneel Heymanslaan 10, 9000 Ghent, Belgium.
| |
Collapse
|
9
|
Monowar Anjum M, Mohammed N, Li W, Jiang X. Privacy Preserving Collaborative Learning of Generalized Linear Mixed Model. J Biomed Inform 2022; 127:104008. [DOI: 10.1016/j.jbi.2022.104008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2021] [Revised: 12/08/2021] [Accepted: 01/30/2022] [Indexed: 12/01/2022]
|
10
|
Towards a More Reliable Interpretation of Machine Learning Outputs for Safety-Critical Systems Using Feature Importance Fusion. APPLIED SCIENCES-BASEL 2021. [DOI: 10.3390/app112411854] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
When machine learning supports decision-making in safety-critical systems, it is important to verify and understand the reasons why a particular output is produced. Although feature importance calculation approaches assist in interpretation, there is a lack of consensus regarding how features’ importance is quantified, which makes the explanations offered for the outcomes mostly unreliable. A possible solution to address the lack of agreement is to combine the results from multiple feature importance quantifiers to reduce the variance in estimates and to improve the quality of explanations. Our hypothesis is that this leads to more robust and trustworthy explanations of the contribution of each feature to machine learning predictions. To test this hypothesis, we propose an extensible model-agnostic framework divided in four main parts: (i) traditional data pre-processing and preparation for predictive machine learning models, (ii) predictive machine learning, (iii) feature importance quantification, and (iv) feature importance decision fusion using an ensemble strategy. Our approach is tested on synthetic data, where the ground truth is known. We compare different fusion approaches and their results for both training and test sets. We also investigate how different characteristics within the datasets affect the quality of the feature importance ensembles studied. The results show that, overall, our feature importance ensemble framework produces 15% less feature importance errors compared with existing methods. Additionally, the results reveal that different levels of noise in the datasets do not affect the feature importance ensembles’ ability to accurately quantify feature importance, whereas the feature importance quantification error increases with the number of features and number of orthogonal informative features. We also discuss the implications of our findings on the quality of explanations provided to safety-critical systems.
Collapse
|
11
|
Tarimo CS, Bhuyan SS, Li Q, Mahande MJJ, Wu J, Fu X. Validating machine learning models for the prediction of labour induction intervention using routine data: a registry-based retrospective cohort study at a tertiary hospital in northern Tanzania. BMJ Open 2021; 11:e051925. [PMID: 34857568 PMCID: PMC8647548 DOI: 10.1136/bmjopen-2021-051925] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/04/2022] Open
Abstract
OBJECTIVES We aimed at identifying the important variables for labour induction intervention and assessing the predictive performance of machine learning algorithms. SETTING We analysed the birth registry data from a referral hospital in northern Tanzania. Since July 2000, every birth at this facility has been recorded in a specific database. PARTICIPANTS 21 578 deliveries between 2000 and 2015 were included. Deliveries that lacked information regarding the labour induction status were excluded. PRIMARY OUTCOME Deliveries involving labour induction intervention. RESULTS Parity, maternal age, body mass index, gestational age and birth weight were all found to be important predictors of labour induction. Boosting method demonstrated the best discriminative performance (area under curve, AUC=0.75: 95% CI (0.73 to 0.76)) while logistic regression presented the least (AUC=0.71: 95% CI (0.70 to 0.73)). Random forest and boosting algorithms showed the highest net-benefits as per the decision curve analysis. CONCLUSION All of the machine learning algorithms performed well in predicting the likelihood of labour induction intervention. Further optimisation of these classifiers through hyperparameter tuning may result in an improved performance. Extensive research into the performance of other classifier algorithms is warranted.
Collapse
Affiliation(s)
- Clifford Silver Tarimo
- College of Public Health, Zhengzhou University, Zhengzhou, China
- Science and Laboratory Technology, Dar es Salaam Institute of Technology, Dar es Salaam, Tanzania, United Republic of
| | - Soumitra S Bhuyan
- School of Planning and Public Policy, Rutgers University-New Brunswick, New York, New York, USA
| | - Quanman Li
- College of Public Health, Zhengzhou University, Zhengzhou, China
| | - Michael Johnson J Mahande
- Institute of Public Health, Kilimanjaro Christian Medical University College, Moshi, Tanzania, United Republic of
| | - Jian Wu
- College of Public Health, Zhengzhou University, Zhengzhou, China
| | - Xiaoli Fu
- College of Public Health, Zhengzhou University, Zhengzhou, China
| |
Collapse
|
12
|
Pathan SA, Thomas CE, Bhutta ZA, Qureshi I, Thomas SA, Moinudheen J, Thomas SH. Qatar Prediction Rule Using ED Indicators of COVID-19 at Triage. Qatar Med J 2021; 2021:18. [PMID: 34422577 PMCID: PMC8359675 DOI: 10.5339/qmj.2021.18] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2020] [Accepted: 02/04/2021] [Indexed: 11/21/2022] Open
Abstract
INTRODUCTION The presence of Severe Acute Respiratory Syndrome Coronavirus-2 (SARS-CoV-2) and its associated disease, COVID-19 has had an enormous impact on the operations of the emergency department (ED), particularly the triage area. The aim of the study was to derive and validate a prediction rule that would be applicable to Qatar's adult ED population to predict COVID-19-positive patients. METHODS This is a retrospective study including adult patients. The data were obtained from the electronic medical records (EMR) of the Hamad Medical Corporation (HMC) for three EDs. Data from the Hamad General Hospital ED were used to derive and internally validate a prediction rule (Q-PREDICT). The Al Wakra Hospital ED and Al Khor Hospital ED data formed an external validation set consisting of the same time frame. The variables in the model included the weekly ED COVID-19-positivity rate and the following patient characteristics: region (nationality), age, acuity, cough, fever, tachypnea, hypoxemia, and hypotension. All statistical analyses were executed with Stata 16.1 (Stata Corp). The study team obtained appropriate institutional approval. RESULTS The study included 45,663 adult patients who were tested for COVID-19. Out of these, 47% (n = 21461) were COVID-19 positive. The derivation-set model had very good discrimination (c = 0.855, 95% Confidence intervals (CI) 0.847-0.861). Cross-validation of the model demonstrated that the validation-set model (c = 0.857, 95% CI 0.849-0.863) retained high discrimination. A high Q-PREDICT score ( ≥ 13) is associated with a nearly 6-fold increase in the likelihood of being COVID-19 positive (likelihood ratio 5.9, 95% CI 5.6-6.2), with a sensitivity of 84.7% (95% CI, 84.0%-85.4%). A low Q-PREDICT ( ≤ 6) is associated with a nearly 20-fold increase in the likelihood of being COVID-19 negative (likelihood ratio 19.3, 95% CI 16.7-22.1), with a specificity of 98.7% (95% CI 98.5%-98.9%). CONCLUSION The Q-PREDICT is a simple scoring system based on information readily collected from patients at the front desk of the ED and helps to predict COVID-19 status at triage. The scoring system performed well in the internal and external validation on datasets obtained from the state of Qatar.
Collapse
Affiliation(s)
| | | | | | | | - Sarah A Thomas
- Bachelor Candidate in Medical Biosciences, Faculty of Medicine, Imperial College London, UK
| | | | - Stephen H Thomas
- Hamad Medical Corporation, Doha, Qatar E-mail:
- Blizard Institute of Barts & The London School of Medicine, Queen Mary Univ. of London, UK
| |
Collapse
|
13
|
Rauschenberger A, Glaab E, van de Wiel MA. Predictive and interpretable models via the stacked elastic net. Bioinformatics 2021; 37:2012-2016. [PMID: 32437519 PMCID: PMC8336997 DOI: 10.1093/bioinformatics/btaa535] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2019] [Revised: 04/30/2020] [Accepted: 05/18/2020] [Indexed: 12/18/2022] Open
Abstract
Motivation Machine learning in the biomedical sciences should ideally provide predictive
and interpretable models. When predicting outcomes from clinical or
molecular features, applied researchers often want to know which features
have effects, whether these effects are positive or negative and how strong
these effects are. Regression analysis includes this information in the
coefficients but typically renders less predictive models than more advanced
machine learning techniques. Results Here, we propose an interpretable meta-learning approach for high-dimensional
regression. The elastic net provides a compromise between estimating weak
effects for many features and strong effects for some features. It has a
mixing parameter to weight between ridge and lasso regularization. Instead
of selecting one weighting by tuning, we combine multiple weightings by
stacking. We do this in a way that increases predictivity without
sacrificing interpretability. Availability and implementation The R package starnet is available on GitHub
(https://github.com/rauschenberger/starnet) and CRAN
(https://CRAN.R-project.org/package=starnet).
Collapse
Affiliation(s)
- Armin Rauschenberger
- Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, 4362 Esch-sur-Alzette, Luxembourg, The Netherlands.,Department of Epidemiology and Data Science, Amsterdam UMC, 1081 HV Amsterdam, The Netherlands
| | - Enrico Glaab
- Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, 4362 Esch-sur-Alzette, Luxembourg, The Netherlands
| | - Mark A van de Wiel
- Department of Epidemiology and Data Science, Amsterdam UMC, 1081 HV Amsterdam, The Netherlands.,MRC Biostatistics Unit, University of Cambridge, CB2 0SR Cambridge, UK
| |
Collapse
|
14
|
Zhan M, Chen Z, Ding C, Qu Q, Wang G, Liu S, Wen F. Risk prediction for delayed clearance of high-dose methotrexate in pediatric hematological malignancies by machine learning. Int J Hematol 2021; 114:483-493. [PMID: 34170480 DOI: 10.1007/s12185-021-03184-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2021] [Revised: 06/21/2021] [Accepted: 06/21/2021] [Indexed: 10/21/2022]
Abstract
This study aimed to establish a predictive model to identify children with hematologic malignancy at high risk for delayed clearance of high-dose methotrexate (HD-MTX) based on machine learning. A total of 205 patients were recruited. Five variables (hematocrit, risk classification, dose, SLC19A1 rs2838958, sex) and three variables (SLC19A1 rs2838958, sex, dose) were statistically significant in univariable analysis and, separately, multivariate logistic regression. The data was randomly split into a "training cohort" and a "validation cohort". A nomogram for prediction of delayed HD-MTX clearance was constructed using the three variables in the training dataset and validated in the validation dataset. Five machine learning algorithms (cart classification and regression trees, naïve Bayes, support vector machine, random forest, C5.0 decision tree) combined with different resampling methods were used for model building with five or three variables. When developed machine learning models were evaluated in the validation dataset, the C5.0 decision tree combined with the synthetic minority oversampling technique (SMOTE) using five variables had the highest area under the receiver operating characteristic curve (AUC 0.807 [95% CI 0.724-0.889]), a better performance than the nomogram (AUC 0.69 [95% CI 0.594-0.787]). The results support potential clinical application of machine learning for patient risk classification.
Collapse
Affiliation(s)
- Min Zhan
- Department of Pharmacy, Shenzhen Children's Hospital, Shenzhen, 518036, People's Republic of China
| | - Zebin Chen
- Department of Pharmacy, Shenzhen Children's Hospital, Shenzhen, 518036, People's Republic of China
| | - Changcai Ding
- Department of Research and Development, Shenzhen Advanced Precision Medical CO., LTD, Shenzhen, 518000, People's Republic of China
| | - Qiang Qu
- Department of Pharmacy, Xiangya Hospital Central South University, Changsha, 410008, People's Republic of China
| | - Guoqiang Wang
- Department of Pharmacy, Shenzhen Children's Hospital, Shenzhen, 518036, People's Republic of China
| | - Sixi Liu
- Department of Hematology/Oncology, Shenzhen Children's Hospital, Shenzhen, 518036, People's Republic of China
| | - Feiqiu Wen
- Department of Hematology/Oncology, Shenzhen Children's Hospital, Shenzhen, 518036, People's Republic of China.
| |
Collapse
|
15
|
Hu S, Luo M, Li Y. Machine Learning for the Prediction of Lymph Nodes Micrometastasis in Patients with Non-Small Cell Lung Cancer: A Comparative Analysis of Two Practical Prediction Models for Gross Target Volume Delineation. Cancer Manag Res 2021; 13:4811-4820. [PMID: 34168500 PMCID: PMC8217594 DOI: 10.2147/cmar.s313941] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2021] [Accepted: 05/31/2021] [Indexed: 12/25/2022] Open
Abstract
Purpose The lymph node gross target volume (GTV) delineation in patients with non-small cell lung cancer (NSCLC) is crucial for prognosis. This study aimed to develop a predictive model that can be used to differentiate between lymph nodes micrometastasis (LNM) and non-lymph nodes micrometastasis (non-LNM). Patients and Methods A retrospective study involving 1524 patients diagnosed with NSCLC was collected in the First Hospital of Wuhan between January 1, 2017, and April 1, 2020. Duplicated and useless variables were excluded, and 16 candidate variables were selected for further analysis. The random forest (RF) algorithm and generalized linear (GL) algorithm were used to screen out the variables that greatly affected the LNM prediction, respectively. The area under the curve (AUC) was compared between the RF model and GL model. Results The RF model revealed that the variables, including pathology, degree of differentiation, maximum short diameter of lymph node, tumor diameter, pulmonary membrane invasion, clustered lymph nodes, and T stage, were more significant for LNM prediction. Multifactorial logistic regression analysis for the GL model indicated that vascular invasion, tumor diameter, degree of differentiation, pulmonary membrane invasion, and maximum standard uptake value (SUVmax) were positively associated with LNM. The AUC for the RF model and GL model was 0.83 (95% CI: 0.75 to 0.90) and 0.64 (95% CI: 0.60 to 0.70), respectively. Conclusion We successfully established an accurate and optimized RF model that could be used to predict LNM in patients with NSCLC. This model can be used to evaluate the risk of an individual patient experiencing LNM and therefore facilitate the choice of treatment.
Collapse
Affiliation(s)
- Shuli Hu
- Department of Intensive Care Unit, Wuhan No. 1 Hospital, Wuhan, 430022, People's Republic of China
| | - Man Luo
- Department of Oncology, Wuhan No.1 Hospital, Wuhan, 430022, People's Republic of China
| | - Yaling Li
- Department of Intensive Care Unit, Wuhan No. 1 Hospital, Wuhan, 430022, People's Republic of China
| |
Collapse
|
16
|
Marconi S, Graves SJ, Weinstein BG, Bohlman S, White EP. Estimating individual-level plant traits at scale. ECOLOGICAL APPLICATIONS : A PUBLICATION OF THE ECOLOGICAL SOCIETY OF AMERICA 2021; 31:e02300. [PMID: 33480058 DOI: 10.1002/eap.2300] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/20/2019] [Revised: 07/22/2020] [Accepted: 08/16/2020] [Indexed: 06/12/2023]
Abstract
Functional ecology has increasingly focused on describing ecological communities based on their traits (measurable features affecting individuals' fitness and performance). Analyzing trait distributions within and among forests could significantly improve understanding of community composition and ecosystem function. Historically, data on trait distributions are generated by (1) collecting a small number of leaves from a small number of trees, which suffers from limited sampling but produces information at the fundamental ecological unit (the individual), or (2) using remote-sensing images to infer traits, producing information continuously across large regions, but as plots (containing multiple trees of different species) or pixels, not individuals. Remote-sensing methods that identify individual trees and estimate their traits would provide the benefits of both approaches, producing continuous large-scale data linked to biological individuals. We used data from the National Ecological Observatory Network (NEON) to develop a method to scale up functional traits from 160 trees to the millions of trees within the spatial extent of two NEON sites. The pipeline consists of three stages: (1) image segmentation, to identify individual trees and estimate structural traits; (2) an ensemble of models to infer leaf mass area (LMA), nitrogen, carbon, and phosphorus content using hyperspectral signatures, and DBH from allometry; and (3) predictions for segmented crowns for the full remote-sensing footprint at the NEON sites. The R2 values on held-out test data ranged from 0.41 to 0.75 on held-out test data. The ensemble approach performed better than single partial least-squares models. Carbon performed poorly compared to other traits (R2 of 0.41). The crown segmentation step contributed the most uncertainty in the pipeline, due to over-segmentation. The pipeline produced good estimates of DBH (R2 of 0.62 on held-out data). Trait predictions for crowns performed significantly better than comparable predictions on pixels, resulting in improvement of R2 on test data of between 0.07 and 0.26. We used the pipeline to produce individual-level trait data for ~5 million individual crowns, covering a total extent of ~360 km2 . This large data set allows testing ecological questions on landscape scales, revealing that foliar traits are correlated with structural traits and environmental conditions.
Collapse
Affiliation(s)
- Sergio Marconi
- School of Natural Resources and Environment, University of Florida, Gainesville, Florida, 32611, USA
| | - Sarah J Graves
- School of Forest Resources and Conservation, University of Florida, Gainesville, Florida, 32603, USA
- Nelson Institute for Environmental Studies, University of Wisconsin-Madison, Madison, Wisconsin, 53706, USA
| | - Ben G Weinstein
- Department of Wildlife Ecology and Conservation, University of Florida, Gainesville, Florida, 32603, USA
| | - Stephanie Bohlman
- School of Forest Resources and Conservation, University of Florida, Gainesville, Florida, 32603, USA
| | - Ethan P White
- Department of Wildlife Ecology and Conservation, University of Florida, Gainesville, Florida, 32603, USA
| |
Collapse
|
17
|
Zhou Z, Li Y, Ma Y, Zhang H, Deng Y, Zhu Z. Multi-biomarker is an early-stage predictor for progression of Coronavirus disease 2019 (COVID-19) infection. Int J Med Sci 2021; 18:2789-2798. [PMID: 34220307 PMCID: PMC8241766 DOI: 10.7150/ijms.58742] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/27/2021] [Accepted: 04/27/2021] [Indexed: 12/12/2022] Open
Abstract
Coronavirus disease 2019 (COVID-19) has spread widely in the communities in many countries. Although most of the mild patients could be cured by their body's ability to self-heal, many patients quickly progressed to severe disease and had to undergo treatment in the intensive care unit (ICU). Thus, it is very important to effectively predict which patients with mild disease are more likely to progress to severe disease. A total of 72 patients hospitalized with COVID-19 in Shandong Provincial Public Health Clinical Center and 1141 patients included in the published papers were enrolled in this study. We determined that the combination of interleukin-6 (IL-6), Neutrophil (NEUT), and Natural Killer (NK) cells had the highest prediction accuracy (with 75% sensitivity and 95% specificity) for progression of COVID-19 infection. A binomial regression equation that accounted for a multiple risk score for the combination of IL-6, NEUT, and NK was also established. The multiple risk score is a good indicator for early stratification of mild patients into risk categories, which is very important for adjusting the treatment plan and preventing death.
Collapse
Affiliation(s)
- Zheng Zhou
- Katharine Hsu International Research Institute of Infectious Disease, Shandong Provincial Public Health Clinical Center, Shandong University, Jinan 250013, China
| | - Ying Li
- Medical Technology School of Xuzhou Medical University, Xuzhou 221004, China
| | - Yuanhui Ma
- Department of Pathology, Shandong Provincial Public Health Clinical Center, Shandong University, Jinan 250013, China
| | - Heng Zhang
- Department of Labor, Jining Psychiatric Hospital, Jining 272051, China
| | - Yunfeng Deng
- Katharine Hsu International Research Institute of Infectious Disease, Shandong Provincial Public Health Clinical Center, Shandong University, Jinan 250013, China
| | - Zuobin Zhu
- Department of Genetics, Xuzhou Medical University, Xuzhou 221004, China
| |
Collapse
|
18
|
Mao Y, Chen H, Xie S, Xu L. Acoustic Assessment of Tone Production of Prelingually-Deafened Mandarin-Speaking Children With Cochlear Implants. Front Neurosci 2020; 14:592954. [PMID: 33250708 PMCID: PMC7673231 DOI: 10.3389/fnins.2020.592954] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2020] [Accepted: 10/12/2020] [Indexed: 11/23/2022] Open
Abstract
Objective The purpose of the present study was to investigate Mandarin tone production performance of prelingually deafened children with cochlear implants (CIs) using modified acoustic analyses and to evaluate the relationship between demographic factors of those CI children and their tone production ability. Methods Two hundred seventy-eight prelingually deafened children with CIs and 173 age-matched normal-hearing (NH) children participated in the study. Thirty-six monosyllabic Mandarin Chinese words were recorded from each subject. The fundamental frequencies (F0) were extracted from the tone tokens. Two acoustic measures (i.e., differentiability and hit rate) were computed based on the F0 onset and offset values (i.e., the tone ellipses of the two-dimensional [2D] method) or the F0 onset, midpoint, and offset values (i.e., the tone ellipsoids of the 3D method). The correlations between the acoustic measures as well as between the methods were performed. The relationship between demographic factors and acoustic measures were also explored. Results The children with CIs showed significantly poorer performance in tone differentiability and hit rate than the NH children. For both CI and NH groups, performance on the two acoustic measures was highly correlated with each other (r values: 0.895–0.961). The performance between the two methods (i.e., 2D and 3D methods) was also highly correlated (r values: 0.774–0.914). Age at implantation and duration of CI use showed a weak correlation with the scores of acoustic measures under both methods. These two factors jointly accounted for 15.4–18.9% of the total variance of tone production performance. Conclusion There were significant deficits in tone production ability in most prelingually deafened children with CIs, even after prolonged use of the devices. The strong correlation between the two methods suggested that the simpler, 2D method seemed to be efficient in acoustic assessment for lexical tones in hearing-impaired children. Age at implantation and especially the duration of CI use were significant, although weak, predictors for tone development in pediatric CI users. Although a large part of tone production ability could not be attributed to these two factors, the results still encourage early implantation and continual CI use for better lexical tone development in Mandarin-speaking pediatric CI users.
Collapse
Affiliation(s)
- Yitao Mao
- Department of Radiology, Xiangya Hospital, Central South University, Changsha, China
| | - Hongsheng Chen
- Department of Otolaryngology-Head and Neck Surgery, Xiangya Hospital, Central South University, Changsha, China
| | - Shumin Xie
- Department of Otolaryngology-Head and Neck Surgery, Xiangya Hospital, Central South University, Changsha, China
| | - Li Xu
- Communication Sciences and Disorders, Ohio University, Athens, OH, United States
| |
Collapse
|
19
|
Pal R, Villarreal P, Yu X, Qiu S, Vargas G. Multimodal widefield fluorescence imaging with nonlinear optical microscopy workflow for noninvasive oral epithelial neoplasia detection: a preclinical study. JOURNAL OF BIOMEDICAL OPTICS 2020; 25:JBO-200213R. [PMID: 33200597 PMCID: PMC7667429 DOI: 10.1117/1.jbo.25.11.116008] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/07/2020] [Accepted: 10/02/2020] [Indexed: 05/06/2023]
Abstract
SIGNIFICANCE Early detection of epithelial cancers and precancers/neoplasia in the presence of benign lesions is challenging due to the lack of robust in vivo imaging and biopsy guidance techniques. Label-free nonlinear optical microscopy (NLOM) has shown promise for optical biopsy through the detection of cellular and extracellular signatures of neoplasia. Although in vivo microscopy techniques continue to be developed, the surface area imaged in microscopy is limited by the field of view. FDA-approved widefield fluorescence (WF) imaging systems that capture autofluorescence signatures of neoplasia provide molecular information at large fields of view, which may complement the cytologic and architectural information provided by NLOM. AIM A multimodal imaging approach with high-sensitivity WF and high-resolution NLOM was investigated to identify and distinguish image-based features of neoplasia from normal and benign lesions. APPROACH In vivo label-free WF imaging and NLOM was performed in preclinical hamster models of oral neoplasia and inflammation. Analyses of WF imaging, NLOM imaging, and dual modality (WF combined with NLOM) were performed. RESULTS WF imaging showed increased red-to-green autofluorescence ratio in neoplasia compared to inflammation and normal oral mucosa (p < 0.01). In vivo assessment of the mucosal tissue with NLOM revealed subsurface cytologic (nuclear pleomorphism) and architectural (remodeling of extracellular matrix) atypia in histologically confirmed neoplastic tissue, which were not observed in inflammation or normal mucosa. Univariate and multivariate statistical analysis of macroscopic and microscopic image-based features indicated improved performance (94% sensitivity and 97% specificity) of a multiscale approach over WF alone, even in the presence of benign lesions (inflammation), a common confounding factor in diagnostics. CONCLUSIONS A multimodal imaging approach integrating strengths from WF and NLOM may be beneficial in identifying oral neoplasia. Our study could guide future studies on human oral neoplasia to further evaluate merits and limitations of multimodal workflows and inform the development of multiscale clinical imaging systems.
Collapse
Affiliation(s)
- Rahul Pal
- Massachusetts General Hospital and Harvard Medical School, Athinoula A. Martinos Center for Biomedical Imaging, Charlestown, Massachusetts, United States
| | - Paula Villarreal
- The University of Texas Medical Branch, Biomedical Engineering and Imaging Sciences Group, Galveston, Texas, United States
- The University of Texas Medical Branch, Department of Neuroscience, Cell Biology, and Anatomy, Galveston, Texas, United States
| | - Xiaoying Yu
- The University of Texas Medical Branch, Department of Preventive Medicine and Population Health, Galveston, Texas, United States
| | - Suimin Qiu
- The University of Texas Medical Branch, Department of Pathology, Galveston, Texas, United States
| | - Gracie Vargas
- The University of Texas Medical Branch, Biomedical Engineering and Imaging Sciences Group, Galveston, Texas, United States
- The University of Texas Medical Branch, Department of Neuroscience, Cell Biology, and Anatomy, Galveston, Texas, United States
| |
Collapse
|
20
|
Abstract
Scientists in biomedical and psychosocial research need to deal with skewed data all the time. In the case of comparing means from two groups, the log transformation is commonly used as a traditional technique to normalize skewed data before utilizing the two-group t-test. An alternative method that does not assume normality is the generalized linear model (GLM) combined with an appropriate link function. In this work, the two techniques are compared using Monte Carlo simulations; each consists of many iterations that simulate two groups of skewed data for three different sampling distributions: gamma, exponential, and beta. Afterward, both methods are compared regarding Type I error rates, power rates and the estimates of the mean differences. We conclude that the t-test with log transformation had superior performance over the GLM method for any data that are not normal and follow beta or gamma distributions. Alternatively, for exponentially distributed data, the GLM method had superior performance over the t-test with log transformation.
Collapse
|
21
|
Rachid Zaim S, Kenost C, Berghout J, Chiu W, Wilson L, Zhang HH, Lussier YA. binomialRF: interpretable combinatoric efficiency of random forests to identify biomarker interactions. BMC Bioinformatics 2020; 21:374. [PMID: 32859146 PMCID: PMC7456085 DOI: 10.1186/s12859-020-03718-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2020] [Accepted: 08/19/2020] [Indexed: 11/17/2022] Open
Abstract
BACKGROUND In this era of data science-driven bioinformatics, machine learning research has focused on feature selection as users want more interpretation and post-hoc analyses for biomarker detection. However, when there are more features (i.e., transcripts) than samples (i.e., mice or human samples) in a study, it poses major statistical challenges in biomarker detection tasks as traditional statistical techniques are underpowered in high dimension. Second and third order interactions of these features pose a substantial combinatoric dimensional challenge. In computational biology, random forest (RF) classifiers are widely used due to their flexibility, powerful performance, their ability to rank features, and their robustness to the "P > > N" high-dimensional limitation that many matrix regression algorithms face. We propose binomialRF, a feature selection technique in RFs that provides an alternative interpretation for features using a correlated binomial distribution and scales efficiently to analyze multiway interactions. RESULTS In both simulations and validation studies using datasets from the TCGA and UCI repositories, binomialRF showed computational gains (up to 5 to 300 times faster) while maintaining competitive variable precision and recall in identifying biomarkers' main effects and interactions. In two clinical studies, the binomialRF algorithm prioritizes previously-published relevant pathological molecular mechanisms (features) with high classification precision and recall using features alone, as well as with their statistical interactions alone. CONCLUSION binomialRF extends upon previous methods for identifying interpretable features in RFs and brings them together under a correlated binomial distribution to create an efficient hypothesis testing algorithm that identifies biomarkers' main effects and interactions. Preliminary results in simulations demonstrate computational gains while retaining competitive model selection and classification accuracies. Future work will extend this framework to incorporate ontologies that provide pathway-level feature selection from gene expression input data.
Collapse
Affiliation(s)
- Samir Rachid Zaim
- Center for Biomedical Informatics and Biostatistics, University of Arizona Health Sciences, 1230 N. Cherry Ave, Tucson, AZ, 85721, USA
- The Graduate Interdisciplinary Program in Statistics, The University of Arizona, 617 N. Santa Rita Ave., Tucson, AZ, 85721, USA
- College of Medicine, Tucson, 1501 N. Campbell Ave, Tucson, AZ, 85721, USA
| | - Colleen Kenost
- Center for Biomedical Informatics and Biostatistics, University of Arizona Health Sciences, 1230 N. Cherry Ave, Tucson, AZ, 85721, USA
- College of Medicine, Tucson, 1501 N. Campbell Ave, Tucson, AZ, 85721, USA
| | - Joanne Berghout
- Center for Biomedical Informatics and Biostatistics, University of Arizona Health Sciences, 1230 N. Cherry Ave, Tucson, AZ, 85721, USA
- College of Medicine, Tucson, 1501 N. Campbell Ave, Tucson, AZ, 85721, USA
| | - Wesley Chiu
- Center for Biomedical Informatics and Biostatistics, University of Arizona Health Sciences, 1230 N. Cherry Ave, Tucson, AZ, 85721, USA
- College of Medicine, Tucson, 1501 N. Campbell Ave, Tucson, AZ, 85721, USA
| | - Liam Wilson
- Center for Biomedical Informatics and Biostatistics, University of Arizona Health Sciences, 1230 N. Cherry Ave, Tucson, AZ, 85721, USA
- College of Medicine, Tucson, 1501 N. Campbell Ave, Tucson, AZ, 85721, USA
| | - Hao Helen Zhang
- Center for Biomedical Informatics and Biostatistics, University of Arizona Health Sciences, 1230 N. Cherry Ave, Tucson, AZ, 85721, USA.
- The Graduate Interdisciplinary Program in Statistics, The University of Arizona, 617 N. Santa Rita Ave., Tucson, AZ, 85721, USA.
- Department of Mathematics, College of Sciences, The University of Arizona, 617 N. Santa Rita Ave., Tucson, AZ, 85721, USA.
| | - Yves A Lussier
- Center for Biomedical Informatics and Biostatistics, University of Arizona Health Sciences, 1230 N. Cherry Ave, Tucson, AZ, 85721, USA.
- The Graduate Interdisciplinary Program in Statistics, The University of Arizona, 617 N. Santa Rita Ave., Tucson, AZ, 85721, USA.
- College of Medicine, Tucson, 1501 N. Campbell Ave, Tucson, AZ, 85721, USA.
- The Center for Applied Genetic and Genomic Medicine, 1295 N. Martin, Tucson, AZ, 85721, USA.
- The University of Arizona Cancer Center, 3838 N. Campbell Ave, Tucson, AZ, 85721, USA.
- The University of Arizona BIO5 Institute, 1657 E. Helen Street, Tucson, AZ, 85721, USA.
| |
Collapse
|
22
|
A novel serum miRNA-pair classifier for diagnosis of sarcoma. PLoS One 2020; 15:e0236097. [PMID: 32673360 PMCID: PMC7365454 DOI: 10.1371/journal.pone.0236097] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2019] [Accepted: 06/30/2020] [Indexed: 11/19/2022] Open
Abstract
Soft tissue sarcomas (STS) is a set of rare malignant tumor originated from mesoderm. For the prognosis of sarcoma, early diagnosis is important, however, currently no mature and non-invasive method for diagnosis exists. MicroRNAs (miRNAs) are a class of noncoding RNAs and their expression varies greatly, especially during tumor activity. The purpose of this study was to construct a predictive model for the diagnosis of sarcomas based on the relative expression level of miRNA in serum. miRNA array expression data of 677 samples including 402 malignant sarcoma samples and 275 healthy samples was used to construct the prediction model. Based on 6 gene pairs, random generalized linear model (RGLM) was constructed, with an accuracy of 100% in the internal test dataset and of 74.3% in the merged external dataset in prediction whether a serum sample was obtained from a sarcoma patient, with a specificity of 100% in the internal test dataset and 90.5% in the external dataset. In conclusion, our serum miRNA-pair classifier has the potential to be used for the screening of sarcoma with high accuracy and specificity.
Collapse
|
23
|
Verstockt B, Verstockt S, Veny M, Dehairs J, Arnauts K, Van Assche G, De Hertogh G, Vermeire S, Salas A, Ferrante M. Expression Levels of 4 Genes in Colon Tissue Might Be Used to Predict Which Patients Will Enter Endoscopic Remission After Vedolizumab Therapy for Inflammatory Bowel Diseases. Clin Gastroenterol Hepatol 2020; 18:1142-1151.e10. [PMID: 31446181 PMCID: PMC7196933 DOI: 10.1016/j.cgh.2019.08.030] [Citation(s) in RCA: 49] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/20/2019] [Accepted: 08/09/2019] [Indexed: 02/06/2023]
Abstract
BACKGROUND & AIMS We aimed to identify biomarkers that might be used to predict responses of patients with inflammatory bowel diseases (IBD) to vedolizumab therapy. METHODS We obtained biopsies from inflamed colon of patients with IBD who began treatment with vedolizumab (n = 31) or tumor necrosis factor (TNF) antagonists (n = 20) and performed RNA-sequencing analyses. We compared gene expression patterns between patients who did and did not enter endoscopic remission (absence of ulcerations at month 6 for patients with Crohn's disease or Mayo endoscopic subscore ≤1 at week 14 for patients with ulcerative colitis) and performed pathway analysis and cell deconvolution for training (n = 20) and validation (n = 11) datasets. Colon biopsies were also analyzed by immunohistochemistry. We validated a baseline gene expression pattern associated with endoscopic remission after vedolizumab therapy using 3 independent datasets (n = 66). RESULTS We identified significant differences in expression levels of 44 genes between patients who entered remission after vedolizumab and those who did not; we found significant increases in leukocyte migration in colon tissues from patients who did not enter remission (P < .006). Deconvolution methods identified a significant enrichment of monocytes (P = .005), M1-macrophages (P = .05), and CD4+ T cells (P = .008) in colon tissues from patients who did not enter remission, whereas colon tissues from patients in remission had higher numbers of naïve B cells before treatment (P = .05). Baseline expression levels of PIWIL1, MAATS1, RGS13, and DCHS2 identified patients who did vs did not enter remission with 80% accuracy in the training set and 100% accuracy in validation dataset 1. We validated these findings in the 3 independent datasets by microarray, RNA sequencing and quantitative PCR analysis (P = .003). Expression levels of these 4 genes did not associate with response to anti-TNF agents. We confirmed the presence of proteins encoded by mRNAs using immunohistochemistry. CONCLUSIONS We identified 4 genes whose baseline expression levels in colon tissues of patients with IBD associate with endoscopic remission after vedolizumab, but not anti-TNF, treatment. We validated this signature in 4 independent datasets and also at the protein level. Studies of these genes might provide insights into the mechanisms of action of vedolizumab.
Collapse
Affiliation(s)
- Bram Verstockt
- Department of Gastroenterology and Hepatology, University Hospitals Leuven, KU Leuven, Leuven, Belgium,Translational Research Center for Gastrointestinal Disorders, Department of Chronic Disease, Metabolism and Ageing, KU Leuven, Leuven, Belgium
| | - Sare Verstockt
- Laboratory for Complex Genetics, Department of Human Genetics, KU Leuven, Leuven, Belgium
| | - Marisol Veny
- Department of Gastroenterology, Institut d’Investigacions Biomèdiques August Pi i Sunyer, Hospital Clínic, Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas, Barcelona, Spain
| | - Jonas Dehairs
- Laboratory of Lipid Metabolism and Cancer, Department of Oncology, KU Leuven, Leuven, Belgium
| | - Kaline Arnauts
- Translational Research Center for Gastrointestinal Disorders, Department of Chronic Disease, Metabolism and Ageing, KU Leuven, Leuven, Belgium,Stem Cell Institute Leuven, Department of Development and Regeneration, KU Leuven, Leuven, Belgium
| | - Gert Van Assche
- Department of Gastroenterology and Hepatology, University Hospitals Leuven, KU Leuven, Leuven, Belgium,Translational Research Center for Gastrointestinal Disorders, Department of Chronic Disease, Metabolism and Ageing, KU Leuven, Leuven, Belgium
| | - Gert De Hertogh
- Translational Cell & Tissue Research Unit, Department of Imaging & Pathology, KU Leuven, Leuven, Belgium
| | - Séverine Vermeire
- Department of Gastroenterology and Hepatology, University Hospitals Leuven, KU Leuven, Leuven, Belgium,Translational Research Center for Gastrointestinal Disorders, Department of Chronic Disease, Metabolism and Ageing, KU Leuven, Leuven, Belgium
| | - Azucena Salas
- Department of Gastroenterology, Institut d’Investigacions Biomèdiques August Pi i Sunyer, Hospital Clínic, Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas, Barcelona, Spain
| | - Marc Ferrante
- Department of Gastroenterology and Hepatology, University Hospitals Leuven, KU Leuven, Leuven, Belgium; Translational Research Center for Gastrointestinal Disorders, Department of Chronic Disease, Metabolism and Ageing, KU Leuven, Leuven, Belgium.
| |
Collapse
|
24
|
Khamitova AF, Lakman IA, Akhmetvaleev RR, Tulbaev EL, Gareeva DF, Zagidullin SZ, Zagidullin NS. [Multifactor predictive model in patients with myocardial infarction based on modern biomarkers]. ACTA ACUST UNITED AC 2020; 60:14-20. [PMID: 32375611 DOI: 10.18087/cardio.2020.3.2593] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2020] [Accepted: 04/30/2020] [Indexed: 11/18/2022]
Abstract
Objective To study the prognostic role of current serum biomarkers in patients with myocardial infarction (MI) by constructing a multifactorial model for prediction of cardiovascular complications (CVC) in remote MI. Acute coronary syndrome is a major cause of death and disability in the Russian Federation. Introduction of current biomarkers, such as N-terminal pro-brain natriuretic peptide, stimulating growth factor (ST2), and centraxin-2 (Pentraxin, Ptx-3), provides more possibilities for diagnostics and calculation of risk for CVC.Materials and Methods Concentrations of biomarkers were measured in 180 patients with MI (mean age, 61.4±1.7) upon admission. At one year, specific and composite endpoints were determined (MI, acute cerebrovascular disease, admission for CVD, and cardiovascular death). Based on this information, a prognostic model for subsequent events was developed.Results A mathematical model was created for computing the development of a composite endpoint. In this model, the biomarkers NT-proBNP, Ptx-3 and, to a lesser extent, ST2 demonstrated their prognostic significance in diagnosis of CVC with a sensitivity of 78.79 % and specificity of 86.67 % (area under the curve, AUC 0.73).Conclusion In patients with remote MI, the biomarkers NT-proBNP, ST2, and Ptx-3 improve prediction of CVC.
Collapse
Affiliation(s)
| | | | | | - E L Tulbaev
- Bashkir State Medical University Municipal Clinical Hospital #21
| | | | - Sh Z Zagidullin
- ФГБОУ ВО «Башкирский государственный медицинский университет» Минздрава России
| | - N Sh Zagidullin
- ФГБОУ ВО «Башкирский государственный медицинский университет» Минздрава России ГБУЗ РБ «Городская клиническая больница № 21»
| |
Collapse
|
25
|
Webb CA, Cohen ZD, Beard C, Forgeard M, Peckham AD, Björgvinsson T. Personalized prognostic prediction of treatment outcome for depressed patients in a naturalistic psychiatric hospital setting: A comparison of machine learning approaches. J Consult Clin Psychol 2020; 88:25-38. [PMID: 31841022 DOI: 10.1037/ccp0000451] [Citation(s) in RCA: 41] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Abstract
OBJECTIVE Research on predictors of treatment outcome in depression has largely derived from randomized clinical trials involving strict standardization of treatments, stringent patient exclusion criteria, and careful selection and supervision of study clinicians. The extent to which findings from such studies generalize to naturalistic psychiatric settings is unclear. This study sought to predict depression outcomes for patients seeking treatment within an intensive psychiatric hospital setting and while comparing the performance of a range of machine learning approaches. METHOD Depressed patients (N = 484; ages 18-72; 89% White) receiving treatment within a psychiatric partial hospital program delivering pharmacotherapy and cognitive behavioral therapy were split into a training sample and holdout sample. First, within the training sample, 51 pretreatment variables were submitted to 13 machine learning algorithms to predict, via cross-validation, posttreatment Patient Health Questionnaire-9 depression scores. Second, the best performing modeling approach (lowest mean squared error; MSE) from the training sample was selected to predict outcome in the holdout sample. RESULTS The best performing model in the training sample was elastic net regularization (ENR; MSE = 20.49, R2 = .28), which had comparable performance in the holdout sample (MSE = 11.26; R2 = .38). There were 14 pretreatment variables that predicted outcome. To demonstrate the translation of an ENR model to personalized prediction of treatment outcome, a patient-specific prognosis calculator is presented. CONCLUSIONS Informed by pretreatment patient characteristics, such predictive models could be used to communicate prognosis to clinicians and to guide treatment planning. Identified predictors of poor prognosis may suggest important targets for intervention. (PsycINFO Database Record (c) 2019 APA, all rights reserved).
Collapse
Affiliation(s)
- Christian A Webb
- Department of Psychiatry, Harvard Medical School/McLean Hospital
| | - Zachary D Cohen
- Department of Psychology, University of California, Los Angeles
| | - Courtney Beard
- Department of Psychiatry, Harvard Medical School/McLean Hospital
| | - Marie Forgeard
- Department of Clinical Psychology, William James College
| | - Andrew D Peckham
- Department of Psychiatry, Harvard Medical School/McLean Hospital
| | | |
Collapse
|
26
|
Hoque F, Hu B, Wang JG, Hall GB. Use of geospatial methods to characterize dispersion of the Emerald ash borer in southern Ontario, Canada. ECOL INFORM 2020. [DOI: 10.1016/j.ecoinf.2019.101037] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023]
|
27
|
Beyond metaphors and semantics: A framework for causal inference in neuroscience. Behav Brain Sci 2019; 42:e230. [PMID: 31775938 DOI: 10.1017/s0140525x19001389] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
The long-enduring coding metaphor is deemed problematic because it imbues correlational evidence with causal power. In neuroscience, most research is correlational or conditionally correlational; this research, in aggregate, informs causal inference. Rather than prescribing semantics used in correlational studies, it would be useful for neuroscientists to focus on a constructive syntax to guide principled causal inference.
Collapse
|
28
|
Chow C, Andrášik R, Fischer B, Keiler M. Application of statistical techniques to proportional loss data: Evaluating the predictive accuracy of physical vulnerability to hazardous hydro-meteorological events. JOURNAL OF ENVIRONMENTAL MANAGEMENT 2019; 246:85-100. [PMID: 31176183 DOI: 10.1016/j.jenvman.2019.05.084] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/28/2018] [Revised: 05/09/2019] [Accepted: 05/21/2019] [Indexed: 06/09/2023]
Abstract
Knowledge about the cause of differential structural damages following the occurrence of hazardous hydro-meteorological events can inform more effective risk management and spatial planning solutions. While studies have been previously conducted to describe relationships between physical vulnerability and features about building properties, the immediate environment and event intensity proxies, several key challenges remain. In particular, observations, especially those associated with high magnitude events, and studies designed to evaluate a comprehensive range of predictive features are both limited. To build upon previous developments, we described a workflow to support the continued development and assessment of empirical, multivariate physical vulnerability functions based on predictive accuracy. Within this workflow, we evaluated several statistical approaches, namely generalized linear models and their more complex alternatives. A series of models were built 1) to explicitly consider the effects of dimension reduction, 2) to evaluate the inclusion of interaction effects between and among predictors, 3) to evaluate an ensemble prediction method for applications where data observations are sparse, 4) to describe how model results can inform about the relative importance of predictors to explain variance in expected damages and 5) to assess the predictive accuracy of the models based on prescribed metrics. The utility of the workflow was demonstrated on data with characteristics of what is commonly acquired in ex-post field assessments. The workflow and recommendations from this study aim to provide guidance to researchers and practitioners in the natural hazards community.
Collapse
Affiliation(s)
- Candace Chow
- University of Bern, Geography Institute, Hallerstrasse 12, 3012, Bern, Switzerland.
| | - Richard Andrášik
- CDV Transport Research Centre, Líšeňská 33a, 63600, Brno, Czech Republic.
| | - Benjamin Fischer
- Geoformer Igp AG, Sebastiansplatz 1, 3900, Brig-Glis, Switzerland.
| | - Margreth Keiler
- University of Bern, Geography Institute, Hallerstrasse 12, 3012, Bern, Switzerland.
| |
Collapse
|
29
|
Hosni M, Abnane I, Idri A, Carrillo de Gea JM, Fernández Alemán JL. Reviewing ensemble classification methods in breast cancer. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2019; 177:89-112. [PMID: 31319964 DOI: 10.1016/j.cmpb.2019.05.019] [Citation(s) in RCA: 41] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/07/2019] [Revised: 05/16/2019] [Accepted: 05/18/2019] [Indexed: 05/09/2023]
Abstract
CONTEXT Ensemble methods consist of combining more than one single technique to solve the same task. This approach was designed to overcome the weaknesses of single techniques and consolidate their strengths. Ensemble methods are now widely used to carry out prediction tasks (e.g. classification and regression) in several fields, including that of bioinformatics. Researchers have particularly begun to employ ensemble techniques to improve research into breast cancer, as this is the most frequent type of cancer and accounts for most of the deaths among women. OBJECTIVE AND METHOD The goal of this study is to analyse the state of the art in ensemble classification methods when applied to breast cancer as regards 9 aspects: publication venues, medical tasks tackled, empirical and research types adopted, types of ensembles proposed, single techniques used to construct the ensembles, validation framework adopted to evaluate the proposed ensembles, tools used to build the ensembles, and optimization methods used for the single techniques. This paper was undertaken as a systematic mapping study. RESULTS A total of 193 papers that were published from the year 2000 onwards, were selected from four online databases: IEEE Xplore, ACM digital library, Scopus and PubMed. This study found that of the six medical tasks that exist, the diagnosis medical task was that most frequently researched, and that the experiment-based empirical type and evaluation-based research type were the most dominant approaches adopted in the selected studies. The homogeneous type was that most widely used to perform the classification task. With regard to single techniques, this mapping study found that decision trees, support vector machines and artificial neural networks were those most frequently adopted to build ensemble classifiers. In the case of the evaluation framework, the Wisconsin Breast Cancer dataset was the most frequently used by researchers to perform their experiments, while the most noticeable validation method was k-fold cross-validation. Several tools are available to perform experiments related to ensemble classification methods, such as Weka and R Software. Few researchers took into account the optimisation of the single technique of which their proposed ensemble was composed, while the grid search method was that most frequently adopted to tune the parameter settings of a single classifier. CONCLUSION This paper reports an in-depth study of the application of ensemble methods as regards breast cancer. Our results show that there are several gaps and issues and we, therefore, provide researchers in the field of breast cancer research with recommendations. Moreover, after analysing the papers found in this systematic mapping study, we discovered that the majority report positive results concerning the accuracy of ensemble classifiers when compared to the single classifiers. In order to aggregate the evidence reported in literature, it will, therefore, be necessary to perform a systematic literature review and meta-analysis in which an in-depth analysis could be conducted so as to confirm the superiority of ensemble classifiers over the classical techniques.
Collapse
Affiliation(s)
- Mohamed Hosni
- Software Project Management Research Team, ENSIAS, University Mohammed V of Rabat, Morocco.
| | - Ibtissam Abnane
- Software Project Management Research Team, ENSIAS, University Mohammed V of Rabat, Morocco.
| | - Ali Idri
- Software Project Management Research Team, ENSIAS, University Mohammed V of Rabat, Morocco.
| | - Juan M Carrillo de Gea
- Department of Informatics and Systems, Faculty of Computer Science, University of Murcia, Spain.
| | | |
Collapse
|
30
|
Affiliation(s)
| | - Laks Lakshmanan
- Department of Computer Science, University of British Columbia, Vancouver, BC, Canada
| | - Ezequiel Smucler
- Department of Mathematics and Statistics, Universidad Torcuato Di Tella, Buenos Aires, Buenos Aires, Argentina
| | - Ruben Zamar
- Department of Statistics, University of British Columbia, Vancouver, BC, Canada
| |
Collapse
|
31
|
Development of predictive models for all individual questions of SRS-22R after adult spinal deformity surgery: a step toward individualized medicine. EUROPEAN SPINE JOURNAL : OFFICIAL PUBLICATION OF THE EUROPEAN SPINE SOCIETY, THE EUROPEAN SPINAL DEFORMITY SOCIETY, AND THE EUROPEAN SECTION OF THE CERVICAL SPINE RESEARCH SOCIETY 2019; 28:1998-2011. [DOI: 10.1007/s00586-019-06079-x] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/17/2018] [Revised: 06/10/2019] [Accepted: 07/14/2019] [Indexed: 10/26/2022]
|
32
|
Klales A, Ousley S, Vollner J. Response to multivariate ordinal probit analysis in the skeletal assessment of sex (Konigsberg and Frankenberg). AMERICAN JOURNAL OF PHYSICAL ANTHROPOLOGY 2019; 169:388-389. [DOI: 10.1002/ajpa.23830] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/07/2019] [Revised: 03/17/2019] [Accepted: 03/20/2019] [Indexed: 11/10/2022]
Affiliation(s)
| | - Stephen Ousley
- Department of Computing and Information ScienceMercyhurst University Erie Pennsylvania
| | | |
Collapse
|
33
|
An extensive experimental survey of regression methods. Neural Netw 2018; 111:11-34. [PMID: 30654138 DOI: 10.1016/j.neunet.2018.12.010] [Citation(s) in RCA: 62] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2018] [Revised: 11/21/2018] [Accepted: 12/11/2018] [Indexed: 01/22/2023]
Abstract
Regression is a very relevant problem in machine learning, with many different available approaches. The current work presents a comparison of a large collection composed by 77 popular regression models which belong to 19 families: linear and generalized linear models, generalized additive models, least squares, projection methods, LASSO and ridge regression, Bayesian models, Gaussian processes, quantile regression, nearest neighbors, regression trees and rules, random forests, bagging and boosting, neural networks, deep learning and support vector regression. These methods are evaluated using all the regression datasets of the UCI machine learning repository (83 datasets), with some exceptions due to technical reasons. The experimental work identifies several outstanding regression models: the M5 rule-based model with corrections based on nearest neighbors (cubist), the gradient boosted machine (gbm), the boosting ensemble of regression trees (bstTree) and the M5 regression tree. Cubist achieves the best squared correlation ( R2) in 15.7% of datasets being very near to it, with difference below 0.2 for 89.1% of datasets, and the median of these differences over the dataset collection is very low (0.0192), compared e.g. to the classical linear regression (0.150). However, cubist is slow and fails in several large datasets, while other similar regression models as M5 never fail and its difference to the best R2 is below 0.2 for 92.8% of datasets. Other well-performing regression models are the committee of neural networks (avNNet), extremely randomized regression trees (extraTrees, which achieves the best R2 in 33.7% of datasets), random forest (rf) and ε-support vector regression (svr), but they are slower and fail in several datasets. The fastest regression model is least angle regression lars, which is 70 and 2,115 times faster than M5 and cubist, respectively. The model which requires least memory is non-negative least squares (nnls), about 2 GB, similarly to cubist, while M5 requires about 8 GB. For 97.6% of datasets there is a regression model among the 10 bests which is very near (difference below 0.1) to the best R2, which increases to 100% allowing differences of 0.2. Therefore, provided that our dataset and model collection are representative enough, the main conclusion of this study is that, for a new regression problem, some model in our top-10 should achieve R2 near to the best attainable for that problem.
Collapse
|
34
|
Urda D, Aragón F, Bautista R, Franco L, Veredas FJ, Claros MG, Jerez JM. BLASSO: integration of biological knowledge into a regularized linear model. BMC SYSTEMS BIOLOGY 2018; 12:94. [PMID: 30458775 PMCID: PMC6245593 DOI: 10.1186/s12918-018-0612-8] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
Background In RNA-Seq gene expression analysis, a genetic signature or biomarker is defined as a subset of genes that is probably involved in a given complex human trait and usually provide predictive capabilities for that trait. The discovery of new genetic signatures is challenging, as it entails the analysis of complex-nature information encoded at gene level. Moreover, biomarkers selection becomes unstable, since high correlation among the thousands of genes included in each sample usually exists, thus obtaining very low overlapping rates between the genetic signatures proposed by different authors. In this sense, this paper proposes BLASSO, a simple and highly interpretable linear model with l1-regularization that incorporates prior biological knowledge to the prediction of breast cancer outcomes. Two different approaches to integrate biological knowledge in BLASSO, Gene-specific and Gene-disease, are proposed to test their predictive performance and biomarker stability on a public RNA-Seq gene expression dataset for breast cancer. The relevance of the genetic signature for the model is inspected by a functional analysis. Results BLASSO has been compared with a baseline LASSO model. Using 10-fold cross-validation with 100 repetitions for models’ assessment, average AUC values of 0.7 and 0.69 were obtained for the Gene-specific and the Gene-disease approaches, respectively. These efficacy rates outperform the average AUC of 0.65 obtained with the LASSO. With respect to the stability of the genetic signatures found, BLASSO outperformed the baseline model in terms of the robustness index (RI). The Gene-specific approach gave RI of 0.15±0.03, compared to RI of 0.09±0.03 given by LASSO, thus being 66% times more robust. The functional analysis performed to the genetic signature obtained with the Gene-disease approach showed a significant presence of genes related with cancer, as well as one gene (IFNK) and one pseudogene (PCNAP1) which a priori had not been described to be related with cancer. Conclusions BLASSO has been shown as a good choice both in terms of predictive efficacy and biomarker stability, when compared to other similar approaches. Further functional analyses of the genetic signatures obtained with BLASSO has not only revealed genes with important roles in cancer, but also genes that should play an unknown or collateral role in the studied disease.
Collapse
Affiliation(s)
- Daniel Urda
- Universidad de Cádiz, Departamento de Ingeniería Informática, Avda. de la Universidad de Cádiz n°10, Puerto Real, Cádiz, 11519, Spain.
| | - Francisco Aragón
- Universidad de Málaga, Departamento de Lenguajes y Ciencias de la Computación, Bulevar Louis Pasteur, 35. Campus de Teatinos, Málaga, 29071, Spain
| | - Rocío Bautista
- Universidad de Málaga, Plataforma Andaluza de Bioinformática, Parque Tecnológico de Andalucía, Calle Severo Ochoa 34, Málaga, 29590, Spain
| | - Leonardo Franco
- Instituto de Investigación Biomédica de Málaga (IBIMA), Inteligencia Computacional en Biomedicina, Avda. Jorge Luis Borges n°15 Bl.3 Pl.3, Málaga, 29010, Spain.,Universidad de Málaga, Departamento de Lenguajes y Ciencias de la Computación, Bulevar Louis Pasteur, 35. Campus de Teatinos, Málaga, 29071, Spain
| | - Francisco J Veredas
- Instituto de Investigación Biomédica de Málaga (IBIMA), Inteligencia Computacional en Biomedicina, Avda. Jorge Luis Borges n°15 Bl.3 Pl.3, Málaga, 29010, Spain.,Universidad de Málaga, Departamento de Lenguajes y Ciencias de la Computación, Bulevar Louis Pasteur, 35. Campus de Teatinos, Málaga, 29071, Spain
| | - Manuel Gonzalo Claros
- Universidad de Málaga, Departamento de Biología Molecular y Bioquímica, Facultad de Ciencias, Campus Universitario de Teatinos, Málaga, 29071, Spain
| | - José Manuel Jerez
- Instituto de Investigación Biomédica de Málaga (IBIMA), Inteligencia Computacional en Biomedicina, Avda. Jorge Luis Borges n°15 Bl.3 Pl.3, Málaga, 29010, Spain.,Universidad de Málaga, Departamento de Lenguajes y Ciencias de la Computación, Bulevar Louis Pasteur, 35. Campus de Teatinos, Málaga, 29071, Spain
| |
Collapse
|
35
|
Estimating Forest Canopy Cover in Black Locust (Robinia pseudoacacia L.) Plantations on the Loess Plateau Using Random Forest. FORESTS 2018. [DOI: 10.3390/f9100623] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
The forest canopy is the medium for energy and mass exchange between forest ecosystems and the atmosphere. Remote sensing techniques are more efficient and appropriate for estimating forest canopy cover (CC) than traditional methods, especially at large scales. In this study, we evaluated the CC of black locust plantations on the Loess Plateau using random forest (RF) regression models. The models were established using the relationships between digital hemispherical photograph (DHP) field data and variables that were calculated from satellite images. Three types of variables were calculated from the satellite data: spectral variables calculated from a multispectral image, textural variables calculated from a panchromatic image (Tpan) with a 15 × 15 window size, and textural variables calculated from spectral variables (TB+VIs) with a 9 × 9 window size. We compared different mtry and ntree values to find the most suitable parameters for the RF models. The results indicated that the RF model of spectral variables explained 57% (root mean square error (RMSE) = 0.06) of the variability in the field CC data. The soil-adjusted vegetation index (SAVI) and enhanced vegetation index (EVI) were more important than other spectral variables. The RF model of Tpan obtained higher accuracy (R2 = 0.69, RMSE = 0.05) than the spectral variables, and the grey level co-occurrence matrix-based texture measure—Correlation (COR) was the most important variable for Tpan. The most accurate model was obtained from the TB+VIs (R2 = 0.79, RMSE = 0.05), which combined spectral and textural information, thus providing a significant improvement in estimating CC. This model provided an effective approach for detecting the CC of black locust plantations on the Loess Plateau.
Collapse
|
36
|
Cardoso‐Silva J, Papadatos G, Papageorgiou LG, Tsoka S. Optimal Piecewise Linear Regression Algorithm for QSAR Modelling. Mol Inform 2018; 38:e1800028. [DOI: 10.1002/minf.201800028] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2018] [Accepted: 08/02/2018] [Indexed: 12/20/2022]
Affiliation(s)
- Jonathan Cardoso‐Silva
- Department of Informatics, Faculty of Natural and Mathematical SciencesKing's College London, Bush House London WC2B 4BG UK
| | - George Papadatos
- European Molecular Biology Laboratory – European Bioinformatics InstituteWellcome Trust Genome Campus Hinxton, Cambridge CB10 1SD UK
- GlaxoSmithKline Gunnels Wood Road Stevenage, Hertfordshire SG1 2NY UK
| | - Lazaros G. Papageorgiou
- Centre for Process Systems Engineering, Department of Chemical EngineeringUniversity College London Torrington Place London WC1E 7JE UK
| | - Sophia Tsoka
- Department of Informatics, Faculty of Natural and Mathematical SciencesKing's College London, Bush House London WC2B 4BG UK
| |
Collapse
|
37
|
Choi SW. Life is lognormal! What to do when your data does not follow a normal distribution. Anaesthesia 2018; 71:1363-1366. [PMID: 27734487 DOI: 10.1111/anae.13666] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 08/11/2016] [Indexed: 11/27/2022]
Affiliation(s)
- S W Choi
- Department of Anaesthesiology, The University of Hong Kong, Hong Kong.
| |
Collapse
|
38
|
Shiao SPK, Grayson J, Yu CH. Gene-Metabolite Interaction in the One Carbon Metabolism Pathway: Predictors of Colorectal Cancer in Multi-Ethnic Families. J Pers Med 2018; 8:E26. [PMID: 30082654 PMCID: PMC6164460 DOI: 10.3390/jpm8030026] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2018] [Revised: 07/14/2018] [Accepted: 08/01/2018] [Indexed: 02/07/2023] Open
Abstract
For personalized healthcare, the purpose of this study was to examine the key genes and metabolites in the one-carbon metabolism (OCM) pathway and their interactions as predictors of colorectal cancer (CRC) in multi-ethnic families. In this proof-of-concept study, we included a total of 30 participants, 15 CRC cases and 15 matched family/friends representing major ethnic groups in southern California. Analytics based on supervised machine learning were applied, with the target variable being specified as cancer, including the ensemble method and generalized regression (GR) prediction. Elastic Net with Akaike's Information Criterion with correction (AICc) and Leave-One-Out cross validation GR methods were used to validate the results for enhanced optimality, prediction, and reproducibility. The results revealed that despite some family members sharing genetic heritage, the CRC group had greater combined gene polymorphism-mutations than the family controls (p < 0.1) for five genes including MTHFR C677T, MTHFR A1298C, MTR A2756G, MTRR A66G, and DHFR 19bp. Blood metabolites including homocysteine (7 µmol/L), methyl-folate (40 nmol/L) with total gene mutations (≥4); age (51 years) and vegetable intake (2 cups), and interactions of gene mutations and methylmalonic acid (MMA) (400 nmol/L) were significant predictors (all p < 0.0001) using the AICc. The results were validated by a 3% misclassification rate, AICc of 26, and >99% area under the receiver operating characteristic curve. These results point to the important roles of blood metabolites as potential markers in the prevention of CRC. Future intervention studies can be designed to target the ways to mitigate the enzyme-metabolite deficiencies in the OCM pathway to prevent cancer.
Collapse
Affiliation(s)
- S Pamela K Shiao
- Medical College of Georgia, Augusta University, Augusta, GA 30912, USA.
| | - James Grayson
- Hull College of Business, Augusta University, Augusta, GA 30912, USA.
| | - Chong Ho Yu
- Department of Psychology, Azusa Pacific University, Azusa, CA 91702, USA.
| |
Collapse
|
39
|
Gonzales MC, Grayson J, Lie A, Yu CH, Shiao SYPK. Gene-environment interactions and predictors of breast cancer in family-based multi-ethnic groups. Oncotarget 2018; 9:29019-29035. [PMID: 30018733 PMCID: PMC6044380 DOI: 10.18632/oncotarget.25520] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2018] [Accepted: 05/08/2018] [Indexed: 12/30/2022] Open
Abstract
Breast cancer (BC) is the most common cancer in women worldwide and second leading cause of cancer-related death. Understanding gene-environment interactions could play a critical role for next stage of BC prevention efforts. Hence, the purpose of this study was to examine the key gene-environmental factors affecting the risks of BC in a diverse sample. Five genes in one-carbon metabolism pathway including MTHFR 677, MTHFR 1298, MTR 2756, MTRR 66, and DHFR 19bp together with demographics, lifestyle, and dietary intake factors were examined in association with BC risks. A total of 80 participants (40 BC cases and 40 family/friend controls) in southern California were interviewed and provided salivary samples for genotyping. We presented the first study utilizing both conventional and new analytics including ensemble method and predictive modeling based on smallest errors to predict BC risks. Predictive modeling of Generalized Regression Elastic Net Leave-One-Out demonstrated alcohol use (p = 0.0126) and age (p < 0.0001) as significant predictors; and significant interactions were noted between body mass index (BMI) and alcohol use (p = 0.0027), and between BMI and MTR 2756 polymorphisms (p = 0.0090). Our findings identified the modifiable lifestyle factors in gene-environment interactions that are valuable for BC prevention.
Collapse
Affiliation(s)
- Mildred C Gonzales
- Los Angeles County College of Nursing and Allied Health, Los Angeles, CA, USA
| | - James Grayson
- Hull College of Business, Augusta University, Augusta, GA, USA
| | - Amanda Lie
- Citrus Valley Health Partners, Foothill Presbyterian Hospital, Glendora, CA, USA
| | | | | |
Collapse
|
40
|
Shiao SPK, Grayson J, Lie A, Yu CH. Personalized Nutrition-Genes, Diet, and Related Interactive Parameters as Predictors of Cancer in Multiethnic Colorectal Cancer Families. Nutrients 2018; 10:nu10060795. [PMID: 29925788 PMCID: PMC6024706 DOI: 10.3390/nu10060795] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2018] [Revised: 06/13/2018] [Accepted: 06/19/2018] [Indexed: 01/04/2023] Open
Abstract
To personalize nutrition, the purpose of this study was to examine five key genes in the folate metabolism pathway, and dietary parameters and related interactive parameters as predictors of colorectal cancer (CRC) by measuring the healthy eating index (HEI) in multiethnic families. The five genes included methylenetetrahydrofolate reductase (MTHFR) 677 and 1298, methionine synthase (MTR) 2756, methionine synthase reductase (MTRR 66), and dihydrofolate reductase (DHFR) 19bp, and they were used to compute a total gene mutation score. We included 53 families, 53 CRC patients and 53 paired family friend members of diverse population groups in Southern California. We measured multidimensional data using the ensemble bootstrap forest method to identify variables of importance within domains of genetic, demographic, and dietary parameters to achieve dimension reduction. We then constructed predictive generalized regression (GR) modeling with a supervised machine learning validation procedure with the target variable (cancer status) being specified to validate the results to allow enhanced prediction and reproducibility. The results showed that the CRC group had increased total gene mutation scores compared to the family members (p < 0.05). Using the Akaike’s information criterion and Leave-One-Out cross validation GR methods, the HEI was interactive with thiamine (vitamin B1), which is a new finding for the literature. The natural food sources for thiamine include whole grains, legumes, and some meats and fish which HEI scoring included as part of healthy portions (versus limiting portions on salt, saturated fat and empty calories). Additional predictors included age, as well as gender and the interaction of MTHFR 677 with overweight status (measured by body mass index) in predicting CRC, with the cancer group having more men and overweight cases. The HEI score was significant when split at the median score of 77 into greater or less scores, confirmed through the machine-learning recursive tree method and predictive modeling, although an HEI score of greater than 80 is the US national standard set value for a good diet. The HEI and healthy eating are modifiable factors for healthy living in relation to dietary parameters and cancer prevention, and they can be used for personalized nutrition in the precision-based healthcare era.
Collapse
Affiliation(s)
- S Pamela K Shiao
- College of Nursing and Medical College of Georgia, Augusta University, Augusta, GA 30912, USA.
| | - James Grayson
- Hull College of Business, Augusta University, Augusta, GA 30912, USA.
| | - Amanda Lie
- Citrus Valley Health Partners, Foothill Presbyterian Hospital, Glendora, CA 91741, USA.
| | - Chong Ho Yu
- School of Business, University of Phoenix, Pasadena, CA 91101, USA.
| |
Collapse
|
41
|
Predictors of the Healthy Eating Index and Glycemic Index in Multi-Ethnic Colorectal Cancer Families. Nutrients 2018; 10:nu10060674. [PMID: 29861441 PMCID: PMC6024360 DOI: 10.3390/nu10060674] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2018] [Revised: 05/22/2018] [Accepted: 05/24/2018] [Indexed: 12/13/2022] Open
Abstract
For personalized nutrition in preparation for precision healthcare, we examined the predictors of healthy eating, using the healthy eating index (HEI) and glycemic index (GI), in family-based multi-ethnic colorectal cancer (CRC) families. A total of 106 participants, 53 CRC cases and 53 family members from multi-ethnic families participated in the study. Machine learning validation procedures, including the ensemble method and generalized regression prediction, Elastic Net with Akaike’s Information Criterion with correction and Leave-One-Out cross validation methods, were applied to validate the results for enhanced prediction and reproducibility. Models were compared based on HEI scales for the scores of 77 versus 80 as the status of healthy eating, predicted from individual dietary parameters and health outcomes. Gender and CRC status were interactive as additional predictors of HEI based on the HEI score of 77. Predictors of HEI 80 as the criterion score of a good diet included five significant dietary parameters (with intake amount): whole fruit (1 cup), milk or milk alternative such as soy drinks (6 oz), whole grain (1 oz), saturated fat (15 g), and oil and nuts (1 oz). Compared to the GI models, HEI models presented more accurate and fitted models. Milk or a milk alternative such as soy drink (6 oz) is the common significant parameter across HEI and GI predictive models. These results point to the importance of healthy eating, with the appropriate amount of healthy foods, as modifiable factors for cancer prevention.
Collapse
|
42
|
hsa-miRNA-154-5p expression in plasma of endometriosis patients is a potential diagnostic marker for the disease. Reprod Biomed Online 2018; 37:449-466. [PMID: 29857988 DOI: 10.1016/j.rbmo.2018.05.007] [Citation(s) in RCA: 34] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2017] [Revised: 05/02/2018] [Accepted: 05/04/2018] [Indexed: 02/06/2023]
Abstract
RESEARCH QUESTION As microRNA (miRNA) are stable in circulation, this study tested whether they could serve as putative non-invasive biomarkers for endometriosis, and their expression differences between endometriosis patients and controls. It also addressed whether the combination of differently expressed miRNA together with clinical parameters in a statistical model could distinguish between endometriosis patients and controls. DESIGN This prospective cohort study explored the possibility of using changes in extracellular miRNA spectra in plasma of 51 patients with endometriosis compared with 41 controls combined with clinical data as non-invasive biomarkers for the disease. The project was divided into three different phases for biomarker screening, discovery and validation. The differences in expression levels of plasma miRNA obtained from women with and without endometriosis were analysed with quantitative PCR-based microarrays. The diagnostic performance of the selected individual and/or combined differentially expressed miRNA candidates and clinical parameters was assessed using in silico bioinformatics modelling and receiver operating characteristic curve analysis. RESULTS Data showed that a specific plasma miRNA signature is associated with endometriosis and that hsa-miR-154-5p, which alone or in combination with hsa-miR-196b-5p, hsa-miR-378a-3p, and hsa-miR-33a-5p and the clinical parameters of body mass index and age, are potentially applicable for non-invasive diagnosis of the disease. Changes in the levels of expression of certain circulating plasma miRNA also occurred within the phases of the menstrual cycle. CONCLUSIONS miRNA seem to be promising candidates for the non-invasive diagnosis of endometriosis. Further, other clinical parameters may help in distinguishing women suffering from endometriosis from healthy individuals.
Collapse
|
43
|
Shiao SPK, Grayson J, Yu CH, Wasek B, Bottiglieri T. Gene Environment Interactions and Predictors of Colorectal Cancer in Family-Based, Multi-Ethnic Groups. J Pers Med 2018; 8:E10. [PMID: 29462916 PMCID: PMC5872084 DOI: 10.3390/jpm8010010] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2017] [Revised: 02/14/2018] [Accepted: 02/14/2018] [Indexed: 12/11/2022] Open
Abstract
For the personalization of polygenic/omics-based health care, the purpose of this study was to examine the gene-environment interactions and predictors of colorectal cancer (CRC) by including five key genes in the one-carbon metabolism pathways. In this proof-of-concept study, we included a total of 54 families and 108 participants, 54 CRC cases and 54 matched family friends representing four major racial ethnic groups in southern California (White, Asian, Hispanics, and Black). We used three phases of data analytics, including exploratory, family-based analyses adjusting for the dependence within the family for sharing genetic heritage, the ensemble method, and generalized regression models for predictive modeling with a machine learning validation procedure to validate the results for enhanced prediction and reproducibility. The results revealed that despite the family members sharing genetic heritage, the CRC group had greater combined gene polymorphism rates than the family controls (p < 0.05), on MTHFR C677T, MTR A2756G, MTRR A66G, and DHFR 19 bp except MTHFR A1298C. Four racial groups presented different polymorphism rates for four genes (all p < 0.05) except MTHFR A1298C. Following the ensemble method, the most influential factors were identified, and the best predictive models were generated by using the generalized regression models, with Akaike's information criterion and leave-one-out cross validation methods. Body mass index (BMI) and gender were consistent predictors of CRC for both models when individual genes versus total polymorphism counts were used, and alcohol use was interactive with BMI status. Body mass index status was also interactive with both gender and MTHFR C677T gene polymorphism, and the exposure to environmental pollutants was an additional predictor. These results point to the important roles of environmental and modifiable factors in relation to gene-environment interactions in the prevention of CRC.
Collapse
Affiliation(s)
- S Pamela K Shiao
- College of Nursing and Medical College of Georgia, Augusta University, Augusta, GA 30912, USA.
| | - James Grayson
- College of Business, Augusta University, Augusta, GA 30912, USA.
| | - Chong Ho Yu
- University of Phoenix, Pasadena, CA 91101, USA.
| | - Brandi Wasek
- Center of Metabolomics, Institute of Metabolic Disease, Baylor Scott & White Research Institute, Dallas, TX 75226, USA.
| | - Teodoro Bottiglieri
- Center of Metabolomics, Institute of Metabolic Disease, Baylor Scott & White Research Institute, Dallas, TX 75226, USA.
| |
Collapse
|
44
|
Te Beest DE, Mes SW, Wilting SM, Brakenhoff RH, van de Wiel MA. Improved high-dimensional prediction with Random Forests by the use of co-data. BMC Bioinformatics 2017; 18:584. [PMID: 29281963 PMCID: PMC5745983 DOI: 10.1186/s12859-017-1993-1] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2017] [Accepted: 12/06/2017] [Indexed: 12/13/2022] Open
Abstract
Background Prediction in high dimensional settings is difficult due to the large number of variables relative to the sample size. We demonstrate how auxiliary ‘co-data’ can be used to improve the performance of a Random Forest in such a setting. Results Co-data are incorporated in the Random Forest by replacing the uniform sampling probabilities that are used to draw candidate variables by co-data moderated sampling probabilities. Co-data here are defined as any type information that is available on the variables of the primary data, but does not use its response labels. These moderated sampling probabilities are, inspired by empirical Bayes, learned from the data at hand. We demonstrate the co-data moderated Random Forest (CoRF) with two examples. In the first example we aim to predict the presence of a lymph node metastasis with gene expression data. We demonstrate how a set of external p-values, a gene signature, and the correlation between gene expression and DNA copy number can improve the predictive performance. In the second example we demonstrate how the prediction of cervical (pre-)cancer with methylation data can be improved by including the location of the probe relative to the known CpG islands, the number of CpG sites targeted by a probe, and a set of p-values from a related study. Conclusion The proposed method is able to utilize auxiliary co-data to improve the performance of a Random Forest. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1993-1) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Dennis E Te Beest
- Department of Epidemiology and Biostatistics, VU University Medical Center, Amsterdam, 1007 MB, The Netherlands
| | - Steven W Mes
- Department of Otolaryngology-Head and Neck Surgery, VU University Medical Center, Amsterdam, 1007 MB, The Netherlands
| | - Saskia M Wilting
- Department of Medical Oncology, Erasmus MC Cancer Institute, Erasmus University Medical Center, Rotterdam, 3015 CE, The Netherlands
| | - Ruud H Brakenhoff
- Department of Otolaryngology-Head and Neck Surgery, VU University Medical Center, Amsterdam, 1007 MB, The Netherlands
| | - Mark A van de Wiel
- Department of Epidemiology and Biostatistics, VU University Medical Center, Amsterdam, 1007 MB, The Netherlands. .,Department of Mathematics, VU University, Amsterdam, 1081 HV, The Netherlands.
| |
Collapse
|
45
|
Abstract
Eyewitness identifications play an important role in the investigation and prosecution of crimes, but it is well known that eyewitnesses make mistakes, often with serious consequences. In light of these concerns, the National Academy of Sciences recently convened a panel of experts to undertake a comprehensive study of current practice and use of eyewitness testimony, with an eye toward understanding why identification errors occur and what can be done to prevent them. The work of this committee led to key findings and recommendations for reform, detailed in a consensus report entitled Identifying the Culprit: Assessing Eyewitness Identification In this review, I focus on the scientific issues that emerged from this study, along with brief discussions of how these issues led to specific recommendations for additional research, best practices for law enforcement, and use of eyewitness evidence by the courts.
Collapse
|
46
|
Predicting Vascular Plant Diversity in Anthropogenic Peatlands: Comparison of Modeling Methods with Free Satellite Data. REMOTE SENSING 2017. [DOI: 10.3390/rs9070681] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
47
|
Shuryak I. Advantages of Synthetic Noise and Machine Learning for Analyzing Radioecological Data Sets. PLoS One 2017; 12:e0170007. [PMID: 28068401 PMCID: PMC5222373 DOI: 10.1371/journal.pone.0170007] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2016] [Accepted: 12/26/2016] [Indexed: 11/25/2022] Open
Abstract
The ecological effects of accidental or malicious radioactive contamination are insufficiently understood because of the hazards and difficulties associated with conducting studies in radioactively-polluted areas. Data sets from severely contaminated locations can therefore be small. Moreover, many potentially important factors, such as soil concentrations of toxic chemicals, pH, and temperature, can be correlated with radiation levels and with each other. In such situations, commonly-used statistical techniques like generalized linear models (GLMs) may not be able to provide useful information about how radiation and/or these other variables affect the outcome (e.g. abundance of the studied organisms). Ensemble machine learning methods such as random forests offer powerful alternatives. We propose that analysis of small radioecological data sets by GLMs and/or machine learning can be made more informative by using the following techniques: (1) adding synthetic noise variables to provide benchmarks for distinguishing the performances of valuable predictors from irrelevant ones; (2) adding noise directly to the predictors and/or to the outcome to test the robustness of analysis results against random data fluctuations; (3) adding artificial effects to selected predictors to test the sensitivity of the analysis methods in detecting predictor effects; (4) running a selected machine learning method multiple times (with different random-number seeds) to test the robustness of the detected “signal”; (5) using several machine learning methods to test the “signal’s” sensitivity to differences in analysis techniques. Here, we applied these approaches to simulated data, and to two published examples of small radioecological data sets: (I) counts of fungal taxa in samples of soil contaminated by the Chernobyl nuclear power plan accident (Ukraine), and (II) bacterial abundance in soil samples under a ruptured nuclear waste storage tank (USA). We show that the proposed techniques were advantageous compared with the methodology used in the original publications where the data sets were presented. Specifically, our approach identified a negative effect of radioactive contamination in data set I, and suggested that in data set II stable chromium could have been a stronger limiting factor for bacterial abundance than the radionuclides 137Cs and 99Tc. This new information, which was extracted from these data sets using the proposed techniques, can potentially enhance the design of radioactive waste bioremediation.
Collapse
Affiliation(s)
- Igor Shuryak
- Center for Radiological Research, Columbia University, New York, New York, United States of America
- * E-mail:
| |
Collapse
|
48
|
Rodríguez J, Barrera-Animas AY, Trejo LA, Medina-Pérez MA, Monroy R. Ensemble of One-Class Classifiers for Personal Risk Detection Based on Wearable Sensor Data. SENSORS 2016; 16:s16101619. [PMID: 27690054 PMCID: PMC5087407 DOI: 10.3390/s16101619] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/30/2016] [Revised: 09/24/2016] [Accepted: 09/24/2016] [Indexed: 11/17/2022]
Abstract
This study introduces the One-Class K-means with Randomly-projected features Algorithm (OCKRA). OCKRA is an ensemble of one-class classifiers built over multiple projections of a dataset according to random feature subsets. Algorithms found in the literature spread over a wide range of applications where ensembles of one-class classifiers have been satisfactorily applied; however, none is oriented to the area under our study: personal risk detection. OCKRA has been designed with the aim of improving the detection performance in the problem posed by the Personal RIsk DEtection(PRIDE) dataset. PRIDE was built based on 23 test subjects, where the data for each user were captured using a set of sensors embedded in a wearable band. The performance of OCKRA was compared against support vector machine and three versions of the Parzen window classifier. On average, experimental results show that OCKRA outperformed the other classifiers for at least 0.53% of the area under the curve (AUC). In addition, OCKRA achieved an AUC above 90% for more than 57% of the users.
Collapse
Affiliation(s)
- Jorge Rodríguez
- Escuela de Ingeniería y Ciencias, Tecnologico de Monterrey, Carretera al Lago de Guadalupe Km. 3.5, Atizapán, Edo. de México C.P. 52926, Mexico.
| | - Ari Y Barrera-Animas
- Escuela de Ingeniería y Ciencias, Tecnologico de Monterrey, Carretera al Lago de Guadalupe Km. 3.5, Atizapán, Edo. de México C.P. 52926, Mexico.
| | - Luis A Trejo
- Escuela de Ingeniería y Ciencias, Tecnologico de Monterrey, Carretera al Lago de Guadalupe Km. 3.5, Atizapán, Edo. de México C.P. 52926, Mexico.
| | - Miguel Angel Medina-Pérez
- Escuela de Ingeniería y Ciencias, Tecnologico de Monterrey, Carretera al Lago de Guadalupe Km. 3.5, Atizapán, Edo. de México C.P. 52926, Mexico.
| | - Raúl Monroy
- Escuela de Ingeniería y Ciencias, Tecnologico de Monterrey, Carretera al Lago de Guadalupe Km. 3.5, Atizapán, Edo. de México C.P. 52926, Mexico.
| |
Collapse
|
49
|
Gurard-Levin ZA, Wilson LOW, Pancaldi V, Postel-Vinay S, Sousa FG, Reyes C, Marangoni E, Gentien D, Valencia A, Pommier Y, Cottu P, Almouzni G. Chromatin Regulators as a Guide for Cancer Treatment Choice. Mol Cancer Ther 2016; 15:1768-77. [PMID: 27196757 DOI: 10.1158/1535-7163.mct-15-1008] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2015] [Accepted: 04/26/2016] [Indexed: 12/22/2022]
Abstract
The limited capacity to predict a patient's response to distinct chemotherapeutic agents is a major hurdle in cancer management. The efficiency of a large fraction of current cancer therapeutics (radio- and chemotherapies) is influenced by chromatin structure. Reciprocally, alterations in chromatin organization may affect resistance mechanisms. Here, we explore how the misexpression of chromatin regulators-factors involved in the establishment and maintenance of functional chromatin domains-can inform about the extent of docetaxel response. We exploit Affymetrix and NanoString gene expression data for a set of chromatin regulators generated from breast cancer patient-derived xenograft models and patient samples treated with docetaxel. Random Forest classification reveals specific panels of chromatin regulators, including key components of the SWI/SNF chromatin remodeler, which readily distinguish docetaxel high-responders and poor-responders. Further exploration of SWI/SNF components in the comprehensive NCI-60 dataset reveals that the expression inversely correlates with docetaxel sensitivity. Finally, we show that loss of the SWI/SNF subunit BRG1 (SMARCA4) in a model cell line leads to enhanced docetaxel sensitivity. Altogether, our findings point toward chromatin regulators as biomarkers for drug response as well as therapeutic targets to sensitize patients toward docetaxel and combat drug resistance. Mol Cancer Ther; 15(7); 1768-77. ©2016 AACR.
Collapse
Affiliation(s)
- Zachary A Gurard-Levin
- Institut Curie, PSL Research University, CNRS, UMR3664, Equipe Labellisée Ligue contre le Cancer, Paris, France. Sorbonne Universités, UPMC Universite Paris 06, CNRS, UMR3664, Paris, France.
| | - Laurence O W Wilson
- Institut Curie, PSL Research University, CNRS, UMR3664, Equipe Labellisée Ligue contre le Cancer, Paris, France. Sorbonne Universités, UPMC Universite Paris 06, CNRS, UMR3664, Paris, France
| | - Vera Pancaldi
- Spanish National Cancer Research Centre (CNIO), c/Melchor Fernandez, Almagro, Madrid, Spain
| | - Sophie Postel-Vinay
- DITEP (Département d'Innovations Thérapeutiques et Essais Précoces), Gustave Roussy, France. Inserm Unit U981, Gustave Roussy, Villejuif, France. Université Paris Saclay, Université Paris-Sud, Faculté de Médicine, Le Kremlin Bicêtre, France
| | - Fabricio G Sousa
- Developmental Therapeutics Branch and Laboratory of Molecular Pharmacology, Center for Cancer Research, National Cancer Institute, NIH, Bethesda, Maryland
| | - Cecile Reyes
- Institut Curie, PSL Research University, Translational Research Department, Genomics Platform, Paris, France
| | - Elisabetta Marangoni
- Institut Curie, PSL Research University, Translational Research Department, Genomics Platform, Paris, France
| | - David Gentien
- Institut Curie, PSL Research University, Translational Research Department, Genomics Platform, Paris, France
| | - Alfonso Valencia
- Spanish National Cancer Research Centre (CNIO), c/Melchor Fernandez, Almagro, Madrid, Spain
| | - Yves Pommier
- Developmental Therapeutics Branch and Laboratory of Molecular Pharmacology, Center for Cancer Research, National Cancer Institute, NIH, Bethesda, Maryland
| | - Paul Cottu
- Institut Curie, Medical Oncology, Paris, France
| | - Geneviève Almouzni
- Institut Curie, PSL Research University, CNRS, UMR3664, Equipe Labellisée Ligue contre le Cancer, Paris, France. Sorbonne Universités, UPMC Universite Paris 06, CNRS, UMR3664, Paris, France.
| |
Collapse
|
50
|
Franke SK, van Kesteren RE, Hofman S, Wubben JAM, Smit AB, Philippens IHCHM. Individual and Familial Susceptibility to MPTP in a Common Marmoset Model for Parkinson's Disease. NEURODEGENER DIS 2016; 16:293-303. [PMID: 26999593 DOI: 10.1159/000442574] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2015] [Accepted: 11/11/2015] [Indexed: 11/19/2022] Open
Abstract
INTRODUCTION Insight into susceptibility mechanisms underlying Parkinson's disease (PD) would aid the understanding of disease etiology, enable target finding and benefit the development of more refined disease-modifying strategies. METHODS We used intermittent low-dose MPTP (0.5 mg/kg/week) injections in marmosets and measured multiple behavioral and neurochemical parameters. Genetically diverse monkeys from different breeding families were selected to investigate inter- and intrafamily differences in susceptibility to MPTP treatment. RESULTS We show that such differences exist in clinical signs, in particular nonmotor PD-related behaviors, and that they are accompanied by differences in neurotransmitter levels. In line with the contribution of a genetic component, different susceptibility phenotypes could be traced back through genealogy to individuals of the different families. CONCLUSION Our findings show that low-dose MPTP treatment in marmosets represents a clinically relevant PD model, with a window of opportunity to examine the onset of the disease, allowing the detection of individual variability in disease susceptibility, which may be of relevance for the diagnosis and treatment of PD in humans.
Collapse
Affiliation(s)
- Sigrid K Franke
- Department of Molecular and Cellular Neurobiology, Center for Neurogenomics and Cognitive Research, Neuroscience Campus Amsterdam, VU University, Amsterdam, The Netherlands
| | | | | | | | | | | |
Collapse
|