1
|
Büyükakın F, Özyılmaz A, Işık E, Bayraktar Y, Olgun MF, Toprak M. Pandemics, Income Inequality, and Refugees: The Case of COVID-19. SOCIAL WORK IN PUBLIC HEALTH 2024; 39:78-92. [PMID: 38372287 DOI: 10.1080/19371918.2024.2318372] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/20/2024]
Abstract
Refugees are more vulnerable to COVID-19 due to factors such as low standard of living, accommodation in crowded households, difficulty in receiving health care due to high treatment costs in some countries, and inability to access public health and social services. The increasing income inequalities, anxiety about providing minimum living conditions, and fear of being unemployed compel refugees to continue their jobs, and this affects the number of cases and case-related deaths. The aim of the study is to analyze the impact of refugees and income inequality on COVID-19 cases and deaths in 95 countries for the year 2021 using Poisson regression, Negative Binomial Regression, and Machine Learning methods. According to the estimation results, refugees and income inequalities increase both COVID-19 cases and deaths. On the other hand, the impact of income inequality on COVID-19 cases and deaths is stronger than on refugees.
Collapse
Affiliation(s)
- Figen Büyükakın
- Department of Economics, University of Kocaeli, Kocaeli, Turkey
| | - Ayfer Özyılmaz
- Department of Public Fınance, University of Kırıkkale, Kırıkkale, Turkey
| | - Esme Işık
- Department of Optician, Malatya Turgut Özal Unıversıty, Malatya, Turkey
| | | | - Mehmet Firat Olgun
- The Department of Technology Transfer, University of Kastamonu, Kastamonu, Turkey
| | - Metin Toprak
- Department of Economics, Halıc Unıversıty, Istanbul, Turkey
| |
Collapse
|
2
|
Nizeyimana P, Lee KE, Kim I. Bayesian pathway selection. J Korean Stat Soc 2023. [DOI: 10.1007/s42952-022-00201-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2023]
|
3
|
Random Forests in Count Data Modelling: An Analysis of the Influence of Data Features and Overdispersion on Regression Performance. JOURNAL OF PROBABILITY AND STATISTICS 2022. [DOI: 10.1155/2022/2833537] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/04/2022] Open
Abstract
Machine learning algorithms, especially random forests (RFs), have become an integrated part of the modern scientific methodology and represent an efficient alternative to conventional parametric algorithms. This study aimed to assess the influence of data features and overdispersion on RF regression performance. We assessed the effect of types of predictors (100, 75, 50, and 20% continuous, and 100% categorical), the number of predictors (p = 816 and 24), and the sample size (N = 50, 250, and 1250) on RF parameter settings. We also compared RF performance to that of classical generalized linear models (Poisson, negative binomial, and zero-inflated Poisson) and the linear model applied to log-transformed data. Two real datasets were analysed to demonstrate the usefulness of RF for overdispersed data modelling. Goodness-of-fit statistics such as root mean square error (RMSE) and biases were used to determine RF accuracy and validity. Results revealed that the number of variables to be randomly selected for each split, the proportion of samples to train the model, the minimal number of samples within each terminal node, and RF regression performance are not influenced by the sample size, number, and type of predictors. However, the ratio of observations to the number of predictors affects the stability of the best RF parameters. RF performs well for all types of covariates and different levels of dispersion. The magnitude of dispersion does not significantly influence RF predictive validity. In contrast, its predictive accuracy is significantly influenced by the magnitude of dispersion in the response variable, conditional on the explanatory variables. RF has performed almost as well as the models of the classical Poisson family in the presence of overdispersion. Given RF’s advantages, it is an appropriate statistical alternative for counting data.
Collapse
|
4
|
Zhao J, Jiang H, Zou G, Lin Q, Wang Q, Liu J, Ma L. CNNArginineMe: A CNN structure for training models for predicting arginine methylation sites based on the One-Hot encoding of peptide sequence. Front Genet 2022; 13:1036862. [PMID: 36324513 PMCID: PMC9618650 DOI: 10.3389/fgene.2022.1036862] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2022] [Accepted: 10/04/2022] [Indexed: 11/30/2022] Open
Abstract
Protein arginine methylation (PRme), as one post-translational modification, plays a critical role in numerous cellular processes and regulates critical cellular functions. Though several in silico models for predicting PRme sites have been reported, new models may be required to develop due to the significant increase of identified PRme sites. In this study, we constructed multiple machine-learning and deep-learning models. The deep-learning model CNN combined with the One-Hot coding showed the best performance, dubbed CNNArginineMe. CNNArginineMe performed best in AUC scoring metrics in comparisons with several reported predictors. Additionally, we employed CNNArginineMe to predict arginine methylation proteome and performed functional analysis. The arginine methylated proteome is significantly enriched in the amyotrophic lateral sclerosis (ALS) pathway. CNNArginineMe is freely available at https://github.com/guoyangzou/CNNArginineMe.
Collapse
Affiliation(s)
- Jiaojiao Zhao
- Cancer Institute of the Affiliated Hospital of Qingdao University and Qingdao Cancer Institute, Qingdao University, Qingdao, China
- School of Basic Medicine, Qingdao University, Qingdao, China
| | - Haoqiang Jiang
- School of Basic Medicine, Qingdao University, Qingdao, China
| | - Guoyang Zou
- School of Basic Medicine, Qingdao University, Qingdao, China
| | - Qian Lin
- Cancer Institute of the Affiliated Hospital of Qingdao University and Qingdao Cancer Institute, Qingdao University, Qingdao, China
| | - Qiang Wang
- Oncology Department, Shandong Second Provincial General Hospital, Jinan, China
| | - Jia Liu
- Department of Pharmacology, School of Pharmacy, Qingdao University, Qingdao, China
| | - Leina Ma
- Cancer Institute of the Affiliated Hospital of Qingdao University and Qingdao Cancer Institute, Qingdao University, Qingdao, China
- *Correspondence: Leina Ma,
| |
Collapse
|
5
|
The advanced design of bioleaching process for metal recovery: A machine learning approach. Sep Purif Technol 2022. [DOI: 10.1016/j.seppur.2022.120919] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
|
6
|
Zhang X, Xuan J, Yao C, Gao Q, Wang L, Jin X, Li S. A deep learning approach for orphan gene identification in moso bamboo (Phyllostachys edulis) based on the CNN + Transformer model. BMC Bioinformatics 2022; 23:162. [PMID: 35513802 PMCID: PMC9069780 DOI: 10.1186/s12859-022-04702-1] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2021] [Accepted: 04/28/2022] [Indexed: 12/02/2022] Open
Abstract
Background Orphan gene play an important role in the environmental stresses of many species and their identification is a critical step to understand biological functions. Moso bamboo has high ecological, economic and cultural value. Studies have shown that the growth of moso bamboo is influenced by various stresses. Several traditional methods are time-consuming and inefficient. Hence, the development of efficient and high-accuracy computational methods for predicting orphan genes is of great significance. Results In this paper, we propose a novel deep learning model (CNN + Transformer) for identifying orphan genes in moso bamboo. It uses a convolutional neural network in combination with a transformer neural network to capture k-mer amino acids and features between k-mer amino acids in protein sequences. The experimental results show that the average balance accuracy value of CNN + Transformer on moso bamboo dataset can reach 0.875, and the average Matthews Correlation Coefficient (MCC) value can reach 0.471. For the same testing set, the Balance Accuracy (BA), Geometric Mean (GM), Bookmaker Informedness (BM), and MCC values of the recurrent neural network, long short-term memory, gated recurrent unit, and transformer models are all lower than those of CNN + Transformer, which indicated that the model has the extensive ability for OG identification in moso bamboo. Conclusions CNN + Transformer model is feasible and obtains the credible predictive results. It may also provide valuable references for other related research. As our knowledge, this is the first model to adopt the deep learning techniques for identifying orphan genes in plants. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04702-1.
Collapse
Affiliation(s)
- Xiaodan Zhang
- Anhui Province Key Laboratory of Smart Agricultural Technology and Equipment, Anhui Agriculture University, Hefei, 230001, China.,College of Information and Computer Science, Anhui Agricultural University, Hefei, 230001, China
| | - Jinxiang Xuan
- Anhui Province Key Laboratory of Smart Agricultural Technology and Equipment, Anhui Agriculture University, Hefei, 230001, China.,College of Information and Computer Science, Anhui Agricultural University, Hefei, 230001, China
| | - Chensong Yao
- Graduate School, Anhui Agricultural University, Hefei, 230036, China
| | - Qijuan Gao
- Anhui Province Key Laboratory of Smart Agricultural Technology and Equipment, Anhui Agriculture University, Hefei, 230001, China
| | - Lianglong Wang
- Anhui Province Key Laboratory of Smart Agricultural Technology and Equipment, Anhui Agriculture University, Hefei, 230001, China.,College of Information and Computer Science, Anhui Agricultural University, Hefei, 230001, China
| | - Xiu Jin
- Anhui Province Key Laboratory of Smart Agricultural Technology and Equipment, Anhui Agriculture University, Hefei, 230001, China. .,College of Information and Computer Science, Anhui Agricultural University, Hefei, 230001, China.
| | - Shaowen Li
- Anhui Province Key Laboratory of Smart Agricultural Technology and Equipment, Anhui Agriculture University, Hefei, 230001, China. .,College of Information and Computer Science, Anhui Agricultural University, Hefei, 230001, China.
| |
Collapse
|
7
|
Canella Vieira C, Zhou J, Usovsky M, Vuong T, Howland AD, Lee D, Li Z, Zhou J, Shannon G, Nguyen HT, Chen P. Exploring Machine Learning Algorithms to Unveil Genomic Regions Associated With Resistance to Southern Root-Knot Nematode in Soybeans. FRONTIERS IN PLANT SCIENCE 2022; 13:883280. [PMID: 35592556 PMCID: PMC9111516 DOI: 10.3389/fpls.2022.883280] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/24/2022] [Accepted: 04/08/2022] [Indexed: 06/15/2023]
Abstract
Southern root-knot nematode [SRKN, Meloidogyne incognita (Kofold & White) Chitwood] is a plant-parasitic nematode challenging to control due to its short life cycle, a wide range of hosts, and limited management options, of which genetic resistance is the main option to efficiently control the damage caused by SRKN. To date, a major quantitative trait locus (QTL) mapped on chromosome (Chr.) 10 plays an essential role in resistance to SRKN in soybean varieties. The confidence of discovered trait-loci associations by traditional methods is often limited by the assumptions of individual single nucleotide polymorphisms (SNPs) always acting independently as well as the phenotype following a Gaussian distribution. Therefore, the objective of this study was to conduct machine learning (ML)-based genome-wide association studies (GWAS) utilizing Random Forest (RF) and Support Vector Machine (SVM) algorithms to unveil novel regions of the soybean genome associated with resistance to SRKN. A total of 717 breeding lines derived from 330 unique bi-parental populations were genotyped with the Illumina Infinium BARCSoySNP6K BeadChip and phenotyped for SRKN resistance in a greenhouse. A GWAS pipeline involving a supervised feature dimension reduction based on Variable Importance in Projection (VIP) and SNP detection based on classification accuracy was proposed. Minor effect SNPs were detected by the proposed ML-GWAS methodology but not identified using Bayesian-information and linkage-disequilibrium Iteratively Nested Keyway (BLINK), Fixed and Random Model Circulating Probability Unification (FarmCPU), and Enriched Compressed Mixed Linear Model (ECMLM) models. Besides the genomic region on Chr. 10 that can explain most of SRKN resistance variance, additional minor effects SNPs were also identified on Chrs. 10 and 11. The findings in this study demonstrated that overfitting in GWAS may lead to lower prediction accuracy, and the detection of significant SNPs based on classification accuracy limited false-positive associations. The expansion of the basis of the genetic resistance to SRKN can potentially reduce the selection pressure over the major QTL on Chr. 10 and achieve higher levels of resistance.
Collapse
Affiliation(s)
- Caio Canella Vieira
- Fisher Delta Research, Extension, and Education Center, Division of Plant Science and Technology, University of Missouri, Portageville, MO, United States
| | - Jing Zhou
- Biological Systems Engineering, University of Wisconsin–Madison, Madison, WI, United States
| | - Mariola Usovsky
- Division of Plant Science and Technology, University of Missouri, Columbia, MO, United States
| | - Tri Vuong
- Division of Plant Science and Technology, University of Missouri, Columbia, MO, United States
| | - Amanda D. Howland
- Department of Entomology, College of Agriculture and Natural Resources, Michigan State University, East Lansing, MI, United States
| | - Dongho Lee
- Fisher Delta Research, Extension, and Education Center, Division of Plant Science and Technology, University of Missouri, Portageville, MO, United States
| | - Zenglu Li
- Institute of Plant Breeding, Genetics, and Genomics, College of Agricultural and Environmental Sciences, University of Georgia, Athens, GA, United States
| | - Jianfeng Zhou
- Division of Plant Science and Technology, University of Missouri, Columbia, MO, United States
| | - Grover Shannon
- Fisher Delta Research, Extension, and Education Center, Division of Plant Science and Technology, University of Missouri, Portageville, MO, United States
| | - Henry T. Nguyen
- Division of Plant Science and Technology, University of Missouri, Columbia, MO, United States
| | - Pengyin Chen
- Fisher Delta Research, Extension, and Education Center, Division of Plant Science and Technology, University of Missouri, Portageville, MO, United States
| |
Collapse
|
8
|
Shen J, Jin G, Zhang Z, Zhang J, Sun Y, Xie X, Ma T, Zhu Y, Du Y, Niu Y, Shi X. A multiple-dimension model for microbiota of patients with colorectal cancer from normal participants and other intestinal disorders. Appl Microbiol Biotechnol 2022; 106:2161-2173. [PMID: 35218389 DOI: 10.1007/s00253-022-11846-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2021] [Revised: 02/12/2022] [Accepted: 02/19/2022] [Indexed: 11/02/2022]
Abstract
Gut microbiota is a primary driver of inflammation in the colon and is linked to early colorectal cancer (CRC) development. Thus, a novel and noninvasive microbiome-based model could promote screening in patients at average risk for CRC. Nevertheless, the relevance and effectiveness of microbial biomarkers for noninvasive CRC screening remains unclear, and researchers lack the data to distinguish CRC-related gut microbiome biomarkers from those of other common gastrointestinal (GI) diseases. Microbiome-based classification distinguishes patients with CRC from normal participants and excludes other CRC-relevant diseases (e.g., GI bleed, adenoma, bowel diseases, and postoperative). The area under the receiver operator characteristic curve (AUC) was 92.2%. Known associations with oral pathogenic features, benefits-generated features, and functional features of CRC were confirmed using the model. Our optimised prediction model was established using large-scale experimental population-based data and other sequence-based faecal microbial community data. This model can be used to identify the high-risk groups and has the potential to become a novel screening method for CRC biomarkers because of its low false-positive rate (FPR) and good stability. KEY POINTS: • A total of 5744 CRC and non-CRC large-scale faecal samples were sequenced, and a model was constructed for CRC discrimination on the basis of the relative abundance of taxonomic and functional features. • This model could identify high-risk groups and become a novel screening method for CRC biomarkers because of its low FPR and good stability. • The association relationship of oral pathogenic features, benefits-generated features, and functional features in CRC was confirmed by the study.
Collapse
Affiliation(s)
- Jian Shen
- Department of Medical Administration, Zhejiang Provincial People's Hospital, Affiliated People's Hospital, Hangzhou Medical College, Hangzhou, Zhejiang, China.,Laboratory Medicine Center, Department of Transfusion Medicine, Zhejiang Provincial People's Hospital, Affiliated People's Hospital, Hangzhou Medical College, Hangzhou, Zhejiang, China
| | - Gulei Jin
- Hangzhou GUHE Information and Technology Company, Hangzhou, Zhejiang, China.,Department of Clinical Laboratory, The Second Affiliated Hospital of Zhejiang University School of Medicine, Hangzhou, Zhejiang, China
| | - Zhengliang Zhang
- Department of Clinical Laboratory, The Second Affiliated Hospital of Zhejiang University School of Medicine, Hangzhou, Zhejiang, China
| | - Jun Zhang
- Department of Medical Administration, Zhejiang Provincial People's Hospital, Affiliated People's Hospital, Hangzhou Medical College, Hangzhou, Zhejiang, China.,Cancer Center, Department of Gastroenterology, Zhejiang Provincial People's Hospital, Affiliated People's Hospital, Hangzhou Medical College, Hangzhou, Zhejiang, China
| | - Yan Sun
- Cancer Center, Department of Gastroenterology, Zhejiang Provincial People's Hospital, Affiliated People's Hospital, Hangzhou Medical College, Hangzhou, Zhejiang, China
| | - Xiaoxiao Xie
- Hangzhou GUHE Information and Technology Company, Hangzhou, Zhejiang, China
| | - Tingting Ma
- Hangzhou GUHE Information and Technology Company, Hangzhou, Zhejiang, China
| | - Yongze Zhu
- Laboratory Medicine Center, Department of Clinical Laboratory, Zhejiang Provincial People's Hospital, Affiliated People's Hospital, Hangzhou Medical College, Hangzhou, Zhejiang, China
| | - Yaoqiang Du
- Laboratory Medicine Center, Department of Transfusion Medicine, Zhejiang Provincial People's Hospital, Affiliated People's Hospital, Hangzhou Medical College, Hangzhou, Zhejiang, China.
| | - Yaofang Niu
- Hangzhou GUHE Information and Technology Company, Hangzhou, Zhejiang, China.
| | - Xinwei Shi
- Department of Nursing, The Eye Hospital of Wenzhou Medical University (Zhejiang Eye Hospital), Hangzhou, Zhejiang, China.
| |
Collapse
|
9
|
Jung SY, Sobel EM, Pellegrini M, Yu H, Papp JC. Synergistic Effects of Genetic Variants of Glucose Homeostasis and Lifelong Exposures to Cigarette Smoking, Female Hormones, and Dietary Fat Intake on Primary Colorectal Cancer Development in African and Hispanic/Latino American Women. Front Oncol 2021; 11:760243. [PMID: 34692549 PMCID: PMC8529283 DOI: 10.3389/fonc.2021.760243] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2021] [Accepted: 09/22/2021] [Indexed: 12/24/2022] Open
Abstract
BACKGROUND Disparities in cancer genomic science exist among racial/ethnic minorities. Particularly, African American (AA) and Hispanic/Latino American (HA) women, the 2 largest minorities, are underrepresented in genetic/genome-wide studies for cancers and their risk factors. We conducted on AA and HA postmenopausal women a genomic study for insulin resistance (IR), the main biologic mechanism underlying colorectal cancer (CRC) carcinogenesis owing to obesity. METHODS With 780 genome-wide IR-specific single-nucleotide polymorphisms (SNPs) among 4,692 AA and 1,986 HA women, we constructed a CRC-risk prediction model. Along with these SNPs, we incorporated CRC-associated lifestyles in the model of each group and detected the topmost influential genetic and lifestyle factors. Further, we estimated the attributable risk of the topmost risk factors shared by the groups to explore potential factors that differentiate CRC risk between these groups. RESULTS In both groups, we detected IR-SNPs in PCSK1 (in AA) and IFT172, GCKR, and NRBP1 (in HA) and risk lifestyles, including long lifetime exposures to cigarette smoking and endogenous female hormones and daily intake of polyunsaturated fatty acids (PFA), as the topmost predictive variables for CRC risk. Combinations of those top genetic- and lifestyle-markers synergistically increased CRC risk. Of those risk factors, dietary PFA intake and long lifetime exposure to female hormones may play a key role in mediating racial disparity of CRC incidence between AA and HA women. CONCLUSIONS Our results may improve CRC risk prediction performance in those medically/scientifically underrepresented groups and lead to the development of genetically informed interventions for cancer prevention and therapeutic effort, thus contributing to reduced cancer disparities in those minority subpopulations.
Collapse
Affiliation(s)
- Su Yon Jung
- Translational Sciences Section, Jonsson Comprehensive Cancer Center, School of Nursing, University of California, Los Angeles, Los Angeles, CA, United States
| | - Eric M. Sobel
- Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, United States
- Department of Computational Medicine, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, United States
| | - Matteo Pellegrini
- Department of Molecular, Cell and Developmental Biology, Life Sciences Division, University of California, Los Angeles, Los Angeles, CA, United States
| | - Herbert Yu
- Cancer Epidemiology Program, University of Hawaii Cancer Center, Honolulu, HI, United States
| | - Jeanette C. Papp
- Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, United States
| |
Collapse
|
10
|
Jung SY. Genetic Signatures of Glucose Homeostasis: Synergistic Interplay With Long-Term Exposure to Cigarette Smoking in Development of Primary Colorectal Cancer Among African American Women. Clin Transl Gastroenterol 2021; 12:e00412. [PMID: 34608882 PMCID: PMC8500576 DOI: 10.14309/ctg.0000000000000412] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/25/2021] [Accepted: 08/22/2021] [Indexed: 11/17/2022] Open
Abstract
INTRODUCTION Insulin resistance (IR)/glucose intolerance is a critical biologic mechanism for the development of colorectal cancer (CRC) in postmenopausal women. Whereas IR and excessive adiposity are more prevalent in African American (AA) women than in White women, AA women are underrepresented in genome-wide studies for systemic regulation of IR and the association with CRC risk. METHODS With 780 genome-wide IR single-nucleotide polymorphisms (SNPs) among 4,692 AA women, we tested for a causal inference between genetically elevated IR and CRC risk. Furthermore, by incorporating CRC-associated lifestyle factors, we established a prediction model on the basis of gene-environment interactions to generate risk profiles for CRC with the most influential genetic and lifestyle factors. RESUTLS In the pooled Mendelian randomization analysis, the genetically elevated IR was associated with 9 times increased risk of CRC, but with lack of analytic power. By addressing the variation of individual SNPs in CRC in the prediction model, we detected 4 fasting glucose-specific SNPs in GCK, PCSK1, and MTNR1B and 4 lifestyles, including smoking, aging, prolonged lifetime exposure to endogenous estrogen, and high fat intake, as the most predictive markers of CRC risk. Our joint test for those risk genotypes and lifestyles with smoking revealed the synergistically increased CRC risk, more substantially in women with longer-term exposure to cigarette smoking. DISCUSSION Our findings may improve CRC prediction ability among medically underrepresented AA women and highlight genetically informed preventive interventions (e.g., smoking cessation; CRC screening to longer-term smokers) for those women at high risk with risk genotypes and behavioral patterns.
Collapse
Affiliation(s)
- Su Yon Jung
- Translational Sciences Section, School of Nursing, University of California, Los Angeles, Los Angeles, California, USA; and
- Jonsson Comprehensive Cancer Center, University of California, Los Angeles, Los Angeles, California, USA.
| |
Collapse
|
11
|
Montesinos-López OA, Montesinos-López A, Mosqueda-Gonzalez BA, Montesinos-López JC, Crossa J, Ramirez NL, Singh P, Valladares-Anguiano FA. A zero altered Poisson random forest model for genomic-enabled prediction. G3-GENES GENOMES GENETICS 2021; 11:6042695. [PMID: 33693599 PMCID: PMC8022945 DOI: 10.1093/g3journal/jkaa057] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/03/2020] [Accepted: 12/10/2020] [Indexed: 12/23/2022]
Abstract
In genomic selection choosing the statistical machine learning model is of paramount importance. In this paper, we present an application of a zero altered random forest model with two versions (ZAP_RF and ZAPC_RF) to deal with excess zeros in count response variables. The proposed model was compared with the conventional random forest (RF) model and with the conventional Generalized Poisson Ridge regression (GPR) using two real datasets, and we found that, in terms of prediction performance, the proposed zero inflated random forest model outperformed the conventional RF and GPR models.
Collapse
Affiliation(s)
| | - Abelardo Montesinos-López
- Departamento de Matemáticas, Centro Universitario de Ciencias Exactas e Ingenierías (CUCEI), Universidad de Guadalajara, 44430 Guadalajara, Jalisco, México
| | | | | | - José Crossa
- Colegio de Postgraduados, Montecillos, Edo. de México CP 56230, México.,International Maize and Wheat Improvement Center (CIMMYT), Km 45, Carretera Mexico-Veracruz, CP 52640, Edo. de México, México
| | - Nerida Lozano Ramirez
- International Maize and Wheat Improvement Center (CIMMYT), Km 45, Carretera Mexico-Veracruz, CP 52640, Edo. de México, México
| | - Pawan Singh
- International Maize and Wheat Improvement Center (CIMMYT), Km 45, Carretera Mexico-Veracruz, CP 52640, Edo. de México, México
| | | |
Collapse
|
12
|
Smith PF, Zheng Y. Applications of Multivariate Statistical and Data Mining Analyses to the Search for Biomarkers of Sensorineural Hearing Loss, Tinnitus, and Vestibular Dysfunction. Front Neurol 2021; 12:627294. [PMID: 33746881 PMCID: PMC7966509 DOI: 10.3389/fneur.2021.627294] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2020] [Accepted: 02/01/2021] [Indexed: 11/24/2022] Open
Abstract
Disorders of sensory systems, as with most disorders of the nervous system, usually involve the interaction of multiple variables to cause some change, and yet often basic sensory neuroscience data are analyzed using univariate statistical analyses only. The exclusive use of univariate statistical procedures, analyzing one variable at a time, may limit the potential of studies to determine how interactions between variables may, as a network, determine a particular result. The use of multivariate statistical and data mining methods provides the opportunity to analyse many variables together, in order to appreciate how they may function as a system of interacting variables, and how this system or network may change as a result of sensory disorders such as sensorineural hearing loss, tinnitus or different types of vestibular dysfunction. Here we provide an overview of the potential applications of multivariate statistical and data mining techniques, such as principal component and factor analysis, cluster analysis, multiple linear regression, random forest regression, linear discriminant analysis, support vector machines, random forest classification, Bayesian classification, and orthogonal partial least squares discriminant analysis, to the study of auditory and vestibular dysfunction, with an emphasis on classification analytic methods that may be used in the search for biomarkers of disease.
Collapse
Affiliation(s)
- Paul F. Smith
- Department of Pharmacology and Toxicology, Brain Health Research Centre, School of Biomedical Sciences, University of Otago, Dunedin, New Zealand
- Brain Research New Zealand Centre of Research Excellence, University of Auckland, Auckland, New Zealand
- The Eisdell Moore Centre for Hearing and Balance Research, University of Auckland, Auckland, New Zealand
| | - Yiwen Zheng
- Department of Pharmacology and Toxicology, Brain Health Research Centre, School of Biomedical Sciences, University of Otago, Dunedin, New Zealand
- Brain Research New Zealand Centre of Research Excellence, University of Auckland, Auckland, New Zealand
- The Eisdell Moore Centre for Hearing and Balance Research, University of Auckland, Auckland, New Zealand
| |
Collapse
|
13
|
He S, Guo F, Zou Q, HuiDing. MRMD2.0: A Python Tool for Machine Learning with Feature Ranking and Reduction. Curr Bioinform 2021. [DOI: 10.2174/1574893615999200503030350] [Citation(s) in RCA: 101] [Impact Index Per Article: 33.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Aims:
The study aims to find a way to reduce the dimensionality of the dataset.
Background:
Dimensionality reduction is the key issue of the machine learning process. It does
not only improve the prediction performance but also could recommend the intrinsic features and
help to explore the biological expression of the machine learning “black box”.
Objective:
A variety of feature selection algorithms are used to select data features to achieve
dimensionality reduction.
Methods:
First, MRMD2.0 integrated 7 different popular feature ranking algorithms with
PageRank strategy. Second, optimized dimensionality was detected with forward adding strategy.
Result:
We have achieved good results in our experiments.
Conclusion:
Several works have been tested with MRMD2.0. It showed well performance.
Otherwise, it also can draw the performance curves according to the feature dimensionality. If
users want to sacrifice accuracy for fewer features, they can select the dimensionality from the
performance curves.
Other:
We developed friendly python tools together with the web server. The users could upload
their csv, arff or libsvm format files. Then the webserver would help to rank features and find the
optimized dimensionality.
Collapse
Affiliation(s)
- Shida He
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Fei Guo
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - HuiDing
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
14
|
Seifert S, Gundlach S, Junge O, Szymczak S. Integrating biological knowledge and gene expression data using pathway-guided random forests: a benchmarking study. Bioinformatics 2021; 36:4301-4308. [PMID: 32399562 PMCID: PMC7520048 DOI: 10.1093/bioinformatics/btaa483] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2019] [Revised: 03/13/2020] [Accepted: 05/05/2020] [Indexed: 12/12/2022] Open
Abstract
MOTIVATION High-throughput technologies allow comprehensive characterization of individuals on many molecular levels. However, training computational models to predict disease status based on omics data is challenging. A promising solution is the integration of external knowledge about structural and functional relationships into the modeling process. We compared four published random forest-based approaches using two simulation studies and nine experimental datasets. RESULTS The self-sufficient prediction error approach should be applied when large numbers of relevant pathways are expected. The competing methods hunting and learner of functional enrichment should be used when low numbers of relevant pathways are expected or the most strongly associated pathways are of interest. The hybrid approach synthetic features is not recommended because of its high false discovery rate. AVAILABILITY AND IMPLEMENTATION An R package providing functions for data analysis and simulation is available at GitHub (https://github.com/szymczak-lab/PathwayGuidedRF). An accompanying R data package (https://github.com/szymczak-lab/DataPathwayGuidedRF) stores the processed and quality controlled experimental datasets downloaded from Gene Expression Omnibus (GEO). SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Stephan Seifert
- Institute of Medical Informatics and Statistics, Kiel University, University Hospital Schleswig-Holstein, Kiel 24105, Germany
| | - Sven Gundlach
- Institute of Medical Informatics and Statistics, Kiel University, University Hospital Schleswig-Holstein, Kiel 24105, Germany
| | - Olaf Junge
- Institute of Medical Informatics and Statistics, Kiel University, University Hospital Schleswig-Holstein, Kiel 24105, Germany
| | - Silke Szymczak
- Institute of Medical Informatics and Statistics, Kiel University, University Hospital Schleswig-Holstein, Kiel 24105, Germany
| |
Collapse
|
15
|
Gut microbiome analysis as a predictive marker for the gastric cancer patients. Appl Microbiol Biotechnol 2021; 105:803-814. [PMID: 33404833 DOI: 10.1007/s00253-020-11043-7] [Citation(s) in RCA: 30] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2020] [Revised: 11/24/2020] [Accepted: 12/01/2020] [Indexed: 02/06/2023]
Abstract
Gut microbiota have been implicated in the development of cancer. Colorectal and gastric cancers, the major gastrointestinal tract cancers, are closely connected with the gut microbiome. Nevertheless, the characteristics of gut microbiota composition that correlate with gastric cancer are unclear. In this study, we investigated gut microbiota alterations during the progression of gastric cancer to identify the most relevant taxa associated with gastric cancer and evaluated the potential of the microbiome as an indicator for the diagnosis of gastric cancer. Compared with the healthy group, gut microbiota composition and diversity shifted in patients with gastric cancer. Different bacteria were used to design a random forest model, which provided an area under the curve value of 0.91. Verification samples achieved a true positive rate of 0.83 in gastric cancer. Principal component analysis showed that gastritis shares some microbiome characteristics of gastric cancer. Chemotherapy reduced the elevated bacteria levels in gastric cancer by more than half. More importantly, we found that the genera Lactobacillus and Megasphaera were associated with gastric cancer.Key Points• Gut microbiota has high sensitivity and specificity in distinguishing patients with gastric cancer from healthy individuals, indicating that gut microbiota is a potential noninvasive tool for the diagnosis of gastric cancer.• Gastritis shares some microbiota features with gastric cancer, and chemotherapy reduces the microbial abundance and diversity in gastric cancer patients.• Two bacterial taxa, namely, Lactobacillus and Megasphaera, are predictive markers for gastric cancer.
Collapse
|
16
|
Gao Q, Jin X, Xia E, Wu X, Gu L, Yan H, Xia Y, Li S. Identification of Orphan Genes in Unbalanced Datasets Based on Ensemble Learning. Front Genet 2020; 11:820. [PMID: 33133122 PMCID: PMC7567012 DOI: 10.3389/fgene.2020.00820] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2020] [Accepted: 07/08/2020] [Indexed: 11/13/2022] Open
Abstract
Orphan genes are associated with regulatory patterns, but experimental methods for identifying orphan genes are both time-consuming and expensive. Designing an accurate and robust classification model to detect orphan and non-orphan genes in unbalanced distribution datasets poses a particularly huge challenge. Synthetic minority over-sampling algorithms (SMOTE) are selected in a preliminary step to deal with unbalanced gene datasets. To identify orphan genes in balanced and unbalanced Arabidopsis thaliana gene datasets, SMOTE algorithms were then combined with traditional and advanced ensemble classified algorithms respectively, using Support Vector Machine, Random Forest (RF), AdaBoost (adaptive boosting), GBDT (gradient boosting decision tree), and XGBoost (extreme gradient boosting). After comparing the performance of these ensemble models, SMOTE algorithms with XGBoost achieved an F1 score of 0.94 with the balanced A. thaliana gene datasets, but a lower score with the unbalanced datasets. The proposed ensemble method combines different balanced data algorithms including Borderline SMOTE (BSMOTE), Adaptive Synthetic Sampling (ADSYN), SMOTE-Tomek, and SMOTE-ENN with the XGBoost model separately. The performances of the SMOTE-ENN-XGBoost model, which combined over-sampling and under-sampling algorithms with XGBoost, achieved higher predictive accuracy than the other balanced algorithms with XGBoost models. Thus, SMOTE-ENN-XGBoost provides a theoretical basis for developing evaluation criteria for identifying orphan genes in unbalanced and biological datasets.
Collapse
Affiliation(s)
- Qijuan Gao
- Anhui Province Key Laboratory of Smart Agricultural Technology and Equipment, Anhui Agriculture University, Hefei, China
| | - Xiu Jin
- Anhui Province Key Laboratory of Smart Agricultural Technology and Equipment, Anhui Agriculture University, Hefei, China
| | - Enhua Xia
- State Key Laboratory of Tea Plant Biology and Utilization, Anhui Agricultural University, Hefei, China
| | - Xiangwei Wu
- School of Resources and Environment, Anhui Agricultural University, Hefei, China
| | - Lichuan Gu
- School of Information and Computer Science, Anhui Agricultural University, Hefei, China
| | - Hanwei Yan
- Key Laboratory of Crop Biology of Anhui Province, Anhui Agricultural University, Hefei, China
| | - Yingchun Xia
- School of Information and Computer Science, Anhui Agricultural University, Hefei, China
| | - Shaowen Li
- Anhui Province Key Laboratory of Smart Agricultural Technology and Equipment, Anhui Agriculture University, Hefei, China
| |
Collapse
|
17
|
Yan KK, Wang X, Lam WWT, Vardhanabhuti V, Lee AWM, Pang HH. Radiomics analysis using stability selection supervised component analysis for right-censored survival data. Comput Biol Med 2020; 124:103959. [PMID: 32905923 PMCID: PMC7501167 DOI: 10.1016/j.compbiomed.2020.103959] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2020] [Revised: 08/02/2020] [Accepted: 08/03/2020] [Indexed: 02/03/2023]
Abstract
Radiomics is a newly emerging field that involves the extraction of massive quantitative features from biomedical images by using data-characterization algorithms. Distinctive imaging features identified from biomedical images can be used for prognosis and therapeutic response prediction, and they can provide a noninvasive approach for personalized therapy. So far, many of the published radiomics studies utilize existing out of the box algorithms to identify the prognostic markers from biomedical images that are not specific to radiomics data. To better utilize biomedical images, we propose a novel machine learning approach, stability selection supervised principal component analysis (SSSuperPCA) that identifies stable features from radiomics big data coupled with dimension reduction for right-censored survival outcomes. The proposed approach allows us to identify a set of stable features that are highly associated with the survival outcomes in a simple yet meaningful manner, while controlling the per-family error rate. We evaluate the performance of SSSuperPCA using simulations and real data sets for non-small cell lung cancer and head and neck cancer, and compare it with other machine learning algorithms. The results demonstrate that our method has a competitive edge over other existing methods in identifying the prognostic markers from biomedical imaging data for the prediction of right-censored survival outcomes.
Collapse
Affiliation(s)
- Kang K Yan
- School of Public Health, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China
| | - Xiaofei Wang
- Department of Biostatistics and Bioinformatics, Duke University School of Medicine, Durham, NC, USA
| | - Wendy W T Lam
- School of Public Health, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China; Jockey Club Institute of Cancer Care, Li Ka Shing Faculty of Medicine, Hong Kong SAS, China
| | - Varut Vardhanabhuti
- Department of Diagnostic Radiology, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China
| | - Anne W M Lee
- Department of Clinical Oncology, The University of Hong Kong-Shenzhen Hospital and The University of Hong Kong, Hong Kong SAR, China
| | - Herbert H Pang
- School of Public Health, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China.
| |
Collapse
|
18
|
Gu X, Chen Z, Wang D. Prediction of G Protein-Coupled Receptors With CTDC Extraction and MRMD2.0 Dimension-Reduction Methods. Front Bioeng Biotechnol 2020; 8:635. [PMID: 32671038 PMCID: PMC7329982 DOI: 10.3389/fbioe.2020.00635] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2020] [Accepted: 05/26/2020] [Indexed: 11/13/2022] Open
Abstract
The G Protein-Coupled Receptor (GPCR) family consists of more than 800 different members. In this article, we attempt to use the physicochemical properties of Composition, Transition, Distribution (CTD) to represent GPCRs. The dimensionality reduction method of MRMD2.0 filters the physicochemical properties of GPCR redundancy. Matplotlib plots the coordinates to distinguish GPCRs from other protein sequences. The chart data show a clear distinction effect, and there is a well-defined boundary between the two. The experimental results show that our method can predict GPCRs.
Collapse
Affiliation(s)
- Xingyue Gu
- Institute of Computing Science and Technology, Guangzhou University, Guangzhou, China
| | - Zhihua Chen
- Institute of Computing Science and Technology, Guangzhou University, Guangzhou, China
| | - Donghua Wang
- Department of General Surgery, Heilongjiang Province Land Reclamation Headquarters General Hospital, Harbin, China
| |
Collapse
|
19
|
Effect of the Abnormal Expression of BMP-4 in the Blood of Diabetic Patients on the Osteogenic Differentiation Potential of Alveolar BMSCs and the Rescue Effect of Metformin: A Bioinformatics-Based Study. BIOMED RESEARCH INTERNATIONAL 2020; 2020:7626215. [PMID: 32596370 PMCID: PMC7298258 DOI: 10.1155/2020/7626215] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/30/2020] [Accepted: 04/28/2020] [Indexed: 02/08/2023]
Abstract
The success rate of oral implants is lower in type 2 diabetes mellitus (T2DM) patients than in nondiabetic subjects; functional impairment of bone marrow-derived mesenchymal stem cells (BMSCs) is an important underlying cause. Many factors in the blood can act on BMSCs to regulate their biological functions and influence implant osseointegration, but which factors play important negative roles in T2DM patients is still unclear. This study is aimed at screening differentially expressed genes in the blood from T2DM and nondiabetic patients, identifying which genes impact the osteogenic differentiation potential of alveolar BMSCs in T2DM patients, exploring drug intervention regimens, and providing a basis for improving implant osseointegration. Thus, a whole-blood gene expression microarray dataset (GSE26168) of T2DM patients and nondiabetic controls was analyzed. Based on Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) results, differentially expressed genes and signaling pathways related to BMSC osteogenic differentiation were screened, and major risk genes were extracted based on the mean decrease Gini coefficient calculated using the random forest method. Bone morphogenetic protein-4 (BMP-4), with significantly low expression in T2DM blood, was identified as the most significant factor affecting BMSC osteogenic differentiation potential. Subsequently, metformin, a first-line clinical drug for T2DM treatment, was found to improve the osteogenic differentiation potential of BMSCs from T2DM patients via the BMP-4/Smad/Runx2 signaling pathway. These results demonstrate that low BMP-4 expression in the blood of T2DM patients significantly hinders the osteogenic function of BMSCs and that metformin is effective in counteracting the negative impact of BMP-4 deficiency.
Collapse
|
20
|
Wang H, Sham P, Tong T, Pang H. Pathway-Based Single-Cell RNA-Seq Classification, Clustering, and Construction of Gene-Gene Interactions Networks Using Random Forests. IEEE J Biomed Health Inform 2020; 24:1814-1822. [DOI: 10.1109/jbhi.2019.2944865] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
21
|
Fernández-Martínez JL, Álvarez-Machancoses Ó, deAndrés-Galiana EJ, Bea G, Kloczkowski A. Robust Sampling of Defective Pathways in Alzheimer's Disease. Implications in Drug Repositioning. Int J Mol Sci 2020; 21:ijms21103594. [PMID: 32438758 PMCID: PMC7279419 DOI: 10.3390/ijms21103594] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2020] [Revised: 05/09/2020] [Accepted: 05/13/2020] [Indexed: 12/21/2022] Open
Abstract
We present the analysis of the defective genetic pathways of the Late-Onset Alzheimer’s Disease (LOAD) compared to the Mild Cognitive Impairment (MCI) and Healthy Controls (HC) using different sampling methodologies. These algorithms sample the uncertainty space that is intrinsic to any kind of highly underdetermined phenotype prediction problem, by looking for the minimum-scale signatures (header genes) corresponding to different random holdouts. The biological pathways can be identified performing posterior analysis of these signatures established via cross-validation holdouts and plugging the set of most frequently sampled genes into different ontological platforms. That way, the effect of helper genes, whose presence might be due to the high degree of under determinacy of these experiments and data noise, is reduced. Our results suggest that common pathways for Alzheimer’s disease and MCI are mainly related to viral mRNA translation, influenza viral RNA transcription and replication, gene expression, mitochondrial translation, and metabolism, with these results being highly consistent regardless of the comparative methods. The cross-validated predictive accuracies achieved for the LOAD and MCI discriminations were 84% and 81.5%, respectively. The difference between LOAD and MCI could not be clearly established (74% accuracy). The most discriminatory genes of the LOAD-MCI discrimination are associated with proteasome mediated degradation and G-protein signaling. Based on these findings we have also performed drug repositioning using Dr. Insight package, proposing the following different typologies of drugs: isoquinoline alkaloids, antitumor antibiotics, phosphoinositide 3-kinase PI3K, autophagy inhibitors, antagonists of the muscarinic acetylcholine receptor and histone deacetylase inhibitors. We believe that the potential clinical relevance of these findings should be further investigated and confirmed with other independent studies.
Collapse
Affiliation(s)
- Juan Luis Fernández-Martínez
- Group of Inverse Problems, Optimization and Machine Learning, Department of Mathematics, University of Oviedo, C/Federico García Lorca, 18, 33007 Oviedo, Spain; (Ó.Á.-M.); (E.J.d.-G.); (G.B.)
- DeepBioInsights, C/Federico García Lorca, 18, 33007 Oviedo, Spain
- Correspondence:
| | - Óscar Álvarez-Machancoses
- Group of Inverse Problems, Optimization and Machine Learning, Department of Mathematics, University of Oviedo, C/Federico García Lorca, 18, 33007 Oviedo, Spain; (Ó.Á.-M.); (E.J.d.-G.); (G.B.)
- DeepBioInsights, C/Federico García Lorca, 18, 33007 Oviedo, Spain
| | - Enrique J. deAndrés-Galiana
- Group of Inverse Problems, Optimization and Machine Learning, Department of Mathematics, University of Oviedo, C/Federico García Lorca, 18, 33007 Oviedo, Spain; (Ó.Á.-M.); (E.J.d.-G.); (G.B.)
- Department of Informatics and Computer Science, University of Oviedo, C/Federico García Lorca, 18, 33007 Oviedo, Spain
| | - Guillermina Bea
- Group of Inverse Problems, Optimization and Machine Learning, Department of Mathematics, University of Oviedo, C/Federico García Lorca, 18, 33007 Oviedo, Spain; (Ó.Á.-M.); (E.J.d.-G.); (G.B.)
| | - Andrzej Kloczkowski
- Battelle Center for Mathematical Medicine, Nationwide Children’s Hospital, Columbus, OH 43205, USA;
- Department of Pediatrics, The Ohio State University, Columbus, OH 43205, USA
| |
Collapse
|
22
|
Seifert S. Application of random forest based approaches to surface-enhanced Raman scattering data. Sci Rep 2020; 10:5436. [PMID: 32214194 PMCID: PMC7096517 DOI: 10.1038/s41598-020-62338-8] [Citation(s) in RCA: 31] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2019] [Accepted: 02/26/2020] [Indexed: 01/08/2023] Open
Abstract
Surface-enhanced Raman scattering (SERS) is a valuable analytical technique for the analysis of biological samples. However, due to the nature of SERS it is often challenging to exploit the generated data to obtain the desired information when no reporter or label molecules are used. Here, the suitability of random forest based approaches is evaluated using SERS data generated by a simulation framework that is also presented. More specifically, it is demonstrated that important SERS signals can be identified, the relevance of predefined spectral groups can be evaluated, and the relations of different SERS signals can be analyzed. It is shown that for the selection of important SERS signals Boruta and surrogate minimal depth (SMD) and for the analysis of spectral groups the competing method Learner of Functional Enrichment (LeFE) should be applied. In general, this investigation demonstrates that the combination of random forest approaches and SERS data is very promising for sophisticated analysis of complex biological samples.
Collapse
Affiliation(s)
- Stephan Seifert
- Kiel University, University Hospital Schleswig-Holstein, Institute of Medical Informatics and Statistics, Kiel, 24105, Germany.
- University of Hamburg, Hamburg School of Food Science, Institute of Food Chemistry, Hamburg, 20146, Germany.
| |
Collapse
|
23
|
Holland CH, Tanevski J, Perales-Patón J, Gleixner J, Kumar MP, Mereu E, Joughin BA, Stegle O, Lauffenburger DA, Heyn H, Szalai B, Saez-Rodriguez J. Robustness and applicability of transcription factor and pathway analysis tools on single-cell RNA-seq data. Genome Biol 2020; 21:36. [PMID: 32051003 PMCID: PMC7017576 DOI: 10.1186/s13059-020-1949-z] [Citation(s) in RCA: 173] [Impact Index Per Article: 43.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2019] [Accepted: 01/29/2020] [Indexed: 12/31/2022] Open
Abstract
BACKGROUND Many functional analysis tools have been developed to extract functional and mechanistic insight from bulk transcriptome data. With the advent of single-cell RNA sequencing (scRNA-seq), it is in principle possible to do such an analysis for single cells. However, scRNA-seq data has characteristics such as drop-out events and low library sizes. It is thus not clear if functional TF and pathway analysis tools established for bulk sequencing can be applied to scRNA-seq in a meaningful way. RESULTS To address this question, we perform benchmark studies on simulated and real scRNA-seq data. We include the bulk-RNA tools PROGENy, GO enrichment, and DoRothEA that estimate pathway and transcription factor (TF) activities, respectively, and compare them against the tools SCENIC/AUCell and metaVIPER, designed for scRNA-seq. For the in silico study, we simulate single cells from TF/pathway perturbation bulk RNA-seq experiments. We complement the simulated data with real scRNA-seq data upon CRISPR-mediated knock-out. Our benchmarks on simulated and real data reveal comparable performance to the original bulk data. Additionally, we show that the TF and pathway activities preserve cell type-specific variability by analyzing a mixture sample sequenced with 13 scRNA-seq protocols. We also provide the benchmark data for further use by the community. CONCLUSIONS Our analyses suggest that bulk-based functional analysis tools that use manually curated footprint gene sets can be applied to scRNA-seq data, partially outperforming dedicated single-cell tools. Furthermore, we find that the performance of functional analysis tools is more sensitive to the gene sets than to the statistic used.
Collapse
Affiliation(s)
- Christian H Holland
- Institute for Computational Biomedicine, Bioquant, Heidelberg University, Faculty of Medicine, and Heidelberg University Hospital, Heidelberg, Germany
- Joint Research Centre for Computational Biomedicine (JRC-COMBINE), RWTH Aachen University, Faculty of Medicine, Aachen, Germany
| | - Jovan Tanevski
- Institute for Computational Biomedicine, Bioquant, Heidelberg University, Faculty of Medicine, and Heidelberg University Hospital, Heidelberg, Germany
- Department of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia
| | - Javier Perales-Patón
- Institute for Computational Biomedicine, Bioquant, Heidelberg University, Faculty of Medicine, and Heidelberg University Hospital, Heidelberg, Germany
| | - Jan Gleixner
- German Cancer Research Center (DKFZ), Heidelberg, Germany
- European Molecular Biology Laboratory (EMBL), Genome Biology Unit, Heidelberg, Germany
| | - Manu P Kumar
- Department of Biological Engineering, MIT, Cambridge, MA, USA
| | - Elisabetta Mereu
- CNAG-CRG, Centre for Genomic Regulation (CRG), Barcelona Institute of Science and Technology (BIST), Barcelona, Spain
| | - Brian A Joughin
- Department of Biological Engineering, MIT, Cambridge, MA, USA
- Koch Institute for Integrative Cancer Biology, MIT, Cambridge, MA, USA
| | - Oliver Stegle
- German Cancer Research Center (DKFZ), Heidelberg, Germany
- European Molecular Biology Laboratory (EMBL), Genome Biology Unit, Heidelberg, Germany
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Cambridge, UK
| | | | - Holger Heyn
- CNAG-CRG, Centre for Genomic Regulation (CRG), Barcelona Institute of Science and Technology (BIST), Barcelona, Spain
- Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - Bence Szalai
- Faculty of Medicine, Department of Physiology, Semmelweis University, Budapest, Hungary
| | - Julio Saez-Rodriguez
- Institute for Computational Biomedicine, Bioquant, Heidelberg University, Faculty of Medicine, and Heidelberg University Hospital, Heidelberg, Germany.
- Joint Research Centre for Computational Biomedicine (JRC-COMBINE), RWTH Aachen University, Faculty of Medicine, Aachen, Germany.
| |
Collapse
|
24
|
Martey ONK, Greish K, Smith PF, Rosengren RJ. A multivariate statistical analysis of the effects of styrene maleic acid encapsulated RL71 in a xenograft model of triple negative breast cancer. J Biol Methods 2019; 6:e121. [PMID: 31976348 PMCID: PMC6974696 DOI: 10.14440/jbm.2019.306] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2019] [Revised: 09/08/2016] [Accepted: 10/07/2019] [Indexed: 12/29/2022] Open
Abstract
We have previously shown that the curcumin derivative 3,5-bis(3,4,5-trimethoxybenzylidene)-1-methylpiperidine-4-one (RL71), when encapsulated in styrene maleic acid micelles (SMA-RL71), significantly suppressed the growth of MDA-MB-231 xenografts by 67%. Univariate statistical analysis showed that pEGFR/EGFR, pAkt/Akt, pmTOR/mTOR and p4EBP1/4EPBP1 were all significantly decreased in tumors from treated mice compared to SMA controls. In this study, multivariate statistical analyses (MVAs) were performed to identify the molecular networks that worked together to drive tumor suppression, with the aim to determine if this analysis could also be used to predict treatment outcome. Linear discriminant analysis correctly predicted, to 100% certainty, mice that received SMA-RL71 treatment. Additionally, results from multiple linear regression showed that the expression of Ki67, PKC-α, PP2AA-α, PP2AA-β and CaD1 networked together to drive tumor growth suppression. Overall, the MVAs provided evidence for a molecular network of signaling proteins that drives tumor suppression in response to SMA-RL71 treatment, which should be explored further in animal studies of cancer.
Collapse
Affiliation(s)
- Orleans N K Martey
- Department of Pharmacology and Toxicology, School of Biomedical Sciences, University of Otago, Dunedin 9045, New Zealand
| | - Khaled Greish
- Department of Molecular Medicine, College of Medicine and Medical Sciences, Arabian Gulf University, Manama, Kingdom of Bahrain
| | - Paul F Smith
- Department of Pharmacology and Toxicology, School of Biomedical Sciences, University of Otago, Dunedin 9045, New Zealand
| | - Rhonda J Rosengren
- Department of Pharmacology and Toxicology, School of Biomedical Sciences, University of Otago, Dunedin 9045, New Zealand
| |
Collapse
|
25
|
Network-based Biased Tree Ensembles (NetBiTE) for Drug Sensitivity Prediction and Drug Sensitivity Biomarker Identification in Cancer. Sci Rep 2019; 9:15918. [PMID: 31685861 PMCID: PMC6828742 DOI: 10.1038/s41598-019-52093-w] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2019] [Accepted: 10/07/2019] [Indexed: 12/15/2022] Open
Abstract
We present the Network-based Biased Tree Ensembles (NetBiTE) method for drug sensitivity prediction and drug sensitivity biomarker identification in cancer using a combination of prior knowledge and gene expression data. Our devised method consists of a biased tree ensemble that is built according to a probabilistic bias weight distribution. The bias weight distribution is obtained from the assignment of high weights to the drug targets and propagating the assigned weights over a protein-protein interaction network such as STRING. The propagation of weights, defines neighborhoods of influence around the drug targets and as such simulates the spread of perturbations within the cell, following drug administration. Using a synthetic dataset, we showcase how application of biased tree ensembles (BiTE) results in significant accuracy gains at a much lower computational cost compared to the unbiased random forests (RF) algorithm. We then apply NetBiTE to the Genomics of Drug Sensitivity in Cancer (GDSC) dataset and demonstrate that NetBiTE outperforms RF in predicting IC50 drug sensitivity, only for drugs that target membrane receptor pathways (MRPs): RTK, EGFR and IGFR signaling pathways. We propose based on the NetBiTE results, that for drugs that inhibit MRPs, the expression of target genes prior to drug administration is a biomarker for IC50 drug sensitivity following drug administration. We further verify and reinforce this proposition through control studies on, PI3K/MTOR signaling pathway inhibitors, a drug category that does not target MRPs, and through assignment of dummy targets to MRP inhibiting drugs and investigating the variation in NetBiTE accuracy.
Collapse
|
26
|
Rahimi A, Gönen M. Discriminating early- and late-stage cancers using multiple kernel learning on gene sets. Bioinformatics 2019; 34:i412-i421. [PMID: 29949993 PMCID: PMC6022595 DOI: 10.1093/bioinformatics/bty239] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
Motivation Identifying molecular mechanisms that drive cancers from early to late stages is highly important to develop new preventive and therapeutic strategies. Standard machine learning algorithms could be used to discriminate early- and late-stage cancers from each other using their genomic characterizations. Even though these algorithms would get satisfactory predictive performance, their knowledge extraction capability would be quite restricted due to highly correlated nature of genomic data. That is why we need algorithms that can also extract relevant information about these biological mechanisms using our prior knowledge about pathways/gene sets. Results In this study, we addressed the problem of separating early- and late-stage cancers from each other using their gene expression profiles. We proposed to use a multiple kernel learning (MKL) formulation that makes use of pathways/gene sets (i) to obtain satisfactory/improved predictive performance and (ii) to identify biological mechanisms that might have an effect in cancer progression. We extensively compared our proposed MKL on gene sets algorithm against two standard machine learning algorithms, namely, random forests and support vector machines, on 20 diseases from the Cancer Genome Atlas cohorts for two different sets of experiments. Our method obtained statistically significantly better or comparable predictive performance on most of the datasets using significantly fewer gene expression features. We also showed that our algorithm was able to extract meaningful and disease-specific information that gives clues about the progression mechanism. Availability and implementation Our implementations of support vector machine and multiple kernel learning algorithms in R are available at https://github.com/mehmetgonen/gsbc together with the scripts that replicate the reported experiments.
Collapse
Affiliation(s)
- Arezou Rahimi
- Graduate School of Sciences and Engineering, Koç University, Istanbul, Turkey
| | - Mehmet Gönen
- Department of Industrial Engineering, College of Engineering, Koç University, İstanbul, Turkey.,School of Medicine, Koç University, İstanbul, Turkey.,Department of Biomedical Engineering, School of Medicine, Oregon Health & Science University, Portland, OR, USA
| |
Collapse
|
27
|
Modelling the Spatial Distribution of Asbestos—Cement Products in Poland with the Use of the Random Forest Algorithm. SUSTAINABILITY 2019. [DOI: 10.3390/su11164355] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
The unique set of physical and chemical properties of asbestos has led to its many industrial applications worldwide, of which roofing and facades constitute approximately 80% of currently used asbestos-containing products. Since asbestos-containing products are harmful to human health, their use and production have been banned in many countries. To date, no research has been undertaken to estimate the total amount of asbestos–cement products used at the country level in relation to regions or other administrative units. The objective of this paper is to present a possible new solution for developing the spatial distribution of asbestos–cement products used across the country by applying the supervised machine learning algorithm, i.e., Random Forest. Based on the results of a physical inventory taken on asbestos–cement products with the use of aerial imagery, and the application of selected features, considering the socio-economic situation of Poland, i.e., population, buildings, public finance, housing economy and municipal infrastructure, wages, salaries and social security benefits, agricultural census, entities of the national economy, labor market, environment protection, area of built-up surfaces, historical belonging to annexations, and data on asbestos manufacturing plants, best Random Forest models were computed. The selection of important variables was made in the R v.3.1.0 program and supported by the Boruta algorithm. The prediction of the amount of asbestos–cement products used in communes was executed in the randomForest package. An algorithm explaining 75.85% of the variance was subsequently used to prepare the prediction map of the spatial distribution of the amount of asbestos–cement products used in Poland. The total amount was estimated at 710,278,645 m2 (7.8 million tons). Since the best model used data on built-up surfaces which are available for the whole of Europe, it is worth considering the use of the developed method in other European countries, as well as to assess the environmental risk of asbestos exposure to humans.
Collapse
|
28
|
Jung SY, Zhang ZF. The effects of genetic variants related to insulin metabolism pathways and the interactions with lifestyles on colorectal cancer risk. Menopause 2019; 26:771-780. [PMID: 30649085 PMCID: PMC7035960 DOI: 10.1097/gme.0000000000001301] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
OBJECTIVES Genetic variants in metabolic signaling pathways may interact with lifestyle factors, such as dietary fatty acids, influencing postmenopausal colorectal cancer (CRC) risk, but these interrelated pathways are not fully understood. METHODS In this study, we examined 54 single-nucleotide polymorphisms (SNPs) in genes related to insulin-like growth factor-I/insulin traits and their signaling pathways and lifestyle factors in relation to postmenopausal CRC, using data from 6,539 postmenopausal women in the Women's Health Initiative Harmonized and Imputed Genome-Wide Association Studies. By employing a two-stage random survival forest analysis, we evaluated the SNPs and lifestyle factors by ranking them according to their predictive value and accuracy for CRC. RESULTS We identified four SNPs (IRS1 rs1801123, IRS1 rs1801278, AKT2 rs3730256, and AKT2 rs7247515) and two lifestyle factors (age and percentage calories from saturated fatty acids) as the top six most influential predictors for CRC risk. We further examined interactive effects of those factors on cancer risk. In the individual SNP analysis, no significant association was observed, but the combination of the four SNPs, age, and percentage calories from saturated fatty acid (≥11% per day) significantly increased the risk of CRC in a gene and lifestyle dose-dependent manner. CONCLUSIONS Our findings provide insight into gene-lifestyle interactions and will enable researchers to focus on individuals with risk genotypes to promote intervention strategies. Our study suggests the careful use of data on potential genetic targets in clinical trials for cancer prevention to reduce the risk for CRC in postmenopausal women.
Collapse
Affiliation(s)
- Su Yon Jung
- Translational Sciences Section, Jonsson Comprehensive Cancer Center, School of Nursing, University of California, Los Angeles, Los Angeles, CA, USA
| | - Zuo-Feng Zhang
- Department of Epidemiology, Fielding School of Public Health, University of California, Los Angeles, Los Angeles, CA, USA
| |
Collapse
|
29
|
Xu Y, Kim I, Carroll RJ. A hybrid omnibus test for generalized semiparametric single-index models with high-dimensional covariate sets. Biometrics 2019; 75:757-767. [PMID: 30859553 DOI: 10.1111/biom.13054] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2017] [Accepted: 02/26/2019] [Indexed: 11/27/2022]
Abstract
Numerous statistical methods have been developed for analyzing high-dimensional data. These methods often focus on variable selection approaches but are limited for the purpose of testing with high-dimensional data. They are often required to have explicit-likelihood functions. In this article, we propose a "hybrid omnibus test" for high-dicmensional data testing purpose with much weaker requirements. Our hybrid omnibus test is developed under a semiparametric framework where a likelihood function is no longer necessary. Our test is a version of a frequentist-Bayesian hybrid score-type test for a generalized partially linear single-index model, which has a link function being a function of a set of variables through a generalized partially linear single index. We propose an efficient score based on estimating equations, define local tests, and then construct our hybrid omnibus test using local tests. We compare our approach with an empirical-likelihood ratio test and Bayesian inference based on Bayes factors, using simulation studies. Our simulation results suggest that our approach outperforms the others, in terms of type I error, power, and computational cost in both the low- and high-dimensional cases. The advantage of our approach is demonstrated by applying it to genetic pathway data for type II diabetes mellitus.
Collapse
Affiliation(s)
- Yangyi Xu
- Department of Statistics, Virginia Tech., Blacksburg, Virginia
| | - Inyoung Kim
- Department of Statistics, Virginia Tech., Blacksburg, Virginia
| | - Raymond J Carroll
- Department of Statistics, Texas A&M University, 3143 TAMU, College Station, Texas.,School of Mathematical and Physical Sciences, University of Technology, Sydney, Sydney, Broadway, NSW, Australia
| |
Collapse
|
30
|
Paldino MJ, Golriz F, Zhang W, Chu ZD. Normalization enhances brain network features that predict individual intelligence in children with epilepsy. PLoS One 2019; 14:e0212901. [PMID: 30835738 PMCID: PMC6400436 DOI: 10.1371/journal.pone.0212901] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2018] [Accepted: 02/12/2019] [Indexed: 12/18/2022] Open
Abstract
BACKGROUND AND PURPOSE Architecture of the cerebral network has been shown to associate with IQ in children with epilepsy. However, subject-level prediction on this basis, a crucial step toward harnessing network analyses for the benefit of children with epilepsy, has yet to be achieved. We compared two network normalization strategies in terms of their ability to optimize subject-level inferences on the relationship between brain network architecture and brain function. MATERIALS AND METHODS Patients with epilepsy and resting state fMRI were retrospectively identified. Brain network nodes were defined by anatomic parcellation, first in patient space (nodes defined for each patient) and again in template space (same nodes for all patients). Whole-brain weighted graphs were constructed according to pair-wise correlation of BOLD-signal time courses between nodes. The following metrics were then calculated: clustering coefficient, transitivity, modularity, path length, and global efficiency. Metrics computed on graphs in patient space were normalized to the same metric computed on a random network of identical size. A machine learning algorithm was used to predict patient IQ given access to only the network metrics. RESULTS Twenty-seven patients (8-18 years) comprised the final study group. All brain networks demonstrated expected small world properties. Accounting for intrinsic population heterogeneity had a significant effect on prediction accuracy. Specifically, transformation of all patients into a common standard space as well as normalization of metrics to those computed on a random network both substantially outperformed the use of non-normalized metrics. CONCLUSION Normalization contributed significantly to accurate subject-level prediction of cognitive function in children with epilepsy. These findings support the potential for quantitative network approaches to contribute clinically meaningful information in children with neurological disorders.
Collapse
Affiliation(s)
- Michael J. Paldino
- Department of Radiology, Texas Children’s Hospital, Houston, TX, United States of America
- * E-mail:
| | - Farahnaz Golriz
- Department of Radiology, Texas Children’s Hospital, Houston, TX, United States of America
| | - Wei Zhang
- Department of Radiology, Texas Children’s Hospital, Houston, TX, United States of America
| | - Zili D. Chu
- Department of Radiology, Texas Children’s Hospital, Houston, TX, United States of America
| |
Collapse
|
31
|
Lim S, Lee S, Jung I, Rhee S, Kim S. Comprehensive and critical evaluation of individualized pathway activity measurement tools on pan-cancer data. Brief Bioinform 2018; 21:36-46. [PMID: 30462155 DOI: 10.1093/bib/bby097] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2018] [Revised: 08/20/2018] [Accepted: 09/09/2018] [Indexed: 12/11/2022] Open
Abstract
Motivation : Biological pathways are extensively used for the analysis of transcriptome data to characterize biological mechanisms underlying various phenotypes. There are a number of computational tools that summarize transcriptome data at the pathway level. However, there is no comparative study on how well these tools produce useful information at the cohort level, enabling comparison of many samples or patients. Results : In this study, we systematically compared and evaluated 13 different pathway activity inference tools based on 5 comparison criteria using pan-cancer data set. This study has two major contributions. First, our study provides a comprehensive survey on computational techniques used by existing pathway activity inference tools. The tools use different strategies and assume different requirements on data: input transformation, use of labels, necessity of cohort-level input data, use of gene relations and scoring metric. Second, we performed extensive evaluations on the performance of these tools. Because different tools use different methods to map samples to the pathway dimension, the tools are evaluated at the pathway level using five comparison criteria. Starting from measuring how well a tool maintains the characteristics of original gene expression values, robustness was also investigated by adding noise into gene expression data. Classification tasks on three clinical variables (tumor versus normal, survival and cancer subtypes) were performed to evaluate the utility of tools for their clinical applications. In addition, the inferred activity values were compared between the tools to see how similar they are along with the scoring schemes they use.
Collapse
Affiliation(s)
- Sangsoo Lim
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Korea
| | - Sangseon Lee
- Department of Computer Science and Engineering, Seoul National University, Seoul, Korea
| | - Inuk Jung
- Bioinformatics Institute, Seoul National University, Seoul, Korea
| | - Sungmin Rhee
- Department of Computer Science and Engineering, Seoul National University, Seoul, Korea
| | - Sun Kim
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Korea.,Department of Computer Science and Engineering, Seoul National University, Seoul, Korea.,Bioinformatics Institute, Seoul National University, Seoul, Korea
| |
Collapse
|
32
|
Mutual Information Better Quantifies Brain Network Architecture in Children with Epilepsy. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2018; 2018:6142898. [PMID: 30425750 PMCID: PMC6217888 DOI: 10.1155/2018/6142898] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/07/2018] [Revised: 08/06/2018] [Accepted: 09/18/2018] [Indexed: 01/01/2023]
Abstract
Purpose Metrics of the brain network architecture derived from resting-state fMRI have been shown to provide physiologically meaningful markers of IQ in children with epilepsy. However, traditional measures of functional connectivity (FC), specifically the Pearson correlation, assume a dominant linear relationship between BOLD time courses; this assumption may not be valid. Mutual information is an alternative measure of FC which has shown promise in the study of complex networks due to its ability to flexibly capture association of diverse forms. We aimed to compare network metrics derived from mutual information-defined FC to those derived from traditional correlation in terms of their capacity to predict patient-level IQ. Materials and Methods Patients were retrospectively identified with the following: (1) focal epilepsy; (2) resting-state fMRI; and (3) full-scale IQ by a neuropsychologist. Brain network nodes were defined by anatomic parcellation. Parcellation was performed at the size threshold of 350 mm2, resulting in networks containing 780 nodes. Whole-brain, weighted graphs were then constructed according to the pairwise connectivity between nodes. In the traditional condition, edges (connections) between each pair of nodes were defined as the absolute value of the Pearson correlation coefficient between their BOLD time courses. In the mutual information condition, edges were defined as the mutual information between time courses. The following metrics were then calculated for each weighted graph: clustering coefficient, modularity, characteristic path length, and global efficiency. A machine learning algorithm was used to predict the IQ of each individual based on their network metrics. Prediction accuracy was assessed as the fractional variation explained for each condition. Results Twenty-four patients met the inclusion criteria (age: 8-18 years). All brain networks demonstrated expected small-world properties. Network metrics derived from mutual information-defined FC significantly outperformed the use of the Pearson correlation. Specifically, fractional variation explained was 49% (95% CI: 46%, 51%) for the mutual information method; the Pearson correlation demonstrated a variation of 17% (95% CI: 13%, 19%). Conclusion Mutual information-defined functional connectivity captures physiologically relevant features of the brain network better than correlation. Clinical Relevance Optimizing the capacity to predict cognitive phenotypes at the patient level is a necessary step toward the clinical utility of network-based biomarkers.
Collapse
|
33
|
Zhang L, Kim I. Semiparametric Bayesian kernel survival model for evaluating pathway effects. Stat Methods Med Res 2018; 28:3301-3317. [PMID: 30289021 DOI: 10.1177/0962280218797360] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
Abstract
Massive amounts of high-dimensional data have been accumulated over the past two decades, which has cultured increasing interests in identifying gene pathways related to certain biological processes. In particular, since pathway-based analysis has the ability to detect subtle changes of differentially expressed genes that could be missed when using gene-based analysis, detecting the gene pathways that regulate certain diseases can provide new strategies for medical procedures and new targets for drug discovery. Limited work has been carried out, primarily in regression settings, to study the effects of pathways on survival outcomes. Motivated by a breast cancer gene-pathway data set, which exhibits the "small n, large p" characteristics, we propose a semiparametric Bayesian kernel survival model (s-BKSurv) to study the effects of both clinical covariates and gene expression levels within a pathway on survival time. We model the unknown high-dimensional functions of pathways via Gaussian kernel machine to consider the possibility that genes within the same pathway interact with each other. To address the multiple comparisons problem under a full Bayesian setting, we propose a similarity-dependent procedure based on Bayes factor to control the family-wise error rate. We demonstrate the outperformance of our approach under various simulation settings and pathways data.
Collapse
Affiliation(s)
- Lin Zhang
- Department of Statistics, Virginia Polytechnic Institute and State University, Blacksburg, VA, USA
| | - Inyoung Kim
- Department of Statistics, Virginia Polytechnic Institute and State University, Blacksburg, VA, USA
| |
Collapse
|
34
|
Li B, Zhang N, Wang YG, George AW, Reverter A, Li Y. Genomic Prediction of Breeding Values Using a Subset of SNPs Identified by Three Machine Learning Methods. Front Genet 2018; 9:237. [PMID: 30023001 PMCID: PMC6039760 DOI: 10.3389/fgene.2018.00237] [Citation(s) in RCA: 79] [Impact Index Per Article: 13.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2018] [Accepted: 06/14/2018] [Indexed: 12/22/2022] Open
Abstract
The analysis of large genomic data is hampered by issues such as a small number of observations and a large number of predictive variables (commonly known as “large P small N”), high dimensionality or highly correlated data structures. Machine learning methods are renowned for dealing with these problems. To date machine learning methods have been applied in Genome-Wide Association Studies for identification of candidate genes, epistasis detection, gene network pathway analyses and genomic prediction of phenotypic values. However, the utility of two machine learning methods, Gradient Boosting Machine (GBM) and Extreme Gradient Boosting Method (XgBoost), in identifying a subset of SNP makers for genomic prediction of breeding values has never been explored before. In this study, using 38,082 SNP markers and body weight phenotypes from 2,093 Brahman cattle (1,097 bulls as a discovery population and 996 cows as a validation population), we examined the efficiency of three machine learning methods, namely Random Forests (RF), GBM and XgBoost, in (a) the identification of top 400, 1,000, and 3,000 ranked SNPs; (b) using the subsets of SNPs to construct genomic relationship matrices (GRMs) for the estimation of genomic breeding values (GEBVs). For comparison purposes, we also calculated the GEBVs from (1) 400, 1,000, and 3,000 SNPs that were randomly selected and evenly spaced across the genome, and (2) from all the SNPs. We found that RF and especially GBM are efficient methods in identifying a subset of SNPs with direct links to candidate genes affecting the growth trait. In comparison to the estimate of prediction accuracy of GEBVs from using all SNPs (0.43), the 3,000 top SNPs identified by RF (0.42) and GBM (0.46) had similar values to those of the whole SNP panel. The performance of the subsets of SNPs from RF and GBM was substantially better than that of evenly spaced subsets across the genome (0.18–0.29). Of the three methods, RF and GBM consistently outperformed the XgBoost in genomic prediction accuracy.
Collapse
Affiliation(s)
- Bo Li
- CSIRO Agriculture and Food, St Lucia, QLD, Australia.,Shandong Technology and Business University, School of Computer Science and Technology, YanTai, China.,Shandong Co-Innovation Centre of Future Intelligent Computing, YanTai, China
| | - Nanxi Zhang
- Centre for Applications in Natural Resource Mathematics, University of Queensland, St Lucia, QLD, Australia
| | - You-Gan Wang
- School of Mathematical Sciences, Queensland University of Technology, Brisbane, QLD, Australia
| | | | | | - Yutao Li
- CSIRO Agriculture and Food, St Lucia, QLD, Australia
| |
Collapse
|
35
|
Wang J, Jain S, Chen D, Song W, Hu CT, Su YH. Development and Evaluation of Novel Statistical Methods in Urine Biomarker-Based Hepatocellular Carcinoma Screening. Sci Rep 2018; 8:3799. [PMID: 29491388 PMCID: PMC5830457 DOI: 10.1038/s41598-018-21922-9] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2017] [Accepted: 02/13/2018] [Indexed: 02/07/2023] Open
Abstract
Hepatocellular carcinoma is one of the fastest growing cancers in the US and has a low survival rate, partly due to difficulties in early detection. Because of HCC's high heterogeneity, it has been suggested that multiple biomarkers would be needed to develop a sensitive HCC screening test. This study applied random forest (RF), a machine learning technique, and proposed two novel models, fixed sequential (FS) and two-step (TS), for comparison with two commonly used statistical techniques, logistic regression (LR) and classification and regression trees (CART), in combining multiple urine DNA biomarkers for HCC screening using biomarker values obtained from 137 HCC and 431 non-HCC (224 hepatitis and 207 cirrhosis) subjects. The sensitivity, specificity, area under the receiver operating curve, and variability were estimated through repeated 10-fold cross-validation to compare the models' performances in accuracy and robustness. We show that RF and TS have higher accuracy and stability; specifically, they reach 90% specificity and 86%/87% sensitivity respectively along with 15% higher sensitivity and 10% higher specificity than LR in cross-validation. The potential of RF and TS to develop a panel of multiple biomarkers and the possibility for self-training, cloud-based models for HCC screening are discussed.
Collapse
Affiliation(s)
- Jeremy Wang
- JBS Science, Inc., Doylestown, Pennsylvania, United States
| | - Surbhi Jain
- JBS Science, Inc., Doylestown, Pennsylvania, United States
| | - Dion Chen
- ClinPharma Consulting, Inc, Phoenixville, Pennsylvania, United States
| | - Wei Song
- JBS Science, Inc., Doylestown, Pennsylvania, United States
| | - Chi-Tan Hu
- Buddhist Tzu Chi General Hospital and Tzu Chi University, Hualien, 970, Taiwan R.O.C..
| | - Ying-Hsiu Su
- JBS Science, Inc., Doylestown, Pennsylvania, United States.
- The Baruch S. Blumberg Institute, Doylestown, Pennsylvania, United States.
| |
Collapse
|
36
|
Jung SY, Papp JC, Sobel EM, Zhang ZF. Genetic Variants in Metabolic Signaling Pathways and Their Interaction with Lifestyle Factors on Breast Cancer Risk: A Random Survival Forest Analysis. Cancer Prev Res (Phila) 2018; 11:44-51. [PMID: 29074537 PMCID: PMC5754228 DOI: 10.1158/1940-6207.capr-17-0143] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2017] [Revised: 09/06/2017] [Accepted: 10/18/2017] [Indexed: 12/18/2022]
Abstract
Genetic variants in the insulin-like growth factor-I (IGF-I)/insulin resistance axis may interact with lifestyle factors, influencing postmenopausal breast cancer risk, but these interrelated pathways are not fully understood. In this study, we examined 54 single-nucleotide polymorphisms (SNP) in genes related to IGF-I/insulin phenotypes and signaling pathways and lifestyle factors in relation to postmenopausal breast cancer, using data from 6,567 postmenopausal women in the Women's Health Initiative Harmonized and Imputed Genome-Wide Association Studies. We used a machine-learning method, two-stage random survival forest analysis. We identified three genetic variants (AKT1 rs2494740, AKT1 rs2494744, and AKT1 rs2498789) and two lifestyle factors [body mass index (BMI) and dietary alcohol intake] as the top five most influential predictors for breast cancer risk. The combination of the three SNPs, BMI, and alcohol consumption (≥1 g/day) significantly increased the risk of breast cancer in a gene and lifestyle dose-dependent manner. Our findings provide insight into gene-lifestyle interactions and will enable researchers to focus on individuals with risk genotypes to promote intervention strategies. These data also suggest potential genetic targets in future intervention/clinical trials for cancer prevention in order to reduce the risk for breast cancer in postmenopausal women. Cancer Prev Res; 11(1); 44-51. ©2017 AACR.
Collapse
Affiliation(s)
- Su Yon Jung
- Translational Sciences Section, Jonsson Comprehensive Cancer Center, School of Nursing, University of California, Los Angeles, Los Angeles, California.
| | - Jeanette C Papp
- Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, California
| | - Eric M Sobel
- Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, California
| | - Zuo-Feng Zhang
- Department of Epidemiology, Fielding School of Public Health, University of California, Los Angeles, Los Angeles, California
| |
Collapse
|
37
|
Cheng L, Shan L, Kim I. Multilevel Gaussian graphical model for multilevel networks. J Stat Plan Inference 2017. [DOI: 10.1016/j.jspi.2017.05.003] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
|
38
|
Chiappini F, Coilly A, Kadar H, Gual P, Tran A, Desterke C, Samuel D, Duclos-Vallée JC, Touboul D, Bertrand-Michel J, Brunelle A, Guettier C, Le Naour F. Metabolism dysregulation induces a specific lipid signature of nonalcoholic steatohepatitis in patients. Sci Rep 2017; 7:46658. [PMID: 28436449 PMCID: PMC5402394 DOI: 10.1038/srep46658] [Citation(s) in RCA: 155] [Impact Index Per Article: 22.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2016] [Accepted: 03/28/2017] [Indexed: 02/07/2023] Open
Abstract
Nonalcoholic steatohepatitis (NASH) is a condition which can progress to cirrhosis and hepatocellular carcinoma. Markers for NASH diagnosis are still lacking. We performed a comprehensive lipidomic analysis on human liver biopsies including normal liver, nonalcoholic fatty liver and NASH. Random forests-based machine learning approach allowed characterizing a signature of 32 lipids discriminating NASH with 100% sensitivity and specificity. Furthermore, we validated this signature in an independent group of NASH patients. Then, metabolism dysregulations were investigated in both patients and murine models. Alterations of elongase and desaturase activities were observed along the fatty acid synthesis pathway. The decreased activity of the desaturase FADS1 appeared as a bottleneck, leading upstream to an accumulation of fatty acids and downstream to a deficiency of long-chain fatty acids resulting to impaired phospholipid synthesis. In NASH, mass spectrometry imaging on tissue section revealed the spreading into the hepatic parenchyma of selectively accumulated fatty acids. Such lipids constituted a highly toxic mixture to human hepatocytes. In conclusion, this study characterized a specific and sensitive lipid signature of NASH and positioned FADS1 as a significant player in accumulating toxic lipids during NASH progression.
Collapse
Affiliation(s)
- Franck Chiappini
- Inserm, Unité 1193, Villejuif, F-94800, France.,Univ Paris-Sud, UMR-S1193, Villejuif, F-94800, France.,DHU Hepatinov, Villejuif, F-94800, France
| | - Audrey Coilly
- Inserm, Unité 1193, Villejuif, F-94800, France.,Univ Paris-Sud, UMR-S1193, Villejuif, F-94800, France.,DHU Hepatinov, Villejuif, F-94800, France.,AP-HP, Hôpital Paul-Brousse, Centre Hépato-Biliaire, Villejuif, F-94800, France
| | - Hanane Kadar
- Institut de Chimie des Substances Naturelles, CNRS UPR 2301, Univ. Paris-Sud, Université Paris-Saclay, F-91198 Gif-Sur-Yvette, France
| | - Philippe Gual
- Inserm, Unité 1065, Nice, F-06204, France.,University of Nice-Sophia-Antipolis, Nice, F-06204, France.,Centre Hospitalier Universitaire de Nice, Hôpital L'Archet, Nice Cedex 3, F-06202, France
| | - Albert Tran
- Inserm, Unité 1065, Nice, F-06204, France.,University of Nice-Sophia-Antipolis, Nice, F-06204, France.,Centre Hospitalier Universitaire de Nice, Hôpital L'Archet, Nice Cedex 3, F-06202, France
| | - Christophe Desterke
- Inserm, US33, Villejuif, F-94800, France.,Univ Paris-Sud, US33, Villejuif, F-94800, France
| | - Didier Samuel
- Inserm, Unité 1193, Villejuif, F-94800, France.,Univ Paris-Sud, UMR-S1193, Villejuif, F-94800, France.,DHU Hepatinov, Villejuif, F-94800, France.,AP-HP, Hôpital Paul-Brousse, Centre Hépato-Biliaire, Villejuif, F-94800, France
| | - Jean-Charles Duclos-Vallée
- Inserm, Unité 1193, Villejuif, F-94800, France.,Univ Paris-Sud, UMR-S1193, Villejuif, F-94800, France.,DHU Hepatinov, Villejuif, F-94800, France.,AP-HP, Hôpital Paul-Brousse, Centre Hépato-Biliaire, Villejuif, F-94800, France
| | - David Touboul
- Institut de Chimie des Substances Naturelles, CNRS UPR 2301, Univ. Paris-Sud, Université Paris-Saclay, F-91198 Gif-Sur-Yvette, France
| | | | - Alain Brunelle
- Institut de Chimie des Substances Naturelles, CNRS UPR 2301, Univ. Paris-Sud, Université Paris-Saclay, F-91198 Gif-Sur-Yvette, France
| | - Catherine Guettier
- Inserm, Unité 1193, Villejuif, F-94800, France.,Univ Paris-Sud, UMR-S1193, Villejuif, F-94800, France.,DHU Hepatinov, Villejuif, F-94800, France.,AP-HP, Hôpital du Kremlin-Bicêtre, Service d'Anatomopathologie, Le Kremlin-Bicêtre, F-94275, France
| | - François Le Naour
- Inserm, Unité 1193, Villejuif, F-94800, France.,Univ Paris-Sud, UMR-S1193, Villejuif, F-94800, France.,DHU Hepatinov, Villejuif, F-94800, France.,Inserm, US33, Villejuif, F-94800, France.,Univ Paris-Sud, US33, Villejuif, F-94800, France
| |
Collapse
|
39
|
Pang H, Wang X. Statistical aspect of translational and correlative studies in clinical trials. Chin Clin Oncol 2017; 5:11. [PMID: 26932435 DOI: 10.3978/j.issn.2304-3865.2014.07.04] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2014] [Accepted: 06/18/2014] [Indexed: 01/07/2023]
Abstract
In this article, we describe statistical issues related to the conduct of translational and correlative studies in cancer clinical trials. In the era of personalized medicine, proper biomarker discovery and validation is crucial for producing groundbreaking research. In order to carry out the framework outlined in this article, a team effort between oncologists and statisticians is the key for success.
Collapse
Affiliation(s)
- Herbert Pang
- School of Public Health, Li Ka Shing Faculty of Medicine, Pok Fu Lam, Hong Kong SAR, China.
| | - Xiaofei Wang
- Department of Biostatistics and Bioinformatics, Duke University School of Medicine, Durham, NC, USA.
| |
Collapse
|
40
|
Fabres PJ, Collins C, Cavagnaro TR, Rodríguez López CM. A Concise Review on Multi-Omics Data Integration for Terroir Analysis in Vitis vinifera. FRONTIERS IN PLANT SCIENCE 2017; 8:1065. [PMID: 28676813 PMCID: PMC5477006 DOI: 10.3389/fpls.2017.01065] [Citation(s) in RCA: 41] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/08/2017] [Accepted: 06/02/2017] [Indexed: 05/19/2023]
Abstract
Vitis vinifera (grapevine) is one of the most important fruit crops, both for fresh consumption and wine and spirit production. The term terroir is frequently used in viticulture and the wine industry to relate wine sensory attributes to its geographic origin. Although, it can be cultivated in a wide range of environments, differences in growing conditions have a significant impact on fruit traits that ultimately affect wine quality. Understanding how fruit quality and yield are controlled at a molecular level in grapevine in response to environmental cues has been a major driver of research. Advances in the area of genomics, epigenomics, transcriptomics, proteomics and metabolomics, have significantly increased our knowledge on the abiotic regulation of yield and quality in many crop species, including V. vinifera. The integrated analysis of multiple 'omics' can give us the opportunity to better understand how plants modulate their response to different environments. However, 'omics' technologies provide a large amount of biological data and its interpretation is not always straightforward, especially when different 'omic' results are combined. Here we examine the current strategies used to integrate multi-omics, and how these have been used in V. vinifera. In addition, we also discuss the importance of including epigenomics data when integrating omics data as epigenetic mechanisms could play a major role as an intermediary between the environment and the genome.
Collapse
Affiliation(s)
- Pastor Jullian Fabres
- Environmental Epigenetics and Genetics Group, Plant Research Centre, School of Agriculture, Food and Wine, University of Adelaide, Glen OsmondSA, Australia
| | - Cassandra Collins
- The Waite Research Institute, The School of Agriculture, Food and Wine, The University of Adelaide, Glen OsmondSA, Australia
| | - Timothy R. Cavagnaro
- The Waite Research Institute, The School of Agriculture, Food and Wine, The University of Adelaide, Glen OsmondSA, Australia
| | - Carlos M. Rodríguez López
- Environmental Epigenetics and Genetics Group, Plant Research Centre, School of Agriculture, Food and Wine, University of Adelaide, Glen OsmondSA, Australia
- *Correspondence: Carlos M. Rodríguez López,
| |
Collapse
|
41
|
Lim S, Park Y, Hur B, Kim M, Han W, Kim S. Protein interaction network (PIN)-based breast cancer subsystem identification and activation measurement for prognostic modeling. Methods 2016; 110:81-89. [DOI: 10.1016/j.ymeth.2016.06.015] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2016] [Revised: 05/31/2016] [Accepted: 06/17/2016] [Indexed: 12/20/2022] Open
|
42
|
Estrada-Carmona N, Harper EB, DeClerck F, Fremier AK. Quantifying model uncertainty to improve watershed-level ecosystem service quantification: a global sensitivity analysis of the RUSLE. INTERNATIONAL JOURNAL OF BIODIVERSITY SCIENCE, ECOSYSTEM SERVICES & MANAGEMENT 2016. [DOI: 10.1080/21513732.2016.1237383] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022] Open
Affiliation(s)
- Natalia Estrada-Carmona
- College of Natural Resources, University of Idaho, Moscow, ID, USA
- Division of Research and Development, CATIE, Turrialba, Costa Rica
- Landscape Management and Restoration, Bioversity International, Montpellier, France
| | - Elizabeth B. Harper
- Division of Natural and Social Sciences, New England College, Henniker, NH, USA
| | - Fabrice DeClerck
- Landscape Management and Restoration, Bioversity International, Montpellier, France
| | | |
Collapse
|
43
|
Zheng B, Liu J, Gu J, Du J, Wang L, Gu S, Cheng J, Yang J, Lu H. Classification of Benign and Malignant Thyroid Nodules Using a Combined Clinical Information and Gene Expression Signatures. PLoS One 2016; 11:e0164570. [PMID: 27776138 PMCID: PMC5077123 DOI: 10.1371/journal.pone.0164570] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2016] [Accepted: 09/27/2016] [Indexed: 01/08/2023] Open
Abstract
BACKGROUND A key challenge in thyroid carcinoma is preoperatively diagnosing malignant thyroid nodules. A novel diagnostic test that measures the expression of a 3-gene signature (DPP4, SCG5 and CA12) has demonstrated promise in thyroid carcinoma assessment. However, more reliable prediction methods combining clinical features with genomic signatures with high accuracy, good stability and low cost are needed. METHODOLOGY/PRINCIPAL FINDINGS 25 clinical information were recorded in 771 patients. Feature selection and validation were conducted using random forest. Thyroid samples and clinical data were obtained from 142 patients at two different hospitals, and expression of the 3-gene signature was measured using quantitative PCR. The predictive abilities of three models (based on the selected clinical variables, the gene expression profile, and integrated gene expression and clinical information) were compared. Seven clinical characteristics were selected based on a training set (539 patients) and tested in three test sets, yielding predictive accuracies of 82.3% (n = 232), 81.4% (n = 70), and 81.9% (n = 72). The predictive sensitivity, specificity, and accuracy were 72.3%, 80.5% and 76.8% for the model based on the gene expression signature, 66.2%, 81.8% and 74.6% for the model based on the clinical data, and 83.1%, 84.4% and 83.8% for the combined model in a 10-fold cross-validation (n = 142). CONCLUSIONS These findings reveal that the integrated model, which combines clinical data with the 3-gene signature, is superior to models based on gene expression or clinical data alone. The integrated model appears to be a reliable tool for the preoperative diagnosis of thyroid tumors.
Collapse
Affiliation(s)
- Bing Zheng
- Shanghai Institute of Medical Genetics, Shanghai Children’s Hospital, Shanghai Jiao Tong University, Shanghai, China
- Department of Laboratory Medicine, Renji Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, China
| | - Jun Liu
- Department of Otolaryngology, Renji Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, China
- Department of Otolaryngology-Head and Neck Surgery, Xinhua Hospital, School of Medicine, Shanghai Jiaotong University, Shanghai, China
- Ear Institute, Shanghai Jiaotong University, Shanghai, China
| | - Jianlei Gu
- Shanghai Institute of Medical Genetics, Shanghai Children’s Hospital, Shanghai Jiao Tong University, Shanghai, China
- Key Laboratory of Molecular Embryology, Ministry of Health and Shanghai Key Laboratory of Embryo and Reproduction Engineering, Shanghai, China
| | - Jing Du
- Department of Ultrasonography, Renji Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, China
| | - Lin Wang
- Department of Ultrasonography, Renji Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, China
| | - Shengli Gu
- Department of Ultrasonography, Xinhua Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, China
| | - Juan Cheng
- Department of Ultrasonography, Xinhua Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, China
| | - Jun Yang
- Department of Otolaryngology-Head and Neck Surgery, Xinhua Hospital, School of Medicine, Shanghai Jiaotong University, Shanghai, China
- Ear Institute, Shanghai Jiaotong University, Shanghai, China
| | - Hui Lu
- Shanghai Institute of Medical Genetics, Shanghai Children’s Hospital, Shanghai Jiao Tong University, Shanghai, China
- Key Laboratory of Molecular Embryology, Ministry of Health and Shanghai Key Laboratory of Embryo and Reproduction Engineering, Shanghai, China
- Department of Bioengineering, University of Illinois at Chicago, Chicago, Illinois, United States of America
| |
Collapse
|
44
|
|
45
|
Bayesian Semiparametric Model for Pathway-Based Analysis with Zero-Inflated Clinical Outcomes. JOURNAL OF AGRICULTURAL BIOLOGICAL AND ENVIRONMENTAL STATISTICS 2016. [DOI: 10.1007/s13253-016-0264-3] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
46
|
Li A, Zang Q, Sun D, Wang M. A text feature-based approach for literature mining of lncRNA–protein interactions. Neurocomputing 2016. [DOI: 10.1016/j.neucom.2015.11.110] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2023]
|
47
|
Chiappini F, Desterke C, Bertrand-Michel J, Guettier C, Le Naour F. Hepatic and serum lipid signatures specific to nonalcoholic steatohepatitis in murine models. Sci Rep 2016; 6:31587. [PMID: 27510159 PMCID: PMC4980672 DOI: 10.1038/srep31587] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2016] [Accepted: 07/19/2016] [Indexed: 01/01/2023] Open
Abstract
Nonalcoholic fatty liver (NAFL) is a precursor of nonalcoholic steatohepatitis (NASH), a condition that may progress to cirrhosis and hepatocellular carcinoma. Markers for diagnosis of NASH are still lacking. We have investigated lipid markers using mouse models that developed NAFL when fed with high fat diet (HFD) or NASH when fed using methionine choline deficient diet (MCDD). We have performed a comprehensive lipidomic analysis on liver tissues as well as on sera from mice fed HFD (n = 5), MCDD (n = 5) or normal diet as controls (n = 10). Machine learning approach based on prediction analysis of microarrays followed by random forests allowed identifying 21 lipids out of 149 in the liver and 14 lipids out of 155 in the serum discriminating mice fed MCDD from HFD or controls. In conclusion, the global approach implemented allowed characterizing lipid signatures specific to NASH in both liver and serum from animal models. This opens new avenue for investigating early and non-invasive lipid markers for diagnosis of NASH in human.
Collapse
Affiliation(s)
- Franck Chiappini
- Inserm, Unité 1193, Villejuif, F-94800, France.,Univ Paris-Sud, Institut André Lwoff, Villejuif, F-94800, France.,DHU Hepatinov, Villejuif, F-94800, France
| | - Christophe Desterke
- Univ Paris-Sud, Institut André Lwoff, Villejuif, F-94800, France.,Inserm, US33, Villejuif, F-94800, France
| | | | - Catherine Guettier
- Inserm, Unité 1193, Villejuif, F-94800, France.,Univ Paris-Sud, Institut André Lwoff, Villejuif, F-94800, France.,DHU Hepatinov, Villejuif, F-94800, France.,AP-HP Hôpital du Kremlin-Bicêtre, Service d'Anatomopathologie, Le Kremlin, F-94275, France
| | - François Le Naour
- Inserm, Unité 1193, Villejuif, F-94800, France.,Univ Paris-Sud, Institut André Lwoff, Villejuif, F-94800, France.,DHU Hepatinov, Villejuif, F-94800, France.,Inserm, US33, Villejuif, F-94800, France
| |
Collapse
|
48
|
Chan WH, Mohamad MS, Deris S, Zaki N, Kasim S, Omatu S, Corchado JM, Al Ashwal H. Identification of informative genes and pathways using an improved penalized support vector machine with a weighting scheme. Comput Biol Med 2016; 77:102-15. [PMID: 27522238 DOI: 10.1016/j.compbiomed.2016.08.004] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2016] [Revised: 08/03/2016] [Accepted: 08/03/2016] [Indexed: 01/03/2023]
Abstract
Incorporation of pathway knowledge into microarray analysis has brought better biological interpretation of the analysis outcome. However, most pathway data are manually curated without specific biological context. Non-informative genes could be included when the pathway data is used for analysis of context specific data like cancer microarray data. Therefore, efficient identification of informative genes is inevitable. Embedded methods like penalized classifiers have been used for microarray analysis due to their embedded gene selection. This paper proposes an improved penalized support vector machine with absolute t-test weighting scheme to identify informative genes and pathways. Experiments are done on four microarray data sets. The results are compared with previous methods using 10-fold cross validation in terms of accuracy, sensitivity, specificity and F-score. Our method shows consistent improvement over the previous methods and biological validation has been done to elucidate the relation of the selected genes and pathway with the phenotype under study.
Collapse
Affiliation(s)
- Weng Howe Chan
- Artificial Intelligence and Bioinformatics Research Group, Faculty of Computing, Universiti Teknologi Malaysia, 81310 Skudai, Johor, Malaysia
| | - Mohd Saberi Mohamad
- Artificial Intelligence and Bioinformatics Research Group, Faculty of Computing, Universiti Teknologi Malaysia, 81310 Skudai, Johor, Malaysia.
| | - Safaai Deris
- Faculty of Creative Technology & Heritage, Universiti Malaysia Kelantan, Locked Bag 01, Bachok, 16300 Kota Bharu, Kelantan, Malaysia
| | - Nazar Zaki
- College of Information Technology, United Arab Emirate University, Al Ain 15551, United Arab Emirates
| | - Shahreen Kasim
- Faculty of Computer Science and Information Technology, Universiti Tun Hussein Onn Malaysia, 86400 Batu Pahat, Malaysia
| | - Sigeru Omatu
- Department of Electronics, Information and Communication Engineering, Osaka Institute of Technology, Osaka 535-8585, Japan
| | - Juan Manuel Corchado
- Biomedical Research Institute of Salamanca/BISITE Research Group, University of Salamanca, Salamanca, Spain
| | - Hany Al Ashwal
- College of Information Technology, United Arab Emirate University, Al Ain 15551, United Arab Emirates
| |
Collapse
|
49
|
Cabrera-Barona P, Blaschke T, Kienberger S. Explaining Accessibility and Satisfaction Related to Healthcare: A Mixed-Methods Approach. SOCIAL INDICATORS RESEARCH 2016; 133:719-739. [PMID: 28890596 PMCID: PMC5569143 DOI: 10.1007/s11205-016-1371-9] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Accepted: 05/23/2016] [Indexed: 05/09/2023]
Abstract
Accessibility and satisfaction related to healthcare services are conceived as multidimensional concepts. These concepts can be studied using objective and subjective measures. In this study, we created two indices: a composite healthcare accessibility index (CHCA) and a composite healthcare satisfaction index (CHCS). To calculate the CHCA index we used three indicators based on three components of multidimensional healthcare accessibility: availability, acceptability and accessibility. In the indicator based on the component of accessibility, we included an innovative perceived time-decay parameter. The three indicators of the CHCA index were weighted through the application of a principal components analysis. To calculate the CHCS index, we used three indicators: the waiting time after the patient arrives at the healthcare service, the quality of the healthcare, and the healthcare service supply. These three indicators making up the CHCA index were weighted by applying an analytical hierarchy process. Three kinds of regressions were subsequently applied in order to explain the CHCA and CHCS indices: namely the Linear Least Squares, Ordinal Logistic, and Random Forests regressions. In these regressions, we used different independent social and health-related variables. These variables represented the predisposing, enabling, and need factors of people´s behaviors related to healthcare. All the calculations were applied to a study area: the city of Quito, Ecuador. Results showed that there are health-related inequalities in regard to healthcare accessibility and healthcare satisfaction in our study area. We also identified specific social factors that explained the indices developed. The present work is a mixed-methods approach to evaluate multidimensional healthcare accessibility and healthcare satisfaction, incorporating a pluralistic perspective, as well as a multidisciplinary framework. The results obtained can also be considered as tools for healthcare and urban planners, for more integrative social analyses that can improve the quality of life in urban residents.
Collapse
Affiliation(s)
- Pablo Cabrera-Barona
- Interfaculty Department of Geoinformatics - Z_GIS, University of Salzburg, Schillerstraße 30, 5020 Salzburg, Austria
| | - Thomas Blaschke
- Interfaculty Department of Geoinformatics - Z_GIS, University of Salzburg, Schillerstraße 30, 5020 Salzburg, Austria
| | - Stefan Kienberger
- Interfaculty Department of Geoinformatics - Z_GIS, University of Salzburg, Schillerstraße 30, 5020 Salzburg, Austria
| |
Collapse
|
50
|
Hua L, An L, Li L, Zhang Y, Wang C. A bioinformatics strategy for detecting the complexity of Chronic Obstructive Pulmonary Disease in Northern Chinese Han Population. Genes Genet Syst 2016; 87:197-209. [PMID: 22976395 DOI: 10.1266/ggs.87.197] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
Chronic Obstructive Pulmonary Disease (COPD) is a complex human disease which is driven not only by genetic factors, but also by various environmental variables, such as gender, age and smoking. Therefore, there is a demand for investigating the complexity among various risk factors involved in COPD. In this study, 44 tagging SNPs from EPHX1, GSTP1, SERPINE2 and TGFB1 were selected and genotyped in 310 COPD cases and 203 controls, all of which belong to the Han from North China. We integrated functional prediction algorithms of nonsynonymous SNPs (nsSNPs) into Bayesian network to explore the complex regulatory relationships among disease traits and various risk factors. The results showed that three basic variables (age, sex and smoking) were risk factors of COPD-related trait and phenotype. Besides these environmental risk factors, deleterious nsSNPs were found to perform better than those of significant synonymous SNPs when used as variables to make risk prediction of disease outcome. This study provides further evidences for detecting the complexity of COPD in Northern Chinese Han Population.
Collapse
Affiliation(s)
- Lin Hua
- Biomedical Engineering Institute of Capital Medical University, Beijing, China.
| | | | | | | | | |
Collapse
|