1
|
Liebenberg L, L'Abbé EN, Stull KE. Exploring cranial macromorphoscopic variation and classification accuracy in a South African sample. Int J Legal Med 2024:10.1007/s00414-024-03230-2. [PMID: 38622313 DOI: 10.1007/s00414-024-03230-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2024] [Accepted: 04/03/2024] [Indexed: 04/17/2024]
Abstract
To date South African forensic anthropologists are only able to successfully apply a metric approach to estimate population affinity when constructing a biological profile from skeletal remains. While a non-metric, or macromorphoscopic approach exists, limited research has been conducted to explore its use in a South African population. This study aimed to explore 17 cranial macromorphoscopic traits to develop improved methodology for the estimation of population affinity among black, white and coloured South Africans and for the method to be compliant with standards of best practice. The trait frequency distributions revealed substantial group variation and overlap, and not a single trait can be considered characteristic of any one population group. Kruskal-Wallis and Dunn's tests demonstrated significant population differences for 13 of the 17 traits. Random forest modelling was used to develop classification models to assess the reliability and accuracy of the traits in identifying population affinity. Overall, the model including all traits obtained a classification accuracy of 79% when assessing population affinity, which is comparable to current craniometric methods. The variable importance indicates that all the traits contributed some information to the model, with the inferior nasal margin, nasal bone contour, and nasal aperture shape ranked the most useful for classification. Thus, this study validates the use of macromorphoscopic traits in a South African sample, and the population-specific data from this study can potentially be incorporated into forensic casework and skeletal analyses in South Africa to improve population affinity estimates.
Collapse
Affiliation(s)
- Leandi Liebenberg
- Department of Anatomy, University of Pretoria, Private Bag x323, Arcadia, 0007, South Africa.
- Forensic Anthropology Research Centre, University of Pretoria, Arcadia, South Africa.
| | - Ericka N L'Abbé
- Department of Anatomy, University of Pretoria, Private Bag x323, Arcadia, 0007, South Africa
| | - Kyra E Stull
- Department of Anatomy, University of Pretoria, Private Bag x323, Arcadia, 0007, South Africa
- Department of Anthropology, University of Nevada, Reno, USA
| |
Collapse
|
2
|
Liu Z, Wei S, Xiao N, Liu Y, Sun Q, Zhang B, Ji H, Cao H, Liu S. Insight into the correlation of key taste substances and key volatile substances from shrimp heads at different temperatures. Food Chem 2024; 450:139150. [PMID: 38688226 DOI: 10.1016/j.foodchem.2024.139150] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2023] [Revised: 03/23/2024] [Accepted: 03/24/2024] [Indexed: 05/02/2024]
Abstract
This study aimed to investigate taste substances of shrimp heads stored at 20 °C, 4 °C, -3 °C, and - 18 °C, and the correlation between taste substances and 25 key volatile substances. Notably, samples stored at 20 °C showed significant changes in bitter amino acids and hypoxanthine, and quickly deteriorated. Samples stored at 4 °C for 14 d or - 3 °C for 30 d facilitated the development of umami amino acids, sweet amino acids, and IMP. Furthermore, samples stored at -18 °C for 30 d demonstrated no significant changes in taste profile. Changes in taste substances through quantitative analysis were consistent with changes in taste profile through e-tongue analysis. Based on the results of O2PLS (VIP > 1), Cys, Arg, Glu, Ser, Val, Ala, Ile, ADP, and IMP were correlated with 25 key volatile substances. This study provides fundamental data for the storage, transportation, and value-added utilization of shrimp heads.
Collapse
Affiliation(s)
- Zhenyang Liu
- College of Food Science and Technology, Guangdong Ocean University, Guangdong Provincial Key Laboratory of Aquatic Product Processing and Safety, Guangdong Province Engineering Laboratory for Marine Biological Products, Guangdong Provincial Engineering Technology Research Center of Seafood, Guangdong Provincial Engineering Technology Research Center of Prefabricated Seafood Processing and Quality Control, Zhanjiang 524088, China; Universidade de Vigo, Nutrition and Bromatology Group, Department of Analytical and Food Chemistry, Faculty of Sciences, Ourense 32004, Spain
| | - Shuai Wei
- College of Food Science and Technology, Guangdong Ocean University, Guangdong Provincial Key Laboratory of Aquatic Product Processing and Safety, Guangdong Province Engineering Laboratory for Marine Biological Products, Guangdong Provincial Engineering Technology Research Center of Seafood, Guangdong Provincial Engineering Technology Research Center of Prefabricated Seafood Processing and Quality Control, Zhanjiang 524088, China
| | - Naiyong Xiao
- College of Food Science and Technology, Guangdong Ocean University, Guangdong Provincial Key Laboratory of Aquatic Product Processing and Safety, Guangdong Province Engineering Laboratory for Marine Biological Products, Guangdong Provincial Engineering Technology Research Center of Seafood, Guangdong Provincial Engineering Technology Research Center of Prefabricated Seafood Processing and Quality Control, Zhanjiang 524088, China
| | - Yi Liu
- Universidade de Vigo, Nutrition and Bromatology Group, Department of Analytical and Food Chemistry, Faculty of Sciences, Ourense 32004, Spain
| | - Qinxiu Sun
- College of Food Science and Technology, Guangdong Ocean University, Guangdong Provincial Key Laboratory of Aquatic Product Processing and Safety, Guangdong Province Engineering Laboratory for Marine Biological Products, Guangdong Provincial Engineering Technology Research Center of Seafood, Guangdong Provincial Engineering Technology Research Center of Prefabricated Seafood Processing and Quality Control, Zhanjiang 524088, China
| | - Bin Zhang
- College of Food Science and Pharmacy, Zhejiang Ocean University, Zhoushan 316022, China
| | - Hongwu Ji
- College of Food Science and Technology, Guangdong Ocean University, Guangdong Provincial Key Laboratory of Aquatic Product Processing and Safety, Guangdong Province Engineering Laboratory for Marine Biological Products, Guangdong Provincial Engineering Technology Research Center of Seafood, Guangdong Provincial Engineering Technology Research Center of Prefabricated Seafood Processing and Quality Control, Zhanjiang 524088, China; Collaborative Innovation Center of Seafood Deep Processing, Dalian Polytechnic University, Dalian 116034, China
| | - Hui Cao
- College of Food Science and Technology, Guangdong Ocean University, Guangdong Provincial Key Laboratory of Aquatic Product Processing and Safety, Guangdong Province Engineering Laboratory for Marine Biological Products, Guangdong Provincial Engineering Technology Research Center of Seafood, Guangdong Provincial Engineering Technology Research Center of Prefabricated Seafood Processing and Quality Control, Zhanjiang 524088, China
| | - Shucheng Liu
- College of Food Science and Technology, Guangdong Ocean University, Guangdong Provincial Key Laboratory of Aquatic Product Processing and Safety, Guangdong Province Engineering Laboratory for Marine Biological Products, Guangdong Provincial Engineering Technology Research Center of Seafood, Guangdong Provincial Engineering Technology Research Center of Prefabricated Seafood Processing and Quality Control, Zhanjiang 524088, China; Collaborative Innovation Center of Seafood Deep Processing, Dalian Polytechnic University, Dalian 116034, China.
| |
Collapse
|
3
|
Bothma NP, L'abbé EN, Liebenberg L. Evaluating postcranial macromorphoscopic traits to estimate population variation among modern South Africans. Forensic Sci Int 2024; 356:111954. [PMID: 38382241 DOI: 10.1016/j.forsciint.2024.111954] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2023] [Revised: 12/20/2023] [Accepted: 01/31/2024] [Indexed: 02/23/2024]
Abstract
Population overlap and the variation within and among populations have been globally observed but is often difficult to quantify. To achieve this, numerous different methods need to be explored and validated to assist with the creation of an accurate biological profile. The current lack of databases for postcranial macromorphoscopic traits indicates the need to further investigate if the method can be employed repeatably in a forensic context. The current study aimed to assess the prevalence of eleven postcranial macromorphoscopic traits in a South African sample. A total of 271 postcrania of adult black, coloured, and white South Africans were assessed. The intra- and inter-observer agreement ranged from fair to almost perfect except for the accessory transverse foramen of C1, which had poor agreement between observers. Only seven traits differed significantly between at least two of the groups. Univariate and multivariate random forest models were created to test the positive predictive performance of the traits to classify population affinity. The classification accuracies for the univariate models ranged from 33.3% to 53.0% and ranged from 54.6% to 62.1% for the multivariate models. Based on the variable importance, the traits assessing spinous process bifurcation were the most discriminatory variables. The results indicate that the postcranial MMS approach does not outperform current methods employed to estimate population affinity. Further research needs to be done for the method to have practical applicability for medicolegal casework in South Africa.
Collapse
Affiliation(s)
- N P Bothma
- University of Pretoria, Department of Anatomy, Pretoria, South Africa, Private Bag x323, Gezina 0031, South Africa.
| | - E N L'abbé
- University of Pretoria, Department of Anatomy, Pretoria, South Africa, Private Bag x323, Gezina 0031, South Africa
| | - L Liebenberg
- University of Pretoria, Department of Anatomy, Pretoria, South Africa, Private Bag x323, Gezina 0031, South Africa
| |
Collapse
|
4
|
Hong SM, Yoon IH, Cho KH. Predicting the distribution coefficient of cesium in solid phase groups using machine learning. Chemosphere 2024; 352:141462. [PMID: 38364923 DOI: 10.1016/j.chemosphere.2024.141462] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/10/2023] [Revised: 02/06/2024] [Accepted: 02/13/2024] [Indexed: 02/18/2024]
Abstract
The migration and retention of radioactive contaminants such as 137Cesium (137Cs) in various environmental media pose significant long-term storage challenges for nuclear waste. The distribution coefficient (Kd) is a critical parameter for assessing the mobility of radioactive contaminants and is influenced by various environmental conditions. This study presents machine-learning models based on the Japan Atomic Energy Agency Sorption Database (JAEA-SDB) to predict the Kd values for Cs in solid phase groups. We used three different machine learning models: random forest (RF), artificial neural network (ANN), and convolutional neural network (CNN). The models were trained on 14 input variables from the JAEA-SDB, including factors such as the Cs concentration, solid-phase properties, and solution conditions, which were preprocessed by normalization and log-transformation. The performances of the models were evaluated using the coefficient of determination (R2) and root mean squared error (RMSE). The RF, ANN, and CNN models achieved R2 values greater than 0.97, 0.86, and 0.88, respectively. We also analyzed the variable importance of RF using an out-of-bag (OOB) and a CNN with an attention module. Our results showed that the environmental media, initial radionuclide concentration, solid phase properties, and solution conditions were significant variables for Kd prediction. Our models accurately predict Kd values for different environmental conditions and can assess the environmental risk by analyzing the behavior of radionuclides in solid phase groups. The results of this study can improve safety analyses and long-term risk assessments related to waste disposal and prevent potential hazards and sources of contamination in the surrounding environment.
Collapse
Affiliation(s)
- Seok Min Hong
- Department of Civil, Urban, Earth and Environmental Engineering, Ulsan National Institute of Science and Technology, Ulsan, 44919, Republic of Korea
| | - In-Ho Yoon
- Korea Atomic Energy Research Institute, Daejeon, Republic of Korea.
| | - Kyung Hwa Cho
- School of Civil, Environmental and Architectural Engineering, Korea University, Seoul, 02841, Republic of Korea.
| |
Collapse
|
5
|
Tian H, Tom BDM, Burgess S. A data-adaptive method for investigating effect heterogeneity with high-dimensional covariates in Mendelian randomization. BMC Med Res Methodol 2024; 24:34. [PMID: 38341532 PMCID: PMC10858611 DOI: 10.1186/s12874-024-02153-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2023] [Accepted: 01/17/2024] [Indexed: 02/12/2024] Open
Abstract
BACKGROUND Mendelian randomization is a popular method for causal inference with observational data that uses genetic variants as instrumental variables. Similarly to a randomized trial, a standard Mendelian randomization analysis estimates the population-averaged effect of an exposure on an outcome. Dividing the population into subgroups can reveal effect heterogeneity to inform who would most benefit from intervention on the exposure. However, as covariates are measured post-"randomization", naive stratification typically induces collider bias in stratum-specific estimates. METHOD We extend a previously proposed stratification method (the "doubly-ranked method") to form strata based on a single covariate, and introduce a data-adaptive random forest method to calculate stratum-specific estimates that are robust to collider bias based on a high-dimensional covariate set. We also propose measures based on the Q statistic to assess heterogeneity between stratum-specific estimates (to understand whether estimates are more variable than expected due to chance alone) and variable importance (to identify the key drivers of effect heterogeneity). RESULT We show that the effect of body mass index (BMI) on lung function is heterogeneous, depending most strongly on hip circumference and weight. While for most individuals, the predicted effect of increasing BMI on lung function is negative, it is positive for some individuals and strongly negative for others. CONCLUSION Our data-adaptive approach allows for the exploration of effect heterogeneity in the relationship between an exposure and an outcome within a Mendelian randomization framework. This can yield valuable insights into disease aetiology and help identify specific groups of individuals who would derive the greatest benefit from targeted interventions on the exposure.
Collapse
Affiliation(s)
- Haodong Tian
- MRC Biostatistics Unit, School of Clinical Medicine, University of Cambridge, Cambridge, UK.
| | - Brian D M Tom
- MRC Biostatistics Unit, School of Clinical Medicine, University of Cambridge, Cambridge, UK
| | - Stephen Burgess
- MRC Biostatistics Unit, School of Clinical Medicine, University of Cambridge, Cambridge, UK
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
| |
Collapse
|
6
|
Markert N, Guhl B, Feld CK. Water quality deterioration remains a major stressor for macroinvertebrate, diatom and fish communities in German rivers. Sci Total Environ 2024; 907:167994. [PMID: 37875194 DOI: 10.1016/j.scitotenv.2023.167994] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/24/2023] [Revised: 09/18/2023] [Accepted: 10/19/2023] [Indexed: 10/26/2023]
Abstract
About 60 % of Europe's rivers fail to meet ecological quality standards derived from biological criteria. The causes are manifold, but recent reports suggest a dominant role of hydro-morphological and water quality-related stressors. Yet, in particular micropollutants and hydrological stressors often tend to be underrepresented in multiple-stressor studies. Using monitoring data from four Federal States in Germany, this study investigated the effects of 19 stressor variables from six stressor groups (nutrients, salt ions, dissolved oxygen/water temperature, mixture toxicity of 51 micropollutants, hydrological alteration and morphological habitat quality) on three biological assemblages (fishes, macroinvertebrates, benthic diatoms). Biological effects were analyzed for 35 community metrics and quantified using Random Forest (RF) analyses to put the stressor groups into a hierarchical context. To compare metric responses, metrics were grouped into categories reflecting important characteristics of biological communities, such as sensitivity, functional traits, diversity and community composition as well as composite indices that integrate several metrics into one single index (e.g., ecological quality class). Water quality-related stressors - but not micropollutants - turned out to dominate the responses of all assemblages. In contrast, the effects of hydro-morphological stressors were less pronounced and stronger for hydrological stressors than for morphological stressors. Explained variances of RF models ranged 23-64 % for macroinvertebrates, 16-40 % for benthic diatoms and 18-48 % for fishes. Despite a high variability of responses across assemblages and stressor groups, sensitivity metrics tended to reveal stronger responses to individual stressors and a higher explained variance in RF models than composite indices. The results of this study suggest that (physico-chemical) water quality deterioration continues to impact biological assemblages in many German rivers, despite the extensive progress in wastewater treatment during the past decades. To detect water quality deterioration, monitoring schemes need to target relevant physico-chemical stressors and micropollutants. Furthermore, monitoring needs to integrate measures of hydrological alteration (e.g., flow magnitude and dynamics). At present, hydro-morphological surveys rarely address the degree of hydrological alteration. In order to achieve a good ecological status, river restoration and management needs to address both water quality-related and hydro-morphological stressors. Restricting analyses to just one single organism group (e.g., macroinvertebrates) or only selected metrics (e.g., ecological quality class) may hamper stressor identification and its hierarchical classification and, thus may mislead river management.
Collapse
Affiliation(s)
- Nele Markert
- North Rhine-Westphalian Office of Nature, Environment and Consumer Protection (LANUV NRW), 40208 Düsseldorf, Germany; University Duisburg-Essen, Faculty of Biology, Aquatic Ecology, Universitätsstr. 5, 45141 Essen, Germany.
| | - Barbara Guhl
- North Rhine-Westphalian Office of Nature, Environment and Consumer Protection (LANUV NRW), 40208 Düsseldorf, Germany
| | - Christian K Feld
- University Duisburg-Essen, Faculty of Biology, Aquatic Ecology, Universitätsstr. 5, 45141 Essen, Germany; University Duisburg-Essen, Centre for Water and Environmental Research (ZWU), Universitätsstr. 5, 45141 Essen, Germany
| |
Collapse
|
7
|
Boileau P, Qi NT, van der Laan MJ, Dudoit S, Leng N. A flexible approach for predictive biomarker discovery. Biostatistics 2023; 24:1085-1105. [PMID: 35861622 DOI: 10.1093/biostatistics/kxac029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2022] [Revised: 06/01/2022] [Accepted: 06/27/2022] [Indexed: 11/14/2022] Open
Abstract
An endeavor central to precision medicine is predictive biomarker discovery; they define patient subpopulations which stand to benefit most, or least, from a given treatment. The identification of these biomarkers is often the byproduct of the related but fundamentally different task of treatment rule estimation. Using treatment rule estimation methods to identify predictive biomarkers in clinical trials where the number of covariates exceeds the number of participants often results in high false discovery rates. The higher than expected number of false positives translates to wasted resources when conducting follow-up experiments for drug target identification and diagnostic assay development. Patient outcomes are in turn negatively affected. We propose a variable importance parameter for directly assessing the importance of potentially predictive biomarkers and develop a flexible nonparametric inference procedure for this estimand. We prove that our estimator is double robust and asymptotically linear under loose conditions in the data-generating process, permitting valid inference about the importance metric. The statistical guarantees of the method are verified in a thorough simulation study representative of randomized control trials with moderate and high-dimensional covariate vectors. Our procedure is then used to discover predictive biomarkers from among the tumor gene expression data of metastatic renal cell carcinoma patients enrolled in recently completed clinical trials. We find that our approach more readily discerns predictive from nonpredictive biomarkers than procedures whose primary purpose is treatment rule estimation. An open-source software implementation of the methodology, the uniCATE R package, is briefly introduced.
Collapse
Affiliation(s)
- Philippe Boileau
- Graduate Group in Biostatistics and Center for Computational Biology, University of California, Berkeley, Berkeley, CA 94720, USA
| | - Nina Ting Qi
- Genentech Inc., 1 DNA Way, South San Francisco, CA 94080, USA
| | - Mark J van der Laan
- Division of Biostatistics, Department of Statistics, Center for Computational Biology, University of California, Berkeley, Berkeley, CA 94720, USA
| | - Sandrine Dudoit
- Division of Biostatistics, Department of Statistics, Center for Computational Biology, University of California, Berkeley, Berkeley, CA 94720, USA
| | - Ning Leng
- Genentech Inc., 1 DNA Way, South San Francisco, CA 94080, USA
| |
Collapse
|
8
|
Sheikhalishahi S, Bhattacharyya A, Celi LA, Osmani V. An interpretable deep learning model for time-series electronic health records: Case study of delirium prediction in critical care. Artif Intell Med 2023; 144:102659. [PMID: 37783541 DOI: 10.1016/j.artmed.2023.102659] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2022] [Revised: 07/19/2023] [Accepted: 09/04/2023] [Indexed: 10/04/2023]
Abstract
Deep Learning (DL) models have received increasing attention in the clinical setting, particularly in intensive care units (ICU). In this context, the interpretability of the outcomes estimated by the DL models is an essential step towards increasing adoption of DL models in clinical practice. To address this challenge, we propose an ante-hoc, interpretable neural network model. Our proposed model, named double self-attention architecture (DSA), uses two attention-based mechanisms, including self-attention and effective attention. It can capture the importance of input variables in general, as well as changes in importance along the time dimension for the outcome of interest. We evaluated our model using two real-world clinical datasets covering 22840 patients in predicting onset of delirium 12 h and 48 h in advance. Additionally, we compare the descriptive performance of our model with three post-hoc interpretable algorithms as well as with the opinion of clinicians based on the published literature and clinical experience. We find that our model covers the majority of the top-10 variables ranked by the other three post-hoc interpretable algorithms as well as the clinical opinion, with the advantage of taking into account both, the dependencies among variables as well as dependencies between varying time-steps. Finally, our results show that our model can improve descriptive performance without sacrificing predictive performance.
Collapse
Affiliation(s)
| | | | - Leo Anthony Celi
- Department of Medicine, Beth Israel Deaconess Medical Center, Boston, MA, USA; Institute for Medical Engineering and Science, Massachusetts Institute of Technology, Cambridge, MA, USA; Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Venet Osmani
- Fondazione Bruno Kessler Research Institute, Trento, Italy; Information School, University of Sheffield, UK.
| |
Collapse
|
9
|
Abstract
Recent reform efforts have pushed toward a better understanding of the distinction between exploratory and confirmatory research, and appropriate use of each. As some utilize more exploratory tools, it may be tempting to employ multiple linear regression models. In this paper, we advocate for the use of random forest (RF) models. RF is able to obtain better predictive performance than traditional regression, while also inherently protecting against overfitting as well as detecting nonlinear effects and interactions among predictors. Given the advantages of RF compared to other statistical procedures, it is a tool commonly used within a plethora of industries, including stock trading, banking, pharmaceuticals, and patient healthcare planning. However, we find RF is used within the field of psychology comparatively less frequently. In the current paper, we advocate for RF as an important statistical tool within the context of behavioral and psychological research. In hopes of increasing the use of RF in the field of psychology, we provide information pertaining to the limitations one might confront in using RF and how to overcome such limitations. Moreover, we discuss various methods for how to optimally utilize RF with psychological data, such as nonparametric modeling, interaction and nonlinearity detection, variable selection, prediction and classification modeling, and assessing parameters of Monte Carlo simulations. Throughout, we illustrate the use of RF with visualization strategies, aimed to make RF models more comprehensible and intuitive.
Collapse
|
10
|
Alakus C, Larocque D, Labbe A. Covariance regression with random forests. BMC Bioinformatics 2023; 24:258. [PMID: 37330468 DOI: 10.1186/s12859-023-05377-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2022] [Accepted: 06/02/2023] [Indexed: 06/19/2023] Open
Abstract
Capturing the conditional covariances or correlations among the elements of a multivariate response vector based on covariates is important to various fields including neuroscience, epidemiology and biomedicine. We propose a new method called Covariance Regression with Random Forests (CovRegRF) to estimate the covariance matrix of a multivariate response given a set of covariates, using a random forest framework. Random forest trees are built with a splitting rule specially designed to maximize the difference between the sample covariance matrix estimates of the child nodes. We also propose a significance test for the partial effect of a subset of covariates. We evaluate the performance of the proposed method and significance test through a simulation study which shows that the proposed method provides accurate covariance matrix estimates and that the Type-1 error is well controlled. An application of the proposed method to thyroid disease data is also presented. CovRegRF is implemented in a freely available R package on CRAN.
Collapse
Affiliation(s)
- Cansu Alakus
- Department of Decision Sciences, HEC Montréal, Montréal, Canada.
| | - Denis Larocque
- Department of Decision Sciences, HEC Montréal, Montréal, Canada
| | - Aurélie Labbe
- Department of Decision Sciences, HEC Montréal, Montréal, Canada
| |
Collapse
|
11
|
Markus AF, Fridgeirsson EA, Kors JA, Verhamme KMC, Rijnbeek PR. Challenges of Estimating Global Feature Importance in Real-World Health Care Data. Stud Health Technol Inform 2023; 302:1057-1061. [PMID: 37203580 DOI: 10.3233/shti230346] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/20/2023]
Abstract
Feature importance is often used to explain clinical prediction models. In this work, we examine three challenges using experiments with electronic health record data: computational feasibility, choosing between methods, and interpretation of the resulting explanation. This work aims to create awareness of the disagreement between feature importance methods and underscores the need for guidance to practitioners how to deal with these discrepancies.
Collapse
Affiliation(s)
- Aniek F Markus
- Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, The Netherlands
| | - Egill A Fridgeirsson
- Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, The Netherlands
| | - Jan A Kors
- Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, The Netherlands
| | - Katia M C Verhamme
- Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, The Netherlands
| | - Peter R Rijnbeek
- Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, The Netherlands
| |
Collapse
|
12
|
Sheikholeslami R, Hall JW. Global patterns and key drivers of stream nitrogen concentration: A machine learning approach. Sci Total Environ 2023; 868:161623. [PMID: 36657680 PMCID: PMC10933795 DOI: 10.1016/j.scitotenv.2023.161623] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/18/2022] [Revised: 12/22/2022] [Accepted: 01/11/2023] [Indexed: 06/17/2023]
Abstract
Anthropogenic loading of nitrogen to river systems can pose serious health hazards and create critical environmental threats. Quantification of the magnitude and impact of freshwater nitrogen requires identifying key controls of nitrogen dynamics and analyzing both the past and present patterns of nitrogen flows. To tackle this challenge, we adopted a machine learning (ML) approach and built an ML-driven representation that captures spatiotemporal variability in nitrogen concentrations at global scale. Our model uses random forests to regress a large sample of monthly measured stream nitrogen concentrations onto a set of 17 predictors with a spatial resolution of 0.5-degree over the 1990-2013, including observations within the pixel and upstream drivers. The model was validated with data from rivers outside the training dataset and was used to predict nitrogen concentrations in 520 major river basins of the world, including many with scarce or no observations. We predicted that the regions with highest median nitrogen concentrations in their rivers (in 2013) were: United States (Mississippi), Pakistan, Bangladesh, India (Indus, Ganges), China (Yellow, Yangtze, Yongding, Huai), and most of Europe (Rhine, Danube, Vistula, Thames, Trent, Severn). Other major hotspots were the river basins of the Sebou (Morroco), Nakdong (South Korea), Kitakami (Japan), and Egypt's Nile Delta. Our analysis showed that the rate of increase in nitrogen concentration between 1990s and 2000s was greatest in rivers located in eastern China, eastern and central parts of Canada, Baltic states, Pakistan, mainland southeast Asia, and south-eastern Australia. Using a new grouped variable importance measure, we also found that temporality (month of the year and cumulative month count) is the most influential predictor, followed by factors representing hydroclimatic conditions, diffuse nutrient emissions from agriculture, and topographic features. Our model can be further applied to assess strategies designed to reduce nitrogen pollution in freshwater bodies at large spatial scales.
Collapse
Affiliation(s)
- Razi Sheikholeslami
- School of Geography and the Environment, University of Oxford, Oxford, UK; Environmental Change Institute, University of Oxford, Oxford, UK; Department of Civil Engineering, Sharif University of Technology, Tehran, Iran.
| | - Jim W Hall
- School of Geography and the Environment, University of Oxford, Oxford, UK; Environmental Change Institute, University of Oxford, Oxford, UK
| |
Collapse
|
13
|
Jayaramu V, Zulkafli Z, De Stercke S, Buytaert W, Rahmat F, Abdul Rahman RZ, Ishak AJ, Tahir W, Ab Rahman J, Mohd Fuzi NMH. Leptospirosis modelling using hydrometeorological indices and random forest machine learning. Int J Biometeorol 2023; 67:423-437. [PMID: 36719482 DOI: 10.1007/s00484-022-02422-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/30/2022] [Revised: 12/21/2022] [Accepted: 12/26/2022] [Indexed: 06/18/2023]
Abstract
Leptospirosis is a zoonosis that has been linked to hydrometeorological variability. Hydrometeorological averages and extremes have been used before as drivers in the statistical prediction of disease. However, their importance and predictive capacity are still little known. In this study, the use of a random forest classifier was explored to analyze the relative importance of hydrometeorological indices in developing the leptospirosis model and to evaluate the performance of models based on the type of indices used, using case data from three districts in Kelantan, Malaysia, that experience annual monsoonal rainfall and flooding. First, hydrometeorological data including rainfall, streamflow, water level, relative humidity, and temperature were transformed into 164 weekly average and extreme indices in accordance with the Expert Team on Climate Change Detection and Indices (ETCCDI). Then, weekly case occurrences were classified into binary classes "high" and "low" based on an average threshold. Seventeen models based on "average," "extreme," and "mixed" indices were trained by optimizing the feature subsets based on the model computed mean decrease Gini (MDG) scores. The variable importance was assessed through cross-correlation analysis and the MDG score. The average and extreme models showed similar prediction accuracy ranges (61.5-76.1% and 72.3-77.0%) while the mixed models showed an improvement (71.7-82.6% prediction accuracy). An extreme model was the most sensitive while an average model was the most specific. The time lag associated with the driving indices agreed with the seasonality of the monsoon. The rainfall variable (extreme) was the most important in classifying the leptospirosis occurrence while streamflow was the least important despite showing higher correlations with leptospirosis.
Collapse
Affiliation(s)
- Veianthan Jayaramu
- Department of Civil Engineering, Universiti Putra Malaysia, Serdang, Malaysia
| | - Zed Zulkafli
- Department of Civil Engineering, Universiti Putra Malaysia, Serdang, Malaysia.
| | - Simon De Stercke
- Department of Civil and Environmental Engineering, Imperial College London, London, UK
| | - Wouter Buytaert
- Department of Civil and Environmental Engineering, Imperial College London, London, UK
| | - Fariq Rahmat
- Department of Electrical and Electronic Engineering, Universiti Putra Malaysia, Serdang, Malaysia
| | | | - Asnor Juraiza Ishak
- Department of Electrical and Electronic Engineering, Universiti Putra Malaysia, Serdang, Malaysia
| | - Wardah Tahir
- Flood Control Research Group, Faculty of Civil Engineering, Universiti Teknologi Mara, Shah Alam, Malaysia
| | - Jamalludin Ab Rahman
- Department of Community Medicine, Kulliyyah of Medicine, International Islamic University Malaysia, Kuantan, Malaysia
| | | |
Collapse
|
14
|
Gu Y, Liu D, Arvin R, Khattak AJ, Han LD. Predicting intersection crash frequency using connected vehicle data: A framework for geographical random forest. Accid Anal Prev 2023; 179:106880. [PMID: 36345113 DOI: 10.1016/j.aap.2022.106880] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/22/2022] [Revised: 10/06/2022] [Accepted: 10/20/2022] [Indexed: 06/16/2023]
Abstract
Accurate crash frequency prediction is critical for proactive safety management. The emerging connected vehicles technology provides us with a wealth of vehicular motion data, which enables a better connection between crash frequency and driving behaviors. However, appropriately dealing with the spatial dependence of crash frequency and multitudinous driving features has been a difficult but critical challenge in the prediction process. To this end, this study aims to investigate a new Artificial Intelligence technique called Geographical Random Forest (GRF) that can address spatial heterogeneity and retain all potential predictors. By harnessing more than 2.2 billion high-resolution connected vehicle Basic Safety Message (BSM) observations from the Safety Pilot Model Deployment in Ann Arbor, MI, 30 indicators of driving volatility are extracted, including speed, longitudinal and lateral acceleration, and yaw rate. The developed GRF was implemented to predict rear-end crash frequency at intersections. The results show that: 1) rear-end crashes are more likely to happen at intersections connecting minor roads compared to major roads; 2) a higher number of hard acceleration and deceleration events beyond two standard deviations in the longitudinal direction is a leading indicator of rear-end crashes; 3) the optimal GRF significantly outperforms Global Random Forest, with a 9% lower test error and a substantially better fit; and 4) geographical visualization of variable importance highlights the presence of spatial non-stationarity. The proposed framework can proactively identify at-risk intersections and alert drivers when leading indicators of driving volatility tend to worsen.
Collapse
Affiliation(s)
- Yangsong Gu
- Department of Civil and Environmental Engineering, University of Tennessee, Knoxville, TN, USA.
| | - Diyi Liu
- Department of Civil and Environmental Engineering, University of Tennessee, Knoxville, TN, USA.
| | - Ramin Arvin
- Department of Civil and Environmental Engineering, University of Tennessee, Knoxville, TN, USA.
| | - Asad J Khattak
- Department of Civil and Environmental Engineering, University of Tennessee, Knoxville, TN, USA.
| | - Lee D Han
- Department of Civil and Environmental Engineering, University of Tennessee, Knoxville, TN, USA.
| |
Collapse
|
15
|
Sun T, Ji C, Li F, Shan X, Wu H. The legacy effect of microplastics on aquatic animals in the depuration phase: Kinetic characteristics and recovery potential. Environ Int 2022; 168:107467. [PMID: 35985106 DOI: 10.1016/j.envint.2022.107467] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/23/2022] [Revised: 07/25/2022] [Accepted: 08/09/2022] [Indexed: 06/15/2023]
Abstract
The prevalence of microplastics (MPs) in global aquatic environments has received considerable attention. Currently, concerns have been raised regarding reports that the adverse effect of MPs on aquatic animals in the exposure phase may not be (completely) reversed in the depuration phase. In order to provide insights into the legacy effect of MPs from the depuration phase, this study evaluated the kinetic characteristics and recovery potential of aquatic animals after the exposure to MPs. More specifically, a total of 68 depuration kinetic curves were highly fitted to estimate the retention time of MPs. It was shown that the retention time ranged from 1.26 to 3.01 days, corresponding to the egestion of 90 % to 99 % of ingested MPs. The retention time decreased with the increased retention rate. Furthermore, variables potentially affecting the retention time were ranked by the decision tree-based eXtreme Gradient Boosting (XGBoost) algorithm, suggesting that the particle size and tested species were of great importance for explaining the difference in retention time of MPs. Moreover, a biomarker profile was recompiled to determine the toxic changes. Results indicated that the MPs-induced toxicity significantly reduced in the depuration phase, evidenced by the recovery of energy reserves and metabolism, hepatotoxicity, immunotoxicity, hematological parameters, neurotoxicity and oxidative stress. However, the continuous detoxification and remarkable genotoxicity implied that the toxicity was not completely alleviated. In addition, the current knowledge gaps are also highlighted, with recommendations proposed for future research.
Collapse
Affiliation(s)
- Tao Sun
- CAS Key Laboratory of Coastal Environmental Processes and Ecological Remediation, Yantai Institute of Coastal Zone Research (YIC), Chinese Academy of Sciences (CAS), Shandong Key Laboratory of Coastal Environmental Processes, YICCAS, Yantai 264003, PR China; University of Chinese Academy of Sciences, Beijing 100049, PR China
| | - Chenglong Ji
- CAS Key Laboratory of Coastal Environmental Processes and Ecological Remediation, Yantai Institute of Coastal Zone Research (YIC), Chinese Academy of Sciences (CAS), Shandong Key Laboratory of Coastal Environmental Processes, YICCAS, Yantai 264003, PR China; Laboratory for Marine Fisheries Science and Food Production Processes, Qingdao National Laboratory for Marine Science and Technology, Qingdao 266237, PR China; Center for Ocean Mega-Science, Chinese Academy of Sciences (CAS), Qingdao 266071, PR China
| | - Fei Li
- CAS Key Laboratory of Coastal Environmental Processes and Ecological Remediation, Yantai Institute of Coastal Zone Research (YIC), Chinese Academy of Sciences (CAS), Shandong Key Laboratory of Coastal Environmental Processes, YICCAS, Yantai 264003, PR China; Center for Ocean Mega-Science, Chinese Academy of Sciences (CAS), Qingdao 266071, PR China
| | - Xiujuan Shan
- Laboratory for Marine Fisheries Science and Food Production Processes, Qingdao National Laboratory for Marine Science and Technology, Qingdao 266237, PR China
| | - Huifeng Wu
- CAS Key Laboratory of Coastal Environmental Processes and Ecological Remediation, Yantai Institute of Coastal Zone Research (YIC), Chinese Academy of Sciences (CAS), Shandong Key Laboratory of Coastal Environmental Processes, YICCAS, Yantai 264003, PR China; Laboratory for Marine Fisheries Science and Food Production Processes, Qingdao National Laboratory for Marine Science and Technology, Qingdao 266237, PR China; Center for Ocean Mega-Science, Chinese Academy of Sciences (CAS), Qingdao 266071, PR China.
| |
Collapse
|
16
|
Behrouz MS, Yazdi MN, Sample DJ. Using Random Forest, a machine learning approach to predict nitrogen, phosphorus, and sediment event mean concentrations in urban runoff. J Environ Manage 2022; 317:115412. [PMID: 35649331 DOI: 10.1016/j.jenvman.2022.115412] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/31/2022] [Revised: 05/22/2022] [Accepted: 05/24/2022] [Indexed: 06/15/2023]
Abstract
Estimating pollutant loads from developed watersheds is vitally important to reduce nonpoint source pollution from urban areas, as a key tool in meeting water quality goals is the implementation of Stormwater Control Measures (SCMs). SCMs are selected and sized based on influent pollutant loads. A common method used to estimate pollutant loads in urban runoff is the Event Mean Concentration (EMC) method. In this study, we develop and apply data-driven models using Random Forest (RF), a machine learning approach, to predict Total Nitrogen (TN), Total Phosphorus (TP), Total Suspended Solids (TSS), and Ortho-Phosphorus (Ortho-P) EMCs in urban runoff. The parameters considered in this study were climatological characteristics (i.e., Antecedent Dry Period or ADP, Precipitation Depth or P, Duration or D, and Intensity or I) and catchment characteristics including land use-related parameters including Imperviousness or Imp, Saturated Hydraulic Conductivity or Ksat, and Available Water Capacity or AWC), and site-specific parameters including Slope (S), and Catchment Size (A). Stormwater quality data for this study were obtained from the National Stormwater Quality Database (NSQD), which is the largest repository of stormwater quality data in the U.S. Results demonstrate that land use-related characteristics (i.e., Imp, Ksat, and AWC) were the most effective variables for predicting all EMCs. For TP, TSS, and Ortho-P, site-specific characteristics (S and A) had a greater effect than climatological characteristics (i.e., ADP, P, D, and I). However, for TN, climatological characteristics had a greater effect than site-specific characteristics (S and A). In addition, for TN, TP, and TSS, precipitation characteristics (P, D, and I) were found to be more effective parameters for estimating EMCs than ADP. This study highlights the most influential parameters affecting EMCs which can be used by stakeholders and SCMs designers to improve estimates of nutrients and sediment EMCs. The selection and design of the highest performing SCMs is essential in achieving effective treatment of stormwater, attaining water quality goals, and protecting downstream waterbodies.
Collapse
Affiliation(s)
- Mina Shahed Behrouz
- Department of Biological System Engineering, Virginia Polytechnic Institute and State University, Seitz Hall, 155 Ag-Quad Ln, Blacksburg, VA, 24060, United States; Hampton Roads Agricultural Research and Extension Center, Virginia Polytechnic and State University, 1444 Diamond Springs Rd, Virginia Beach, VA, 23455, United States.
| | - Mohammad Nayeb Yazdi
- Department of Biological System Engineering, Virginia Polytechnic Institute and State University, Seitz Hall, 155 Ag-Quad Ln, Blacksburg, VA, 24060, United States; Hampton Roads Agricultural Research and Extension Center, Virginia Polytechnic and State University, 1444 Diamond Springs Rd, Virginia Beach, VA, 23455, United States.
| | - David J Sample
- Department of Biological System Engineering, Virginia Polytechnic Institute and State University, Seitz Hall, 155 Ag-Quad Ln, Blacksburg, VA, 24060, United States.
| |
Collapse
|
17
|
Lin JYJ, Hu L, Huang C, Jiayi J, Lawrence S, Govindarajulu U. A flexible approach for variable selection in large-scale healthcare database studies with missing covariate and outcome data. BMC Med Res Methodol 2022; 22:132. [PMID: 35508974 PMCID: PMC9066834 DOI: 10.1186/s12874-022-01608-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2021] [Accepted: 04/19/2022] [Indexed: 12/17/2022] Open
Abstract
Background Prior work has shown that combining bootstrap imputation with tree-based machine learning variable selection methods can provide good performances achievable on fully observed data when covariate and outcome data are missing at random (MAR). This approach however is computationally expensive, especially on large-scale datasets. Methods We propose an inference-based method, called RR-BART, which leverages the likelihood-based Bayesian machine learning technique, Bayesian additive regression trees, and uses Rubin’s rule to combine the estimates and variances of the variable importance measures on multiply imputed datasets for variable selection in the presence of MAR data. We conduct a representative simulation study to investigate the practical operating characteristics of RR-BART, and compare it with the bootstrap imputation based methods. We further demonstrate the methods via a case study of risk factors for 3-year incidence of metabolic syndrome among middle-aged women using data from the Study of Women’s Health Across the Nation (SWAN). Results The simulation study suggests that even in complex conditions of nonlinearity and nonadditivity with a large percentage of missingness, RR-BART can reasonably recover both prediction and variable selection performances, achievable on the fully observed data. RR-BART provides the best performance that the bootstrap imputation based methods can achieve with the optimal selection threshold value. In addition, RR-BART demonstrates a substantially stronger ability of detecting discrete predictors. Furthermore, RR-BART offers substantial computational savings. When implemented on the SWAN data, RR-BART adds to the literature by selecting a set of predictors that had been less commonly identified as risk factors but had substantial biological justifications. Conclusion The proposed variable selection method for MAR data, RR-BART, offers both computational efficiency and good operating characteristics and is utilitarian in large-scale healthcare database studies. Supplementary Information The online version contains supplementary material available at (10.1186/s12874-022-01608-7).
Collapse
Affiliation(s)
- Jung-Yi Joyce Lin
- Department of Population Health Science and Policy, Icahn School of Medicine at Mount Sinai, 1425 Madison Ave, New York, 10029, USA
| | - Liangyuan Hu
- Department of Biostatistics and Epidemiology, Rutgers University, 683 Hoes Lane West, Piscataway, 08854, USA.
| | - Chuyue Huang
- Primary Research Solution LLC., 115 W 18th St, New York, 10011, USA
| | - Ji Jiayi
- Department of Biostatistics and Epidemiology, Rutgers University, 683 Hoes Lane West, Piscataway, 08854, USA
| | - Steven Lawrence
- Department of Population Health Science and Policy, Icahn School of Medicine at Mount Sinai, 1425 Madison Ave, New York, 10029, USA
| | - Usha Govindarajulu
- Department of Population Health Science and Policy, Icahn School of Medicine at Mount Sinai, 1425 Madison Ave, New York, 10029, USA
| |
Collapse
|
18
|
Luo Y, Yan J, McClure SC, Li F. Socioeconomic and environmental factors of poverty in China using geographically weighted random forest regression model. Environ Sci Pollut Res Int 2022; 29:33205-33217. [PMID: 35022975 PMCID: PMC8754530 DOI: 10.1007/s11356-021-17513-3] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/03/2021] [Accepted: 11/09/2021] [Indexed: 06/07/2023]
Abstract
Correlations between socioeconomic factors and poverty in regression models do not reflect actual relationships, especially when data exhibit patterns of spatial heterogeneity. Spatial regression models can estimate the relationships between socioeconomic factors and poverty in defined geographical areas, explaining the imbalanced distribution of poverty, but the relationships between these factors and poverty are not always linear however, and conventional simple linear local regression models do not accurately capture these nonlinear relationships. To fill this gap, we used a local regression method, geographically weighted random forest regression (GW-RFR), that integrates a spatial weight matrix (SWM) and random forest (RF). The GW-RFR evaluates the spatial variations in the nonlinear relationships between variables. A county-level poverty data set of China was employed to estimate the performance of the GW-RFR against the random forest (RF). In this poverty application, the value of [Formula: see text] was 0.128 higher than that of the RF, the NRMSE value was 1.6% lower than the RF, and the MAE value was 0.295 lower than the RF. These results showed that the relationship between poverty factors and poverty varies with space at the county level in China, and the GW-RFR was suitable for dealing with nonlinear relationships in local regression analysis.
Collapse
Affiliation(s)
- Yaowen Luo
- Chinese Antarctic Center of Surveying and Mapping, Wuhan University, Wuhan, 430070, China
| | - Jianguo Yan
- State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan, 430070, China.
| | - Stephen C McClure
- State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan, 430070, China
| | - Fei Li
- Chinese Antarctic Center of Surveying and Mapping, Wuhan University, Wuhan, 430070, China.
- State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan, 430070, China.
| |
Collapse
|
19
|
Parks J, McLean KE, McCandless L, de Souza RJ, Brook JR, Scott J, Turvey SE, Mandhane PJ, Becker AB, Azad MB, Moraes TJ, Lefebvre DL, Sears MR, Subbarao P, Takaro TK. Assessing secondhand and thirdhand tobacco smoke exposure in Canadian infants using questionnaires, biomarkers, and machine learning. J Expo Sci Environ Epidemiol 2022; 32:112-123. [PMID: 34175887 PMCID: PMC8770125 DOI: 10.1038/s41370-021-00350-4] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/27/2020] [Revised: 05/28/2021] [Accepted: 05/28/2021] [Indexed: 06/02/2023]
Abstract
BACKGROUND As smoking prevalence has decreased in Canada, particularly during pregnancy and around children, and technological improvements have lowered detection limits, the use of traditional tobacco smoke biomarkers in infant populations requires re-evaluation. OBJECTIVE We evaluated concentrations of urinary nicotine biomarkers, cotinine and trans-3'-hydroxycotinine (3HC), and questionnaire responses. We used machine learning and prediction modeling to understand sources of tobacco smoke exposure for infants from the CHILD Cohort Study. METHODS Multivariable linear regression models, chosen through a combination of conceptual and data-driven strategies including random forest regression, assessed the ability of questionnaires to predict variation in urinary cotinine and 3HC concentrations of 2017 3-month-old infants. RESULTS Although only 2% of mothers reported smoking prior to and throughout their pregnancy, cotinine and 3HC were detected in 76 and 89% of the infants' urine (n = 2017). Questionnaire-based models explained 31 and 41% of the variance in cotinine and 3HC levels, respectively. Observed concentrations suggest 0.25 and 0.50 ng/mL as cut-points in cotinine and 3HC to characterize SHS exposure. This cut-point suggests that 23.5% of infants had moderate or regular smoke exposure. SIGNIFICANCE Though most people make efforts to reduce exposure to their infants, parents do not appear to consider the pervasiveness and persistence of secondhand and thirdhand smoke. More than half of the variation in urinary cotinine and 3HC in infants could not be predicted with modeling. The pervasiveness of thirdhand smoke, the potential for dermal and oral routes of nicotine exposure, along with changes in public perceptions of smoking exposure and risk warrant further exploration.
Collapse
Affiliation(s)
- Jaclyn Parks
- Faculty of Health Sciences, Simon Fraser University, Burnaby, BC, Canada
| | | | | | - Russell J de Souza
- Department of Health Research Methods, Evidence, and Impact, Faculty of Health Sciences, McMaster University, Hamilton, ON, Canada
| | - Jeffrey R Brook
- Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada
| | - James Scott
- Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada
| | - Stuart E Turvey
- Department of Pediatrics, Faculty of Medicine, University of British Columbia, Vancouver, BC, Canada
| | - Piush J Mandhane
- Department of Pediatrics, University of Alberta, Edmonton, AB, Canada
| | - Allan B Becker
- Department of Pediatrics and Child Health, University of Manitoba, Winnipeg, MB, Canada
| | - Meghan B Azad
- Department of Pediatrics and Child Health, University of Manitoba, Winnipeg, MB, Canada
- Children's Hospital Research Institute of Manitoba, Winnipeg, MB, Canada
| | - Theo J Moraes
- Hospital for Sick Children, Toronto, ON, Canada
- Department of Pediatrics, University of Toronto, Toronto, ON, Canada
| | - Diana L Lefebvre
- Department of Medicine, Faculty of Health Sciences, McMaster University, Hamilton, ON, Canada
| | - Malcolm R Sears
- Department of Medicine, Faculty of Health Sciences, McMaster University, Hamilton, ON, Canada
| | - Padmaja Subbarao
- Hospital for Sick Children, Toronto, ON, Canada
- Department of Pediatrics, University of Toronto, Toronto, ON, Canada
| | - Tim K Takaro
- Faculty of Health Sciences, Simon Fraser University, Burnaby, BC, Canada.
| |
Collapse
|
20
|
Pickett KL, Suresh K, Campbell KR, Davis S, Juarez-Colunga E. Random survival forests for dynamic predictions of a time-to-event outcome using a longitudinal biomarker. BMC Med Res Methodol 2021; 21:216. [PMID: 34657597 PMCID: PMC8520610 DOI: 10.1186/s12874-021-01375-x] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2021] [Accepted: 08/21/2021] [Indexed: 11/24/2022] Open
Abstract
BACKGROUND Risk prediction models for time-to-event outcomes play a vital role in personalized decision-making. A patient's biomarker values, such as medical lab results, are often measured over time but traditional prediction models ignore their longitudinal nature, using only baseline information. Dynamic prediction incorporates longitudinal information to produce updated survival predictions during follow-up. Existing methods for dynamic prediction include joint modeling, which often suffers from computational complexity and poor performance under misspecification, and landmarking, which has a straightforward implementation but typically relies on a proportional hazards model. Random survival forests (RSF), a machine learning algorithm for time-to-event outcomes, can capture complex relationships between the predictors and survival without requiring prior specification and has been shown to have superior predictive performance. METHODS We propose an alternative approach for dynamic prediction using random survival forests in a landmarking framework. With a simulation study, we compared the predictive performance of our proposed method with Cox landmarking and joint modeling in situations where the proportional hazards assumption does not hold and the longitudinal marker(s) have a complex relationship with the survival outcome. We illustrated the use of the RSF landmark approach in two clinical applications to assess the performance of various RSF model building decisions and to demonstrate its use in obtaining dynamic predictions. RESULTS In simulation studies, RSF landmarking outperformed joint modeling and Cox landmarking when a complex relationship between the survival and longitudinal marker processes was present. It was also useful in application when there were several predictors for which the clinical relevance was unknown and multiple longitudinal biomarkers were present. Individualized dynamic predictions can be obtained from this method and the variable importance metric is useful for examining the changing predictive power of variables over time. In addition, RSF landmarking is easily implementable in standard software and using suggested specifications requires less computation time than joint modeling. CONCLUSIONS RSF landmarking is a nonparametric, machine learning alternative to current methods for obtaining dynamic predictions when there are complex or unknown relationships present. It requires little upfront decision-making and has comparable predictive performance and has preferable computational speed.
Collapse
Affiliation(s)
- Kaci L Pickett
- Department of Pediatrics, University of Colorado Anschutz Medical Campus, Aurora, 80045 Colorado USA
| | - Krithika Suresh
- Department of Biostatistics and Informatics, University of Colorado Anschutz Medical Campus, Aurora, 80045 Colorado USA
- Adult and Child Consortium for Health Outcomes and Delivery Science, University of Colorado Anschutz Medical Campus, Aurora, 80045 Colorado USA
| | - Kristen R Campbell
- Department of Pediatrics, University of Colorado Anschutz Medical Campus, Aurora, 80045 Colorado USA
| | - Scott Davis
- Division of Renal Diseases and Hypertension, University of Colorado Anschutz Medical Campus, Aurora, 80045 Colorado USA
| | - Elizabeth Juarez-Colunga
- Department of Pediatrics, University of Colorado Anschutz Medical Campus, Aurora, 80045 Colorado USA
- Department of Biostatistics and Informatics, University of Colorado Anschutz Medical Campus, Aurora, 80045 Colorado USA
| |
Collapse
|
21
|
Nath A. Prediction for understanding the effectiveness of antiviral peptides. Comput Biol Chem 2021; 95:107588. [PMID: 34655913 DOI: 10.1016/j.compbiolchem.2021.107588] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2021] [Revised: 10/01/2021] [Accepted: 10/02/2021] [Indexed: 11/20/2022]
Abstract
The low efficacy of current antivirals in conjunction with the resistance of viruses against existing antiviral drugs has resulted in the demand for the development of novel antiviral agents. Antiviral peptides (AVPs) are those bioactive peptides having virucidal activity and they can be developed into promising antiviral drugs. They are shorter length peptides having the ability to cease the progression of viral infections. The use of antiviral peptides in therapeutics has recently attracted the attention of the research community. The development and identification of AVPs is imperative for the discovery of novel therapeutics for viral infections. In the present work, a meta classifier (stacking) based approach is implemented for the prediction of IC50 (half maximal inhibitory concentration) and pIC50 (negative log of half maximal inhibitory concentration) values. The best prediction model with evolutionary information and local alignment scores as features achieved a correlation coefficient values of 0.670 and 0.753 on the training and testing sets respectively for IC50. Further, the prediction of pIC50 reached a correlation coefficient value of 0.797 and 0.789 for training and testing sets respectively. For the development of machine learning models involved in the prediction of IC50, the use of pIC50 over IC50 is recommended as the target variable. Further on a systematic comparison of AVPs with high IC50 values and Low IC50 values, it is revealed that higher mean charge and tiny amino acids are preferred and higher length and consecutive hydrophilic amino acids are avoided in the former.
Collapse
|
22
|
Singha S, Pasupuleti S, Singha SS, Singh R, Kumar S. Prediction of groundwater quality using efficient machine learning technique. Chemosphere 2021; 276:130265. [PMID: 34088106 DOI: 10.1016/j.chemosphere.2021.130265] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/14/2020] [Revised: 03/07/2021] [Accepted: 03/11/2021] [Indexed: 06/12/2023]
Abstract
To ensure safe drinking water sources in the future, it is imperative to understand the quality and pollution level of existing groundwater. The prediction of water quality with high accuracy is the key to control water pollution and the improvement of water management. In this study, a deep learning (DL) based model is proposed for predicting groundwater quality and compared with three other machine learning (ML) models, namely, random forest (RF), eXtreme gradient boosting (XGBoost), and artificial neural network (ANN). A total of 226 groundwater samples are collected from an agriculturally intensive area Arang of Raipur district, Chhattisgarh, India, and various physicochemical parameters are measured to compute entropy weight-based groundwater quality index (EWQI). Prediction performances of models are determined by introducing five error metrics. Results showed that DL model is the best prediction model with the highest accuracy in terms of R2, i.e., R2 = 0996 against the RF (R2 = 0.886), XGBoost (R2 = 0.0.927), and ANN (R2 = 0.917). The uncertainty of the DL model output is cross-verified by running the proposed algorithm with newly randomized dataset for ten times, where minor deviations in the mean value of performance metrics are observed. Moreover, input variable importance computed by prediction models highlights that DL model is the most realistic and accurate approach in the prediction of groundwater quality.
Collapse
Affiliation(s)
- Sudhakar Singha
- Department of Civil Engineering, Indian Institute of Technology (Indian School of Mines), Dhanbad, 826004, Jharkhand, India
| | - Srinivas Pasupuleti
- Department of Civil Engineering, Indian Institute of Technology (Indian School of Mines), Dhanbad, 826004, Jharkhand, India.
| | - Soumya S Singha
- Department of Civil Engineering, Indian Institute of Technology (Indian School of Mines), Dhanbad, 826004, Jharkhand, India
| | - Rambabu Singh
- Exploration Department, Central Mine Planning and Design Institute Limited, Bilaspur, 495006, Chhattisgarh, India
| | - Suresh Kumar
- Central Ground Water Board, Patna, 800001, Bihar, India
| |
Collapse
|
23
|
Zhang J, Lv L, Lu D, Kong D, Al-Alashaari MAA, Zhao X. Variable selection from a feature representing protein sequences: a case of classification on bacterial type IV secreted effectors. BMC Bioinformatics 2020; 21:480. [PMID: 33109082 PMCID: PMC7590791 DOI: 10.1186/s12859-020-03826-6] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2020] [Accepted: 10/19/2020] [Indexed: 12/13/2022] Open
Abstract
Background Classification of certain proteins with specific functions is momentous for biological research. Encoding approaches of protein sequences for feature extraction play an important role in protein classification. Many computational methods (namely classifiers) are used for classification on protein sequences according to various encoding approaches. Commonly, protein sequences keep certain labels corresponding to different categories of biological functions (e.g., bacterial type IV secreted effectors or not), which makes protein prediction a fantasy. As to protein prediction, a kernel set of protein sequences keeping certain labels certified by biological experiments should be existent in advance. However, it has been hardly ever seen in prevailing researches. Therefore, unsupervised learning rather than supervised learning (e.g. classification) should be considered. As to protein classification, various classifiers may help to evaluate the effectiveness of different encoding approaches. Besides, variable selection from an encoded feature representing protein sequences is an important issue that also needs to be considered. Results Focusing on the latter problem, we propose a new method for variable selection from an encoded feature representing protein sequences. Taking a benchmark dataset containing 1947 protein sequences as a case, experiments are made to identify bacterial type IV secreted effectors (T4SE) from protein sequences, which are composed of 399 T4SE and 1548 non-T4SE. Comparable and quantified results are obtained only using certain components of the encoded feature, i.e., position-specific scoring matix, and that indicates the effectiveness of our method. Conclusions Certain variables other than an encoded feature they belong to do work for discrimination between different types of proteins. In addition, ensemble classifiers with an automatic assignment of different base classifiers do achieve a better classification result.
Collapse
Affiliation(s)
- Jian Zhang
- College of Artificial Intelligence, Wuxi Vocational College of Science and Technology, No. 8 Xinxi Road, Wuxi, 214028, China
| | - Lixin Lv
- College of Artificial Intelligence, Wuxi Vocational College of Science and Technology, No. 8 Xinxi Road, Wuxi, 214028, China
| | - Donglei Lu
- College of Artificial Intelligence, Wuxi Vocational College of Science and Technology, No. 8 Xinxi Road, Wuxi, 214028, China
| | - Denan Kong
- College of Information and Computer Engineering, Northeast Forestry University, No. 26 Hexing Road, Harbin, 150040, China
| | | | - Xudong Zhao
- College of Information and Computer Engineering, Northeast Forestry University, No. 26 Hexing Road, Harbin, 150040, China.
| |
Collapse
|
24
|
Crager MR. Extensions of the absolute standardized hazard ratio and connections with measures of explained variation and variable importance. Lifetime Data Anal 2020; 26:872-892. [PMID: 32705583 DOI: 10.1007/s10985-020-09504-2] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/05/2019] [Accepted: 07/09/2020] [Indexed: 06/11/2023]
Abstract
The absolute standardized hazard ratio (ASHR) is a scale-invariant scalar measure of the strength of association of a vector of covariates with the risk of an event. It is derived from proportional hazards regression. The ASHR is useful for making comparisons among different sets of covariates. Extensions of the ASHR concept and practical considerations regarding its computation are discussed. These include a new method to conduct preliminary checks for collinearity among covariates, a partial ASHR to evaluate the association with event risk of some of the covariates conditioning on others, and the ASHR for interactions. To put the ASHR in context, its relationship to measures of explained variation and other measures of separation of risk is discussed. A new measure of the contribution of each covariate to the risk score variance is proposed. This measure, which is derived from the ASHR calculations, is interpretable as variable importance within the context of the multivariable model.
Collapse
Affiliation(s)
- Michael R Crager
- Department of Biostatistics, Exact Sciences Corporation, 301 Penobscot Drive, Redwood City, CA, 94063, USA.
| |
Collapse
|
25
|
Pourghasemi HR, Sadhasivam N, Yousefi S, Tavangar S, Ghaffari Nazarlou H, Santosh M. Using machine learning algorithms to map the groundwater recharge potential zones. J Environ Manage 2020; 265:110525. [PMID: 32275245 DOI: 10.1016/j.jenvman.2020.110525] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/05/2019] [Revised: 03/23/2020] [Accepted: 03/28/2020] [Indexed: 06/11/2023]
Abstract
Groundwater recharge is indispensable for the sustainable management of freshwater resources, especially in the arid regions. Here we address some of the important aspects of groundwater recharge through machine learning algorithms (MLAs). Three MLAs including, SVM, MARS, and RF were validated for higher prediction accuracies in generating groundwater recharge potential maps (GRPMs). Accordingly, soil permeability samples were prepared and are arbitrarily grouped into training (70%) and validation (30%) samples. The GRPMs are generated using sixteen effective factors, such as elevation (denoted using a digital elevation model; DEM), aspect, slope angle, TWI (topographic wetness index), fault density, MRVBF (multiresolution index of valley bottom flatness), rainfall, lithology, land use, drainage density, distance from rivers, distance from faults, annual ETP (evapo-transpiration), minimum temperature, maximum temperature, and rainfall 24-hr. Subsequently, the VI (variables importance) is assessed based on the LASSO algorithm. The GRPMs of three MLAs were validated using the ROC-AUC (receiver operating characteristic-area under curve) and various techniques including true positive rate (TPR), false positive rate (FPR), F-measures, fallout, sensitivity, specificity, true skill statistics (TSS), and corrected classified instances (CCI). Based on the validation, the RF algorithm performed better (AUC = 0.987) than the SVM (AUC = 0.963) and the MARS algorithm (AUC = 0.962). Furthermore, the accuracy of these MLAs are included in excellent class, based on the ROC curve threshold. Our case study shows that the GRPMs are potential guidelines for decision-makers in drafting policies related to the sustainable management of the groundwater resources.
Collapse
Affiliation(s)
- Hamid Reza Pourghasemi
- Department of Natural Resources and Environmental Engineering, College of Agriculture, Shiraz University, Shiraz, Iran.
| | - Nitheshnirmal Sadhasivam
- Department of Geography, School of Earth Science, Bharathidasan University, Tiruchirappalli, 620 024, Tamil Nadu, India
| | - Saleh Yousefi
- Soil Conservation and Watershed Management Research Department, Chaharmahal and Bakhtiari Agricultural and Natural Resources Research and Education Center, AREEO, Shahrekord, Iran
| | - Shahla Tavangar
- Department of Watershed Management Engineering, Faculty of Natural Resources and Marine Sciences, Tarbiat Modare University, Iran
| | | | - M Santosh
- School of Earth Sciences and Resources, China University of Geosciences Beijing, Beijing, 100083, PR China; Department of Earth Sciences, University of Adelaide, Adelaide, SA, 5005, Australia
| |
Collapse
|
26
|
Fathololoumi S, Vaezi AR, Alavipanah SK, Ghorbani A, Biswas A. Comparison of spectral and spatial-based approaches for mapping the local variation of soil moisture in a semi-arid mountainous area. Sci Total Environ 2020; 724:138319. [PMID: 32408464 DOI: 10.1016/j.scitotenv.2020.138319] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/14/2020] [Revised: 03/28/2020] [Accepted: 03/28/2020] [Indexed: 06/11/2023]
Abstract
Accurate information on soil moisture (SM) is critical in various applications including agriculture, climate, hydrology, soil and drought. In this paper, various predictive relationships including regression (Multiple Linear Regression, MLR), machine learning (Random Forest, RF; Triangular regression, Tr) and spatial modeling (Inverse Distance Weighing, IDW and Ordinary kriging, OK) approaches were compared to estimate SM in a semi-arid mountainous watershed. In developing predictive relationship, Remote Sensing datasets including Landsat 8 satellite imagery derived surface biophysical characteristic, ASTER digital elevation model (DEM) derived surface topographical characteristic, climatic data recorded at the synoptic station and in situ SM data measured at Landsat 8 overpass time were utilized, while in spatial modeling, point-based SM measurements were interpolated. While 70%(calibration set) of the measured SM data were used for modeling, 30%(validation set) were used to evaluate modeling accuracy. Finally, the SM uncertainty maps were created for different models based on a bootstrapping approach. Among the environmental parameter sets, land surface temperature (LST) showed the highest impact on the spatial distribution of SM in the region at all dates. Mean R2(RMSE) between measured and modeled SM on three dates obtained from the MLR, RF, IDW, OK, and Tr models were 0.70(1.97%), 0.72(1.92%), 0.59(2.38%), 0.59(2.27%) and 0.71(1.99%), respectively. The results showed that RF and IDW produced the highest and lowest performance in SM modeling, respectively. Generally, the performance of RS-based models was higher than interpolation models for estimating SM due to the influence from combination of topographic parameters and surface biophysical characteristics. Modeled SM uncertainty with different models varies in the study area. The highest uncertainty in SM modeling was observed at the north part of the study area where the surface heterogeneity is high. Using RS data increased the accuracy of SM modeling because they can capture the surface biophysical characteristics and topographical properties heterogeneity.
Collapse
Affiliation(s)
- Solmaz Fathololoumi
- Department of Soil Science, Faculty of Agriculture, University of Zanjan, Iran.
| | - Ali Reza Vaezi
- Department of Soil Science, Faculty of Agriculture, University of Zanjan, Iran.
| | - Seyed Kazem Alavipanah
- Department of Remote Sensing & GIS, Faculty of Geography, University of Tehran, Iran; Department of Geography, Humboldt University Berlin, Berlin, Germany.
| | - Ardavan Ghorbani
- Department of Natural Resources, Faculty of Agriculture and Natural Resources, University of Mohaghegh Ardebili, Ardabil, Iran.
| | - Asim Biswas
- School of Environmental Sciences, University of Guelph, Canada.
| |
Collapse
|
27
|
Chen V, Zhang H. Depth importance in precision medicine (DIPM): a tree- and forest-based method for right-censored survival outcomes. Biostatistics 2020; 23:157-172. [PMID: 32424406 DOI: 10.1093/biostatistics/kxaa021] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2019] [Revised: 04/09/2020] [Accepted: 04/13/2020] [Indexed: 12/26/2022] Open
Abstract
Many clinical trials have been conducted to compare right-censored survival outcomes between interventions. Such comparisons are typically made on the basis of the entire group receiving one intervention versus the others. In order to identify subgroups for which the preferential treatment may differ from the overall group, we propose the depth importance in precision medicine (DIPM) method for such data within the precision medicine framework. The approach first modifies the split criteria of the traditional classification tree to fit the precision medicine setting. Then, a random forest of trees is constructed at each node. The forest is used to calculate depth variable importance scores for each candidate split variable. The variable with the highest score is identified as the best variable to split the node. The importance score is a flexible and simply constructed measure that makes use of the observation that more important variables tend to be selected closer to the root nodes of trees. The DIPM method is primarily designed for the analysis of clinical data with two treatment groups. We also present the extension to the case of more than two treatment groups. We use simulation studies to demonstrate the accuracy of our method and provide the results of applications to two real-world data sets. In the case of one data set, the DIPM method outperforms an existing method, and a primary motivation of this article is the ability of the DIPM method to address the shortcomings of this existing method. Altogether, the DIPM method yields promising results that demonstrate its capacity to guide personalized treatment decisions in cases with right-censored survival outcomes.
Collapse
Affiliation(s)
- Victoria Chen
- Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA
| | - Heping Zhang
- Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA
| |
Collapse
|
28
|
Abstract
An important problem that hinders the use of supervised classification algorithms for brain imaging is that the number of variables per single subject far exceeds the number of training subjects available. Deriving multivariate measures of variable importance becomes a challenge in such scenarios. This paper proposes a new measure of variable importance termed sign-consistency bagging (SCB). The SCB captures variable importance by analyzing the sign consistency of the corresponding weights in an ensemble of linear support vector machine (SVM) classifiers. Further, the SCB variable importances are enhanced by means of transductive conformal analysis. This extra step is important when the data can be assumed to be heterogeneous. Finally, the proposal of these SCB variable importance measures is completed with the derivation of a parametric hypothesis test of variable importance. The new importance measures were compared with a t-test based univariate and an SVM-based multivariate variable importances using anatomical and functional magnetic resonance imaging data. The obtained results demonstrated that the new SCB based importance measures were superior to the compared methods in terms of reproducibility and classification accuracy.
Collapse
Affiliation(s)
- Vanessa Gómez-Verdejo
- Department of Signal Processing and Communications, Universidad Carlos III de Madrid, Leganés, Spain
| | - Emilio Parrado-Hernández
- Department of Signal Processing and Communications, Universidad Carlos III de Madrid, Leganés, Spain
| | - Jussi Tohka
- A.I. Virtanen Institute for Molecular Sciences, University of Eastern Finland, Kuopio, Finland.
| | | |
Collapse
|
29
|
de Menezes MD, Bispo FHA, Faria WM, Gonçalves MGM, Curi N, Guilherme LRG. Modeling arsenic content in Brazilian soils: What is relevant? Sci Total Environ 2020; 712:136511. [PMID: 32050379 DOI: 10.1016/j.scitotenv.2020.136511] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/24/2019] [Revised: 12/30/2019] [Accepted: 01/02/2020] [Indexed: 06/10/2023]
Abstract
Arsenic accumulation in the environment poses ecological and human health risks. A greater knowledge about soil total As content variability and its main drivers is strategic for maintaining soil security, helping public policies and environmental surveys. Considering the poor history of As studies in Brazil at the country's geographical scale, this work aimed to generate predictive models of topsoil As content using machine learning (ML) algorithms based on several environmental covariables representing soil forming factors, ranking their importance as explanatory covariables and for feeding group analysis. An unprecedented databank based on laboratory analyses (including rare earth elements), proximal and remote sensing, geographical information system operations, and pedological information were surveyed. The median soil As content ranged from 0.14 to 41.1 mg kg-1 in reference soils, and 0.28 to 58.3 mg kg-1 in agricultural soils. Recursive Feature Elimination Random Forest outperformed other ML algorithms, ranking as most important environmental covariables: temperature, soil organic carbon (SOC), clay, sand, and TiO2. Four natural groups were statistically suggested (As content ± standard error in mg kg-1): G1) with coarser texture, lower SOC, higher temperatures, and the lowest TiO2 contents, has the lowest As content (2.24 ± 0.50), accomplishing different environmental conditions; G2) organic soils located in floodplains, medium TiO2 and temperature, whose As content (3.78 ± 2.05) is slightly higher than G1, but lower than G3 and G4; G3) medium contents of As (7.14 ± 1.30), texture, SOC, TiO2, and temperature, representing the largest number of points widespread throughout Brazil; G4) the largest contents of As (11.97 ± 1.62), SOC, and TiO2, and the lowest sand content, with points located mainly across Southeastern Brazil with milder temperature. In the absence of soil As content, a common scenario in Brazil and in many Latin American countries, such natural groups could work as environmental indicators.
Collapse
Affiliation(s)
| | | | | | | | - Nilton Curi
- Department of Soil Science, Federal University of Lavras, Lavras, MG, Brazil
| | | |
Collapse
|
30
|
Yao D, Zhan X, Zhan X, Kwoh CK, Li P, Wang J. A random forest based computational model for predicting novel lncRNA-disease associations. BMC Bioinformatics 2020; 21:126. [PMID: 32216744 DOI: 10.1186/s12859-020-3458-1] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2019] [Accepted: 03/18/2020] [Indexed: 02/06/2023] Open
Abstract
BACKGROUND Accumulated evidence shows that the abnormal regulation of long non-coding RNA (lncRNA) is associated with various human diseases. Accurately identifying disease-associated lncRNAs is helpful to study the mechanism of lncRNAs in diseases and explore new therapies of diseases. Many lncRNA-disease association (LDA) prediction models have been implemented by integrating multiple kinds of data resources. However, most of the existing models ignore the interference of noisy and redundancy information among these data resources. RESULTS To improve the ability of LDA prediction models, we implemented a random forest and feature selection based LDA prediction model (RFLDA in short). First, the RFLDA integrates the experiment-supported miRNA-disease associations (MDAs) and LDAs, the disease semantic similarity (DSS), the lncRNA functional similarity (LFS) and the lncRNA-miRNA interactions (LMI) as input features. Then, the RFLDA chooses the most useful features to train prediction model by feature selection based on the random forest variable importance score that takes into account not only the effect of individual feature on prediction results but also the joint effects of multiple features on prediction results. Finally, a random forest regression model is trained to score potential lncRNA-disease associations. In terms of the area under the receiver operating characteristic curve (AUC) of 0.976 and the area under the precision-recall curve (AUPR) of 0.779 under 5-fold cross-validation, the performance of the RFLDA is better than several state-of-the-art LDA prediction models. Moreover, case studies on three cancers demonstrate that 43 of the 45 lncRNAs predicted by the RFLDA are validated by experimental data, and the other two predicted lncRNAs are supported by other LDA prediction models. CONCLUSIONS Cross-validation and case studies indicate that the RFLDA has excellent ability to identify potential disease-associated lncRNAs.
Collapse
|
31
|
Aziz F, Malek S, Mhd Ali A, Wong MS, Mosleh M, Milow P. Determining hypertensive patients' beliefs towards medication and associations with medication adherence using machine learning methods. PeerJ 2020; 8:e8286. [PMID: 32206445 PMCID: PMC7075362 DOI: 10.7717/peerj.8286] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2018] [Accepted: 11/24/2019] [Indexed: 01/31/2023] Open
Abstract
Background This study assesses the feasibility of using machine learning methods such as Random Forests (RF), Artificial Neural Networks (ANN), Support Vector Regression (SVR) and Self-Organizing Feature Maps (SOM) to identify and determine factors associated with hypertensive patients' adherence levels. Hypertension is the medical term for systolic and diastolic blood pressure higher than 140/90 mmHg. A conventional medication adherence scale was used to identify patients' adherence to their prescribed medication. Using machine learning applications to predict precise numeric adherence scores in hypertensive patients has not yet been reported in the literature. Methods Data from 160 hypertensive patients from a tertiary hospital in Kuala Lumpur, Malaysia, were used in this study. Variables were ranked based on their significance to adherence levels using the RF variable importance method. The backward elimination method was then performed using RF to obtain the variables significantly associated with the patients' adherence levels. RF, SVR and ANN models were developed to predict adherence using the identified significant variables. Visualizations of the relationships between hypertensive patients' adherence levels and variables were generated using SOM. Result Machine learning models constructed using the selected variables reported RMSE values of 1.42 for ANN, 1.53 for RF, and 1.55 for SVR. The accuracy of the dichotomised scores, calculated based on a percentage of correctly identified adherence values, was used as an additional model performance measure, resulting in accuracies of 65% (ANN), 78% (RF) and 79% (SVR), respectively. The Wilcoxon signed ranked test reported that there was no significant difference between the predictions of the machine learning models and the actual scores. The significant variables identified from the RF variable importance method were educational level, marital status, General Overuse, monthly income, and Specific Concern. Conclusion This study suggests an effective alternative to conventional methods in identifying the key variables to understand hypertensive patients' adherence levels. This can be used as a tool to educate patients on the importance of medication in managing hypertension.
Collapse
Affiliation(s)
- Firdaus Aziz
- Bioinformatics Science Programme, Institute of Biological Sciences, University of Malaya, Kuala Lumpur, Malaysia
| | - Sorayya Malek
- Bioinformatics Science Programme, Institute of Biological Sciences, University of Malaya, Kuala Lumpur, Malaysia
| | - Adliah Mhd Ali
- Quality Use of Medicines Research Group, Faculty of Pharmacy, Universiti Kebangsaan Malaysia, Kuala Lumpur, Malaysia
| | - Mee Sieng Wong
- Quality Use of Medicines Research Group, Faculty of Pharmacy, Universiti Kebangsaan Malaysia, Kuala Lumpur, Malaysia
| | - Mogeeb Mosleh
- Software Engineering Department, Faculty of Engineering & Information Technology, Taiz University, Taiz, Yemen
| | - Pozi Milow
- Environmental Management Programme, Institute of Biological Sciences, University of Malaya, Kuala Lumpur, Malaysia
| |
Collapse
|
32
|
Zhou Y, Zuo Z, Xu F, Wang Y. Origin identification of Panax notoginseng by multi-sensor information fusion strategy of infrared spectra combined with random forest. Spectrochim Acta A Mol Biomol Spectrosc 2020; 226:117619. [PMID: 31606667 DOI: 10.1016/j.saa.2019.117619] [Citation(s) in RCA: 32] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/08/2019] [Revised: 07/14/2019] [Accepted: 10/06/2019] [Indexed: 06/10/2023]
Abstract
Traditional Chinese medicine Panax notoginseng is a valuable geo-authentic herbal material. The difference of growth environment in different producing areas has significant influence on the quality of traditional Chinese medicine, and origin identification is an important part of the quality assessment of P. notoginseng. In this study, Fourier transform mid-infrared (FT-MIR) and near infrared (NIR) sensor technologies combined with single spectra analysis and multi-sensor information fusion strategy (low-, mid- and high-level) for the origin identification of 210 P. notoginseng samples from five cities in Yunnan Province, China. FT-MIR spectra were considered to play a greater role in data analysis than NIR spectra. Random forest (RF) was used to establish classification models. The result of the random forest Boruta (RF-Bo) model and the random forest variable selection (RF-Vs) model based on high-level multi-sensor information fusion strategy was satisfactory. In addition, the RF-Bo model based on high-level multi-sensor information fusion strategy was faster and simpler in data analysis and the accuracy was 95.6%.
Collapse
Affiliation(s)
- Yuhou Zhou
- College of Traditional Chinese Medicine, Yunnan University of Chinese Medicine, Kunming, 650500, PR China
| | - Zhitian Zuo
- Institute of Medicinal Plants, Yunnan Academy of Agricultural Sciences, Kunming, 650200, PR China
| | - Furong Xu
- College of Traditional Chinese Medicine, Yunnan University of Chinese Medicine, Kunming, 650500, PR China.
| | - Yuanzhong Wang
- Institute of Medicinal Plants, Yunnan Academy of Agricultural Sciences, Kunming, 650200, PR China.
| |
Collapse
|
33
|
Ozigis MS, Kaduk JD, Jarvis CH, da Conceição Bispo P, Balzter H. Detection of oil pollution impacts on vegetation using multifrequency SAR, multispectral images with fuzzy forest and random forest methods. Environ Pollut 2020; 256:113360. [PMID: 31672372 DOI: 10.1016/j.envpol.2019.113360] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/31/2019] [Revised: 09/28/2019] [Accepted: 10/06/2019] [Indexed: 05/22/2023]
Abstract
Oil pollution harms terrestrial ecosystems. There is an urgent requirement to improve on existing methods for detecting, mapping and establishing the precise extent of oil-impacted and oil-free vegetation. This is needed to quantify existing spill extents, formulate effective remediation strategies and to enable effective pipeline monitoring strategies to identify leakages at an early stage. An effective oil spill detection algorithm based on optical image spectral responses can benefit immensely from the inclusion of multi-frequency Synthetic Aperture Radar (SAR) data, especially when the effect of multi-collinearity is sufficiently reduced. This study compared the Fuzzy Forest (FF) and Random Forest (RF) methods in detecting and mapping oil-impacted vegetation from a post spill multispectral optical sentinel 2 image and multifrequency C and X Band Sentinel - 1, COSMO Skymed and TanDEM-X SAR images. FF and RF classifiers were employed to discriminate oil-spill impacted and oil-free vegetation in a study area in Nigeria. Fuzzy Forest uses specific functions for the selection and use of uncorrelated variables in the classification process to yield an improved result. This method proved an efficient variable selection technique addressing the effects of high dimensionality and multi-collinearity, as the optimization and use of different SAR and optical image variables generated more accurate results than the RF algorithm in densely vegetated areas. An Overall Accuracy (OA) of 75% was obtained for the dense (Tree Cover Area) vegetation, while cropland and grassland areas had 59.4% and 65% OA respectively. However, RF performed better in Cropland areas with OA = 75% when SAR-optical image variables were used for classification, while both methods performed equally well in Grassland areas with OA = 65%. Similarly, significant backscatter differences (P < 0.005) were observed in the C-Band backscatter sample mean of polluted and oil-free TCA, while strong linear associations existed between LAI and backscatter in grassland and TCA. This study demonstrates that SAR based monitoring of petroleum hydrocarbon impacts on vegetation is feasible and has high potential for establishing oil-impacted areas and oil pipeline monitoring.
Collapse
Affiliation(s)
- Mohammed S Ozigis
- Centre for Landscape and Climate Research, School of Geography, Geology and Environment, University of Leicester, Leicester, United Kingdom; Department of Strategic Space Applications, National Space Research and Development Agency, (NASRDA), Abuja, Nigeria.
| | - Jorg D Kaduk
- Centre for Landscape and Climate Research, School of Geography, Geology and Environment, University of Leicester, Leicester, United Kingdom; Centre for Landscape and Climate Research, Space Park Leicester, University of Leicester, United Kingdom
| | - Claire H Jarvis
- Centre for Landscape and Climate Research, School of Geography, Geology and Environment, University of Leicester, Leicester, United Kingdom
| | - Polyanna da Conceição Bispo
- Centre for Landscape and Climate Research, School of Geography, Geology and Environment, University of Leicester, Leicester, United Kingdom; National Centre for Earth Observation, University of Leicester, Leicester, United Kingdom; Department of Geography, School of Environment, Education and Development, University of Manchester, Manchester, United Kingdom
| | - Heiko Balzter
- Centre for Landscape and Climate Research, School of Geography, Geology and Environment, University of Leicester, Leicester, United Kingdom; National Centre for Earth Observation, University of Leicester, Leicester, United Kingdom; Centre for Landscape and Climate Research, Space Park Leicester, University of Leicester, United Kingdom
| |
Collapse
|
34
|
Abstract
Aims: The aim of this study is to compare some machine learning methods with traditional statistical parametric analyses using logistic regression to investigate the relationship of risk factors for diabetes and cardiovascular (cardiometabolic risk) for U.S. adults using a cross-sectional data from participants in a wellness improvement program. Methods: Logistic regression was used to find the relationship between individual risk factors, predictor and cardiometabolic risk. Supervised machine learning methods were used to predict risk and produce a ranking of variables' importance. A clustering method was used to identify subpopulations of interest. Predictors were divided into those that are nonmodifiable and those that are modifiable. Results: The population comprised 217,254 adults of whom 8.1% had diabetes. Using logistic regression, six variables were identified to be negatively related and eleven were positively related to cardiometabolic risk. Three supervised machine learning classifiers (random forest, gradient boosting, and bagging) were applied with average AUC to be 0.806. Each classifier also produced a ranking of variables' importance. Four subgroups were identified with a k-medoid clustering algorithm, which were mainly distinguished by gender and diabetes status. Conclusions: The study illustrates that machine learning is an important addition to traditional logistic regression in terms of identifying important cardiometabolic risk factors and ranking their importance and the potential for interventions based on lifestyle and medications at an individual level.
Collapse
Affiliation(s)
- Xiyue Liao
- 1 Department of Statistics and Applied Probability, University of California Santa Barbara, Santa Barbara, California
| | - David Kerr
- 2 Sansum Diabetes Research Institute, Santa Barbara, California
| | | | - Ian Duncan
- 1 Department of Statistics and Applied Probability, University of California Santa Barbara, Santa Barbara, California
| |
Collapse
|
35
|
Pedersen KB, Jensen PE, Ottosen LM, Barlindhaug J. Applying multivariate analysis for optimising the electrodialytic removal of Cu and Pb from shooting range soils. J Hazard Mater 2019; 368:869-876. [PMID: 30322811 DOI: 10.1016/j.jhazmat.2018.10.014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/26/2017] [Revised: 09/20/2018] [Accepted: 10/03/2018] [Indexed: 06/08/2023]
Abstract
Multivariate analysis was applied to simultaneously evaluate the influence of soil properties and experimental variables on electrodialytic removal of Cu and Pb from three shooting range soils. Both stationary and stirred set-ups in laboratory scale were tested, representing in-situ and ex-situ remediation conditions, respectively. Within the same experimental space, higher removal of the targeted metals, Cu and Pb, were observed in the stirred set-up (9-81%) compared to the stationary set-up (0-41%). Multivariate analysis (projections onto latent structures) revealed that the influence of soil type on the remediation efficiency was dependent on the metal and varied in the stationary and stirred set-ups. Optimising the removal of Cu by adjusting the experimental settings was easier to achieve in the stirred set-up and could be done by increasing the current density. Optimising the removal of Pb could be done by prolonging the treatment and in the stirred set-up also by increasing the current density.
Collapse
Affiliation(s)
- Kristine B Pedersen
- Akvaplan-niva AS, High North Research Centre for Climate and the Environment, Hjalmar Johansens Gate 14, 9007, Tromsø, Norway.
| | - Pernille E Jensen
- Arctic Technology Centre, Department of Civil Engineering, Technical University of Denmark, Building 118, 2800, Lyngby, Denmark
| | - Lisbeth M Ottosen
- Arctic Technology Centre, Department of Civil Engineering, Technical University of Denmark, Building 118, 2800, Lyngby, Denmark
| | | |
Collapse
|
36
|
Ozigis MS, Kaduk JD, Jarvis CH. Mapping terrestrial oil spill impact using machine learning random forest and Landsat 8 OLI imagery: a case site within the Niger Delta region of Nigeria. Environ Sci Pollut Res Int 2019; 26:3621-3635. [PMID: 30535661 PMCID: PMC6513793 DOI: 10.1007/s11356-018-3824-y] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/18/2018] [Accepted: 11/21/2018] [Indexed: 04/12/2023]
Abstract
Terrestrial oil pollution is one of the major causes of ecological damage within the Niger Delta region of Nigeria and has caused a considerable loss of mangroves and arable croplands since the discovery of crude oil in 1956. The exact extent of landcover loss due to oil pollution remains uncertain due to the variability in factors such as volume and size of the oil spills, the age of oil, and its effects on the different vegetation types. Here, the feasibility of identifying oil-impacted land in the Niger Delta region of Nigeria with a machine learning random forest classifier using Landsat 8 (OLI spectral bands) and Vegetation Health Indices is explored. Oil spill incident data for the years 2015 and 2016 were obtained from published records of the National Oil Spill Detection and Response Agency and Shell Petroleum Development Corporation. Various health indices and spectral wavelengths from visible, near-infrared, and shortwave infrared bands were fused and classified using the machine learning random forest classifier to distinguish between oil-free and oil spill-impacted landcover. This provided the basis for the identification of the best variables for discriminating oil polluted from unpolluted land. Results showed that better results for discriminating oil-free and oil polluted landcovers were obtained when individual landcover types were classified separately as opposed to when the full study area image including all landcover types was classified at once. Similarly, the results also showed that biomass density plays a significant role in the characterization and classification of oil contaminated and oil-free pixels as tree cover areas showed higher classification accuracy compared to cropland and grassland.
Collapse
Affiliation(s)
- Mohammed S. Ozigis
- Department of Geography, University of Leicester, Leicester, United Kingdom
- Department of Strategic Space Applications, National Space Research and Development Agency (NASRDA), Abuja, Nigeria
| | - Jorg D. Kaduk
- Department of Geography, University of Leicester, Leicester, United Kingdom
| | - Claire H. Jarvis
- Department of Geography, University of Leicester, Leicester, United Kingdom
| |
Collapse
|
37
|
Huber M, Kurz C, Leidl R. Predicting patient-reported outcomes following hip and knee replacement surgery using supervised machine learning. BMC Med Inform Decis Mak 2019; 19:3. [PMID: 30621670 PMCID: PMC6325823 DOI: 10.1186/s12911-018-0731-6] [Citation(s) in RCA: 53] [Impact Index Per Article: 10.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2018] [Accepted: 12/27/2018] [Indexed: 12/28/2022] Open
Abstract
BACKGROUND Machine-learning classifiers mostly offer good predictive performance and are increasingly used to support shared decision-making in clinical practice. Focusing on performance and practicability, this study evaluates prediction of patient-reported outcomes (PROs) by eight supervised classifiers including a linear model, following hip and knee replacement surgery. METHODS NHS PRO data (130,945 observations) from April 2015 to April 2017 were used to train and test eight classifiers to predict binary postoperative improvement based on minimal important differences. Area under the receiver operating characteristic, J-statistic and several other metrics were calculated. The dependent outcomes were generic and disease-specific improvement based on the EQ-5D-3L visual analogue scale (VAS) as well as the Oxford Hip and Knee Score (Q score). RESULTS The area under the receiver operating characteristic of the best training models was around 0.87 (VAS) and 0.78 (Q score) for hip replacement, while it was around 0.86 (VAS) and 0.70 (Q score) for knee replacement surgery. Extreme gradient boosting, random forests, multistep elastic net and linear model provided the highest overall J-statistics. Based on variable importance, the most important predictors for post-operative outcomes were preoperative VAS, Q score and single Q score dimensions. Sensitivity analysis for hip replacement VAS evaluated the influence of minimal important difference, patient selection criteria as well as additional data years. Together with a small benchmark of the NHS prediction model, robustness of our results was confirmed. CONCLUSIONS Supervised machine-learning implementations, like extreme gradient boosting, can provide better performance than linear models and should be considered, when high predictive performance is needed. Preoperative VAS, Q score and specific dimensions like limping are the most important predictors for postoperative hip and knee PROMs.
Collapse
Affiliation(s)
- Manuel Huber
- German Research Center for Environmental Health, Institute for Health Economics and Health Care Management, Helmholtz Zentrum München, Postfach 1129, 85758 Neuherberg, Germany
| | - Christoph Kurz
- German Research Center for Environmental Health, Institute for Health Economics and Health Care Management, Helmholtz Zentrum München, Postfach 1129, 85758 Neuherberg, Germany
| | - Reiner Leidl
- German Research Center for Environmental Health, Institute for Health Economics and Health Care Management, Helmholtz Zentrum München, Postfach 1129, 85758 Neuherberg, Germany
- Munich Center of Health Sciences, Ludwig-Maximilians-University, Ludwigstr. 28, 80539 Munich, RG Germany
| |
Collapse
|
38
|
Ding C, Chen P, Jiao J. Non-linear effects of the built environment on automobile-involved pedestrian crash frequency: A machine learning approach. Accid Anal Prev 2018; 112:116-126. [PMID: 29329016 PMCID: PMC10388697 DOI: 10.1016/j.aap.2017.12.026] [Citation(s) in RCA: 32] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/24/2017] [Revised: 11/12/2017] [Accepted: 12/31/2017] [Indexed: 06/07/2023]
Abstract
Although a growing body of literature focuses on the relationship between the built environment and pedestrian crashes, limited evidence is provided about the relative importance of many built environment attributes by accounting for their mutual interaction effects and their non-linear effects on automobile-involved pedestrian crashes. This study adopts the approach of Multiple Additive Poisson Regression Trees (MAPRT) to fill such gaps using pedestrian collision data collected from Seattle, Washington. Traffic analysis zones are chosen as the analytical unit. The effects of various factors on pedestrian crash frequency investigated include characteristics the of road network, street elements, land use patterns, and traffic demand. Density and the degree of mixed land use have major effects on pedestrian crash frequency, accounting for approximately 66% of the effects in total. More importantly, some factors show clear non-linear relationships with pedestrian crash frequency, challenging the linearity assumption commonly used in existing studies which employ statistical models. With various accurately identified non-linear relationships between the built environment and pedestrian crashes, this study suggests local agencies to adopt geo-spatial differentiated policies to establish a safe walking environment. These findings, especially the effective ranges of the built environment, provide evidence to support for transport and land use planning, policy recommendations, and road safety programs.
Collapse
Affiliation(s)
- Chuan Ding
- School of Transportation Science and Engineering, Beijing Key Laboratory for Cooperative Vehicle Infrastructure System and Safety Control, Beihang University, Beijing, China.
| | - Peng Chen
- School of Architecture and Urban Planning, Harbin Institute of Technology Shenzhen Campus, Shenzhen, China.
| | - Junfeng Jiao
- School of Architecture, University of Texas at Austin, Austin, USA.
| |
Collapse
|
39
|
Park H, Haghani A, Samuel S, Knodler MA. Real-time prediction and avoidance of secondary crashes under unexpected traffic congestion. Accid Anal Prev 2018; 112:39-49. [PMID: 29306687 DOI: 10.1016/j.aap.2017.11.025] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/13/2017] [Revised: 10/15/2017] [Accepted: 11/18/2017] [Indexed: 06/07/2023]
Abstract
According to the Federal Highway Administration, nonrecurring congestion contributes to nearly half of the overall congestion. Temporal disruptions impact the effective use of the complete roadway, due to speed reduction and rubbernecking resulting from primary incidents that in turn provoke secondary incidents. There is an additional reduction of discharge flow caused by secondary incident that significantly increases total delay. Therefore, it is important to sequentially predict the probability of secondary incidents and develop appropriate countermeasures to reduce the associated risk. Advanced computing techniques were used to easily understand and reliably predict secondary incident occurrences that have low sample mean and a small sample size. The likelihood of a secondary incident was sequentially predicted from the point of incident response to the eventual road clearance. The quality of predictions improved with the availability of additional information. The prediction performance of the principled Bayesian learning approach to neural networks (bnn) was compared to the Stochastic Gradient Boosted Decision Trees (gbdt). A pedagogical rule extraction approach, trepan, which extracts comprehensible rules from the neural networks, improved the ability to understand secondary incidents in a simplified manner. With an acceptable accuracy, gbdt is a useful tool that presents the relative importance of the predictor variables. Unexpected traffic congestion incurred by an incident is a dominant causative factor for the occurrence of secondary incidents at different stages of incident clearance. This symbolic description represents a series of decisions that may assist emergency operators by improving their decision-making capabilities. Analyzing causes and effects of traffic incidents helps traffic operators develop incident-specific strategic plans for prompt emergency response and clearance. Application of the model in connected vehicle environments will help drivers receive proactive corrective feedback before a crash. The proposed methodology can be used to alert drivers about potential highway conditions and may increase the drivers' awareness of potential events when no rerouting is possible, optimal or otherwise.
Collapse
Affiliation(s)
- Hyoshin Park
- Department of Computational Science & Engineering, North Carolina Agricultural & Technical State University, United States.
| | - Ali Haghani
- Department of Civil & Environmental Engineering, University of Maryland, College Park, United States
| | - Siby Samuel
- Department of Mechanical & Industrial Engineering, University of Massachusetts, Amherst, United States
| | - Michael A Knodler
- Department of Civil & Environmental Engineering, University of Massachusetts, Amherst, United States
| |
Collapse
|
40
|
Abstract
Background In-silico quantitative structure–activity relationship (QSAR) models based tools are widely used to screen huge databases of compounds in order to determine the biological properties of chemical molecules based on their chemical structure. With the passage of time, the exponentially growing amount of synthesized and known chemicals data demands computationally efficient automated QSAR modeling tools, available to researchers that may lack extensive knowledge of machine learning modeling. Thus, a fully automated and advanced modeling platform can be an important addition to the QSAR community. Results In the presented workflow the process from data preparation to model building and validation has been completely automated. The most critical modeling tasks (data curation, data set characteristics evaluation, variable selection and validation) that largely influence the performance of QSAR models were focused. It is also included the ability to quickly evaluate the feasibility of a given data set to be modeled. The developed framework is tested on data sets of thirty different problems. The best-optimized feature selection methodology in the developed workflow is able to remove 62–99% of all redundant data. On average, about 19% of the prediction error was reduced by using feature selection producing an increase of 49% in the percentage of variance explained (PVE) compared to models without feature selection. Selecting only the models with a modelability score above 0.6, average PVE scores were 0.71. A strong correlation was verified between the modelability scores and the PVE of the models produced with variable selection. Conclusions We developed an extendable and highly customizable fully automated QSAR modeling framework. This designed workflow does not require any advanced parameterization nor depends on users decisions or expertise in machine learning/programming. With just a given target or problem, the workflow follows an unbiased standard protocol to develop reliable QSAR models by directly accessing online manually curated databases or by using private data sets. The other distinctive features of the workflow include prior estimation of data modelability to avoid time-consuming modeling trials for non modelable data sets, an efficient variable selection procedure and the facility of output availability at each modeling task for the diverse application and reproduction of historical predictions. The results reached on a selection of thirty QSAR problems suggest that the approach is capable of building reliable models even for challenging problems. Electronic supplementary material The online version of this article (10.1186/s13321-017-0256-5) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Samina Kausar
- LaSIGE, Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, 1749-016, Lisbon, Portugal.,BioISI: Biosystems and Integrative Sciences Institute, Faculdade de Ciências, Universidade de Lisboa, 1749-016, Lisbon, Portugal
| | - Andre O Falcao
- LaSIGE, Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, 1749-016, Lisbon, Portugal. .,BioISI: Biosystems and Integrative Sciences Institute, Faculdade de Ciências, Universidade de Lisboa, 1749-016, Lisbon, Portugal.
| |
Collapse
|
41
|
Abstract
Background Random forests have often been claimed to uncover interaction effects. However, if and how interaction effects can be differentiated from marginal effects remains unclear. In extensive simulation studies, we investigate whether random forest variable importance measures capture or detect gene-gene interactions. With capturing interactions, we define the ability to identify a variable that acts through an interaction with another one, while detection is the ability to identify an interaction effect as such. Results Of the single importance measures, the Gini importance captured interaction effects in most of the simulated scenarios, however, they were masked by marginal effects in other variables. With the permutation importance, the proportion of captured interactions was lower in all cases. Pairwise importance measures performed about equal, with a slight advantage for the joint variable importance method. However, the overall fraction of detected interactions was low. In almost all scenarios the detection fraction in a model with only marginal effects was larger than in a model with an interaction effect only. Conclusions Random forests are generally capable of capturing gene-gene interactions, but current variable importance measures are unable to detect them as interactions. In most of the cases, interactions are masked by marginal effects and interactions cannot be differentiated from marginal effects. Consequently, caution is warranted when claiming that random forests uncover interactions. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-0995-8) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Marvin N Wright
- Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Ratzeburger Allee 160, Lübeck, 23562, Germany
| | - Andreas Ziegler
- Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Ratzeburger Allee 160, Lübeck, 23562, Germany.,Zentrum für Klinische Studien, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Lübeck, Germany.,School of Mathematics, Statistics and Computer Science, University of KwaZulu-Natal, Pietermaritzburg, South Africa
| | - Inke R König
- Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Ratzeburger Allee 160, Lübeck, 23562, Germany.
| |
Collapse
|
42
|
Szymczak S, Holzinger E, Dasgupta A, Malley JD, Molloy AM, Mills JL, Brody LC, Stambolian D, Bailey-Wilson JE. r2VIM: A new variable selection method for random forests in genome-wide association studies. BioData Min 2016; 9:7. [PMID: 26839594 PMCID: PMC4736152 DOI: 10.1186/s13040-016-0087-3] [Citation(s) in RCA: 40] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2015] [Accepted: 01/19/2016] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Machine learning methods and in particular random forests (RFs) are a promising alternative to standard single SNP analyses in genome-wide association studies (GWAS). RFs provide variable importance measures (VIMs) to rank SNPs according to their predictive power. However, in contrast to the established genome-wide significance threshold, no clear criteria exist to determine how many SNPs should be selected for downstream analyses. RESULTS We propose a new variable selection approach, recurrent relative variable importance measure (r2VIM). Importance values are calculated relative to an observed minimal importance score for several runs of RF and only SNPs with large relative VIMs in all of the runs are selected as important. Evaluations on simulated GWAS data show that the new method controls the number of false-positives under the null hypothesis. Under a simple alternative hypothesis with several independent main effects it is only slightly less powerful than logistic regression. In an experimental GWAS data set, the same strong signal is identified while the approach selects none of the SNPs in an underpowered GWAS. CONCLUSIONS The novel variable selection method r2VIM is a promising extension to standard RF for objectively selecting relevant SNPs in GWAS while controlling the number of false-positive results.
Collapse
Affiliation(s)
- Silke Szymczak
- Statistical Genetics Section, Inherited Disease Research Branch, National Human Genome Research Institute, National Institutes of Health, 333 Cassell Dr, 21224 Baltimore, USA ; Current address: Institute of Medical Informatics and Statistics, University of Kiel, Brunswiker Str. 10, 24105 Kiel, Germany
| | - Emily Holzinger
- Statistical Genetics Section, Inherited Disease Research Branch, National Human Genome Research Institute, National Institutes of Health, 333 Cassell Dr, 21224 Baltimore, USA
| | - Abhijit Dasgupta
- Clinical Trials and Outcomes Branch, National Institute of Arthritis and Musculoskeletal and Skin Diseases, National Institutes of Health, 1 AMS Circle, 20892 Bethesda, USA
| | - James D Malley
- Division of Computational Bioscience, Center for Information Technology, National Institutes of Health, 12 South Dr, 20892 Bethesda, USA
| | - Anne M Molloy
- Department of Clinical Medicine, School of Medicine, Trinity College Dublin, 152-160 Pearse Street, 2 Dublin, Ireland
| | - James L Mills
- Division of Intramural Population Health Research, Eunice Shriver National Institute of Child Health and Human Development, National Institutes of Health, 6100 Executive Blvd, 20892 Bethesda, USA
| | - Lawrence C Brody
- Molecular Pathogenesis Section, Medical Genomics and Metabolic Genetics Branch, National Human Genome Research Institute, National Institutes of Health, 50 South Dr, 20892 Bethesda, USA
| | - Dwight Stambolian
- Department of Ophthalmology, University of Pennsylvania, 422 Curie Blvd, 19104 Philadelphia, USA
| | - Joan E Bailey-Wilson
- Statistical Genetics Section, Inherited Disease Research Branch, National Human Genome Research Institute, National Institutes of Health, 333 Cassell Dr, 21224 Baltimore, USA
| |
Collapse
|
43
|
Yun YH, Deng BC, Cao DS, Wang WT, Liang YZ. Variable importance analysis based on rank aggregation with applications in metabolomics for biomarker discovery. Anal Chim Acta 2016; 911:27-34. [PMID: 26893083 DOI: 10.1016/j.aca.2015.12.043] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2015] [Revised: 12/28/2015] [Accepted: 12/30/2015] [Indexed: 11/17/2022]
Abstract
Biomarker discovery is one important goal in metabolomics, which is typically modeled as selecting the most discriminating metabolites for classification and often referred to as variable importance analysis or variable selection. Until now, a number of variable importance analysis methods to discover biomarkers in the metabolomics studies have been proposed. However, different methods are mostly likely to generate different variable ranking results due to their different principles. Each method generates a variable ranking list just as an expert presents an opinion. The problem of inconsistency between different variable ranking methods is often ignored. To address this problem, a simple and ideal solution is that every ranking should be taken into account. In this study, a strategy, called rank aggregation, was employed. It is an indispensable tool for merging individual ranking lists into a single "super"-list reflective of the overall preference or importance within the population. This "super"-list is regarded as the final ranking for biomarker discovery. Finally, it was used for biomarkers discovery and selecting the best variable subset with the highest predictive classification accuracy. Nine methods were used, including three univariate filtering and six multivariate methods. When applied to two metabolic datasets (Childhood overweight dataset and Tubulointerstitial lesions dataset), the results show that the performance of rank aggregation has improved greatly with higher prediction accuracy compared with using all variables. Moreover, it is also better than penalized method, least absolute shrinkage and selectionator operator (LASSO), with higher prediction accuracy or less number of selected variables which are more interpretable.
Collapse
Affiliation(s)
- Yong-Huan Yun
- College of Chemistry and Chemical Engineering, Central South University, Changsha, 410083, PR China
| | - Bai-Chuan Deng
- College of Animal Science, South China Agricultural University, Guangzhou, 510642, PR China
| | - Dong-Sheng Cao
- College of Pharmaceutical Sciences, Central South University, Changsha, 410083, PR China
| | - Wei-Ting Wang
- College of Chemistry and Chemical Engineering, Central South University, Changsha, 410083, PR China
| | - Yi-Zeng Liang
- College of Chemistry and Chemical Engineering, Central South University, Changsha, 410083, PR China.
| |
Collapse
|
44
|
Saha D, Alluri P, Gan A. Prioritizing Highway Safety Manual's crash prediction variables using boosted regression trees. Accid Anal Prev 2015; 79:133-144. [PMID: 25823903 DOI: 10.1016/j.aap.2015.03.011] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/15/2014] [Revised: 02/14/2015] [Accepted: 03/10/2015] [Indexed: 06/04/2023]
Abstract
The Highway Safety Manual (HSM) recommends using the empirical Bayes (EB) method with locally derived calibration factors to predict an agency's safety performance. However, the data needs for deriving these local calibration factors are significant, requiring very detailed roadway characteristics information. Many of the data variables identified in the HSM are currently unavailable in the states' databases. Moreover, the process of collecting and maintaining all the HSM data variables is cost-prohibitive. Prioritization of the variables based on their impact on crash predictions would, therefore, help to identify influential variables for which data could be collected and maintained for continued updates. This study aims to determine the impact of each independent variable identified in the HSM on crash predictions. A relatively recent data mining approach called boosted regression trees (BRT) is used to investigate the association between the variables and crash predictions. The BRT method can effectively handle different types of predictor variables, identify very complex and non-linear association among variables, and compute variable importance. Five years of crash data from 2008 to 2012 on two urban and suburban facility types, two-lane undivided arterials and four-lane divided arterials, were analyzed for estimating the influence of variables on crash predictions. Variables were found to exhibit non-linear and sometimes complex relationship to predicted crash counts. In addition, only a few variables were found to explain most of the variation in the crash data.
Collapse
Affiliation(s)
- Dibakar Saha
- Department of Civil and Environmental Engineering, Florida International University, 10555 West Flagler Street, EC 3680, Miami, FL 33174, United States.
| | - Priyanka Alluri
- Department of Civil and Environmental Engineering, Florida International University, 10555 West Flagler Street, EC 3680, Miami, FL 33174, United States.
| | - Albert Gan
- Department of Civil and Environmental Engineering, Florida International University, 10555 West Flagler Street, EC 3680, Miami, FL 33174, United States.
| |
Collapse
|