Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

Download

Total Articles

44
(from Reference Citation Analysis)

Article PDFs (10)

Cited by > 0 (33)

Searched Name

Variable importance

Ranked By

Results Analysis

Year Published Analysis
Article Type Analysis
Publication Title Analysis
Category Analysis

Results Analysis

Indexed Articles

Year Published

Show more Refine

Article Type

Show more Refine

Article Statistics

Refine

MESH Headings

Show more Refine

First Author

Show more Refine

First Author Affiliations

Show more Refine

Authors

Show more Refine

Publication Titles

Show more Refine

Grant Agencies

Show more Refine

Countries/Regions

Show more Refine

Affiliations

Show more Refine

Corresponding Author Affiliations

Show more Refine

Category

Show more Refine

Number

Citation Analysis

Liebenberg L, L'Abbé EN, Stull KE. Exploring cranial macromorphoscopic variation and classification accuracy in a South African sample. Int J Legal Med 2024:10.1007/s00414-024-03230-2. [PMID: 38622313 DOI: 10.1007/s00414-024-03230-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2024] [Accepted: 04/03/2024] [Indexed: 04/17/2024]

Liu Z, Wei S, Xiao N, Liu Y, Sun Q, Zhang B, Ji H, Cao H, Liu S. Insight into the correlation of key taste substances and key volatile substances from shrimp heads at different temperatures. Food Chem 2024;450:139150. [PMID: 38688226 DOI: 10.1016/j.foodchem.2024.139150] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2023] [Revised: 03/23/2024] [Accepted: 03/24/2024] [Indexed: 05/02/2024]

Affiliation(s)

Zhenyang Liu College of Food Science and Technology, Guangdong Ocean University, Guangdong Provincial Key Laboratory of Aquatic Product Processing and Safety, Guangdong Province Engineering Laboratory for Marine Biological Products, Guangdong Provincial Engineering Technology Research Center of Seafood, Guangdong Provincial Engineering Technology Research Center of Prefabricated Seafood Processing and Quality Control, Zhanjiang 524088, China; Universidade de Vigo, Nutrition and Bromatology Group, Department of Analytical and Food Chemistry, Faculty of Sciences, Ourense 32004, Spain
Shuai Wei College of Food Science and Technology, Guangdong Ocean University, Guangdong Provincial Key Laboratory of Aquatic Product Processing and Safety, Guangdong Province Engineering Laboratory for Marine Biological Products, Guangdong Provincial Engineering Technology Research Center of Seafood, Guangdong Provincial Engineering Technology Research Center of Prefabricated Seafood Processing and Quality Control, Zhanjiang 524088, China
Naiyong Xiao College of Food Science and Technology, Guangdong Ocean University, Guangdong Provincial Key Laboratory of Aquatic Product Processing and Safety, Guangdong Province Engineering Laboratory for Marine Biological Products, Guangdong Provincial Engineering Technology Research Center of Seafood, Guangdong Provincial Engineering Technology Research Center of Prefabricated Seafood Processing and Quality Control, Zhanjiang 524088, China
Yi Liu Universidade de Vigo, Nutrition and Bromatology Group, Department of Analytical and Food Chemistry, Faculty of Sciences, Ourense 32004, Spain
Qinxiu Sun College of Food Science and Technology, Guangdong Ocean University, Guangdong Provincial Key Laboratory of Aquatic Product Processing and Safety, Guangdong Province Engineering Laboratory for Marine Biological Products, Guangdong Provincial Engineering Technology Research Center of Seafood, Guangdong Provincial Engineering Technology Research Center of Prefabricated Seafood Processing and Quality Control, Zhanjiang 524088, China
Bin Zhang College of Food Science and Pharmacy, Zhejiang Ocean University, Zhoushan 316022, China
Hongwu Ji College of Food Science and Technology, Guangdong Ocean University, Guangdong Provincial Key Laboratory of Aquatic Product Processing and Safety, Guangdong Province Engineering Laboratory for Marine Biological Products, Guangdong Provincial Engineering Technology Research Center of Seafood, Guangdong Provincial Engineering Technology Research Center of Prefabricated Seafood Processing and Quality Control, Zhanjiang 524088, China; Collaborative Innovation Center of Seafood Deep Processing, Dalian Polytechnic University, Dalian 116034, China
Hui Cao College of Food Science and Technology, Guangdong Ocean University, Guangdong Provincial Key Laboratory of Aquatic Product Processing and Safety, Guangdong Province Engineering Laboratory for Marine Biological Products, Guangdong Provincial Engineering Technology Research Center of Seafood, Guangdong Provincial Engineering Technology Research Center of Prefabricated Seafood Processing and Quality Control, Zhanjiang 524088, China
Shucheng Liu College of Food Science and Technology, Guangdong Ocean University, Guangdong Provincial Key Laboratory of Aquatic Product Processing and Safety, Guangdong Province Engineering Laboratory for Marine Biological Products, Guangdong Provincial Engineering Technology Research Center of Seafood, Guangdong Provincial Engineering Technology Research Center of Prefabricated Seafood Processing and Quality Control, Zhanjiang 524088, China; Collaborative Innovation Center of Seafood Deep Processing, Dalian Polytechnic University, Dalian 116034, China.

Collapse

Bothma NP, L'abbé EN, Liebenberg L. Evaluating postcranial macromorphoscopic traits to estimate population variation among modern South Africans. Forensic Sci Int 2024;356:111954. [PMID: 38382241 DOI: 10.1016/j.forsciint.2024.111954] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2023] [Revised: 12/20/2023] [Accepted: 01/31/2024] [Indexed: 02/23/2024]

Hong SM, Yoon IH, Cho KH. Predicting the distribution coefficient of cesium in solid phase groups using machine learning. Chemosphere 2024;352:141462. [PMID: 38364923 DOI: 10.1016/j.chemosphere.2024.141462] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/10/2023] [Revised: 02/06/2024] [Accepted: 02/13/2024] [Indexed: 02/18/2024]

Tian H, Tom BDM, Burgess S. A data-adaptive method for investigating effect heterogeneity with high-dimensional covariates in Mendelian randomization. BMC Med Res Methodol 2024;24:34. [PMID: 38341532 PMCID: PMC10858611 DOI: 10.1186/s12874-024-02153-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2023] [Accepted: 01/17/2024] [Indexed: 02/12/2024] Open

Abstract

BACKGROUND

Mendelian randomization is a popular method for causal inference with observational data that uses genetic variants as instrumental variables. Similarly to a randomized trial, a standard Mendelian randomization analysis estimates the population-averaged effect of an exposure on an outcome. Dividing the population into subgroups can reveal effect heterogeneity to inform who would most benefit from intervention on the exposure. However, as covariates are measured post-"randomization", naive stratification typically induces collider bias in stratum-specific estimates.

METHOD

We extend a previously proposed stratification method (the "doubly-ranked method") to form strata based on a single covariate, and introduce a data-adaptive random forest method to calculate stratum-specific estimates that are robust to collider bias based on a high-dimensional covariate set. We also propose measures based on the Q statistic to assess heterogeneity between stratum-specific estimates (to understand whether estimates are more variable than expected due to chance alone) and variable importance (to identify the key drivers of effect heterogeneity).

RESULT

We show that the effect of body mass index (BMI) on lung function is heterogeneous, depending most strongly on hip circumference and weight. While for most individuals, the predicted effect of increasing BMI on lung function is negative, it is positive for some individuals and strongly negative for others.

CONCLUSION

Our data-adaptive approach allows for the exploration of effect heterogeneity in the relationship between an exposure and an outcome within a Mendelian randomization framework. This can yield valuable insights into disease aetiology and help identify specific groups of individuals who would derive the greatest benefit from targeted interventions on the exposure.

Collapse

Markert N, Guhl B, Feld CK. Water quality deterioration remains a major stressor for macroinvertebrate, diatom and fish communities in German rivers. Sci Total Environ 2024;907:167994. [PMID: 37875194 DOI: 10.1016/j.scitotenv.2023.167994] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/24/2023] [Revised: 09/18/2023] [Accepted: 10/19/2023] [Indexed: 10/26/2023]

Abstract

About 60 % of Europe's rivers fail to meet ecological quality standards derived from biological criteria. The causes are manifold, but recent reports suggest a dominant role of hydro-morphological and water quality-related stressors. Yet, in particular micropollutants and hydrological stressors often tend to be underrepresented in multiple-stressor studies. Using monitoring data from four Federal States in Germany, this study investigated the effects of 19 stressor variables from six stressor groups (nutrients, salt ions, dissolved oxygen/water temperature, mixture toxicity of 51 micropollutants, hydrological alteration and morphological habitat quality) on three biological assemblages (fishes, macroinvertebrates, benthic diatoms). Biological effects were analyzed for 35 community metrics and quantified using Random Forest (RF) analyses to put the stressor groups into a hierarchical context. To compare metric responses, metrics were grouped into categories reflecting important characteristics of biological communities, such as sensitivity, functional traits, diversity and community composition as well as composite indices that integrate several metrics into one single index (e.g., ecological quality class). Water quality-related stressors - but not micropollutants - turned out to dominate the responses of all assemblages. In contrast, the effects of hydro-morphological stressors were less pronounced and stronger for hydrological stressors than for morphological stressors. Explained variances of RF models ranged 23-64 % for macroinvertebrates, 16-40 % for benthic diatoms and 18-48 % for fishes. Despite a high variability of responses across assemblages and stressor groups, sensitivity metrics tended to reveal stronger responses to individual stressors and a higher explained variance in RF models than composite indices. The results of this study suggest that (physico-chemical) water quality deterioration continues to impact biological assemblages in many German rivers, despite the extensive progress in wastewater treatment during the past decades. To detect water quality deterioration, monitoring schemes need to target relevant physico-chemical stressors and micropollutants. Furthermore, monitoring needs to integrate measures of hydrological alteration (e.g., flow magnitude and dynamics). At present, hydro-morphological surveys rarely address the degree of hydrological alteration. In order to achieve a good ecological status, river restoration and management needs to address both water quality-related and hydro-morphological stressors. Restricting analyses to just one single organism group (e.g., macroinvertebrates) or only selected metrics (e.g., ecological quality class) may hamper stressor identification and its hierarchical classification and, thus may mislead river management.

Collapse

Boileau P, Qi NT, van der Laan MJ, Dudoit S, Leng N. A flexible approach for predictive biomarker discovery. Biostatistics 2023;24:1085-1105. [PMID: 35861622 DOI: 10.1093/biostatistics/kxac029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2022] [Revised: 06/01/2022] [Accepted: 06/27/2022] [Indexed: 11/14/2022] Open

Sheikhalishahi S, Bhattacharyya A, Celi LA, Osmani V. An interpretable deep learning model for time-series electronic health records: Case study of delirium prediction in critical care. Artif Intell Med 2023;144:102659. [PMID: 37783541 DOI: 10.1016/j.artmed.2023.102659] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2022] [Revised: 07/19/2023] [Accepted: 09/04/2023] [Indexed: 10/04/2023]

Fife DA, D'Onofrio J. Common, uncommon, and novel applications of random forest in psychological research. Behav Res Methods 2023;55:2447-2466. [PMID: 35915361 DOI: 10.3758/s13428-022-01901-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 06/05/2022] [Indexed: 01/08/2023]

Alakus C, Larocque D, Labbe A. Covariance regression with random forests. BMC Bioinformatics 2023;24:258. [PMID: 37330468 DOI: 10.1186/s12859-023-05377-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2022] [Accepted: 06/02/2023] [Indexed: 06/19/2023] Open

Markus AF, Fridgeirsson EA, Kors JA, Verhamme KMC, Rijnbeek PR. Challenges of Estimating Global Feature Importance in Real-World Health Care Data. Stud Health Technol Inform 2023;302:1057-1061. [PMID: 37203580 DOI: 10.3233/shti230346] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/20/2023]

Sheikholeslami R, Hall JW. Global patterns and key drivers of stream nitrogen concentration: A machine learning approach. Sci Total Environ 2023;868:161623. [PMID: 36657680 PMCID: PMC10933795 DOI: 10.1016/j.scitotenv.2023.161623] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/18/2022] [Revised: 12/22/2022] [Accepted: 01/11/2023] [Indexed: 06/17/2023]

Abstract

Anthropogenic loading of nitrogen to river systems can pose serious health hazards and create critical environmental threats. Quantification of the magnitude and impact of freshwater nitrogen requires identifying key controls of nitrogen dynamics and analyzing both the past and present patterns of nitrogen flows. To tackle this challenge, we adopted a machine learning (ML) approach and built an ML-driven representation that captures spatiotemporal variability in nitrogen concentrations at global scale. Our model uses random forests to regress a large sample of monthly measured stream nitrogen concentrations onto a set of 17 predictors with a spatial resolution of 0.5-degree over the 1990-2013, including observations within the pixel and upstream drivers. The model was validated with data from rivers outside the training dataset and was used to predict nitrogen concentrations in 520 major river basins of the world, including many with scarce or no observations. We predicted that the regions with highest median nitrogen concentrations in their rivers (in 2013) were: United States (Mississippi), Pakistan, Bangladesh, India (Indus, Ganges), China (Yellow, Yangtze, Yongding, Huai), and most of Europe (Rhine, Danube, Vistula, Thames, Trent, Severn). Other major hotspots were the river basins of the Sebou (Morroco), Nakdong (South Korea), Kitakami (Japan), and Egypt's Nile Delta. Our analysis showed that the rate of increase in nitrogen concentration between 1990s and 2000s was greatest in rivers located in eastern China, eastern and central parts of Canada, Baltic states, Pakistan, mainland southeast Asia, and south-eastern Australia. Using a new grouped variable importance measure, we also found that temporality (month of the year and cumulative month count) is the most influential predictor, followed by factors representing hydroclimatic conditions, diffuse nutrient emissions from agriculture, and topographic features. Our model can be further applied to assess strategies designed to reduce nitrogen pollution in freshwater bodies at large spatial scales.

Collapse

Jayaramu V, Zulkafli Z, De Stercke S, Buytaert W, Rahmat F, Abdul Rahman RZ, Ishak AJ, Tahir W, Ab Rahman J, Mohd Fuzi NMH. Leptospirosis modelling using hydrometeorological indices and random forest machine learning. Int J Biometeorol 2023;67:423-437. [PMID: 36719482 DOI: 10.1007/s00484-022-02422-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/30/2022] [Revised: 12/21/2022] [Accepted: 12/26/2022] [Indexed: 06/18/2023]

Abstract

Leptospirosis is a zoonosis that has been linked to hydrometeorological variability. Hydrometeorological averages and extremes have been used before as drivers in the statistical prediction of disease. However, their importance and predictive capacity are still little known. In this study, the use of a random forest classifier was explored to analyze the relative importance of hydrometeorological indices in developing the leptospirosis model and to evaluate the performance of models based on the type of indices used, using case data from three districts in Kelantan, Malaysia, that experience annual monsoonal rainfall and flooding. First, hydrometeorological data including rainfall, streamflow, water level, relative humidity, and temperature were transformed into 164 weekly average and extreme indices in accordance with the Expert Team on Climate Change Detection and Indices (ETCCDI). Then, weekly case occurrences were classified into binary classes "high" and "low" based on an average threshold. Seventeen models based on "average," "extreme," and "mixed" indices were trained by optimizing the feature subsets based on the model computed mean decrease Gini (MDG) scores. The variable importance was assessed through cross-correlation analysis and the MDG score. The average and extreme models showed similar prediction accuracy ranges (61.5-76.1% and 72.3-77.0%) while the mixed models showed an improvement (71.7-82.6% prediction accuracy). An extreme model was the most sensitive while an average model was the most specific. The time lag associated with the driving indices agreed with the seasonality of the monsoon. The rainfall variable (extreme) was the most important in classifying the leptospirosis occurrence while streamflow was the least important despite showing higher correlations with leptospirosis.

Collapse

Gu Y, Liu D, Arvin R, Khattak AJ, Han LD. Predicting intersection crash frequency using connected vehicle data: A framework for geographical random forest. Accid Anal Prev 2023;179:106880. [PMID: 36345113 DOI: 10.1016/j.aap.2022.106880] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/22/2022] [Revised: 10/06/2022] [Accepted: 10/20/2022] [Indexed: 06/16/2023]

Sun T, Ji C, Li F, Shan X, Wu H. The legacy effect of microplastics on aquatic animals in the depuration phase: Kinetic characteristics and recovery potential. Environ Int 2022;168:107467. [PMID: 35985106 DOI: 10.1016/j.envint.2022.107467] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/23/2022] [Revised: 07/25/2022] [Accepted: 08/09/2022] [Indexed: 06/15/2023]

Affiliation(s)

Tao Sun CAS Key Laboratory of Coastal Environmental Processes and Ecological Remediation, Yantai Institute of Coastal Zone Research (YIC), Chinese Academy of Sciences (CAS), Shandong Key Laboratory of Coastal Environmental Processes, YICCAS, Yantai 264003, PR China; University of Chinese Academy of Sciences, Beijing 100049, PR China
Chenglong Ji CAS Key Laboratory of Coastal Environmental Processes and Ecological Remediation, Yantai Institute of Coastal Zone Research (YIC), Chinese Academy of Sciences (CAS), Shandong Key Laboratory of Coastal Environmental Processes, YICCAS, Yantai 264003, PR China; Laboratory for Marine Fisheries Science and Food Production Processes, Qingdao National Laboratory for Marine Science and Technology, Qingdao 266237, PR China; Center for Ocean Mega-Science, Chinese Academy of Sciences (CAS), Qingdao 266071, PR China
Fei Li CAS Key Laboratory of Coastal Environmental Processes and Ecological Remediation, Yantai Institute of Coastal Zone Research (YIC), Chinese Academy of Sciences (CAS), Shandong Key Laboratory of Coastal Environmental Processes, YICCAS, Yantai 264003, PR China; Center for Ocean Mega-Science, Chinese Academy of Sciences (CAS), Qingdao 266071, PR China
Xiujuan Shan Laboratory for Marine Fisheries Science and Food Production Processes, Qingdao National Laboratory for Marine Science and Technology, Qingdao 266237, PR China
Huifeng Wu CAS Key Laboratory of Coastal Environmental Processes and Ecological Remediation, Yantai Institute of Coastal Zone Research (YIC), Chinese Academy of Sciences (CAS), Shandong Key Laboratory of Coastal Environmental Processes, YICCAS, Yantai 264003, PR China; Laboratory for Marine Fisheries Science and Food Production Processes, Qingdao National Laboratory for Marine Science and Technology, Qingdao 266237, PR China; Center for Ocean Mega-Science, Chinese Academy of Sciences (CAS), Qingdao 266071, PR China.

Collapse

Behrouz MS, Yazdi MN, Sample DJ. Using Random Forest, a machine learning approach to predict nitrogen, phosphorus, and sediment event mean concentrations in urban runoff. J Environ Manage 2022;317:115412. [PMID: 35649331 DOI: 10.1016/j.jenvman.2022.115412] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/31/2022] [Revised: 05/22/2022] [Accepted: 05/24/2022] [Indexed: 06/15/2023]

Abstract

Estimating pollutant loads from developed watersheds is vitally important to reduce nonpoint source pollution from urban areas, as a key tool in meeting water quality goals is the implementation of Stormwater Control Measures (SCMs). SCMs are selected and sized based on influent pollutant loads. A common method used to estimate pollutant loads in urban runoff is the Event Mean Concentration (EMC) method. In this study, we develop and apply data-driven models using Random Forest (RF), a machine learning approach, to predict Total Nitrogen (TN), Total Phosphorus (TP), Total Suspended Solids (TSS), and Ortho-Phosphorus (Ortho-P) EMCs in urban runoff. The parameters considered in this study were climatological characteristics (i.e., Antecedent Dry Period or ADP, Precipitation Depth or P, Duration or D, and Intensity or I) and catchment characteristics including land use-related parameters including Imperviousness or Imp, Saturated Hydraulic Conductivity or K_sat, and Available Water Capacity or AWC), and site-specific parameters including Slope (S), and Catchment Size (A). Stormwater quality data for this study were obtained from the National Stormwater Quality Database (NSQD), which is the largest repository of stormwater quality data in the U.S. Results demonstrate that land use-related characteristics (i.e., Imp, K_sat, and AWC) were the most effective variables for predicting all EMCs. For TP, TSS, and Ortho-P, site-specific characteristics (S and A) had a greater effect than climatological characteristics (i.e., ADP, P, D, and I). However, for TN, climatological characteristics had a greater effect than site-specific characteristics (S and A). In addition, for TN, TP, and TSS, precipitation characteristics (P, D, and I) were found to be more effective parameters for estimating EMCs than ADP. This study highlights the most influential parameters affecting EMCs which can be used by stakeholders and SCMs designers to improve estimates of nutrients and sediment EMCs. The selection and design of the highest performing SCMs is essential in achieving effective treatment of stormwater, attaining water quality goals, and protecting downstream waterbodies.

Collapse

Lin JYJ, Hu L, Huang C, Jiayi J, Lawrence S, Govindarajulu U. A flexible approach for variable selection in large-scale healthcare database studies with missing covariate and outcome data. BMC Med Res Methodol 2022;22:132. [PMID: 35508974 PMCID: PMC9066834 DOI: 10.1186/s12874-022-01608-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2021] [Accepted: 04/19/2022] [Indexed: 12/17/2022] Open

Abstract

Background

Prior work has shown that combining bootstrap imputation with tree-based machine learning variable selection methods can provide good performances achievable on fully observed data when covariate and outcome data are missing at random (MAR). This approach however is computationally expensive, especially on large-scale datasets.

Methods

We propose an inference-based method, called RR-BART, which leverages the likelihood-based Bayesian machine learning technique, Bayesian additive regression trees, and uses Rubin’s rule to combine the estimates and variances of the variable importance measures on multiply imputed datasets for variable selection in the presence of MAR data. We conduct a representative simulation study to investigate the practical operating characteristics of RR-BART, and compare it with the bootstrap imputation based methods. We further demonstrate the methods via a case study of risk factors for 3-year incidence of metabolic syndrome among middle-aged women using data from the Study of Women’s Health Across the Nation (SWAN).

Results

The simulation study suggests that even in complex conditions of nonlinearity and nonadditivity with a large percentage of missingness, RR-BART can reasonably recover both prediction and variable selection performances, achievable on the fully observed data. RR-BART provides the best performance that the bootstrap imputation based methods can achieve with the optimal selection threshold value. In addition, RR-BART demonstrates a substantially stronger ability of detecting discrete predictors. Furthermore, RR-BART offers substantial computational savings. When implemented on the SWAN data, RR-BART adds to the literature by selecting a set of predictors that had been less commonly identified as risk factors but had substantial biological justifications.

Conclusion

The proposed variable selection method for MAR data, RR-BART, offers both computational efficiency and good operating characteristics and is utilitarian in large-scale healthcare database studies.

Supplementary Information

The online version contains supplementary material available at (10.1186/s12874-022-01608-7).

Collapse

Luo Y, Yan J, McClure SC, Li F. Socioeconomic and environmental factors of poverty in China using geographically weighted random forest regression model. Environ Sci Pollut Res Int 2022;29:33205-33217. [PMID: 35022975 PMCID: PMC8754530 DOI: 10.1007/s11356-021-17513-3] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/03/2021] [Accepted: 11/09/2021] [Indexed: 06/07/2023]

Parks J, McLean KE, McCandless L, de Souza RJ, Brook JR, Scott J, Turvey SE, Mandhane PJ, Becker AB, Azad MB, Moraes TJ, Lefebvre DL, Sears MR, Subbarao P, Takaro TK. Assessing secondhand and thirdhand tobacco smoke exposure in Canadian infants using questionnaires, biomarkers, and machine learning. J Expo Sci Environ Epidemiol 2022;32:112-123. [PMID: 34175887 PMCID: PMC8770125 DOI: 10.1038/s41370-021-00350-4] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/27/2020] [Revised: 05/28/2021] [Accepted: 05/28/2021] [Indexed: 06/02/2023]

Pickett KL, Suresh K, Campbell KR, Davis S, Juarez-Colunga E. Random survival forests for dynamic predictions of a time-to-event outcome using a longitudinal biomarker. BMC Med Res Methodol 2021;21:216. [PMID: 34657597 PMCID: PMC8520610 DOI: 10.1186/s12874-021-01375-x] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2021] [Accepted: 08/21/2021] [Indexed: 11/24/2022] Open

Abstract

BACKGROUND

Risk prediction models for time-to-event outcomes play a vital role in personalized decision-making. A patient's biomarker values, such as medical lab results, are often measured over time but traditional prediction models ignore their longitudinal nature, using only baseline information. Dynamic prediction incorporates longitudinal information to produce updated survival predictions during follow-up. Existing methods for dynamic prediction include joint modeling, which often suffers from computational complexity and poor performance under misspecification, and landmarking, which has a straightforward implementation but typically relies on a proportional hazards model. Random survival forests (RSF), a machine learning algorithm for time-to-event outcomes, can capture complex relationships between the predictors and survival without requiring prior specification and has been shown to have superior predictive performance.

METHODS

We propose an alternative approach for dynamic prediction using random survival forests in a landmarking framework. With a simulation study, we compared the predictive performance of our proposed method with Cox landmarking and joint modeling in situations where the proportional hazards assumption does not hold and the longitudinal marker(s) have a complex relationship with the survival outcome. We illustrated the use of the RSF landmark approach in two clinical applications to assess the performance of various RSF model building decisions and to demonstrate its use in obtaining dynamic predictions.

RESULTS

In simulation studies, RSF landmarking outperformed joint modeling and Cox landmarking when a complex relationship between the survival and longitudinal marker processes was present. It was also useful in application when there were several predictors for which the clinical relevance was unknown and multiple longitudinal biomarkers were present. Individualized dynamic predictions can be obtained from this method and the variable importance metric is useful for examining the changing predictive power of variables over time. In addition, RSF landmarking is easily implementable in standard software and using suggested specifications requires less computation time than joint modeling.

CONCLUSIONS

RSF landmarking is a nonparametric, machine learning alternative to current methods for obtaining dynamic predictions when there are complex or unknown relationships present. It requires little upfront decision-making and has comparable predictive performance and has preferable computational speed.

Collapse

Nath A. Prediction for understanding the effectiveness of antiviral peptides. Comput Biol Chem 2021;95:107588. [PMID: 34655913 DOI: 10.1016/j.compbiolchem.2021.107588] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2021] [Revised: 10/01/2021] [Accepted: 10/02/2021] [Indexed: 11/20/2022]

Singha S, Pasupuleti S, Singha SS, Singh R, Kumar S. Prediction of groundwater quality using efficient machine learning technique. Chemosphere 2021;276:130265. [PMID: 34088106 DOI: 10.1016/j.chemosphere.2021.130265] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/14/2020] [Revised: 03/07/2021] [Accepted: 03/11/2021] [Indexed: 06/12/2023]

Zhang J, Lv L, Lu D, Kong D, Al-Alashaari MAA, Zhao X. Variable selection from a feature representing protein sequences: a case of classification on bacterial type IV secreted effectors. BMC Bioinformatics 2020;21:480. [PMID: 33109082 PMCID: PMC7590791 DOI: 10.1186/s12859-020-03826-6] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2020] [Accepted: 10/19/2020] [Indexed: 12/13/2022] Open

Abstract

Background

Classification of certain proteins with specific functions is momentous for biological research. Encoding approaches of protein sequences for feature extraction play an important role in protein classification. Many computational methods (namely classifiers) are used for classification on protein sequences according to various encoding approaches. Commonly, protein sequences keep certain labels corresponding to different categories of biological functions (e.g., bacterial type IV secreted effectors or not), which makes protein prediction a fantasy. As to protein prediction, a kernel set of protein sequences keeping certain labels certified by biological experiments should be existent in advance. However, it has been hardly ever seen in prevailing researches. Therefore, unsupervised learning rather than supervised learning (e.g. classification) should be considered. As to protein classification, various classifiers may help to evaluate the effectiveness of different encoding approaches. Besides, variable selection from an encoded feature representing protein sequences is an important issue that also needs to be considered.

Results

Focusing on the latter problem, we propose a new method for variable selection from an encoded feature representing protein sequences. Taking a benchmark dataset containing 1947 protein sequences as a case, experiments are made to identify bacterial type IV secreted effectors (T4SE) from protein sequences, which are composed of 399 T4SE and 1548 non-T4SE. Comparable and quantified results are obtained only using certain components of the encoded feature, i.e., position-specific scoring matix, and that indicates the effectiveness of our method.

Conclusions

Certain variables other than an encoded feature they belong to do work for discrimination between different types of proteins. In addition, ensemble classifiers with an automatic assignment of different base classifiers do achieve a better classification result.

Collapse

Crager MR. Extensions of the absolute standardized hazard ratio and connections with measures of explained variation and variable importance. Lifetime Data Anal 2020;26:872-892. [PMID: 32705583 DOI: 10.1007/s10985-020-09504-2] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/05/2019] [Accepted: 07/09/2020] [Indexed: 06/11/2023]

Pourghasemi HR, Sadhasivam N, Yousefi S, Tavangar S, Ghaffari Nazarlou H, Santosh M. Using machine learning algorithms to map the groundwater recharge potential zones. J Environ Manage 2020;265:110525. [PMID: 32275245 DOI: 10.1016/j.jenvman.2020.110525] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/05/2019] [Revised: 03/23/2020] [Accepted: 03/28/2020] [Indexed: 06/11/2023]

Fathololoumi S, Vaezi AR, Alavipanah SK, Ghorbani A, Biswas A. Comparison of spectral and spatial-based approaches for mapping the local variation of soil moisture in a semi-arid mountainous area. Sci Total Environ 2020;724:138319. [PMID: 32408464 DOI: 10.1016/j.scitotenv.2020.138319] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/14/2020] [Revised: 03/28/2020] [Accepted: 03/28/2020] [Indexed: 06/11/2023]

Abstract

Accurate information on soil moisture (SM) is critical in various applications including agriculture, climate, hydrology, soil and drought. In this paper, various predictive relationships including regression (Multiple Linear Regression, MLR), machine learning (Random Forest, RF; Triangular regression, Tr) and spatial modeling (Inverse Distance Weighing, IDW and Ordinary kriging, OK) approaches were compared to estimate SM in a semi-arid mountainous watershed. In developing predictive relationship, Remote Sensing datasets including Landsat 8 satellite imagery derived surface biophysical characteristic, ASTER digital elevation model (DEM) derived surface topographical characteristic, climatic data recorded at the synoptic station and in situ SM data measured at Landsat 8 overpass time were utilized, while in spatial modeling, point-based SM measurements were interpolated. While 70%(calibration set) of the measured SM data were used for modeling, 30%(validation set) were used to evaluate modeling accuracy. Finally, the SM uncertainty maps were created for different models based on a bootstrapping approach. Among the environmental parameter sets, land surface temperature (LST) showed the highest impact on the spatial distribution of SM in the region at all dates. Mean R²(RMSE) between measured and modeled SM on three dates obtained from the MLR, RF, IDW, OK, and Tr models were 0.70(1.97%), 0.72(1.92%), 0.59(2.38%), 0.59(2.27%) and 0.71(1.99%), respectively. The results showed that RF and IDW produced the highest and lowest performance in SM modeling, respectively. Generally, the performance of RS-based models was higher than interpolation models for estimating SM due to the influence from combination of topographic parameters and surface biophysical characteristics. Modeled SM uncertainty with different models varies in the study area. The highest uncertainty in SM modeling was observed at the north part of the study area where the surface heterogeneity is high. Using RS data increased the accuracy of SM modeling because they can capture the surface biophysical characteristics and topographical properties heterogeneity.

Collapse

Chen V, Zhang H. Depth importance in precision medicine (DIPM): a tree- and forest-based method for right-censored survival outcomes. Biostatistics 2020;23:157-172. [PMID: 32424406 DOI: 10.1093/biostatistics/kxaa021] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2019] [Revised: 04/09/2020] [Accepted: 04/13/2020] [Indexed: 12/26/2022] Open

Gómez-Verdejo V, Parrado-Hernández E, Tohka J. Sign-Consistency Based Variable Importance for Machine Learning in Brain Imaging. Neuroinformatics 2020;17:593-609. [PMID: 30919255 PMCID: PMC6841656 DOI: 10.1007/s12021-019-9415-3] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023]

de Menezes MD, Bispo FHA, Faria WM, Gonçalves MGM, Curi N, Guilherme LRG. Modeling arsenic content in Brazilian soils: What is relevant? Sci Total Environ 2020;712:136511. [PMID: 32050379 DOI: 10.1016/j.scitotenv.2020.136511] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/24/2019] [Revised: 12/30/2019] [Accepted: 01/02/2020] [Indexed: 06/10/2023]

Abstract

Arsenic accumulation in the environment poses ecological and human health risks. A greater knowledge about soil total As content variability and its main drivers is strategic for maintaining soil security, helping public policies and environmental surveys. Considering the poor history of As studies in Brazil at the country's geographical scale, this work aimed to generate predictive models of topsoil As content using machine learning (ML) algorithms based on several environmental covariables representing soil forming factors, ranking their importance as explanatory covariables and for feeding group analysis. An unprecedented databank based on laboratory analyses (including rare earth elements), proximal and remote sensing, geographical information system operations, and pedological information were surveyed. The median soil As content ranged from 0.14 to 41.1 mg kg^-¹ in reference soils, and 0.28 to 58.3 mg kg^-¹ in agricultural soils. Recursive Feature Elimination Random Forest outperformed other ML algorithms, ranking as most important environmental covariables: temperature, soil organic carbon (SOC), clay, sand, and TiO₂. Four natural groups were statistically suggested (As content ± standard error in mg kg^-¹): G1) with coarser texture, lower SOC, higher temperatures, and the lowest TiO₂ contents, has the lowest As content (2.24 ± 0.50), accomplishing different environmental conditions; G2) organic soils located in floodplains, medium TiO₂ and temperature, whose As content (3.78 ± 2.05) is slightly higher than G1, but lower than G3 and G4; G3) medium contents of As (7.14 ± 1.30), texture, SOC, TiO₂, and temperature, representing the largest number of points widespread throughout Brazil; G4) the largest contents of As (11.97 ± 1.62), SOC, and TiO₂, and the lowest sand content, with points located mainly across Southeastern Brazil with milder temperature. In the absence of soil As content, a common scenario in Brazil and in many Latin American countries, such natural groups could work as environmental indicators.

Collapse

Yao D, Zhan X, Zhan X, Kwoh CK, Li P, Wang J. A random forest based computational model for predicting novel lncRNA-disease associations. BMC Bioinformatics 2020;21:126. [PMID: 32216744 DOI: 10.1186/s12859-020-3458-1] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2019] [Accepted: 03/18/2020] [Indexed: 02/06/2023] Open

Abstract

BACKGROUND

Accumulated evidence shows that the abnormal regulation of long non-coding RNA (lncRNA) is associated with various human diseases. Accurately identifying disease-associated lncRNAs is helpful to study the mechanism of lncRNAs in diseases and explore new therapies of diseases. Many lncRNA-disease association (LDA) prediction models have been implemented by integrating multiple kinds of data resources. However, most of the existing models ignore the interference of noisy and redundancy information among these data resources.

RESULTS

To improve the ability of LDA prediction models, we implemented a random forest and feature selection based LDA prediction model (RFLDA in short). First, the RFLDA integrates the experiment-supported miRNA-disease associations (MDAs) and LDAs, the disease semantic similarity (DSS), the lncRNA functional similarity (LFS) and the lncRNA-miRNA interactions (LMI) as input features. Then, the RFLDA chooses the most useful features to train prediction model by feature selection based on the random forest variable importance score that takes into account not only the effect of individual feature on prediction results but also the joint effects of multiple features on prediction results. Finally, a random forest regression model is trained to score potential lncRNA-disease associations. In terms of the area under the receiver operating characteristic curve (AUC) of 0.976 and the area under the precision-recall curve (AUPR) of 0.779 under 5-fold cross-validation, the performance of the RFLDA is better than several state-of-the-art LDA prediction models. Moreover, case studies on three cancers demonstrate that 43 of the 45 lncRNAs predicted by the RFLDA are validated by experimental data, and the other two predicted lncRNAs are supported by other LDA prediction models.

CONCLUSIONS

Cross-validation and case studies indicate that the RFLDA has excellent ability to identify potential disease-associated lncRNAs.

Collapse

Aziz F, Malek S, Mhd Ali A, Wong MS, Mosleh M, Milow P. Determining hypertensive patients' beliefs towards medication and associations with medication adherence using machine learning methods. PeerJ 2020;8:e8286. [PMID: 32206445 PMCID: PMC7075362 DOI: 10.7717/peerj.8286] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2018] [Accepted: 11/24/2019] [Indexed: 01/31/2023] Open

Abstract

Background

This study assesses the feasibility of using machine learning methods such as Random Forests (RF), Artificial Neural Networks (ANN), Support Vector Regression (SVR) and Self-Organizing Feature Maps (SOM) to identify and determine factors associated with hypertensive patients' adherence levels. Hypertension is the medical term for systolic and diastolic blood pressure higher than 140/90 mmHg. A conventional medication adherence scale was used to identify patients' adherence to their prescribed medication. Using machine learning applications to predict precise numeric adherence scores in hypertensive patients has not yet been reported in the literature.

Methods

Data from 160 hypertensive patients from a tertiary hospital in Kuala Lumpur, Malaysia, were used in this study. Variables were ranked based on their significance to adherence levels using the RF variable importance method. The backward elimination method was then performed using RF to obtain the variables significantly associated with the patients' adherence levels. RF, SVR and ANN models were developed to predict adherence using the identified significant variables. Visualizations of the relationships between hypertensive patients' adherence levels and variables were generated using SOM.

Result

Machine learning models constructed using the selected variables reported RMSE values of 1.42 for ANN, 1.53 for RF, and 1.55 for SVR. The accuracy of the dichotomised scores, calculated based on a percentage of correctly identified adherence values, was used as an additional model performance measure, resulting in accuracies of 65% (ANN), 78% (RF) and 79% (SVR), respectively. The Wilcoxon signed ranked test reported that there was no significant difference between the predictions of the machine learning models and the actual scores. The significant variables identified from the RF variable importance method were educational level, marital status, General Overuse, monthly income, and Specific Concern.

Conclusion

This study suggests an effective alternative to conventional methods in identifying the key variables to understand hypertensive patients' adherence levels. This can be used as a tool to educate patients on the importance of medication in managing hypertension.

Collapse

Zhou Y, Zuo Z, Xu F, Wang Y. Origin identification of Panax notoginseng by multi-sensor information fusion strategy of infrared spectra combined with random forest. Spectrochim Acta A Mol Biomol Spectrosc 2020;226:117619. [PMID: 31606667 DOI: 10.1016/j.saa.2019.117619] [Citation(s) in RCA: 32] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/08/2019] [Revised: 07/14/2019] [Accepted: 10/06/2019] [Indexed: 06/10/2023]

Ozigis MS, Kaduk JD, Jarvis CH, da Conceição Bispo P, Balzter H. Detection of oil pollution impacts on vegetation using multifrequency SAR, multispectral images with fuzzy forest and random forest methods. Environ Pollut 2020;256:113360. [PMID: 31672372 DOI: 10.1016/j.envpol.2019.113360] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/31/2019] [Revised: 09/28/2019] [Accepted: 10/06/2019] [Indexed: 05/22/2023]

Abstract

Oil pollution harms terrestrial ecosystems. There is an urgent requirement to improve on existing methods for detecting, mapping and establishing the precise extent of oil-impacted and oil-free vegetation. This is needed to quantify existing spill extents, formulate effective remediation strategies and to enable effective pipeline monitoring strategies to identify leakages at an early stage. An effective oil spill detection algorithm based on optical image spectral responses can benefit immensely from the inclusion of multi-frequency Synthetic Aperture Radar (SAR) data, especially when the effect of multi-collinearity is sufficiently reduced. This study compared the Fuzzy Forest (FF) and Random Forest (RF) methods in detecting and mapping oil-impacted vegetation from a post spill multispectral optical sentinel 2 image and multifrequency C and X Band Sentinel - 1, COSMO Skymed and TanDEM-X SAR images. FF and RF classifiers were employed to discriminate oil-spill impacted and oil-free vegetation in a study area in Nigeria. Fuzzy Forest uses specific functions for the selection and use of uncorrelated variables in the classification process to yield an improved result. This method proved an efficient variable selection technique addressing the effects of high dimensionality and multi-collinearity, as the optimization and use of different SAR and optical image variables generated more accurate results than the RF algorithm in densely vegetated areas. An Overall Accuracy (OA) of 75% was obtained for the dense (Tree Cover Area) vegetation, while cropland and grassland areas had 59.4% and 65% OA respectively. However, RF performed better in Cropland areas with OA = 75% when SAR-optical image variables were used for classification, while both methods performed equally well in Grassland areas with OA = 65%. Similarly, significant backscatter differences (P < 0.005) were observed in the C-Band backscatter sample mean of polluted and oil-free TCA, while strong linear associations existed between LAI and backscatter in grassland and TCA. This study demonstrates that SAR based monitoring of petroleum hydrocarbon impacts on vegetation is feasible and has high potential for establishing oil-impacted areas and oil pipeline monitoring.

Collapse

Liao X, Kerr D, Morales J, Duncan I. Application of Machine Learning to Identify Clustering of Cardiometabolic Risk Factors in U.S. Adults. Diabetes Technol Ther 2019;21:245-253. [PMID: 30969131 DOI: 10.1089/dia.2018.0390] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]

Pedersen KB, Jensen PE, Ottosen LM, Barlindhaug J. Applying multivariate analysis for optimising the electrodialytic removal of Cu and Pb from shooting range soils. J Hazard Mater 2019;368:869-876. [PMID: 30322811 DOI: 10.1016/j.jhazmat.2018.10.014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/26/2017] [Revised: 09/20/2018] [Accepted: 10/03/2018] [Indexed: 06/08/2023]

Ozigis MS, Kaduk JD, Jarvis CH. Mapping terrestrial oil spill impact using machine learning random forest and Landsat 8 OLI imagery: a case site within the Niger Delta region of Nigeria. Environ Sci Pollut Res Int 2019;26:3621-3635. [PMID: 30535661 PMCID: PMC6513793 DOI: 10.1007/s11356-018-3824-y] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/18/2018] [Accepted: 11/21/2018] [Indexed: 04/12/2023]

Huber M, Kurz C, Leidl R. Predicting patient-reported outcomes following hip and knee replacement surgery using supervised machine learning. BMC Med Inform Decis Mak 2019;19:3. [PMID: 30621670 PMCID: PMC6325823 DOI: 10.1186/s12911-018-0731-6] [Citation(s) in RCA: 53] [Impact Index Per Article: 10.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2018] [Accepted: 12/27/2018] [Indexed: 12/28/2022] Open

Abstract

BACKGROUND

Machine-learning classifiers mostly offer good predictive performance and are increasingly used to support shared decision-making in clinical practice. Focusing on performance and practicability, this study evaluates prediction of patient-reported outcomes (PROs) by eight supervised classifiers including a linear model, following hip and knee replacement surgery.

METHODS

NHS PRO data (130,945 observations) from April 2015 to April 2017 were used to train and test eight classifiers to predict binary postoperative improvement based on minimal important differences. Area under the receiver operating characteristic, J-statistic and several other metrics were calculated. The dependent outcomes were generic and disease-specific improvement based on the EQ-5D-3L visual analogue scale (VAS) as well as the Oxford Hip and Knee Score (Q score).

RESULTS

The area under the receiver operating characteristic of the best training models was around 0.87 (VAS) and 0.78 (Q score) for hip replacement, while it was around 0.86 (VAS) and 0.70 (Q score) for knee replacement surgery. Extreme gradient boosting, random forests, multistep elastic net and linear model provided the highest overall J-statistics. Based on variable importance, the most important predictors for post-operative outcomes were preoperative VAS, Q score and single Q score dimensions. Sensitivity analysis for hip replacement VAS evaluated the influence of minimal important difference, patient selection criteria as well as additional data years. Together with a small benchmark of the NHS prediction model, robustness of our results was confirmed.

CONCLUSIONS

Supervised machine-learning implementations, like extreme gradient boosting, can provide better performance than linear models and should be considered, when high predictive performance is needed. Preoperative VAS, Q score and specific dimensions like limping are the most important predictors for postoperative hip and knee PROMs.

Collapse

Ding C, Chen P, Jiao J. Non-linear effects of the built environment on automobile-involved pedestrian crash frequency: A machine learning approach. Accid Anal Prev 2018;112:116-126. [PMID: 29329016 PMCID: PMC10388697 DOI: 10.1016/j.aap.2017.12.026] [Citation(s) in RCA: 32] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/24/2017] [Revised: 11/12/2017] [Accepted: 12/31/2017] [Indexed: 06/07/2023]

Park H, Haghani A, Samuel S, Knodler MA. Real-time prediction and avoidance of secondary crashes under unexpected traffic congestion. Accid Anal Prev 2018;112:39-49. [PMID: 29306687 DOI: 10.1016/j.aap.2017.11.025] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/13/2017] [Revised: 10/15/2017] [Accepted: 11/18/2017] [Indexed: 06/07/2023]

Abstract

According to the Federal Highway Administration, nonrecurring congestion contributes to nearly half of the overall congestion. Temporal disruptions impact the effective use of the complete roadway, due to speed reduction and rubbernecking resulting from primary incidents that in turn provoke secondary incidents. There is an additional reduction of discharge flow caused by secondary incident that significantly increases total delay. Therefore, it is important to sequentially predict the probability of secondary incidents and develop appropriate countermeasures to reduce the associated risk. Advanced computing techniques were used to easily understand and reliably predict secondary incident occurrences that have low sample mean and a small sample size. The likelihood of a secondary incident was sequentially predicted from the point of incident response to the eventual road clearance. The quality of predictions improved with the availability of additional information. The prediction performance of the principled Bayesian learning approach to neural networks (bnn) was compared to the Stochastic Gradient Boosted Decision Trees (gbdt). A pedagogical rule extraction approach, trepan, which extracts comprehensible rules from the neural networks, improved the ability to understand secondary incidents in a simplified manner. With an acceptable accuracy, gbdt is a useful tool that presents the relative importance of the predictor variables. Unexpected traffic congestion incurred by an incident is a dominant causative factor for the occurrence of secondary incidents at different stages of incident clearance. This symbolic description represents a series of decisions that may assist emergency operators by improving their decision-making capabilities. Analyzing causes and effects of traffic incidents helps traffic operators develop incident-specific strategic plans for prompt emergency response and clearance. Application of the model in connected vehicle environments will help drivers receive proactive corrective feedback before a crash. The proposed methodology can be used to alert drivers about potential highway conditions and may increase the drivers' awareness of potential events when no rerouting is possible, optimal or otherwise.

Collapse

Kausar S, Falcao AO. An automated framework for QSAR model building. J Cheminform 2018;10:1. [PMID: 29340790 PMCID: PMC5770354 DOI: 10.1186/s13321-017-0256-5] [Citation(s) in RCA: 39] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2017] [Accepted: 12/27/2017] [Indexed: 01/13/2023] Open

Abstract

Background

In-silico quantitative structure–activity relationship (QSAR) models based tools are widely used to screen huge databases of compounds in order to determine the biological properties of chemical molecules based on their chemical structure. With the passage of time, the exponentially growing amount of synthesized and known chemicals data demands computationally efficient automated QSAR modeling tools, available to researchers that may lack extensive knowledge of machine learning modeling. Thus, a fully automated and advanced modeling platform can be an important addition to the QSAR community.

Results

In the presented workflow the process from data preparation to model building and validation has been completely automated. The most critical modeling tasks (data curation, data set characteristics evaluation, variable selection and validation) that largely influence the performance of QSAR models were focused. It is also included the ability to quickly evaluate the feasibility of a given data set to be modeled. The developed framework is tested on data sets of thirty different problems. The best-optimized feature selection methodology in the developed workflow is able to remove 62–99% of all redundant data. On average, about 19% of the prediction error was reduced by using feature selection producing an increase of 49% in the percentage of variance explained (PVE) compared to models without feature selection. Selecting only the models with a modelability score above 0.6, average PVE scores were 0.71. A strong correlation was verified between the modelability scores and the PVE of the models produced with variable selection.

Conclusions

We developed an extendable and highly customizable fully automated QSAR modeling framework. This designed workflow does not require any advanced parameterization nor depends on users decisions or expertise in machine learning/programming. With just a given target or problem, the workflow follows an unbiased standard protocol to develop reliable QSAR models by directly accessing online manually curated databases or by using private data sets. The other distinctive features of the workflow include prior estimation of data modelability to avoid time-consuming modeling trials for non modelable data sets, an efficient variable selection procedure and the facility of output availability at each modeling task for the diverse application and reproduction of historical predictions. The results reached on a selection of thirty QSAR problems suggest that the approach is capable of building reliable models even for challenging problems.

Electronic supplementary material

The online version of this article (10.1186/s13321-017-0256-5) contains supplementary material, which is available to authorized users.

Collapse

Wright MN, Ziegler A, König IR. Do little interactions get lost in dark random forests? BMC Bioinformatics 2016;17:145. [PMID: 27029549 PMCID: PMC4815164 DOI: 10.1186/s12859-016-0995-8] [Citation(s) in RCA: 51] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2016] [Accepted: 03/21/2016] [Indexed: 12/16/2022] Open

Szymczak S, Holzinger E, Dasgupta A, Malley JD, Molloy AM, Mills JL, Brody LC, Stambolian D, Bailey-Wilson JE. r2VIM: A new variable selection method for random forests in genome-wide association studies. BioData Min 2016;9:7. [PMID: 26839594 PMCID: PMC4736152 DOI: 10.1186/s13040-016-0087-3] [Citation(s) in RCA: 40] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2015] [Accepted: 01/19/2016] [Indexed: 11/10/2022] Open

Yun YH, Deng BC, Cao DS, Wang WT, Liang YZ. Variable importance analysis based on rank aggregation with applications in metabolomics for biomarker discovery. Anal Chim Acta 2016;911:27-34. [PMID: 26893083 DOI: 10.1016/j.aca.2015.12.043] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2015] [Revised: 12/28/2015] [Accepted: 12/30/2015] [Indexed: 11/17/2022]

Saha D, Alluri P, Gan A. Prioritizing Highway Safety Manual's crash prediction variables using boosted regression trees. Accid Anal Prev 2015;79:133-144. [PMID: 25823903 DOI: 10.1016/j.aap.2015.03.011] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/15/2014] [Revised: 02/14/2015] [Accepted: 03/10/2015] [Indexed: 06/04/2023]