Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Degenhardt F, Seifert S, Szymczak S. Evaluation of variable selection methods for random forests and omics data sets. Brief Bioinform 2019;20:492-503. [PMID: 29045534 PMCID: PMC6433899 DOI: 10.1093/bib/bbx124] [Citation(s) in RCA: 233] [Impact Index Per Article: 46.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2017] [Revised: 09/06/2017] [Indexed: 12/28/2022] Open

For:	Degenhardt F, Seifert S, Szymczak S. Evaluation of variable selection methods for random forests and omics data sets. Brief Bioinform 2019;20:492-503. [PMID: 29045534 PMCID: PMC6433899 DOI: 10.1093/bib/bbx124] [Citation(s) in RCA: 233] [Impact Index Per Article: 46.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2017] [Revised: 09/06/2017] [Indexed: 12/28/2022] Open

Number

Cited by Other Article(s)

201

Seifert S. Application of random forest based approaches to surface-enhanced Raman scattering data. Sci Rep 2020;10:5436. [PMID: 32214194 PMCID: PMC7096517 DOI: 10.1038/s41598-020-62338-8] [Citation(s) in RCA: 29] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2019] [Accepted: 02/26/2020] [Indexed: 01/08/2023] Open

202

Machine learning analysis of motor evoked potential time series to predict disability progression in multiple sclerosis. BMC Neurol 2020;20:105. [PMID: 32199461 PMCID: PMC7085864 DOI: 10.1186/s12883-020-01672-w] [Citation(s) in RCA: 32] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2019] [Accepted: 03/02/2020] [Indexed: 11/25/2022] Open

Abstract

Background

Evoked potentials (EPs) are a measure of the conductivity of the central nervous system. They are used to monitor disease progression of multiple sclerosis patients. Previous studies only extracted a few variables from the EPs, which are often further condensed into a single variable: the EP score. We perform a machine learning analysis of motor EP that uses the whole time series, instead of a few variables, to predict disability progression after two years. Obtaining realistic performance estimates of this task has been difficult because of small data set sizes. We recently extracted a dataset of EPs from the Rehabiliation & MS Center in Overpelt, Belgium. Our data set is large enough to obtain, for the first time, a performance estimate on an independent test set containing different patients.

Methods

We extracted a large number of time series features from the motor EPs with the highly comparative time series analysis software package. Mutual information with the target and the Boruta method are used to find features which contain information not included in the features studied in the literature. We use random forests (RF) and logistic regression (LR) classifiers to predict disability progression after two years. Statistical significance of the performance increase when adding extra features is checked.

Results

Including extra time series features in motor EPs leads to a statistically significant improvement compared to using only the known features, although the effect is limited in magnitude (ΔAUC = 0.02 for RF and ΔAUC = 0.05 for LR). RF with extra time series features obtains the best performance (AUC = 0.75±0.07 (mean and standard deviation)), which is good considering the limited number of biomarkers in the model. RF (a nonlinear classifier) outperforms LR (a linear classifier).

Conclusions

Using machine learning methods on EPs shows promising predictive performance. Using additional EP time series features beyond those already in use leads to a modest increase in performance. Larger datasets, preferably multi-center, are needed for further research. Given a large enough dataset, these models may be used to support clinicians in their decision making process regarding future treatment.

Collapse

203

Lv Z, Zhang J, Ding H, Zou Q. RF-PseU: A Random Forest Predictor for RNA Pseudouridine Sites. Front Bioeng Biotechnol 2020;8:134. [PMID: 32175316 PMCID: PMC7054385 DOI: 10.3389/fbioe.2020.00134] [Citation(s) in RCA: 62] [Impact Index Per Article: 15.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2020] [Accepted: 02/10/2020] [Indexed: 12/21/2022] Open

204

Chen P, Yang Y, Zhang Y, Jiang S, Li X, Wan J. Identification of prognostic immune-related genes in the tumor microenvironment of endometrial cancer. Aging (Albany NY) 2020;12:3371-3387. [PMID: 32074080 PMCID: PMC7066904 DOI: 10.18632/aging.102817] [Citation(s) in RCA: 39] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2019] [Accepted: 01/27/2020] [Indexed: 12/24/2022]

205

Schachtschneider KM, Welge ME, Auvil LS, Chaki S, Rund LA, Madsen O, Elmore MR, Johnson RW, Groenen MA, Schook LB. Altered Hippocampal Epigenetic Regulation Underlying Reduced Cognitive Development in Response to Early Life Environmental Insults. Genes (Basel) 2020;11:genes11020162. [PMID: 32033187 PMCID: PMC7074491 DOI: 10.3390/genes11020162] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2019] [Revised: 01/30/2020] [Accepted: 02/01/2020] [Indexed: 12/13/2022] Open

Affiliation(s)

Kyle M. Schachtschneider Department of Radiology, University of Illinois at Chicago, Chicago, IL 60607, USA; Department of Biochemistry and Molecular Genetics, University of Illinois at Chicago, Chicago, IL 60607, USA National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, IL 61820, USA; (M.E.W.); (L.S.A.)
Michael E. Welge National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, IL 61820, USA; (M.E.W.); (L.S.A.)
Loretta S. Auvil National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, IL 61820, USA; (M.E.W.); (L.S.A.)
Sulalita Chaki Department of Animal Sciences, University of Illinois at Urbana-Champaign, Urbana, IL 616280, USA; (S.C.); (L.A.R.); (M.R.P.E.); (R.W.J.)
Laurie A. Rund Department of Animal Sciences, University of Illinois at Urbana-Champaign, Urbana, IL 616280, USA; (S.C.); (L.A.R.); (M.R.P.E.); (R.W.J.)
Ole Madsen Animal Breeding and Genomics, Wageningen University, 6708 Wageningen, The Netherlands; (O.M.); (M.A.M.G.)
Monica R.P. Elmore Department of Animal Sciences, University of Illinois at Urbana-Champaign, Urbana, IL 616280, USA; (S.C.); (L.A.R.); (M.R.P.E.); (R.W.J.)
Rodney W. Johnson Department of Animal Sciences, University of Illinois at Urbana-Champaign, Urbana, IL 616280, USA; (S.C.); (L.A.R.); (M.R.P.E.); (R.W.J.)
Martien A.M. Groenen Animal Breeding and Genomics, Wageningen University, 6708 Wageningen, The Netherlands; (O.M.); (M.A.M.G.)
Lawrence B. Schook Department of Radiology, University of Illinois at Chicago, Chicago, IL 60607, USA; National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, IL 61820, USA; (M.E.W.); (L.S.A.) Department of Animal Sciences, University of Illinois at Urbana-Champaign, Urbana, IL 616280, USA; (S.C.); (L.A.R.); (M.R.P.E.); (R.W.J.) Correspondence:

Collapse

206

Radiomics approaches in gastric cancer: a frontier in clinical decision making. Chin Med J (Engl) 2020;132:1983-1989. [PMID: 31348029 PMCID: PMC6708697 DOI: 10.1097/cm9.0000000000000360] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022] Open

207

Classification and prediction of diabetes disease using machine learning paradigm. Health Inf Sci Syst 2020;8:7. [PMID: 31949894 DOI: 10.1007/s13755-019-0095-z] [Citation(s) in RCA: 45] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2019] [Accepted: 12/21/2019] [Indexed: 12/19/2022] Open

208

Xue M, Su Y, Li C, Wang S, Yao H. Identification of Potential Type II Diabetes in a Large-Scale Chinese Population Using a Systematic Machine Learning Framework. J Diabetes Res 2020;2020:6873891. [PMID: 33029536 PMCID: PMC7532405 DOI: 10.1155/2020/6873891] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/12/2020] [Revised: 08/01/2020] [Accepted: 09/02/2020] [Indexed: 12/19/2022] Open

Abstract

BACKGROUND

An estimated 425 million people globally have diabetes, accounting for 12% of the world's health expenditures, and the number continues to grow, placing a huge burden on the healthcare system, especially in those remote, underserved areas.

METHODS

A total of 584,168 adult subjects who have participated in the national physical examination were enrolled in this study. The risk factors for type II diabetes mellitus (T2DM) were identified by p values and odds ratio, using logistic regression (LR) based on variables of physical measurement and a questionnaire. Combined with the risk factors selected by LR, we used a decision tree, a random forest, AdaBoost with a decision tree (AdaBoost), and an extreme gradient boosting decision tree (XGBoost) to identify individuals with T2DM, compared the performance of the four machine learning classifiers, and used the best-performing classifier to output the degree of variables' importance scores of T2DM.

RESULTS

The results indicated that XGBoost had the best performance (accuracy = 0.906, precision = 0.910, recall = 0.902, F-1 = 0.906, and AUC = 0.968). The degree of variables' importance scores in XGBoost showed that BMI was the most significant feature, followed by age, waist circumference, systolic pressure, ethnicity, smoking amount, fatty liver, hypertension, physical activity, drinking status, dietary ratio (meat to vegetables), drink amount, smoking status, and diet habit (oil loving).

CONCLUSIONS

We proposed a classifier based on LR-XGBoost which used fourteen variables of patients which are easily obtained and noninvasive as predictor variables to identify potential incidents of T2DM. The classifier can accurately screen the risk of diabetes in the early phrase, and the degree of variables' importance scores gives a clue to prevent diabetes occurrence.

Collapse

209

García-Timermans C, Rubbens P, Heyse J, Kerckhof FM, Props R, Skirtach AG, Waegeman W, Boon N. Discriminating Bacterial Phenotypes at the Population and Single-Cell Level: A Comparison of Flow Cytometry and Raman Spectroscopy Fingerprinting. Cytometry A 2019;97:713-726. [PMID: 31889414 DOI: 10.1002/cyto.a.23952] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2019] [Revised: 11/20/2019] [Accepted: 11/26/2019] [Indexed: 12/26/2022]

210

Krzykalla J, Benner A, Kopp‐Schneider A. Exploratory identification of predictive biomarkers in randomized trials with normal endpoints. Stat Med 2019;39:923-939. [DOI: 10.1002/sim.8452] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2019] [Revised: 12/02/2019] [Accepted: 12/02/2019] [Indexed: 11/10/2022]

211

Speiser JL, Miller ME, Tooze J, Ip E. A Comparison of Random Forest Variable Selection Methods for Classification Prediction Modeling. EXPERT SYSTEMS WITH APPLICATIONS 2019;134:93-101. [PMID: 32968335 PMCID: PMC7508310 DOI: 10.1016/j.eswa.2019.05.028] [Citation(s) in RCA: 207] [Impact Index Per Article: 41.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/17/2023]

212

González-Riano C, Dudzik D, Garcia A, Gil-de-la-Fuente A, Gradillas A, Godzien J, López-Gonzálvez Á, Rey-Stolle F, Rojo D, Ruperez FJ, Saiz J, Barbas C. Recent Developments along the Analytical Process for Metabolomics Workflows. Anal Chem 2019;92:203-226. [PMID: 31625723 DOI: 10.1021/acs.analchem.9b04553] [Citation(s) in RCA: 62] [Impact Index Per Article: 12.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]

Affiliation(s)

Carolina González-Riano Centre for Metabolomics and Bioanalysis (CEMBIO), Chemistry and Biochemistry Department, Pharmacy Faculty , Universidad San Pablo-CEU , Boadilla del Monte , 28668 Madrid , Spain
Danuta Dudzik Centre for Metabolomics and Bioanalysis (CEMBIO), Chemistry and Biochemistry Department, Pharmacy Faculty , Universidad San Pablo-CEU , Boadilla del Monte , 28668 Madrid , Spain.,Department of Biopharmaceutics and Pharmacodynamics, Faculty of Pharmacy , Medical University of Gdańsk , 80-210 Gdańsk , Poland
Antonia Garcia Centre for Metabolomics and Bioanalysis (CEMBIO), Chemistry and Biochemistry Department, Pharmacy Faculty , Universidad San Pablo-CEU , Boadilla del Monte , 28668 Madrid , Spain
Alberto Gil-de-la-Fuente Department of Information Technology, Escuela Politécnica Superior , Universidad San Pablo-CEU , 28003 Madrid , Spain
Ana Gradillas Centre for Metabolomics and Bioanalysis (CEMBIO), Chemistry and Biochemistry Department, Pharmacy Faculty , Universidad San Pablo-CEU , Boadilla del Monte , 28668 Madrid , Spain
Joanna Godzien Centre for Metabolomics and Bioanalysis (CEMBIO), Chemistry and Biochemistry Department, Pharmacy Faculty , Universidad San Pablo-CEU , Boadilla del Monte , 28668 Madrid , Spain.,Clinical Research Centre , Medical University of Bialystok , 15-089 Bialystok , Poland
Ángeles López-Gonzálvez Centre for Metabolomics and Bioanalysis (CEMBIO), Chemistry and Biochemistry Department, Pharmacy Faculty , Universidad San Pablo-CEU , Boadilla del Monte , 28668 Madrid , Spain
Fernanda Rey-Stolle Centre for Metabolomics and Bioanalysis (CEMBIO), Chemistry and Biochemistry Department, Pharmacy Faculty , Universidad San Pablo-CEU , Boadilla del Monte , 28668 Madrid , Spain
David Rojo Centre for Metabolomics and Bioanalysis (CEMBIO), Chemistry and Biochemistry Department, Pharmacy Faculty , Universidad San Pablo-CEU , Boadilla del Monte , 28668 Madrid , Spain
Francisco J Ruperez Centre for Metabolomics and Bioanalysis (CEMBIO), Chemistry and Biochemistry Department, Pharmacy Faculty , Universidad San Pablo-CEU , Boadilla del Monte , 28668 Madrid , Spain
Jorge Saiz Centre for Metabolomics and Bioanalysis (CEMBIO), Chemistry and Biochemistry Department, Pharmacy Faculty , Universidad San Pablo-CEU , Boadilla del Monte , 28668 Madrid , Spain
Coral Barbas Centre for Metabolomics and Bioanalysis (CEMBIO), Chemistry and Biochemistry Department, Pharmacy Faculty , Universidad San Pablo-CEU , Boadilla del Monte , 28668 Madrid , Spain

Collapse

213

Li J, Veeranampalayam-Sivakumar AN, Bhatta M, Garst ND, Stoll H, Stephen Baenziger P, Belamkar V, Howard R, Ge Y, Shi Y. Principal variable selection to explain grain yield variation in winter wheat from features extracted from UAV imagery. PLANT METHODS 2019;15:123. [PMID: 31695728 PMCID: PMC6824016 DOI: 10.1186/s13007-019-0508-7] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/24/2019] [Accepted: 10/19/2019] [Indexed: 05/23/2023]

Abstract

BACKGROUND

Automated phenotyping technologies are continually advancing the breeding process. However, collecting various secondary traits throughout the growing season and processing massive amounts of data still take great efforts and time. Selecting a minimum number of secondary traits that have the maximum predictive power has the potential to reduce phenotyping efforts. The objective of this study was to select principal features extracted from UAV imagery and critical growth stages that contributed the most in explaining winter wheat grain yield. Five dates of multispectral images and seven dates of RGB images were collected by a UAV system during the spring growing season in 2018. Two classes of features (variables), totaling to 172 variables, were extracted for each plot from the vegetation index and plant height maps, including pixel statistics and dynamic growth rates. A parametric algorithm, LASSO regression (the least angle and shrinkage selection operator), and a non-parametric algorithm, random forest, were applied for variable selection. The regression coefficients estimated by LASSO and the permutation importance scores provided by random forest were used to determine the ten most important variables influencing grain yield from each algorithm.

RESULTS

Both selection algorithms assigned the highest importance score to the variables related with plant height around the grain filling stage. Some vegetation indices related variables were also selected by the algorithms mainly at earlier to mid growth stages and during the senescence. Compared with the yield prediction using all 172 variables derived from measured phenotypes, using the selected variables performed comparable or even better. We also noticed that the prediction accuracy on the adapted NE lines (r = 0.58-0.81) was higher than the other lines (r = 0.21-0.59) included in this study with different genetic backgrounds.

CONCLUSIONS

With the ultra-high resolution plot imagery obtained by the UAS-based phenotyping we are now able to derive more features, such as the variation of plant height or vegetation indices within a plot other than just an averaged number, that are potentially very useful for the breeding purpose. However, too many features or variables can be derived in this way. The promising results from this study suggests that the selected set from those variables can have comparable prediction accuracies on the grain yield prediction than the full set of them but possibly resulting in a better allocation of efforts and resources on phenotypic data collection and processing.

Collapse

214

Nembrini S, König IR, Wright MN. The revival of the Gini importance? Bioinformatics 2019;34:3711-3718. [PMID: 29757357 PMCID: PMC6198850 DOI: 10.1093/bioinformatics/bty373] [Citation(s) in RCA: 204] [Impact Index Per Article: 40.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2018] [Accepted: 05/08/2018] [Indexed: 11/14/2022] Open

215

Stanstrup J, Broeckling CD, Helmus R, Hoffmann N, Mathé E, Naake T, Nicolotti L, Peters K, Rainer J, Salek RM, Schulze T, Schymanski EL, Stravs MA, Thévenot EA, Treutler H, Weber RJM, Willighagen E, Witting M, Neumann S. The metaRbolomics Toolbox in Bioconductor and beyond. Metabolites 2019;9:E200. [PMID: 31548506 PMCID: PMC6835268 DOI: 10.3390/metabo9100200] [Citation(s) in RCA: 51] [Impact Index Per Article: 10.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2019] [Revised: 09/16/2019] [Accepted: 09/17/2019] [Indexed: 11/17/2022] Open

Affiliation(s)

Jan Stanstrup Preventive and Clinical Nutrition, University of Copenhagen, Rolighedsvej 30, 1958 Frederiksberg C, Denmark.
Corey D Broeckling Proteomics and Metabolomics Facility, Colorado State University, Fort Collins, CO 80523, USA.
Rick Helmus Institute for Biodiversity and Ecosystem Dynamics, University of Amsterdam, 1098 XH Amsterdam, The Netherlands.
Nils Hoffmann Leibniz-Institut für Analytische Wissenschaften-ISAS-e.V., Otto-Hahn-Straße 6b, 44227 Dortmund, Germany.
Ewy Mathé Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH 43210, USA.
Thomas Naake Max Planck Institute of Molecular Plant Physiology, 14476 Potsdam-Golm, Germany.
Luca Nicolotti The Australian Wine Research Institute, Metabolomics Australia, PO Box 197, Adelaide SA 5064, Australia.
Kristian Peters Leibniz Institute of Plant Biochemistry (IPB Halle), Bioinformatics and Scientific Data, 06120 Halle, Germany.
Johannes Rainer Institute for Biomedicine, Eurac Research, Affiliated Institute of the University of Lübeck, 39100 Bolzano, Italy.
Reza M Salek The International Agency for Research on Cancer, 150 cours Albert Thomas, CEDEX 08, 69372 Lyon, France.
Tobias Schulze Department of Effect-Directed Analysis, Helmholtz Centre for Environmental Research-UFZ, Permoserstraße 15, 04318 Leipzig, Germany.
Emma L Schymanski Luxembourg Centre for Systems Biomedicine, University of Luxembourg, 6 avenue du Swing, L-4367 Belvaux, Luxembourg.
Michael A Stravs Eawag, Swiss Federal Institute of Aquatic Science and Technology, Überlandstrasse 133, 8600 Dubendorf, Switzerland.
Etienne A Thévenot CEA, LIST, Laboratory for Data Sciences and Decision, MetaboHUB, Gif-Sur-Yvette F-91191, France.
Hendrik Treutler Leibniz Institute of Plant Biochemistry (IPB Halle), Bioinformatics and Scientific Data, 06120 Halle, Germany.
Ralf J M Weber Phenome Centre Birmingham and School of Biosciences, University of Birmingham, Edgbaston, Birmingham B15 2TT, UK.
Egon Willighagen Department of Bioinformatics-BiGCaT, NUTRIM, Maastricht University, 6229 ER Maastricht, The Netherlands.
Michael Witting Research Unit Analytical BioGeoChemistry, Helmholtz Zentrum München, 85764 Neuherberg, Germany. Chair of Analytical Food Chemistry, Technische Universität München, 85354 Weihenstephan, Germany.
Steffen Neumann Leibniz Institute of Plant Biochemistry (IPB Halle), Bioinformatics and Scientific Data, 06120 Halle, Germany. German Centre for Integrative Biodiversity Research (iDiv), Halle-Jena-Leipzig Deutscher, Platz 5e, 04103 Leipzig, Germany.

Collapse

216

Randomized Lasso Links Microbial Taxa with Aquatic Functional Groups Inferred from Flow Cytometry. mSystems 2019;4:4/5/e00093-19. [PMID: 31506260 PMCID: PMC6739098 DOI: 10.1128/msystems.00093-19] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open

Abstract

A major goal in microbial ecology is to understand how microbial community structure influences ecosystem functioning. Various methods to directly associate bacterial taxa to functional groups in the environment are being developed. In this study, we applied machine learning methods to relate taxonomic data obtained from marker gene surveys to functional groups identified by flow cytometry. This allowed us to identify the taxa that are associated with heterotrophic productivity in freshwater lakes and indicated that the key contributors were highly system specific, regularly rare members of the community, and that some could possibly switch between being low and high contributors. Our approach provides a promising framework to identify taxa that contribute to ecosystem functioning and can be further developed to explore microbial contributions beyond heterotrophic production.

High-nucleic-acid (HNA) and low-nucleic-acid (LNA) bacteria are two operational groups identified by flow cytometry (FCM) in aquatic systems. A number of reports have shown that HNA cell density correlates strongly with heterotrophic production, while LNA cell density does not. However, which taxa are specifically associated with these groups, and by extension, productivity has remained elusive. Here, we addressed this knowledge gap by using a machine learning-based variable selection approach that integrated FCM and 16S rRNA gene sequencing data collected from 14 freshwater lakes spanning a broad range in physicochemical conditions. There was a strong association between bacterial heterotrophic production and HNA absolute cell abundances (R² = 0.65), but not with the more abundant LNA cells. This solidifies findings, mainly from marine systems, that HNA and LNA bacteria could be considered separate functional groups, the former contributing a disproportionately large share of carbon cycling. Taxa selected by the models could predict HNA and LNA absolute cell abundances at all taxonomic levels. Selected operational taxonomic units (OTUs) ranged from low to high relative abundance and were mostly lake system specific (89.5% to 99.2%). A subset of selected OTUs was associated with both LNA and HNA groups (12.5% to 33.3%), suggesting either phenotypic plasticity or within-OTU genetic and physiological heterogeneity. These findings may lead to the identification of system-specific putative ecological indicators for heterotrophic productivity. Generally, our approach allows for the association of OTUs with specific functional groups in diverse ecosystems in order to improve our understanding of (microbial) biodiversity-ecosystem functioning relationships.

IMPORTANCE A major goal in microbial ecology is to understand how microbial community structure influences ecosystem functioning. Various methods to directly associate bacterial taxa to functional groups in the environment are being developed. In this study, we applied machine learning methods to relate taxonomic data obtained from marker gene surveys to functional groups identified by flow cytometry. This allowed us to identify the taxa that are associated with heterotrophic productivity in freshwater lakes and indicated that the key contributors were highly system specific, regularly rare members of the community, and that some could possibly switch between being low and high contributors. Our approach provides a promising framework to identify taxa that contribute to ecosystem functioning and can be further developed to explore microbial contributions beyond heterotrophic production.

Collapse

217

Utilizing Precision Medicine to Estimate Timing for Surgical Closure of Traumatic Extremity Wounds. Ann Surg 2019;270:535-543. [DOI: 10.1097/sla.0000000000003470] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]

218

Küntzel A, Weber M, Gierschner P, Trefz P, Miekisch W, Schubert JK, Reinhold P, Köhler H. Core profile of volatile organic compounds related to growth of Mycobacterium avium subspecies paratuberculosis - A comparative extract of three independent studies. PLoS One 2019;14:e0221031. [PMID: 31415617 PMCID: PMC6695172 DOI: 10.1371/journal.pone.0221031] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2019] [Accepted: 07/29/2019] [Indexed: 11/22/2022] Open

219

Lodise TP, Bonine NG, Ye JM, Folse HJ, Gillard P. Development of a bedside tool to predict the probability of drug-resistant pathogens among hospitalized adult patients with gram-negative infections. BMC Infect Dis 2019;19:718. [PMID: 31412809 PMCID: PMC6694572 DOI: 10.1186/s12879-019-4363-y] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2018] [Accepted: 08/06/2019] [Indexed: 01/27/2023] Open

Abstract

Background

We developed a clinical bedside tool to simultaneously estimate the probabilities of third-generation cephalosporin-resistant Enterobacteriaceae (3GC-R), carbapenem-resistant Enterobacteriaceae (CRE), and multidrug-resistant Pseudomonas aeruginosa (MDRP) among hospitalized adult patients with Gram-negative infections.

Methods

Data were obtained from a retrospective observational study of the Premier Hospital that included hospitalized adult patients with a complicated urinary tract infection (cUTI), complicated intra-abdominal infection (cIAI), hospital-acquired/ventilator-associated pneumonia (HAP/VAP), or bloodstream infection (BSI) due to Gram-negative bacteria between 2011 and 2015. Risk factors for 3GC-R, CRE, and MDRP were ascertained by multivariate logistic regression, and separate models were developed for patients with community-acquired versus hospital-acquired infections for each resistance phenotype (N = 6). Models were converted to a singular user-friendly interface to estimate the probabilities of a patient having an infection due to 3GC-R, CRE, or MDRP when ≥ 1 risk factor was present.

Results

Overall, 124,068 patients contributed to the dataset. Percentages of patients admitted for cUTI, cIAI, HAP/VAP, and BSI were 61.6, 4.6, 16.5, and 26.4%, respectively (some patients contributed > 1 infection type). Resistant infection rates were 1.90% for CRE, 12.09% for 3GC-R, and 3.91% for MDRP. A greater percentage of the resistant infections were community-acquired relative to hospital-acquired (CRE, 1.30% vs 0.62% of 1.90%; 3GC-R, 9.27% vs 3.42% of 12.09%; MDRP, 2.39% vs 1.59% of 3.91%). The most important predictors of having an 3GC-R, CRE or MDRP infection were prior number of antibiotics; infection site; infection during the previous 3 months; and hospital prevalence of 3GC-R, CRE, or MDRP. To enable application of the six predictive multivariate logistic regression models to real-world clinical practice, we developed a user-friendly interface that estimates the risk of 3GC-R, CRE, and MDRP simultaneously in a given patient with a Gram-negative infection based on their risk (Additional file 1).

Conclusions

We developed a clinical prediction tool to estimate the probabilities of 3GC-R, CRE, and MDRP among hospitalized adult patients with confirmed community- and hospital-acquired Gram-negative infections. Our predictive model has been implemented as a user-friendly bedside tool for use by clinicians/healthcare professionals to predict the probability of resistant infections in individual patients, to guide early appropriate therapy.

Electronic supplementary material

The online version of this article (10.1186/s12879-019-4363-y) contains supplementary material, which is available to authorized users.

Collapse

220

Zhang C, Chen Y, Xu B, Xue Y, Ren Y. How to predict biodiversity in space? An evaluation of modelling approaches in marine ecosystems. DIVERS DISTRIB 2019. [DOI: 10.1111/ddi.12970] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022] Open

221

Banerji A, Bagley MJ, Shoemaker JA, Tettenhorst DR, Nietch CT, Allen HJ, Santo Domingo JW. Evaluating putative ecological drivers of microcystin spatiotemporal dynamics using metabarcoding and environmental data. HARMFUL ALGAE 2019;86:84-95. [PMID: 31358280 PMCID: PMC7877229 DOI: 10.1016/j.hal.2019.05.004] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/05/2019] [Revised: 04/19/2019] [Accepted: 05/07/2019] [Indexed: 05/03/2023]

222

Niu SY, Liu B, Ma Q, Chou WC. rSeqTU-A Machine-Learning Based R Package for Prediction of Bacterial Transcription Units. Front Genet 2019;10:374. [PMID: 31156694 PMCID: PMC6529933 DOI: 10.3389/fgene.2019.00374] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2019] [Accepted: 04/09/2019] [Indexed: 11/13/2022] Open

223

Lee MY, Kim TK, Walters KA, Wang K. A biological function based biomarker panel optimization process. Sci Rep 2019;9:7365. [PMID: 31089177 PMCID: PMC6517383 DOI: 10.1038/s41598-019-43779-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2018] [Accepted: 04/26/2019] [Indexed: 11/09/2022] Open

224

Luo L, Hudson LG, Lewis J, Lee JH. Two-step approach for assessing the health effects of environmental chemical mixtures: application to simulated datasets and real data from the Navajo Birth Cohort Study. Environ Health 2019;18:46. [PMID: 31072361 PMCID: PMC6507239 DOI: 10.1186/s12940-019-0482-6] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2018] [Accepted: 04/16/2019] [Indexed: 05/07/2023]

Abstract

BACKGROUND

There is increasing interest in examining the consequences of simultaneous exposures to chemical mixtures. However, a consensus or recommendations on how to appropriately select the statistical approach analyzing the health effects of mixture exposures which best aligns with study goals has not been well established. We recognize the limitations that existing methods have in effectively reducing data dimension and detecting interaction effects when analyzing chemical mixture exposures collected in high dimensional datasets with varying degrees of variable intercorrelations. In this research, we aim to examine the performance of a two-step statistical approach in addressing the analytical challenges of chemical mixture exposures using two simulated data sets, and an existing data set from the Navajo Birth Cohort Study as a representative case study.

METHODS

We propose to use a two-step approach: a robust variable selection step using the random forest approach followed by adaptive lasso methods that incorporate both dimensionality reduction and quantification of the degree of association between the chemical exposures and the outcome of interest, including interaction terms. We compared the proposed method with other approaches including (1) single step adaptive lasso; and (2) two-step Classification and regression trees (CART) followed by adaptive lasso method.

RESULTS

Utilizing simulated data sets and applying the method to a real-life dataset from the Navajo Birth Cohort Study, we have demonstrated good performance of the proposed two-step approach. Results from the simulation datasets indicated the effectiveness of variable dimension reduction and reliable identification of a parsimonious model compared to other methods: single-step adaptive lasso or two-step CART followed by adaptive lasso method.

CONCLUSIONS

Our proposed two-step approach provides a robust way of analyzing the effects of high-throughput chemical mixture exposures on health outcomes by combining the strengths of variable selection and adaptive shrinkage strategies.

Collapse

225

Ogink PT, Karhade AV, Thio QCBS, Gormley WB, Oner FC, Verlaan JJ, Schwab JH. Predicting discharge placement after elective surgery for lumbar spinal stenosis using machine learning methods. EUROPEAN SPINE JOURNAL : OFFICIAL PUBLICATION OF THE EUROPEAN SPINE SOCIETY, THE EUROPEAN SPINAL DEFORMITY SOCIETY, AND THE EUROPEAN SECTION OF THE CERVICAL SPINE RESEARCH SOCIETY 2019;28:1433-1440. [PMID: 30941521 DOI: 10.1007/s00586-019-05928-z] [Citation(s) in RCA: 33] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/26/2018] [Revised: 01/11/2019] [Accepted: 02/21/2019] [Indexed: 11/29/2022]

226

Leveraging Machine Learning to Extend Ontology-Driven Geographic Object-Based Image Analysis (O-GEOBIA): A Case Study in Forest-Type Mapping. REMOTE SENSING 2019. [DOI: 10.3390/rs11050503] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]

227

Sun S, Miao Z, Ratcliffe B, Campbell P, Pasch B, El-Kassaby YA, Balasundaram B, Chen C. SNP variable selection by generalized graph domination. PLoS One 2019;14:e0203242. [PMID: 30677030 PMCID: PMC6345469 DOI: 10.1371/journal.pone.0203242] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2018] [Accepted: 01/08/2019] [Indexed: 11/19/2022] Open

Abstract

BACKGROUND

High-throughput sequencing technology has revolutionized both medical and biological research by generating exceedingly large numbers of genetic variants. The resulting datasets share a number of common characteristics that might lead to poor generalization capacity. Concerns include noise accumulated due to the large number of predictors, sparse information regarding the p≫n problem, and overfitting and model mis-identification resulting from spurious collinearity. Additionally, complex correlation patterns are present among variables. As a consequence, reliable variable selection techniques play a pivotal role in predictive analysis, generalization capability, and robustness in clustering, as well as interpretability of the derived models.

METHODS AND FINDINGS

K-dominating set, a parameterized graph-theoretic generalization model, was used to model SNP (single nucleotide polymorphism) data as a similarity network and searched for representative SNP variables. In particular, each SNP was represented as a vertex in the graph, (dis)similarity measures such as correlation coefficients or pairwise linkage disequilibrium were estimated to describe the relationship between each pair of SNPs; a pair of vertices are adjacent, i.e. joined by an edge, if the pairwise similarity measure exceeds a user-specified threshold. A minimum k-dominating set in the SNP graph was then made as the smallest subset such that every SNP that is excluded from the subset has at least k neighbors in the selected ones. The strength of k-dominating set selection in identifying independent variables, and in culling representative variables that are highly correlated with others, was demonstrated by a simulated dataset. The advantages of k-dominating set variable selection were also illustrated in two applications: pedigree reconstruction using SNP profiles of 1,372 Douglas-fir trees, and species delineation for 226 grasshopper mouse samples. A C++ source code that implements SNP-SELECT and uses Gurobi optimization solver for the k-dominating set variable selection is available (https://github.com/transgenomicsosu/SNP-SELECT).

Collapse

228

Long NP, Park S, Anh NH, Nghi TD, Yoon SJ, Park JH, Lim J, Kwon SW. High-Throughput Omics and Statistical Learning Integration for the Discovery and Validation of Novel Diagnostic Signatures in Colorectal Cancer. Int J Mol Sci 2019;20:E296. [PMID: 30642095 PMCID: PMC6358915 DOI: 10.3390/ijms20020296] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2018] [Revised: 12/31/2018] [Accepted: 01/04/2019] [Indexed: 02/07/2023] Open

229

Huynh-Thu VA, Geurts P. Unsupervised Gene Network Inference with Decision Trees and Random Forests. Methods Mol Biol 2019;1883:195-215. [PMID: 30547401 DOI: 10.1007/978-1-4939-8882-2_8] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]

230

Li XX, Yin J, Tang J, Li Y, Yang Q, Xiao Z, Zhang R, Wang Y, Hong J, Tao L, Xue W, Zhu F. Determining the Balance Between Drug Efficacy and Safety by the Network and Biological System Profile of Its Therapeutic Target. Front Pharmacol 2018;9:1245. [PMID: 30429792 PMCID: PMC6220079 DOI: 10.3389/fphar.2018.01245] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2018] [Accepted: 10/12/2018] [Indexed: 12/14/2022] Open

231

Ahamed NU, Kobsar D, Benson L, Clermont C, Kohrs R, Osis ST, Ferber R. Using wearable sensors to classify subject-specific running biomechanical gait patterns based on changes in environmental weather conditions. PLoS One 2018;13:e0203839. [PMID: 30226903 PMCID: PMC6143236 DOI: 10.1371/journal.pone.0203839] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2018] [Accepted: 08/28/2018] [Indexed: 01/07/2023] Open

Abstract

Running-related overuse injuries can result from a combination of various intrinsic (e.g., gait biomechanics) and extrinsic (e.g., running surface) risk factors. However, it is unknown how changes in environmental weather conditions affect running gait biomechanical patterns since these data cannot be collected in a laboratory setting. Therefore, the purpose of this study was to develop a classification model based on subject-specific changes in biomechanical running patterns across two different environmental weather conditions using data obtained from wearable sensors in real-world environments. Running gait data were recorded during winter and spring sessions, with recorded average air temperatures of -10° C and +6° C, respectively. Classification was performed based on measurements of pelvic drop, ground contact time, braking, vertical oscillation of pelvis, pelvic rotation, and cadence obtained from 66,370 strides (~11,000/runner) from a group of recreational runners. A non-linear and ensemble machine learning algorithm, random forest (RF), was used to classify and compute a heuristic for determining the importance of each variable in the prediction model. To validate the developed subject-specific model, two cross-validation methods (one-against-another and partitioning datasets) were used to obtain experimental mean classification accuracies of 87.18% and 95.42%, respectively, indicating an excellent discriminatory ability of the RF-based model. Additionally, the ranked order of variable importance differed across the individual runners. The results from the RF-based machine-learning algorithm demonstrates that processing gait biomechanical signals from a single wearable sensor can successfully detect changes to an individual's running patterns based on data obtained in real-world environments.

Collapse

232

Toward proactive social inclusion powered by machine learning. Knowl Inf Syst 2018. [DOI: 10.1007/s10115-018-1230-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]

233

Kirpich A, Ainsworth EA, Wedow JM, Newman JRB, Michailidis G, McIntyre LM. Variable selection in omics data: A practical evaluation of small sample sizes. PLoS One 2018;13:e0197910. [PMID: 29927942 PMCID: PMC6013185 DOI: 10.1371/journal.pone.0197910] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2017] [Accepted: 05/10/2018] [Indexed: 01/04/2023] Open

234

Wu Q, Boueiz A, Bozkurt A, Masoomi A, Wang A, DeMeo DL, Weiss ST, Qiu W. Deep Learning Methods for Predicting Disease Status Using Genomic Data. JOURNAL OF BIOMETRICS & BIOSTATISTICS 2018;9:417. [PMID: 31131151 PMCID: PMC6530791] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]