201
|
Seifert S. Application of random forest based approaches to surface-enhanced Raman scattering data. Sci Rep 2020; 10:5436. [PMID: 32214194 PMCID: PMC7096517 DOI: 10.1038/s41598-020-62338-8] [Citation(s) in RCA: 29] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2019] [Accepted: 02/26/2020] [Indexed: 01/08/2023] Open
Abstract
Surface-enhanced Raman scattering (SERS) is a valuable analytical technique for the analysis of biological samples. However, due to the nature of SERS it is often challenging to exploit the generated data to obtain the desired information when no reporter or label molecules are used. Here, the suitability of random forest based approaches is evaluated using SERS data generated by a simulation framework that is also presented. More specifically, it is demonstrated that important SERS signals can be identified, the relevance of predefined spectral groups can be evaluated, and the relations of different SERS signals can be analyzed. It is shown that for the selection of important SERS signals Boruta and surrogate minimal depth (SMD) and for the analysis of spectral groups the competing method Learner of Functional Enrichment (LeFE) should be applied. In general, this investigation demonstrates that the combination of random forest approaches and SERS data is very promising for sophisticated analysis of complex biological samples.
Collapse
Affiliation(s)
- Stephan Seifert
- Kiel University, University Hospital Schleswig-Holstein, Institute of Medical Informatics and Statistics, Kiel, 24105, Germany.
- University of Hamburg, Hamburg School of Food Science, Institute of Food Chemistry, Hamburg, 20146, Germany.
| |
Collapse
|
202
|
Machine learning analysis of motor evoked potential time series to predict disability progression in multiple sclerosis. BMC Neurol 2020; 20:105. [PMID: 32199461 PMCID: PMC7085864 DOI: 10.1186/s12883-020-01672-w] [Citation(s) in RCA: 32] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2019] [Accepted: 03/02/2020] [Indexed: 11/25/2022] Open
Abstract
Background Evoked potentials (EPs) are a measure of the conductivity of the central nervous system. They are used to monitor disease progression of multiple sclerosis patients. Previous studies only extracted a few variables from the EPs, which are often further condensed into a single variable: the EP score. We perform a machine learning analysis of motor EP that uses the whole time series, instead of a few variables, to predict disability progression after two years. Obtaining realistic performance estimates of this task has been difficult because of small data set sizes. We recently extracted a dataset of EPs from the Rehabiliation & MS Center in Overpelt, Belgium. Our data set is large enough to obtain, for the first time, a performance estimate on an independent test set containing different patients. Methods We extracted a large number of time series features from the motor EPs with the highly comparative time series analysis software package. Mutual information with the target and the Boruta method are used to find features which contain information not included in the features studied in the literature. We use random forests (RF) and logistic regression (LR) classifiers to predict disability progression after two years. Statistical significance of the performance increase when adding extra features is checked. Results Including extra time series features in motor EPs leads to a statistically significant improvement compared to using only the known features, although the effect is limited in magnitude (ΔAUC = 0.02 for RF and ΔAUC = 0.05 for LR). RF with extra time series features obtains the best performance (AUC = 0.75±0.07 (mean and standard deviation)), which is good considering the limited number of biomarkers in the model. RF (a nonlinear classifier) outperforms LR (a linear classifier). Conclusions Using machine learning methods on EPs shows promising predictive performance. Using additional EP time series features beyond those already in use leads to a modest increase in performance. Larger datasets, preferably multi-center, are needed for further research. Given a large enough dataset, these models may be used to support clinicians in their decision making process regarding future treatment.
Collapse
|
203
|
Lv Z, Zhang J, Ding H, Zou Q. RF-PseU: A Random Forest Predictor for RNA Pseudouridine Sites. Front Bioeng Biotechnol 2020; 8:134. [PMID: 32175316 PMCID: PMC7054385 DOI: 10.3389/fbioe.2020.00134] [Citation(s) in RCA: 62] [Impact Index Per Article: 15.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2020] [Accepted: 02/10/2020] [Indexed: 12/21/2022] Open
Abstract
One of the ubiquitous chemical modifications in RNA, pseudouridine modification is crucial for various cellular biological and physiological processes. To gain more insight into the functional mechanisms involved, it is of fundamental importance to precisely identify pseudouridine sites in RNA. Several useful machine learning approaches have become available recently, with the increasing progress of next-generation sequencing technology; however, existing methods cannot predict sites with high accuracy. Thus, a more accurate predictor is required. In this study, a random forest-based predictor named RF-PseU is proposed for prediction of pseudouridylation sites. To optimize feature representation and obtain a better model, the light gradient boosting machine algorithm and incremental feature selection strategy were used to select the optimum feature space vector for training the random forest model RF-PseU. Compared with previous state-of-the-art predictors, the results on the same benchmark data sets of three species demonstrate that RF-PseU performs better overall. The integrated average leave-one-out cross-validation and independent testing accuracy scores were 71.4% and 74.7%, respectively, representing increments of 3.63% and 4.77% versus the best existing predictor. Moreover, the final RF-PseU model for prediction was built on leave-one-out cross-validation and provides a reliable and robust tool for identifying pseudouridine sites. A web server with a user-friendly interface is accessible at http://148.70.81.170:10228/rfpseu.
Collapse
Affiliation(s)
- Zhibin Lv
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Jun Zhang
- Rehabilitation Department, Heilongjiang Province Land Reclamation Headquarters General Hospital, Harbin, China
| | - Hui Ding
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
204
|
Chen P, Yang Y, Zhang Y, Jiang S, Li X, Wan J. Identification of prognostic immune-related genes in the tumor microenvironment of endometrial cancer. Aging (Albany NY) 2020; 12:3371-3387. [PMID: 32074080 PMCID: PMC7066904 DOI: 10.18632/aging.102817] [Citation(s) in RCA: 39] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2019] [Accepted: 01/27/2020] [Indexed: 12/24/2022]
Abstract
Endometrial cancer (EC) is one of the most common gynecologic malignancies. To identify potential prognostic biomarkers for EC, we analyzed the relationship between the EC tumor microenvironment and gene expression profiles. Using the ESTIMATE R tool, we found that immune and stromal scores correlated with clinical data and the prognosis of EC patients. Based on the immune and stromal scores, 387 intersection differentially expressed genes were identified. Eight immune-related genes were then identified using two machine learning algorithms. Functional enrichment analysis revealed that these genes were mainly associated with T cell activation and response. Kaplan-Meier survival analysis showed that expression of TMEM150B, CACNA2D2, TRPM5, NOL4, CTSW, and SIGLEC1 significantly correlated with overall survival times of EC patients. In addition, using the TIMER algorithm, we found that expression of TMEM150B, SIGLEC1, and CTSW correlated positively with the tumor infiltration levels of B cells, CD8+ T cells, CD4+ T cells, macrophages, and dendritic cells. These findings indicate that the composition of the tumor microenvironment affects the clinical outcomes of EC patients, and suggests that it may provide a basis for development of novel prognostic biomarkers and immunotherapies for EC patients.
Collapse
Affiliation(s)
- Peigen Chen
- Department of Gynecology, The Third Affiliated Hospital of Sun Yat-Sen University, Guangzhou, Guangdong Province, China
| | - Yuebo Yang
- Department of Gynecology, The Third Affiliated Hospital of Sun Yat-Sen University, Guangzhou, Guangdong Province, China
| | - Yu Zhang
- Department of Gynecology, The Third Affiliated Hospital of Sun Yat-Sen University, Guangzhou, Guangdong Province, China
| | - Senwei Jiang
- Department of Gynecology, The Third Affiliated Hospital of Sun Yat-Sen University, Guangzhou, Guangdong Province, China
| | - Xiaomao Li
- Department of Gynecology, The Third Affiliated Hospital of Sun Yat-Sen University, Guangzhou, Guangdong Province, China
| | - Jing Wan
- Department of Gynecology, The Third Affiliated Hospital of Sun Yat-Sen University, Guangzhou, Guangdong Province, China
| |
Collapse
|
205
|
Schachtschneider KM, Welge ME, Auvil LS, Chaki S, Rund LA, Madsen O, Elmore MR, Johnson RW, Groenen MA, Schook LB. Altered Hippocampal Epigenetic Regulation Underlying Reduced Cognitive Development in Response to Early Life Environmental Insults. Genes (Basel) 2020; 11:genes11020162. [PMID: 32033187 PMCID: PMC7074491 DOI: 10.3390/genes11020162] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2019] [Revised: 01/30/2020] [Accepted: 02/01/2020] [Indexed: 12/13/2022] Open
Abstract
The hippocampus is involved in learning and memory and undergoes significant growth and maturation during the neonatal period. Environmental insults during this developmental timeframe can have lasting effects on brain structure and function. This study assessed hippocampal DNA methylation and gene transcription from two independent studies reporting reduced cognitive development stemming from early life environmental insults (iron deficiency and porcine reproductive and respiratory syndrome virus (PRRSv) infection) using porcine biomedical models. In total, 420 differentially expressed genes (DEGs) were identified between the reduced cognition and control groups, including genes involved in neurodevelopment and function. Gene ontology (GO) terms enriched for DEGs were associated with immune responses, angiogenesis, and cellular development. In addition, 116 differentially methylated regions (DMRs) were identified, which overlapped 125 genes. While no GO terms were enriched for genes overlapping DMRs, many of these genes are known to be involved in neurodevelopment and function, angiogenesis, and immunity. The observed altered methylation and expression of genes involved in neurological function suggest reduced cognition in response to early life environmental insults is due to altered cholinergic signaling and calcium regulation. Finally, two DMRs overlapped with two DEGs, VWF and LRRC32, which are associated with blood brain barrier permeability and regulatory T-cell activation, respectively. These results support the role of altered hippocampal DNA methylation and gene expression in early life environmentally-induced reductions in cognitive development across independent studies.
Collapse
Affiliation(s)
- Kyle M. Schachtschneider
- Department of Radiology, University of Illinois at Chicago, Chicago, IL 60607, USA;
- Department of Biochemistry and Molecular Genetics, University of Illinois at Chicago, Chicago, IL 60607, USA
- National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, IL 61820, USA; (M.E.W.); (L.S.A.)
| | - Michael E. Welge
- National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, IL 61820, USA; (M.E.W.); (L.S.A.)
| | - Loretta S. Auvil
- National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, IL 61820, USA; (M.E.W.); (L.S.A.)
| | - Sulalita Chaki
- Department of Animal Sciences, University of Illinois at Urbana-Champaign, Urbana, IL 616280, USA; (S.C.); (L.A.R.); (M.R.P.E.); (R.W.J.)
| | - Laurie A. Rund
- Department of Animal Sciences, University of Illinois at Urbana-Champaign, Urbana, IL 616280, USA; (S.C.); (L.A.R.); (M.R.P.E.); (R.W.J.)
| | - Ole Madsen
- Animal Breeding and Genomics, Wageningen University, 6708 Wageningen, The Netherlands; (O.M.); (M.A.M.G.)
| | - Monica R.P. Elmore
- Department of Animal Sciences, University of Illinois at Urbana-Champaign, Urbana, IL 616280, USA; (S.C.); (L.A.R.); (M.R.P.E.); (R.W.J.)
| | - Rodney W. Johnson
- Department of Animal Sciences, University of Illinois at Urbana-Champaign, Urbana, IL 616280, USA; (S.C.); (L.A.R.); (M.R.P.E.); (R.W.J.)
| | - Martien A.M. Groenen
- Animal Breeding and Genomics, Wageningen University, 6708 Wageningen, The Netherlands; (O.M.); (M.A.M.G.)
| | - Lawrence B. Schook
- Department of Radiology, University of Illinois at Chicago, Chicago, IL 60607, USA;
- National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, IL 61820, USA; (M.E.W.); (L.S.A.)
- Department of Animal Sciences, University of Illinois at Urbana-Champaign, Urbana, IL 616280, USA; (S.C.); (L.A.R.); (M.R.P.E.); (R.W.J.)
- Correspondence:
| |
Collapse
|
206
|
Abstract
Objective: To review the application of radiomics in gastric cancer and its challenges as well as future prospects. Data sources: A research for relevant studies were performed in PubMed with the terms of “radiomics,” “texture analysis,” and “gastric cancer.” The search was updated until February 28th, 2019. Study selection: All original articles regarding the investigation of texture analysis or radiomics in gastric cancer were retrieved. Only papers written in English were included. Results: A total of 17 original articles were selected in final. It is shown that radiomics has yielded moderate to excellent performance in a spectrum of respects including differential diagnosis, assessment of histological differential degree, evaluation of tumor stage, prediction of response to therapy, and prognosis in gastric cancer. Yet, a number of challenges are facing both radiomics itself and its application in gastric cancer. Conclusions: Radiomics holds great potential in facilitating decision-making in gastric cancer. With the standardization of work-flow and advancement of machine learning methods, radiomics is expected to make great breakthroughs in precision medicine of gastric cancer.
Collapse
|
207
|
Classification and prediction of diabetes disease using machine learning paradigm. Health Inf Sci Syst 2020; 8:7. [PMID: 31949894 DOI: 10.1007/s13755-019-0095-z] [Citation(s) in RCA: 45] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2019] [Accepted: 12/21/2019] [Indexed: 12/19/2022] Open
Abstract
Background and objectives Diabetes is a chronic disease characterized by high blood sugar. It may cause many complicated disease like stroke, kidney failure, heart attack, etc. About 422 million people were affected by diabetes disease in worldwide in 2014. The figure will be reached 642 million in 2040. The main objective of this study is to develop a machine learning (ML)-based system for predicting diabetic patients. Materials and methods Logistic regression (LR) is used to identify the risk factors for diabetes disease based on p value and odds ratio (OR). We have adopted four classifiers like naïve Bayes (NB), decision tree (DT), Adaboost (AB), and random forest (RF) to predict the diabetic patients. Three types of partition protocols (K2, K5, and K10) have also adopted and repeated these protocols into 20 trails. Performances of these classifiers are evaluated using accuracy (ACC) and area under the curve (AUC). Results We have used diabetes dataset, conducted in 2009-2012, derived from the National Health and Nutrition Examination Survey. The dataset consists of 6561 respondents with 657 diabetic and 5904 controls. LR model demonstrates that 7 factors out of 14 as age, education, BMI, systolic BP, diastolic BP, direct cholesterol, and total cholesterol are the risk factors for diabetes. The overall ACC of ML-based system is 90.62%. The combination of LR-based feature selection and RF-based classifier gives 94.25% ACC and 0.95 AUC for K10 protocol. Conclusion The combination of LR and RF-based classifier performs better. This combination will be very helpful for predicting diabetic patients.
Collapse
|
208
|
Xue M, Su Y, Li C, Wang S, Yao H. Identification of Potential Type II Diabetes in a Large-Scale Chinese Population Using a Systematic Machine Learning Framework. J Diabetes Res 2020; 2020:6873891. [PMID: 33029536 PMCID: PMC7532405 DOI: 10.1155/2020/6873891] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/12/2020] [Revised: 08/01/2020] [Accepted: 09/02/2020] [Indexed: 12/19/2022] Open
Abstract
BACKGROUND An estimated 425 million people globally have diabetes, accounting for 12% of the world's health expenditures, and the number continues to grow, placing a huge burden on the healthcare system, especially in those remote, underserved areas. METHODS A total of 584,168 adult subjects who have participated in the national physical examination were enrolled in this study. The risk factors for type II diabetes mellitus (T2DM) were identified by p values and odds ratio, using logistic regression (LR) based on variables of physical measurement and a questionnaire. Combined with the risk factors selected by LR, we used a decision tree, a random forest, AdaBoost with a decision tree (AdaBoost), and an extreme gradient boosting decision tree (XGBoost) to identify individuals with T2DM, compared the performance of the four machine learning classifiers, and used the best-performing classifier to output the degree of variables' importance scores of T2DM. RESULTS The results indicated that XGBoost had the best performance (accuracy = 0.906, precision = 0.910, recall = 0.902, F-1 = 0.906, and AUC = 0.968). The degree of variables' importance scores in XGBoost showed that BMI was the most significant feature, followed by age, waist circumference, systolic pressure, ethnicity, smoking amount, fatty liver, hypertension, physical activity, drinking status, dietary ratio (meat to vegetables), drink amount, smoking status, and diet habit (oil loving). CONCLUSIONS We proposed a classifier based on LR-XGBoost which used fourteen variables of patients which are easily obtained and noninvasive as predictor variables to identify potential incidents of T2DM. The classifier can accurately screen the risk of diabetes in the early phrase, and the degree of variables' importance scores gives a clue to prevent diabetes occurrence.
Collapse
Affiliation(s)
- Mingyue Xue
- Hospital of Traditional Chinese Medicine Affiliated to the Fourth Clinical Medical College of Xinjiang Medical University, Urumqi, China
- College of Public Health, Xinjiang Medical University, Urumqi, China
| | - Yinxia Su
- College of Public Health, Xinjiang Medical University, Urumqi, China
| | - Chen Li
- The First Affiliated Hospital of Xinjiang Medical University, Urumqi, China
| | - Shuxia Wang
- Center of Health Management, The First Affiliated Hospital, Xinjiang Medical University, Urumqi, China
| | - Hua Yao
- Center of Health Management, The First Affiliated Hospital, Xinjiang Medical University, Urumqi, China
| |
Collapse
|
209
|
García-Timermans C, Rubbens P, Heyse J, Kerckhof FM, Props R, Skirtach AG, Waegeman W, Boon N. Discriminating Bacterial Phenotypes at the Population and Single-Cell Level: A Comparison of Flow Cytometry and Raman Spectroscopy Fingerprinting. Cytometry A 2019; 97:713-726. [PMID: 31889414 DOI: 10.1002/cyto.a.23952] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2019] [Revised: 11/20/2019] [Accepted: 11/26/2019] [Indexed: 12/26/2022]
Abstract
Investigating phenotypic heterogeneity can help to better understand and manage microbial communities. However, characterizing phenotypic heterogeneity remains a challenge, as there is no standardized analysis framework. Several optical tools are available, such as flow cytometry and Raman spectroscopy, which describe optical properties of the individual cell. In this work, we compare Raman spectroscopy and flow cytometry to study phenotypic heterogeneity in bacterial populations. The growth stages of three replicate Escherichia coli populations were characterized using both technologies. Our findings show that flow cytometry detects and quantifies shifts in phenotypic heterogeneity at the population level due to its high-throughput nature. Raman spectroscopy, on the other hand, offers a much higher resolution at the single-cell level (i.e., more biochemical information is recorded). Therefore, it can identify distinct phenotypic populations when coupled with analyses tailored toward single-cell data. In addition, it provides information about biomolecules that are present, which can be linked to cell functionality. We propose a computational workflow to distinguish between bacterial phenotypic populations using Raman spectroscopy and validated this approach with an external data set. We recommend using flow cytometry to quantify phenotypic heterogeneity at the population level, and Raman spectroscopy to perform a more in-depth analysis of heterogeneity at the single-cell level. © 2019 International Society for Advancement of Cytometry.
Collapse
Affiliation(s)
| | - Peter Rubbens
- KERMIT, Department of Data Analysis and Mathematical Modelling, Ghent University, Ghent, Belgium
| | - Jasmine Heyse
- CMET, Center for Microbial Technology and Ecology, Ghent University, Ghent, Belgium
| | | | - Ruben Props
- CMET, Center for Microbial Technology and Ecology, Ghent University, Ghent, Belgium
| | - Andre G Skirtach
- Nano-BioTechnology Group, Department of Biotechnology, Ghent University, Ghent, Belgium
| | - Willem Waegeman
- KERMIT, Department of Data Analysis and Mathematical Modelling, Ghent University, Ghent, Belgium
| | - Nico Boon
- CMET, Center for Microbial Technology and Ecology, Ghent University, Ghent, Belgium
| |
Collapse
|
210
|
Krzykalla J, Benner A, Kopp‐Schneider A. Exploratory identification of predictive biomarkers in randomized trials with normal endpoints. Stat Med 2019; 39:923-939. [DOI: 10.1002/sim.8452] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2019] [Revised: 12/02/2019] [Accepted: 12/02/2019] [Indexed: 11/10/2022]
Affiliation(s)
- Julia Krzykalla
- Division of BiostatisticsGerman Cancer Research Center (DKFZ) Heidelberg Germany
- Medizinische FakultätUniversität Heidelberg Germany
| | - Axel Benner
- Division of BiostatisticsGerman Cancer Research Center (DKFZ) Heidelberg Germany
| | | |
Collapse
|
211
|
Speiser JL, Miller ME, Tooze J, Ip E. A Comparison of Random Forest Variable Selection Methods for Classification Prediction Modeling. EXPERT SYSTEMS WITH APPLICATIONS 2019; 134:93-101. [PMID: 32968335 PMCID: PMC7508310 DOI: 10.1016/j.eswa.2019.05.028] [Citation(s) in RCA: 207] [Impact Index Per Article: 41.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/17/2023]
Abstract
Random forest classification is a popular machine learning method for developing prediction models in many research settings. Often in prediction modeling, a goal is to reduce the number of variables needed to obtain a prediction in order to reduce the burden of data collection and improve efficiency. Several variable selection methods exist for the setting of random forest classification; however, there is a paucity of literature to guide users as to which method may be preferable for different types of datasets. Using 311 classification datasets freely available online, we evaluate the prediction error rates, number of variables, computation times and area under the receiver operating curve for many random forest variable selection methods. We compare random forest variable selection methods for different types of datasets (datasets with binary outcomes, datasets with many predictors, and datasets with imbalanced outcomes) and for different types of methods (standard random forest versus conditional random forest methods and test based versus performance based methods). Based on our study, the best variable selection methods for most datasets are Jiang's method and the method implemented in the VSURF R package. For datasets with many predictors, the methods implemented in the R packages varSelRF and Boruta are preferable due to computational efficiency. A significant contribution of this study is the ability to assess different variable selection techniques in the setting of random forest classification in order to identify preferable methods based on applications in expert and intelligent systems.
Collapse
Affiliation(s)
- Jaime Lynn Speiser
- Department of Biostatistical Sciences, Wake Forest School of Medicine, Winston-Salem, NC 27157, USA
| | - Michael E. Miller
- Department of Biostatistical Sciences, Wake Forest School of Medicine, Winston-Salem, NC 27157, USA
| | - Janet Tooze
- Department of Biostatistical Sciences, Wake Forest School of Medicine, Winston-Salem, NC 27157, USA
| | - Edward Ip
- Department of Biostatistical Sciences, Wake Forest School of Medicine, Winston-Salem, NC 27157, USA
| |
Collapse
|
212
|
González-Riano C, Dudzik D, Garcia A, Gil-de-la-Fuente A, Gradillas A, Godzien J, López-Gonzálvez Á, Rey-Stolle F, Rojo D, Ruperez FJ, Saiz J, Barbas C. Recent Developments along the Analytical Process for Metabolomics Workflows. Anal Chem 2019; 92:203-226. [PMID: 31625723 DOI: 10.1021/acs.analchem.9b04553] [Citation(s) in RCA: 62] [Impact Index Per Article: 12.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Affiliation(s)
- Carolina González-Riano
- Centre for Metabolomics and Bioanalysis (CEMBIO), Chemistry and Biochemistry Department, Pharmacy Faculty , Universidad San Pablo-CEU , Boadilla del Monte , 28668 Madrid , Spain
| | - Danuta Dudzik
- Centre for Metabolomics and Bioanalysis (CEMBIO), Chemistry and Biochemistry Department, Pharmacy Faculty , Universidad San Pablo-CEU , Boadilla del Monte , 28668 Madrid , Spain.,Department of Biopharmaceutics and Pharmacodynamics, Faculty of Pharmacy , Medical University of Gdańsk , 80-210 Gdańsk , Poland
| | - Antonia Garcia
- Centre for Metabolomics and Bioanalysis (CEMBIO), Chemistry and Biochemistry Department, Pharmacy Faculty , Universidad San Pablo-CEU , Boadilla del Monte , 28668 Madrid , Spain
| | - Alberto Gil-de-la-Fuente
- Department of Information Technology, Escuela Politécnica Superior , Universidad San Pablo-CEU , 28003 Madrid , Spain
| | - Ana Gradillas
- Centre for Metabolomics and Bioanalysis (CEMBIO), Chemistry and Biochemistry Department, Pharmacy Faculty , Universidad San Pablo-CEU , Boadilla del Monte , 28668 Madrid , Spain
| | - Joanna Godzien
- Centre for Metabolomics and Bioanalysis (CEMBIO), Chemistry and Biochemistry Department, Pharmacy Faculty , Universidad San Pablo-CEU , Boadilla del Monte , 28668 Madrid , Spain.,Clinical Research Centre , Medical University of Bialystok , 15-089 Bialystok , Poland
| | - Ángeles López-Gonzálvez
- Centre for Metabolomics and Bioanalysis (CEMBIO), Chemistry and Biochemistry Department, Pharmacy Faculty , Universidad San Pablo-CEU , Boadilla del Monte , 28668 Madrid , Spain
| | - Fernanda Rey-Stolle
- Centre for Metabolomics and Bioanalysis (CEMBIO), Chemistry and Biochemistry Department, Pharmacy Faculty , Universidad San Pablo-CEU , Boadilla del Monte , 28668 Madrid , Spain
| | - David Rojo
- Centre for Metabolomics and Bioanalysis (CEMBIO), Chemistry and Biochemistry Department, Pharmacy Faculty , Universidad San Pablo-CEU , Boadilla del Monte , 28668 Madrid , Spain
| | - Francisco J Ruperez
- Centre for Metabolomics and Bioanalysis (CEMBIO), Chemistry and Biochemistry Department, Pharmacy Faculty , Universidad San Pablo-CEU , Boadilla del Monte , 28668 Madrid , Spain
| | - Jorge Saiz
- Centre for Metabolomics and Bioanalysis (CEMBIO), Chemistry and Biochemistry Department, Pharmacy Faculty , Universidad San Pablo-CEU , Boadilla del Monte , 28668 Madrid , Spain
| | - Coral Barbas
- Centre for Metabolomics and Bioanalysis (CEMBIO), Chemistry and Biochemistry Department, Pharmacy Faculty , Universidad San Pablo-CEU , Boadilla del Monte , 28668 Madrid , Spain
| |
Collapse
|
213
|
Li J, Veeranampalayam-Sivakumar AN, Bhatta M, Garst ND, Stoll H, Stephen Baenziger P, Belamkar V, Howard R, Ge Y, Shi Y. Principal variable selection to explain grain yield variation in winter wheat from features extracted from UAV imagery. PLANT METHODS 2019; 15:123. [PMID: 31695728 PMCID: PMC6824016 DOI: 10.1186/s13007-019-0508-7] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/24/2019] [Accepted: 10/19/2019] [Indexed: 05/23/2023]
Abstract
BACKGROUND Automated phenotyping technologies are continually advancing the breeding process. However, collecting various secondary traits throughout the growing season and processing massive amounts of data still take great efforts and time. Selecting a minimum number of secondary traits that have the maximum predictive power has the potential to reduce phenotyping efforts. The objective of this study was to select principal features extracted from UAV imagery and critical growth stages that contributed the most in explaining winter wheat grain yield. Five dates of multispectral images and seven dates of RGB images were collected by a UAV system during the spring growing season in 2018. Two classes of features (variables), totaling to 172 variables, were extracted for each plot from the vegetation index and plant height maps, including pixel statistics and dynamic growth rates. A parametric algorithm, LASSO regression (the least angle and shrinkage selection operator), and a non-parametric algorithm, random forest, were applied for variable selection. The regression coefficients estimated by LASSO and the permutation importance scores provided by random forest were used to determine the ten most important variables influencing grain yield from each algorithm. RESULTS Both selection algorithms assigned the highest importance score to the variables related with plant height around the grain filling stage. Some vegetation indices related variables were also selected by the algorithms mainly at earlier to mid growth stages and during the senescence. Compared with the yield prediction using all 172 variables derived from measured phenotypes, using the selected variables performed comparable or even better. We also noticed that the prediction accuracy on the adapted NE lines (r = 0.58-0.81) was higher than the other lines (r = 0.21-0.59) included in this study with different genetic backgrounds. CONCLUSIONS With the ultra-high resolution plot imagery obtained by the UAS-based phenotyping we are now able to derive more features, such as the variation of plant height or vegetation indices within a plot other than just an averaged number, that are potentially very useful for the breeding purpose. However, too many features or variables can be derived in this way. The promising results from this study suggests that the selected set from those variables can have comparable prediction accuracies on the grain yield prediction than the full set of them but possibly resulting in a better allocation of efforts and resources on phenotypic data collection and processing.
Collapse
Affiliation(s)
- Jiating Li
- Department of Biological Systems Engineering, University of Nebraska-Lincoln, Lincoln, NE 68583 USA
| | | | - Madhav Bhatta
- Department of Agronomy, University of Wisconsin-Madison, Madison, WI 53706 USA
| | - Nicholas D. Garst
- Department of Agronomy and Horticulture, University of Nebraska-Lincoln, Lincoln, NE 68583 USA
| | - Hannah Stoll
- Department of Agronomy and Horticulture, University of Nebraska-Lincoln, Lincoln, NE 68583 USA
| | - P. Stephen Baenziger
- Department of Agronomy and Horticulture, University of Nebraska-Lincoln, Lincoln, NE 68583 USA
| | - Vikas Belamkar
- Department of Agronomy and Horticulture, University of Nebraska-Lincoln, Lincoln, NE 68583 USA
| | - Reka Howard
- Department of Statistics, University of Nebraska-Lincoln, Lincoln, NE 68583 USA
| | - Yufeng Ge
- Department of Biological Systems Engineering, University of Nebraska-Lincoln, Lincoln, NE 68583 USA
| | - Yeyin Shi
- Department of Biological Systems Engineering, University of Nebraska-Lincoln, Lincoln, NE 68583 USA
| |
Collapse
|
214
|
Nembrini S, König IR, Wright MN. The revival of the Gini importance? Bioinformatics 2019; 34:3711-3718. [PMID: 29757357 PMCID: PMC6198850 DOI: 10.1093/bioinformatics/bty373] [Citation(s) in RCA: 204] [Impact Index Per Article: 40.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2018] [Accepted: 05/08/2018] [Indexed: 11/14/2022] Open
Abstract
Motivation Random forests are fast, flexible and represent a robust approach to analyze high dimensional data. A key advantage over alternative machine learning algorithms are variable importance measures, which can be used to identify relevant features or perform variable selection. Measures based on the impurity reduction of splits, such as the Gini importance, are popular because they are simple and fast to compute. However, they are biased in favor of variables with many possible split points and high minor allele frequency. Results We set up a fast approach to debias impurity-based variable importance measures for classification, regression and survival forests. We show that it creates a variable importance measure which is unbiased with regard to the number of categories and minor allele frequency and almost as fast as the standard impurity importance. As a result, it is now possible to compute reliable importance estimates without the extra computing cost of permutations. Further, we combine the importance measure with a fast testing procedure, producing p-values for variable importance with almost no computational overhead to the creation of the random forest. Applications to gene expression and genome-wide association data show that the proposed method is powerful and computationally efficient. Availability and implementation The procedure is included in the ranger package, available at https://cran.r-project.org/package=ranger and https://github.com/imbs-hl/ranger. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Stefano Nembrini
- Department of Epidemiology, College of Public Health and Health Professions & College of Medicine, University of Florida, Gainesville, FL, USA
| | - Inke R König
- Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Lübeck, Germany
| | - Marvin N Wright
- Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Lübeck, Germany.,Leibniz Institute for Prevention Research and Epidemiology - BIPS, Bremen, Germany
| |
Collapse
|
215
|
Stanstrup J, Broeckling CD, Helmus R, Hoffmann N, Mathé E, Naake T, Nicolotti L, Peters K, Rainer J, Salek RM, Schulze T, Schymanski EL, Stravs MA, Thévenot EA, Treutler H, Weber RJM, Willighagen E, Witting M, Neumann S. The metaRbolomics Toolbox in Bioconductor and beyond. Metabolites 2019; 9:E200. [PMID: 31548506 PMCID: PMC6835268 DOI: 10.3390/metabo9100200] [Citation(s) in RCA: 51] [Impact Index Per Article: 10.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2019] [Revised: 09/16/2019] [Accepted: 09/17/2019] [Indexed: 11/17/2022] Open
Abstract
Metabolomics aims to measure and characterise the complex composition of metabolites in a biological system. Metabolomics studies involve sophisticated analytical techniques such as mass spectrometry and nuclear magnetic resonance spectroscopy, and generate large amounts of high-dimensional and complex experimental data. Open source processing and analysis tools are of major interest in light of innovative, open and reproducible science. The scientific community has developed a wide range of open source software, providing freely available advanced processing and analysis approaches. The programming and statistics environment R has emerged as one of the most popular environments to process and analyse Metabolomics datasets. A major benefit of such an environment is the possibility of connecting different tools into more complex workflows. Combining reusable data processing R scripts with the experimental data thus allows for open, reproducible research. This review provides an extensive overview of existing packages in R for different steps in a typical computational metabolomics workflow, including data processing, biostatistics, metabolite annotation and identification, and biochemical network and pathway analysis. Multifunctional workflows, possible user interfaces and integration into workflow management systems are also reviewed. In total, this review summarises more than two hundred metabolomics specific packages primarily available on CRAN, Bioconductor and GitHub.
Collapse
Affiliation(s)
- Jan Stanstrup
- Preventive and Clinical Nutrition, University of Copenhagen, Rolighedsvej 30, 1958 Frederiksberg C, Denmark.
| | - Corey D Broeckling
- Proteomics and Metabolomics Facility, Colorado State University, Fort Collins, CO 80523, USA.
| | - Rick Helmus
- Institute for Biodiversity and Ecosystem Dynamics, University of Amsterdam, 1098 XH Amsterdam, The Netherlands.
| | - Nils Hoffmann
- Leibniz-Institut für Analytische Wissenschaften-ISAS-e.V., Otto-Hahn-Straße 6b, 44227 Dortmund, Germany.
| | - Ewy Mathé
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH 43210, USA.
| | - Thomas Naake
- Max Planck Institute of Molecular Plant Physiology, 14476 Potsdam-Golm, Germany.
| | - Luca Nicolotti
- The Australian Wine Research Institute, Metabolomics Australia, PO Box 197, Adelaide SA 5064, Australia.
| | - Kristian Peters
- Leibniz Institute of Plant Biochemistry (IPB Halle), Bioinformatics and Scientific Data, 06120 Halle, Germany.
| | - Johannes Rainer
- Institute for Biomedicine, Eurac Research, Affiliated Institute of the University of Lübeck, 39100 Bolzano, Italy.
| | - Reza M Salek
- The International Agency for Research on Cancer, 150 cours Albert Thomas, CEDEX 08, 69372 Lyon, France.
| | - Tobias Schulze
- Department of Effect-Directed Analysis, Helmholtz Centre for Environmental Research-UFZ, Permoserstraße 15, 04318 Leipzig, Germany.
| | - Emma L Schymanski
- Luxembourg Centre for Systems Biomedicine, University of Luxembourg, 6 avenue du Swing, L-4367 Belvaux, Luxembourg.
| | - Michael A Stravs
- Eawag, Swiss Federal Institute of Aquatic Science and Technology, Überlandstrasse 133, 8600 Dubendorf, Switzerland.
| | - Etienne A Thévenot
- CEA, LIST, Laboratory for Data Sciences and Decision, MetaboHUB, Gif-Sur-Yvette F-91191, France.
| | - Hendrik Treutler
- Leibniz Institute of Plant Biochemistry (IPB Halle), Bioinformatics and Scientific Data, 06120 Halle, Germany.
| | - Ralf J M Weber
- Phenome Centre Birmingham and School of Biosciences, University of Birmingham, Edgbaston, Birmingham B15 2TT, UK.
| | - Egon Willighagen
- Department of Bioinformatics-BiGCaT, NUTRIM, Maastricht University, 6229 ER Maastricht, The Netherlands.
| | - Michael Witting
- Research Unit Analytical BioGeoChemistry, Helmholtz Zentrum München, 85764 Neuherberg, Germany.
- Chair of Analytical Food Chemistry, Technische Universität München, 85354 Weihenstephan, Germany.
| | - Steffen Neumann
- Leibniz Institute of Plant Biochemistry (IPB Halle), Bioinformatics and Scientific Data, 06120 Halle, Germany.
- German Centre for Integrative Biodiversity Research (iDiv), Halle-Jena-Leipzig Deutscher, Platz 5e, 04103 Leipzig, Germany.
| |
Collapse
|
216
|
Abstract
A major goal in microbial ecology is to understand how microbial community structure influences ecosystem functioning. Various methods to directly associate bacterial taxa to functional groups in the environment are being developed. In this study, we applied machine learning methods to relate taxonomic data obtained from marker gene surveys to functional groups identified by flow cytometry. This allowed us to identify the taxa that are associated with heterotrophic productivity in freshwater lakes and indicated that the key contributors were highly system specific, regularly rare members of the community, and that some could possibly switch between being low and high contributors. Our approach provides a promising framework to identify taxa that contribute to ecosystem functioning and can be further developed to explore microbial contributions beyond heterotrophic production. High-nucleic-acid (HNA) and low-nucleic-acid (LNA) bacteria are two operational groups identified by flow cytometry (FCM) in aquatic systems. A number of reports have shown that HNA cell density correlates strongly with heterotrophic production, while LNA cell density does not. However, which taxa are specifically associated with these groups, and by extension, productivity has remained elusive. Here, we addressed this knowledge gap by using a machine learning-based variable selection approach that integrated FCM and 16S rRNA gene sequencing data collected from 14 freshwater lakes spanning a broad range in physicochemical conditions. There was a strong association between bacterial heterotrophic production and HNA absolute cell abundances (R2 = 0.65), but not with the more abundant LNA cells. This solidifies findings, mainly from marine systems, that HNA and LNA bacteria could be considered separate functional groups, the former contributing a disproportionately large share of carbon cycling. Taxa selected by the models could predict HNA and LNA absolute cell abundances at all taxonomic levels. Selected operational taxonomic units (OTUs) ranged from low to high relative abundance and were mostly lake system specific (89.5% to 99.2%). A subset of selected OTUs was associated with both LNA and HNA groups (12.5% to 33.3%), suggesting either phenotypic plasticity or within-OTU genetic and physiological heterogeneity. These findings may lead to the identification of system-specific putative ecological indicators for heterotrophic productivity. Generally, our approach allows for the association of OTUs with specific functional groups in diverse ecosystems in order to improve our understanding of (microbial) biodiversity-ecosystem functioning relationships. IMPORTANCE A major goal in microbial ecology is to understand how microbial community structure influences ecosystem functioning. Various methods to directly associate bacterial taxa to functional groups in the environment are being developed. In this study, we applied machine learning methods to relate taxonomic data obtained from marker gene surveys to functional groups identified by flow cytometry. This allowed us to identify the taxa that are associated with heterotrophic productivity in freshwater lakes and indicated that the key contributors were highly system specific, regularly rare members of the community, and that some could possibly switch between being low and high contributors. Our approach provides a promising framework to identify taxa that contribute to ecosystem functioning and can be further developed to explore microbial contributions beyond heterotrophic production.
Collapse
|
217
|
Utilizing Precision Medicine to Estimate Timing for Surgical Closure of Traumatic Extremity Wounds. Ann Surg 2019; 270:535-543. [DOI: 10.1097/sla.0000000000003470] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
|
218
|
Küntzel A, Weber M, Gierschner P, Trefz P, Miekisch W, Schubert JK, Reinhold P, Köhler H. Core profile of volatile organic compounds related to growth of Mycobacterium avium subspecies paratuberculosis - A comparative extract of three independent studies. PLoS One 2019; 14:e0221031. [PMID: 31415617 PMCID: PMC6695172 DOI: 10.1371/journal.pone.0221031] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2019] [Accepted: 07/29/2019] [Indexed: 11/22/2022] Open
Abstract
Analysis of volatile organic compounds (VOC) derived from bacterial metabolism during cultivation is considered an innovative approach to accelerate in vitro detection of slowly growing bacteria. This applies also to Mycobacterium avium subsp. paratuberculosis (MAP), the causative agent of paratuberculosis, a debilitating chronic enteritis of ruminants. Diagnostic application demands robust VOC profiles that are reproducible under variable culture conditions. In this study, the VOC patterns of pure bacterial cultures, derived from three independent in vitro studies performed previously, were comparatively analyzed. Different statistical analyses were linked to extract the VOC core profile of MAP and to prove its robustness, which is a prerequisite for further development towards diagnostic application. Despite methodical variability of bacterial cultivation and sample pre-extraction, a common profile of 28 VOCs indicating cultural growth of MAP was defined. The substances cover six chemical classes. Four of the substances decreased above MAP and 24 increased. Random forest classification was applied to rank the compounds relative to their importance and for classification of MAP versus control samples. Already the top-ranked compound alone achieved high discrimination (AUC 0.85), which was further increased utilizing all compounds of the VOC core profile of MAP (AUC 0.91). The discriminatory power of this tool for the characterization of natural diagnostic samples, in particular its diagnostic specificity for MAP, has to be confirmed in future studies.
Collapse
Affiliation(s)
- Anne Küntzel
- Institute of Molecular Pathogenesis, Friedrich-Loeffler-Institut (FLI), Federal Research Institute for Animal Health, Jena, Germany
| | - Michael Weber
- Institute of Molecular Pathogenesis, Friedrich-Loeffler-Institut (FLI), Federal Research Institute for Animal Health, Jena, Germany
| | - Peter Gierschner
- Rostock Medical Breath Research Analytics and Technologies (RoMBAT), Department of Anaesthesia and Intensive Care, Rostock University Medical Center, Rostock, Germany
| | - Phillip Trefz
- Rostock Medical Breath Research Analytics and Technologies (RoMBAT), Department of Anaesthesia and Intensive Care, Rostock University Medical Center, Rostock, Germany
| | - Wolfram Miekisch
- Rostock Medical Breath Research Analytics and Technologies (RoMBAT), Department of Anaesthesia and Intensive Care, Rostock University Medical Center, Rostock, Germany
| | - Jochen K. Schubert
- Rostock Medical Breath Research Analytics and Technologies (RoMBAT), Department of Anaesthesia and Intensive Care, Rostock University Medical Center, Rostock, Germany
| | - Petra Reinhold
- Institute of Molecular Pathogenesis, Friedrich-Loeffler-Institut (FLI), Federal Research Institute for Animal Health, Jena, Germany
| | - Heike Köhler
- Institute of Molecular Pathogenesis, Friedrich-Loeffler-Institut (FLI), Federal Research Institute for Animal Health, Jena, Germany
- National Reference Laboratory for Paratuberculosis, FLI, Jena, Germany
| |
Collapse
|
219
|
Lodise TP, Bonine NG, Ye JM, Folse HJ, Gillard P. Development of a bedside tool to predict the probability of drug-resistant pathogens among hospitalized adult patients with gram-negative infections. BMC Infect Dis 2019; 19:718. [PMID: 31412809 PMCID: PMC6694572 DOI: 10.1186/s12879-019-4363-y] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2018] [Accepted: 08/06/2019] [Indexed: 01/27/2023] Open
Abstract
Background We developed a clinical bedside tool to simultaneously estimate the probabilities of third-generation cephalosporin-resistant Enterobacteriaceae (3GC-R), carbapenem-resistant Enterobacteriaceae (CRE), and multidrug-resistant Pseudomonas aeruginosa (MDRP) among hospitalized adult patients with Gram-negative infections. Methods Data were obtained from a retrospective observational study of the Premier Hospital that included hospitalized adult patients with a complicated urinary tract infection (cUTI), complicated intra-abdominal infection (cIAI), hospital-acquired/ventilator-associated pneumonia (HAP/VAP), or bloodstream infection (BSI) due to Gram-negative bacteria between 2011 and 2015. Risk factors for 3GC-R, CRE, and MDRP were ascertained by multivariate logistic regression, and separate models were developed for patients with community-acquired versus hospital-acquired infections for each resistance phenotype (N = 6). Models were converted to a singular user-friendly interface to estimate the probabilities of a patient having an infection due to 3GC-R, CRE, or MDRP when ≥ 1 risk factor was present. Results Overall, 124,068 patients contributed to the dataset. Percentages of patients admitted for cUTI, cIAI, HAP/VAP, and BSI were 61.6, 4.6, 16.5, and 26.4%, respectively (some patients contributed > 1 infection type). Resistant infection rates were 1.90% for CRE, 12.09% for 3GC-R, and 3.91% for MDRP. A greater percentage of the resistant infections were community-acquired relative to hospital-acquired (CRE, 1.30% vs 0.62% of 1.90%; 3GC-R, 9.27% vs 3.42% of 12.09%; MDRP, 2.39% vs 1.59% of 3.91%). The most important predictors of having an 3GC-R, CRE or MDRP infection were prior number of antibiotics; infection site; infection during the previous 3 months; and hospital prevalence of 3GC-R, CRE, or MDRP. To enable application of the six predictive multivariate logistic regression models to real-world clinical practice, we developed a user-friendly interface that estimates the risk of 3GC-R, CRE, and MDRP simultaneously in a given patient with a Gram-negative infection based on their risk (Additional file 1). Conclusions We developed a clinical prediction tool to estimate the probabilities of 3GC-R, CRE, and MDRP among hospitalized adult patients with confirmed community- and hospital-acquired Gram-negative infections. Our predictive model has been implemented as a user-friendly bedside tool for use by clinicians/healthcare professionals to predict the probability of resistant infections in individual patients, to guide early appropriate therapy. Electronic supplementary material The online version of this article (10.1186/s12879-019-4363-y) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Thomas P Lodise
- Albany College of Pharmacy and Health Sciences, Albany, NY, 12208-3492, USA.
| | | | | | | | | |
Collapse
|
220
|
Zhang C, Chen Y, Xu B, Xue Y, Ren Y. How to predict biodiversity in space? An evaluation of modelling approaches in marine ecosystems. DIVERS DISTRIB 2019. [DOI: 10.1111/ddi.12970] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022] Open
Affiliation(s)
| | - Yong Chen
- School of Marine Sciences University of Maine Orono ME USA
| | - Binduo Xu
- College of Fisheries Ocean University of China Qingdao China
| | - Ying Xue
- College of Fisheries Ocean University of China Qingdao China
| | - Yiping Ren
- College of Fisheries Ocean University of China Qingdao China
- Qingdao National Laboratory for Marine Science and Technology Qingdao China
| |
Collapse
|
221
|
Banerji A, Bagley MJ, Shoemaker JA, Tettenhorst DR, Nietch CT, Allen HJ, Santo Domingo JW. Evaluating putative ecological drivers of microcystin spatiotemporal dynamics using metabarcoding and environmental data. HARMFUL ALGAE 2019; 86:84-95. [PMID: 31358280 PMCID: PMC7877229 DOI: 10.1016/j.hal.2019.05.004] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/05/2019] [Revised: 04/19/2019] [Accepted: 05/07/2019] [Indexed: 05/03/2023]
Abstract
Microcystin is a cyanobacterial hepatotoxin of global concern. Understanding the environmental factors that cause high concentrations of microcystin is crucial to the development of lake management strategies that minimize harmful exposures. While the literature is replete with studies linking cyanobacterial production of microcystin to changes in various nutrients, abiotic stressors, grazers, and competitors, no single biotic or abiotic factor has been shown to be reliably predictive of microcystin concentrations in complex ecosystems. We performed random forest regression analyses with 16S and 18S rRNA gene sequencing data and environmental data to determine which putative ecological drivers best explained spatiotemporal variation in total microcystin and several individual congeners in a eutrophic freshwater reservoir. Model performance was best for predicting concentrations of the congener MC-LR, with ca. 88% of spatiotemporal variance explained. Most of the variance was associated with changes in the relative abundance of the cyanobacterial genus Microcystis. Follow-up RF regression analyses revealed that factors that were the most important in predicting MC-LR were also the most important in predicting Microcystis population dynamics. We discuss how these results relate to prevailing ecological hypotheses regarding the function of microcystin.
Collapse
Affiliation(s)
- A Banerji
- US Environmental Protection Agency, Cincinnati, OH, 45268, USA
| | - M J Bagley
- US Environmental Protection Agency, Cincinnati, OH, 45268, USA
| | - J A Shoemaker
- US Environmental Protection Agency, Cincinnati, OH, 45268, USA
| | - D R Tettenhorst
- US Environmental Protection Agency, Cincinnati, OH, 45268, USA
| | - C T Nietch
- US Environmental Protection Agency, Cincinnati, OH, 45268, USA
| | - H J Allen
- US Environmental Protection Agency, Cincinnati, OH, 45268, USA
| | | |
Collapse
|
222
|
Niu SY, Liu B, Ma Q, Chou WC. rSeqTU-A Machine-Learning Based R Package for Prediction of Bacterial Transcription Units. Front Genet 2019; 10:374. [PMID: 31156694 PMCID: PMC6529933 DOI: 10.3389/fgene.2019.00374] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2019] [Accepted: 04/09/2019] [Indexed: 11/13/2022] Open
Abstract
A transcription unit (TU) is composed of one or multiple adjacent genes on the same strand that are co-transcribed in mostly prokaryotes. Accurate identification of TUs is a crucial first step to delineate the transcriptional regulatory networks and elucidate the dynamic regulatory mechanisms encoded in various prokaryotic genomes. Many genomic features, for example, gene intergenic distance, and transcriptomic features including continuous and stable RNA-seq reads count signals, have been collected from a large amount of experimental data and integrated into classification techniques to computationally predict genome-wide TUs. Although some tools and web servers are able to predict TUs based on bacterial RNA-seq data and genome sequences, there is a need to have an improved machine learning prediction approach and a better comprehensive pipeline handling QC, TU prediction, and TU visualization. To enable users to efficiently perform TU identification on their local computers or high-performance clusters and provide a more accurate prediction, we develop an R package, named rSeqTU. rSeqTU uses a random forest algorithm to select essential features describing TUs and then uses support vector machine (SVM) to build TU prediction models. rSeqTU (available at https://s18692001.github.io/rSeqTU/) has six computational functionalities including read quality control, read mapping, training set generation, random forest-based feature selection, TU prediction, and TU visualization.
Collapse
Affiliation(s)
- Sheng-Yong Niu
- Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA, United States
| | - Binqiang Liu
- School of Mathematics, Shandong University, Jinan, China
| | - Qin Ma
- Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, United States
| | - Wen-Chi Chou
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA, United States
| |
Collapse
|
223
|
Lee MY, Kim TK, Walters KA, Wang K. A biological function based biomarker panel optimization process. Sci Rep 2019; 9:7365. [PMID: 31089177 PMCID: PMC6517383 DOI: 10.1038/s41598-019-43779-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2018] [Accepted: 04/26/2019] [Indexed: 11/09/2022] Open
Abstract
Implementation of multi-gene biomarker panels identified from high throughput data, including microarray or next generation sequencing, need to be adapted to a platform suitable in a clinical setting such as quantitative polymerase chain reaction. However, technical challenges when transitioning from one measurement platform to another, such as inconsistent measurement results can affect panel development. We describe a process to overcome the challenges by replacing poor performing genes during platform transition and reducing the number of features without impacting classification performance. This approach assumes that a diagnostic panel reflects the effect of dysregulated biological processes associated with a disease, and genes involved in the same biological processes and coordinately affected by a disease share a similar discriminatory power. The utility of this optimization process was assessed using a published sepsis diagnostic panel. Substitution of more than half of the genes and/or reducing genes based on biological processes did not negatively affect the performance of the sepsis diagnostic panel. Our results suggest a systematic gene substitution and reduction process based on biological function can be used to alleviate the challenges associated with clinical development of biomarker panels.
Collapse
Affiliation(s)
- Min Young Lee
- Institute for Systems Biology, Seattle, Washington, United States of America
| | - Taek-Kyun Kim
- Institute for Systems Biology, Seattle, Washington, United States of America
| | - Kathie-Anne Walters
- Institute for Systems Biology, Seattle, Washington, United States of America
| | - Kai Wang
- Institute for Systems Biology, Seattle, Washington, United States of America.
| |
Collapse
|
224
|
Luo L, Hudson LG, Lewis J, Lee JH. Two-step approach for assessing the health effects of environmental chemical mixtures: application to simulated datasets and real data from the Navajo Birth Cohort Study. Environ Health 2019; 18:46. [PMID: 31072361 PMCID: PMC6507239 DOI: 10.1186/s12940-019-0482-6] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2018] [Accepted: 04/16/2019] [Indexed: 05/07/2023]
Abstract
BACKGROUND There is increasing interest in examining the consequences of simultaneous exposures to chemical mixtures. However, a consensus or recommendations on how to appropriately select the statistical approach analyzing the health effects of mixture exposures which best aligns with study goals has not been well established. We recognize the limitations that existing methods have in effectively reducing data dimension and detecting interaction effects when analyzing chemical mixture exposures collected in high dimensional datasets with varying degrees of variable intercorrelations. In this research, we aim to examine the performance of a two-step statistical approach in addressing the analytical challenges of chemical mixture exposures using two simulated data sets, and an existing data set from the Navajo Birth Cohort Study as a representative case study. METHODS We propose to use a two-step approach: a robust variable selection step using the random forest approach followed by adaptive lasso methods that incorporate both dimensionality reduction and quantification of the degree of association between the chemical exposures and the outcome of interest, including interaction terms. We compared the proposed method with other approaches including (1) single step adaptive lasso; and (2) two-step Classification and regression trees (CART) followed by adaptive lasso method. RESULTS Utilizing simulated data sets and applying the method to a real-life dataset from the Navajo Birth Cohort Study, we have demonstrated good performance of the proposed two-step approach. Results from the simulation datasets indicated the effectiveness of variable dimension reduction and reliable identification of a parsimonious model compared to other methods: single-step adaptive lasso or two-step CART followed by adaptive lasso method. CONCLUSIONS Our proposed two-step approach provides a robust way of analyzing the effects of high-throughput chemical mixture exposures on health outcomes by combining the strengths of variable selection and adaptive shrinkage strategies.
Collapse
Affiliation(s)
- Li Luo
- Department of Internal Medicine, MSC10-5550, 1 University of New Mexico, Albuquerque, NM, 87131, USA.
- University of New Mexico Comprehensive Cancer Center, Albuquerque, NM, USA.
| | - Laurie G Hudson
- Department of Pharmaceutical Sciences, College of Pharmacy, University of New Mexico, Albuquerque, NM, USA
| | - Johnnye Lewis
- Community Environmental Health Program, College of Pharmacy, University of New Mexico, Albuquerque, NM, USA
| | - Ji-Hyun Lee
- Department of Internal Medicine, MSC10-5550, 1 University of New Mexico, Albuquerque, NM, 87131, USA
- University of New Mexico Comprehensive Cancer Center, Albuquerque, NM, USA
- Present Address: Division of Quantitative Sciences, University of Florida Health Cancer Center; Department of Biostatistics, University of Florida, Gainesville, Florida, USA
| |
Collapse
|
225
|
Ogink PT, Karhade AV, Thio QCBS, Gormley WB, Oner FC, Verlaan JJ, Schwab JH. Predicting discharge placement after elective surgery for lumbar spinal stenosis using machine learning methods. EUROPEAN SPINE JOURNAL : OFFICIAL PUBLICATION OF THE EUROPEAN SPINE SOCIETY, THE EUROPEAN SPINAL DEFORMITY SOCIETY, AND THE EUROPEAN SECTION OF THE CERVICAL SPINE RESEARCH SOCIETY 2019; 28:1433-1440. [PMID: 30941521 DOI: 10.1007/s00586-019-05928-z] [Citation(s) in RCA: 33] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/26/2018] [Revised: 01/11/2019] [Accepted: 02/21/2019] [Indexed: 11/29/2022]
Abstract
PURPOSE An excessive amount of total hospitalization is caused by delays due to patients waiting to be placed in a rehabilitation facility or skilled nursing facility (RF/SNF). An accurate preoperative prediction of who would need a RF/SNF place after surgery could reduce costs and allow more efficient organizational planning. We aimed to develop a machine learning algorithm that predicts non-home discharge after elective surgery for lumbar spinal stenosis. METHODS We used the American College of Surgeons National Surgical Quality Improvement Program to select patient that underwent elective surgery for lumbar spinal stenosis between 2009 and 2016. The primary outcome measure for the algorithm was non-home discharge. Four machine learning algorithms were developed to predict non-home discharge. Performance of the algorithms was measured with discrimination, calibration, and an overall performance score. RESULTS We included 28,600 patients with a median age of 67 (interquartile range 58-74). The non-home discharge rate was 18.2%. Our final model consisted of the following variables: age, sex, body mass index, diabetes, functional status, ASA class, level, fusion, preoperative hematocrit, and preoperative serum creatinine. The neural network was the best model based on discrimination (c-statistic = 0.751), calibration (slope = 0.933; intercept = 0.037), and overall performance (Brier score = 0.131). CONCLUSIONS A machine learning algorithm is able to predict discharge placement after surgery for lumbar spinal stenosis with both good discrimination and calibration. Implementing this type of algorithm in clinical practice could avert risks associated with delayed discharge and lower costs. These slides can be retrieved under Electronic Supplementary Material.
Collapse
Affiliation(s)
- Paul T Ogink
- UMC Utrecht, Heidelberglaan 100, 3584 CX, Utrecht, The Netherlands.
| | - Aditya V Karhade
- Massachusetts General Hospital - Harvard Medical School, Boston, MA, USA
| | - Quirina C B S Thio
- Massachusetts General Hospital - Harvard Medical School, Boston, MA, USA
| | - William B Gormley
- Brigham and Women's Hospital - Harvard Medical School, Boston, MA, USA
| | - Fetullah C Oner
- UMC Utrecht, Heidelberglaan 100, 3584 CX, Utrecht, The Netherlands
| | - Jorrit J Verlaan
- UMC Utrecht, Heidelberglaan 100, 3584 CX, Utrecht, The Netherlands
| | - Joseph H Schwab
- Massachusetts General Hospital - Harvard Medical School, Boston, MA, USA
| |
Collapse
|
226
|
Leveraging Machine Learning to Extend Ontology-Driven Geographic Object-Based Image Analysis (O-GEOBIA): A Case Study in Forest-Type Mapping. REMOTE SENSING 2019. [DOI: 10.3390/rs11050503] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Ontology-driven Geographic Object-Based Image Analysis (O-GEOBIA) contributes to the identification of meaningful objects. In fusing data from multiple sensors, the number of feature variables is increased and object identification becomes a challenging task. We propose a methodological contribution that extends feature variable characterisation. This method is illustrated with a case study in forest-type mapping in Tasmania, Australia. Satellite images, airborne LiDAR (Light Detection and Ranging) and expert photo-interpretation data are fused for feature extraction and classification. Two machine learning algorithms, Random Forest and Boruta, are used to identify important and relevant feature variables. A variogram is used to describe textural and spatial features. Different variogram features are used as input for rule-based classifications. The rule-based classifications employ (i) spectral features, (ii) vegetation indices, (iii) LiDAR, and (iv) variogram features, and resulted in overall classification accuracies of 77.06%, 78.90%, 73.39% and 77.06% respectively. Following data fusion, the use of combined feature variables resulted in a higher classification accuracy (81.65%). Using relevant features extracted from the Boruta algorithm, the classification accuracy is further improved (82.57%). The results demonstrate that the use of relevant variogram features together with spectral and LiDAR features resulted in improved classification accuracy.
Collapse
|
227
|
Sun S, Miao Z, Ratcliffe B, Campbell P, Pasch B, El-Kassaby YA, Balasundaram B, Chen C. SNP variable selection by generalized graph domination. PLoS One 2019; 14:e0203242. [PMID: 30677030 PMCID: PMC6345469 DOI: 10.1371/journal.pone.0203242] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2018] [Accepted: 01/08/2019] [Indexed: 11/19/2022] Open
Abstract
BACKGROUND High-throughput sequencing technology has revolutionized both medical and biological research by generating exceedingly large numbers of genetic variants. The resulting datasets share a number of common characteristics that might lead to poor generalization capacity. Concerns include noise accumulated due to the large number of predictors, sparse information regarding the p≫n problem, and overfitting and model mis-identification resulting from spurious collinearity. Additionally, complex correlation patterns are present among variables. As a consequence, reliable variable selection techniques play a pivotal role in predictive analysis, generalization capability, and robustness in clustering, as well as interpretability of the derived models. METHODS AND FINDINGS K-dominating set, a parameterized graph-theoretic generalization model, was used to model SNP (single nucleotide polymorphism) data as a similarity network and searched for representative SNP variables. In particular, each SNP was represented as a vertex in the graph, (dis)similarity measures such as correlation coefficients or pairwise linkage disequilibrium were estimated to describe the relationship between each pair of SNPs; a pair of vertices are adjacent, i.e. joined by an edge, if the pairwise similarity measure exceeds a user-specified threshold. A minimum k-dominating set in the SNP graph was then made as the smallest subset such that every SNP that is excluded from the subset has at least k neighbors in the selected ones. The strength of k-dominating set selection in identifying independent variables, and in culling representative variables that are highly correlated with others, was demonstrated by a simulated dataset. The advantages of k-dominating set variable selection were also illustrated in two applications: pedigree reconstruction using SNP profiles of 1,372 Douglas-fir trees, and species delineation for 226 grasshopper mouse samples. A C++ source code that implements SNP-SELECT and uses Gurobi optimization solver for the k-dominating set variable selection is available (https://github.com/transgenomicsosu/SNP-SELECT).
Collapse
Affiliation(s)
- Shuzhen Sun
- Department of Biochemistry and Molecular Biology, Oklahoma State University, Stillwater, United States of America
- Department of Forest and Conservation Sciences, Faculty of Forestry, The University of British Columbia, Vancouver, B.C. Canada
| | - Zhuqi Miao
- Center for Health Systems Innovation, Oklahoma State University, Stillwater, United States of America
| | - Blaise Ratcliffe
- Department of Forest and Conservation Sciences, Faculty of Forestry, The University of British Columbia, Vancouver, B.C. Canada
| | - Polly Campbell
- Department of Integrative Biology, Oklahoma State University, Stillwater, United States of America
- Department of Evolution, Ecology and Organismal Biology, University of California, Riverside, Riverside, United States of America
| | - Bret Pasch
- Department of Biological Sciences, Northern Arizona University, Flagstaff, United States of America
| | - Yousry A. El-Kassaby
- Department of Forest and Conservation Sciences, Faculty of Forestry, The University of British Columbia, Vancouver, B.C. Canada
| | - Balabhaskar Balasundaram
- School of Industrial Engineering and Management, Oklahoma State University, Stillwater, United States of America
| | - Charles Chen
- Department of Biochemistry and Molecular Biology, Oklahoma State University, Stillwater, United States of America
- * E-mail:
| |
Collapse
|
228
|
Long NP, Park S, Anh NH, Nghi TD, Yoon SJ, Park JH, Lim J, Kwon SW. High-Throughput Omics and Statistical Learning Integration for the Discovery and Validation of Novel Diagnostic Signatures in Colorectal Cancer. Int J Mol Sci 2019; 20:E296. [PMID: 30642095 PMCID: PMC6358915 DOI: 10.3390/ijms20020296] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2018] [Revised: 12/31/2018] [Accepted: 01/04/2019] [Indexed: 02/07/2023] Open
Abstract
The advancement of bioinformatics and machine learning has facilitated the discovery and validation of omics-based biomarkers. This study employed a novel approach combining multi-platform transcriptomics and cutting-edge algorithms to introduce novel signatures for accurate diagnosis of colorectal cancer (CRC). Different random forests (RF)-based feature selection methods including the area under the curve (AUC)-RF, Boruta, and Vita were used and the diagnostic performance of the proposed biosignatures was benchmarked using RF, logistic regression, naïve Bayes, and k-nearest neighbors models. All models showed satisfactory performance in which RF appeared to be the best. For instance, regarding the RF model, the following were observed: mean accuracy 0.998 (standard deviation (SD) < 0.003), mean specificity 0.999 (SD < 0.003), and mean sensitivity 0.998 (SD < 0.004). Moreover, proposed biomarker signatures were highly associated with multifaceted hallmarks in cancer. Some biomarkers were found to be enriched in epithelial cell signaling in Helicobacter pylori infection and inflammatory processes. The overexpression of TGFBI and S100A2 was associated with poor disease-free survival while the down-regulation of NR5A2, SLC4A4, and CD177 was linked to worse overall survival of the patients. In conclusion, novel transcriptome signatures to improve the diagnostic accuracy in CRC are introduced for further validations in various clinical settings.
Collapse
Affiliation(s)
- Nguyen Phuoc Long
- College of Pharmacy and Research Institute of Pharmaceutical Sciences, Seoul National University, Seoul 08826, Korea.
| | - Seongoh Park
- Department of Statistics, Seoul National University, Seoul 08826, Korea.
| | - Nguyen Hoang Anh
- College of Pharmacy and Research Institute of Pharmaceutical Sciences, Seoul National University, Seoul 08826, Korea.
| | - Tran Diem Nghi
- School of Medicine, Vietnam National University, Ho Chi Minh 70000, Vietnam.
| | - Sang Jun Yoon
- College of Pharmacy and Research Institute of Pharmaceutical Sciences, Seoul National University, Seoul 08826, Korea.
| | - Jeong Hill Park
- College of Pharmacy and Research Institute of Pharmaceutical Sciences, Seoul National University, Seoul 08826, Korea.
| | - Johan Lim
- Department of Statistics, Seoul National University, Seoul 08826, Korea.
| | - Sung Won Kwon
- College of Pharmacy and Research Institute of Pharmaceutical Sciences, Seoul National University, Seoul 08826, Korea.
| |
Collapse
|
229
|
Huynh-Thu VA, Geurts P. Unsupervised Gene Network Inference with Decision Trees and Random Forests. Methods Mol Biol 2019; 1883:195-215. [PMID: 30547401 DOI: 10.1007/978-1-4939-8882-2_8] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
In this chapter, we introduce the reader to a popular family of machine learning algorithms, called decision trees. We then review several approaches based on decision trees that have been developed for the inference of gene regulatory networks (GRNs). Decision trees have indeed several nice properties that make them well-suited for tackling this problem: they are able to detect multivariate interacting effects between variables, are non-parametric, have good scalability, and have very few parameters. In particular, we describe in detail the GENIE3 algorithm, a state-of-the-art method for GRN inference.
Collapse
Affiliation(s)
- Vân Anh Huynh-Thu
- Department of Electrical Engineering and Computer Science, University of Liège, Liège, Belgium.
| | - Pierre Geurts
- Department of Electrical Engineering and Computer Science, University of Liège, Liège, Belgium
| |
Collapse
|
230
|
Li XX, Yin J, Tang J, Li Y, Yang Q, Xiao Z, Zhang R, Wang Y, Hong J, Tao L, Xue W, Zhu F. Determining the Balance Between Drug Efficacy and Safety by the Network and Biological System Profile of Its Therapeutic Target. Front Pharmacol 2018; 9:1245. [PMID: 30429792 PMCID: PMC6220079 DOI: 10.3389/fphar.2018.01245] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2018] [Accepted: 10/12/2018] [Indexed: 12/14/2022] Open
Abstract
One of the most challenging puzzles in drug discovery is the identification and characterization of candidate drug of well-balanced profile between efficacy and safety. So far, extensive efforts have been made to evaluate this balance by estimating the quantitative structure–therapeutic relationship and exploring target profile of adverse drug reaction. Particularly, the therapeutic index (TI) has emerged as a key indicator illustrating this delicate balance, and a clinically successful agent requires a sufficient TI suitable for it corresponding indication. However, the TI information are largely unknown for most drugs, and the mechanism underlying the drugs with narrow TI (NTI drugs) is still elusive. In this study, the collective effects of human protein–protein interaction (PPI) network and biological system profile on the drugs' efficacy–safety balance were systematically evaluated. First, a comprehensive literature review of the FDA approved drugs confirmed their NTI status. Second, a popular feature selection algorithm based on artificial intelligence (AI) was adopted to identify key factors differencing the target mechanism between NTI and non-NTI drugs. Finally, this work revealed that the targets of NTI drugs were highly centralized and connected in human PPI network, and the number of similarity proteins and affiliated signaling pathways of the corresponding targets was much higher than those of non-NTI drugs. These findings together with the newly discovered features or feature groups clarified the key factors indicating drug's narrow TI, and could thus provide a novel direction for determining the delicate drug efficacy-safety balance.
Collapse
Affiliation(s)
- Xiao Xu Li
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China.,School of Pharmaceutical Sciences and Collaborative Innovation Center for Brain Science, Chongqing University, Chongqing, China
| | - Jiayi Yin
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
| | - Jing Tang
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China.,School of Pharmaceutical Sciences and Collaborative Innovation Center for Brain Science, Chongqing University, Chongqing, China
| | - Yinghong Li
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China.,School of Pharmaceutical Sciences and Collaborative Innovation Center for Brain Science, Chongqing University, Chongqing, China
| | - Qingxia Yang
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China.,School of Pharmaceutical Sciences and Collaborative Innovation Center for Brain Science, Chongqing University, Chongqing, China
| | - Ziyu Xiao
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
| | - Runyuan Zhang
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
| | - Yunxia Wang
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
| | - Jiajun Hong
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
| | - Lin Tao
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicine of Zhejiang Province, School of Medicine, Hangzhou Normal University, Hangzhou, China
| | - Weiwei Xue
- School of Pharmaceutical Sciences and Collaborative Innovation Center for Brain Science, Chongqing University, Chongqing, China
| | - Feng Zhu
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China.,School of Pharmaceutical Sciences and Collaborative Innovation Center for Brain Science, Chongqing University, Chongqing, China
| |
Collapse
|
231
|
Ahamed NU, Kobsar D, Benson L, Clermont C, Kohrs R, Osis ST, Ferber R. Using wearable sensors to classify subject-specific running biomechanical gait patterns based on changes in environmental weather conditions. PLoS One 2018; 13:e0203839. [PMID: 30226903 PMCID: PMC6143236 DOI: 10.1371/journal.pone.0203839] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2018] [Accepted: 08/28/2018] [Indexed: 01/07/2023] Open
Abstract
Running-related overuse injuries can result from a combination of various intrinsic (e.g., gait biomechanics) and extrinsic (e.g., running surface) risk factors. However, it is unknown how changes in environmental weather conditions affect running gait biomechanical patterns since these data cannot be collected in a laboratory setting. Therefore, the purpose of this study was to develop a classification model based on subject-specific changes in biomechanical running patterns across two different environmental weather conditions using data obtained from wearable sensors in real-world environments. Running gait data were recorded during winter and spring sessions, with recorded average air temperatures of -10° C and +6° C, respectively. Classification was performed based on measurements of pelvic drop, ground contact time, braking, vertical oscillation of pelvis, pelvic rotation, and cadence obtained from 66,370 strides (~11,000/runner) from a group of recreational runners. A non-linear and ensemble machine learning algorithm, random forest (RF), was used to classify and compute a heuristic for determining the importance of each variable in the prediction model. To validate the developed subject-specific model, two cross-validation methods (one-against-another and partitioning datasets) were used to obtain experimental mean classification accuracies of 87.18% and 95.42%, respectively, indicating an excellent discriminatory ability of the RF-based model. Additionally, the ranked order of variable importance differed across the individual runners. The results from the RF-based machine-learning algorithm demonstrates that processing gait biomechanical signals from a single wearable sensor can successfully detect changes to an individual's running patterns based on data obtained in real-world environments.
Collapse
Affiliation(s)
| | - Dylan Kobsar
- Faculty of Kinesiology, University of Calgary, Calgary, Alberta, Canada
| | - Lauren Benson
- Faculty of Kinesiology, University of Calgary, Calgary, Alberta, Canada
| | | | - Russell Kohrs
- Faculty of Kinesiology, University of Calgary, Calgary, Alberta, Canada
| | - Sean T. Osis
- Faculty of Kinesiology, University of Calgary, Calgary, Alberta, Canada
- Running Injury Clinic, University of Calgary, Calgary, Alberta, Canada
| | - Reed Ferber
- Faculty of Kinesiology, University of Calgary, Calgary, Alberta, Canada
- Running Injury Clinic, University of Calgary, Calgary, Alberta, Canada
- Faculty of Nursing, University of Calgary, Calgary, Alberta, Canada
| |
Collapse
|
232
|
|
233
|
Kirpich A, Ainsworth EA, Wedow JM, Newman JRB, Michailidis G, McIntyre LM. Variable selection in omics data: A practical evaluation of small sample sizes. PLoS One 2018; 13:e0197910. [PMID: 29927942 PMCID: PMC6013185 DOI: 10.1371/journal.pone.0197910] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2017] [Accepted: 05/10/2018] [Indexed: 01/04/2023] Open
Abstract
In omics experiments, variable selection involves a large number of metabolites/ genes and a small number of samples (the n < p problem). The ultimate goal is often the identification of one, or a few features that are different among conditions- a biomarker. Complicating biomarker identification, the p variables often contain a correlation structure due to the biology of the experiment making identifying causal compounds from correlated compounds difficult. Additionally, there may be elements in the experimental design (blocks, batches) that introduce structure in the data. While this problem has been discussed in the literature and various strategies proposed, the over fitting problems concomitant with such approaches are rarely acknowledged. Instead of viewing a single omics experiment as a definitive test for a biomarker, an unrealistic analytical goal, we propose to view such studies as screening studies where the goal of the study is to reduce the number of features present in the second round of testing, and to limit the Type II error. Using this perspective, the performance of LASSO, ridge regression and Elastic Net was compared with the performance of an ANOVA via a simulation study and two real data comparisons. Interestingly, a dramatic increase in the number of features had no effect on Type I error for the ANOVA approach. ANOVA, even without multiple test correction, has a low false positive rates in the scenarios tested. The Elastic Net has an inflated Type I error (from 10 to 50%) for small numbers of features which increases with sample size. The Type II error rate for the ANOVA is comparable or lower than that for the Elastic Net leading us to conclude that an ANOVA is an effective analytical tool for the initial screening of features in omics experiments.
Collapse
Affiliation(s)
- Alexander Kirpich
- Department of Biology, University of Florida, Gainesville, FL, United States of America
- Informatics Institute, University of Florida, Gainesville, FL, United States of America
| | - Elizabeth A. Ainsworth
- Department of Plant Biology, University of Illinois at Urbana-Champaign, Urbana, IL, United States of America
- USDA ARS Global Change and Photosynthesis Research Unit, Urbana, IL, United States of America
| | - Jessica M. Wedow
- Department of Plant Biology, University of Illinois at Urbana-Champaign, Urbana, IL, United States of America
| | - Jeremy R. B. Newman
- Department of Biology, University of Florida, Gainesville, FL, United States of America
| | - George Michailidis
- Informatics Institute, University of Florida, Gainesville, FL, United States of America
- Department of Statistics, University of Florida, Gainesville, FL, United States of America
| | - Lauren M. McIntyre
- Department of Biology, University of Florida, Gainesville, FL, United States of America
- Informatics Institute, University of Florida, Gainesville, FL, United States of America
- Genetics Institute, University of Florida, Gainesville, FL, United States of America
| |
Collapse
|
234
|
Wu Q, Boueiz A, Bozkurt A, Masoomi A, Wang A, DeMeo DL, Weiss ST, Qiu W. Deep Learning Methods for Predicting Disease Status Using Genomic Data. JOURNAL OF BIOMETRICS & BIOSTATISTICS 2018; 9:417. [PMID: 31131151 PMCID: PMC6530791] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Predicting disease status for a complex human disease using genomic data is an important, yet challenging, step in personalized medicine. Among many challenges, the so-called curse of dimensionality problem results in unsatisfied performances of many state-of-art machine learning algorithms. A major recent advance in machine learning is the rapid development of deep learning algorithms that can efficiently extract meaningful features from high-dimensional and complex datasets through a stacked and hierarchical learning process. Deep learning has shown breakthrough performance in several areas including image recognition, natural language processing, and speech recognition. However, the performance of deep learning in predicting disease status using genomic datasets is still not well studied. In this article, we performed a review on the four relevant articles that we found through our thorough literature search. All four articles first used auto-encoders to project high-dimensional genomic data to a low dimensional space and then applied the state-of-the-art machine learning algorithms to predict disease status based on the low-dimensional representations. These deep learning approaches outperformed existing prediction methods, such as prediction based on transcript-wise screening and prediction based on principal component analysis. The limitations of the current deep learning approach and possible improvements were also discussed.
Collapse
Affiliation(s)
- Qianfan Wu
- Questrom School of Business, Boston University, 595 Commonwealth Avenue, Boston, MA, 02215, USA
| | - Adel Boueiz
- Channing Division of Network Medicine, Brigham and Women’s Hospital/Harvard Medical School, 181 Longwood Avenue, Boston MA 02115, USA,Department of Medicine, Pulmonary and Critical Care Division, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, USA
| | - Alican Bozkurt
- Department of Computer Science, Northeastern University, Boston, MA, USA
| | - Arya Masoomi
- Department of Computer Science, Northeastern University, Boston, MA, USA
| | | | - Dawn L DeMeo
- Channing Division of Network Medicine, Brigham and Women’s Hospital/Harvard Medical School, 181 Longwood Avenue, Boston MA 02115, USA
| | - Scott T Weiss
- Channing Division of Network Medicine, Brigham and Women’s Hospital/Harvard Medical School, 181 Longwood Avenue, Boston MA 02115, USA
| | - Weiliang Qiu
- Channing Division of Network Medicine, Brigham and Women’s Hospital/Harvard Medical School, 181 Longwood Avenue, Boston MA 02115, USA,Corresponding author: Weiliang Qiu, Channing Division of Network Medicine, Brigham and Women’s Hospital/Harvard Medical School, 181 Longwood Avenue, Boston MA02115, USA, Tel: 6177325500;
| |
Collapse
|