1
|
Pattern analysis using lower body human walking data to identify the gaitprint. Comput Struct Biotechnol J 2024; 24:281-291. [PMID: 38644928 PMCID: PMC11033172 DOI: 10.1016/j.csbj.2024.04.017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2023] [Revised: 04/08/2024] [Accepted: 04/09/2024] [Indexed: 04/23/2024] Open
Abstract
All people have a fingerprint that is unique to them and persistent throughout life. Similarly, we propose that people have a gaitprint, a persistent walking pattern that contains unique information about an individual. To provide evidence of a unique gaitprint, we aimed to identify individuals based on basic spatiotemporal variables. 81 adults were recruited to walk overground on an indoor track at their own pace for four minutes wearing inertial measurement units. A total of 18 trials per participant were completed between two days, one week apart. Four methods of pattern analysis, a) Euclidean distance, b) cosine similarity, c) random forest, and d) support vector machine, were applied to our basic spatiotemporal variables such as step and stride lengths to accurately identify people. Our best accuracy (98.63%) was achieved by random forest, followed by support vector machine (98.40%), and the top 10 most similar trials from cosine similarity (98.40%). Our results clearly demonstrate a persistent walking pattern with sufficient information about the individual to make them identifiable, suggesting the existence of a gaitprint.
Collapse
|
2
|
Drought and life-history strategies in Heliophila (Brassicaceae). THE NEW PHYTOLOGIST 2024; 241:532-534. [PMID: 38031508 DOI: 10.1111/nph.19352] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/07/2022] [Accepted: 09/08/2023] [Indexed: 12/01/2023]
|
3
|
Benefiting from the intrinsic role of epigenetics to predict patterns of CTCF binding. Comput Struct Biotechnol J 2023; 21:3024-3031. [PMID: 37266407 PMCID: PMC10229758 DOI: 10.1016/j.csbj.2023.05.012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2022] [Revised: 05/11/2023] [Accepted: 05/11/2023] [Indexed: 06/03/2023] Open
Abstract
Motivation One of the most relevant mechanisms involved in the determination of chromatin structure is the formation of structural loops that are also related with the conservation of chromatin states. Many of these loops are stabilized by CCCTC-binding factor (CTCF) proteins at their base. Despite the relevance of chromatin structure and the key role of CTCF, the role of the epigenetic factors that are involved in the regulation of CTCF binding, and thus, in the formation of structural loops in the chromatin, is not thoroughly understood. Results Here we describe a CTCF binding predictor based on Random Forest that employs different epigenetic data and genomic features. Importantly, given the ability of Random Forests to determine the relevance of features for the prediction, our approach also shows how the different types of descriptors impact the binding of CTCF, confirming previous knowledge on the relevance of chromatin accessibility and DNA methylation, but demonstrating the effect of epigenetic modifications on the activity of CTCF. We compared our approach against other predictors and found improved performance in terms of areas under PR and ROC curves (PRAUC-ROCAUC), outperforming current state-of-the-art methods.
Collapse
|
4
|
Hide and seek shark teeth in Random Forests: machine learning applied to Scyliorhinus canicula populations. PeerJ 2022; 10:e13575. [PMID: 35811817 PMCID: PMC9261926 DOI: 10.7717/peerj.13575] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2022] [Accepted: 05/22/2022] [Indexed: 01/17/2023] Open
Abstract
Shark populations that are distributed alongside a latitudinal gradient often display body size differences at sexual maturity and vicariance patterns related to their number of tooth files. Previous works have demonstrated that Scyliorhinus canicula populations differ between the northeastern Atlantic Ocean and the Mediterranean Sea based on biological features and genetic analysis. In this study, we sample more than 3,000 teeth from 56 S. canicula specimens caught incidentally off Roscoff and Banyuls-sur-Mer. We investigate population differences based on tooth shape and form by using two approaches. Classification results show that the classical geometric morphometric framework is outperformed by an original Random Forests-based framework. Visually, both S. canicula populations share similar ontogenetic trends and timing of gynandric heterodonty emergence but the Atlantic population has bigger, blunter teeth, and less numerous accessory cusps than the Mediterranean population. According to the models, the populations are best differentiated based on their lateral tooth edges, which bear accessory cusps, and the tooth centroid sizes significantly improve classification performances. The differences observed are discussed in light of dietary and behavioural habits of the populations considered. The method proposed in this study could be further adapted to complement DNA analyses to identify shark species or populations based on tooth morphologies. This process would be of particular interest for fisheries management and identification of shark fossils.
Collapse
|
5
|
Prediction of acute kidney injury risk after cardiac surgery: using a hybrid machine learning algorithm. BMC Med Inform Decis Mak 2022; 22:137. [PMID: 35585624 PMCID: PMC9118758 DOI: 10.1186/s12911-022-01859-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2022] [Accepted: 04/20/2022] [Indexed: 11/17/2022] Open
Abstract
Background Acute kidney injury (AKI) is a serious complication after cardiac surgery. We derived and internally validated a Machine Learning preoperative model to predict cardiac surgery-associated AKI of any severity and compared its performance with parametric statistical models. Methods We conducted a retrospective study of adult patients who underwent major cardiac surgery requiring cardiopulmonary bypass between November 1st, 2009 and March 31st, 2015. AKI was defined according to the KDIGO criteria as stage 1 or greater, within 7 days of surgery. We randomly split the cohort into derivation and validation datasets. We developed three AKI risk models: (1) a hybrid machine learning (ML) algorithm, using Random Forests for variable selection, followed by high performance logistic regression; (2) a traditional logistic regression model and (3) an enhanced logistic regression model with 500 bootstraps, with backward variable selection. For each model, we assigned risk scores to each of the retained covariate and assessed model discrimination (C statistic) and calibration (Hosmer–Lemeshow goodness-of-fit test) in the validation datasets. Results Of 6522 included patients, 1760 (27.0%) developed AKI. The best performance was achieved by the hybrid ML algorithm to predict AKI of any severity. The ML and enhanced statistical models remained robust after internal validation (C statistic = 0.75; Hosmer–Lemeshow p = 0.804, and AUC = 0.74, Hosmer–Lemeshow p = 0.347, respectively). Conclusions We demonstrated that a hybrid ML model provides higher accuracy without sacrificing parsimony, computational efficiency, or interpretability, when compared with parametric statistical models. This score-based model can easily be used at the bedside to identify high-risk patients who may benefit from intensive perioperative monitoring and personalized management strategies. Supplementary Information The online version contains supplementary material available at 10.1186/s12911-022-01859-w.
Collapse
|
6
|
A hybrid satellite and land use regression model of source-specific PM 2.5 and PM 2.5 constituents. ENVIRONMENT INTERNATIONAL 2022; 163:107233. [PMID: 35429918 DOI: 10.1016/j.envint.2022.107233] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/16/2021] [Revised: 03/13/2022] [Accepted: 04/06/2022] [Indexed: 06/14/2023]
Abstract
Although PM2.5 mass varies in source and composition over time and space, most health effects assessment have made the inherent assumption that all PM2.5 mass has the same health implications, irrespective of composition. Nationwide estimates of source-specific PM2.5 mass and constituents at local-scale would allow for epidemiological studies and health effects assessments that consider the variability in PM2.5 characteristics in their health impact assessments. In response, we developed US models of annual exposures at the census tract level for five major PM2.5 sources (traffic, soil, coal, oil, and biomass combustion) and six trace elements (elemental carbon, sulfur, silicon, selenium, nickel, and non-soil potassium) for 2001 through 2014. We employed Absolute Factor Analysis (APCA) to derive the source-specific PM2.5 impacts at monitoring stations. Random forest algorithms that incorporated predictors derived from satellite, chemical transport model, and census tract resolution land-use data on traffic, meteorology, and emissions, which were rigorously tested by 10-fold cross-validation (CV), were then employed to estimate elemental and source-specific PM2.5 levels at non-monitoring site census-tracts over the study years. Model performances were moderate to good, with CV R2 ranging from 0.41 to 0.95. For PM2.5 sources, the highest CV R2 was attained for traffic PM2.5 (CV R2 = 0.73), followed by coal (CV R2 = 0.65), oil (CV R2 = 0.62), soil (CV R2 = 0.60), and biomass (CV R2 = 0.41). Among constituents, the CV was highest for sulfur (CV R2 = 0.95). Our analyses provided highly resolved spatial estimates of annual elemental and source-specific PM2.5 concentrations at the census-tract level, for 2001 through 2014. This dataset offers exposure estimates in support of future nationwide long-term health effects studies of source-specific PM2.5 mass and constituents, enabling epidemiological research that addresses the fact that not all particles are the same.
Collapse
|
7
|
Grafted and Vanishing Random Subspaces. Pattern Anal Appl 2022; 25:89-124. [PMID: 35370452 PMCID: PMC8975250 DOI: 10.1007/s10044-021-01029-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023]
Abstract
The Random Subspace Method (RSM) is an ensemble procedure in which each constituent learner is constructed using a randomly chosen subset of the data features. Regression trees are ideal candidate learners in RSM ensembles. By constructing trees upon different feature subsets, RSM reduces correlation between trees resulting in a stronger ensemble. Furthermore, it lessens computational burden by only considering a subset of the features when building each tree. Despite its apparent advantages, RSM has a notable drawback. In some instances a randomly chosen subspace may lack informative features. This is especially true in situations in which the number of truly informative variables is small relative to the total number of variables. Trees that are constructed using feature subsets lacking informative features can be damaging to the ensemble. Here we present Grafted Random Subspaces (GRS) and Vanishing Random Subspaces (VRS), two novel ensemble procedures designed to remedy the aforementioned drawback by reusing information across trees. Both techniques borrow from RSM by growing individual trees on randomly selected feature subsets. For each tree in a GRS ensemble, the most important variable is identified and guaranteed inclusion into the next q feature subsets. This allows GRS to recycle a promising feature from one tree across several successive trees, effectively grafting the variable into the next q active subsets. In the VRS procedure the least important feature is guaranteed exclusion from the next q feature subsets. This creates a more enriched pool of candidate variables from which the successive feature subsets are drawn.
Collapse
|
8
|
Prediction of synergistic drug combinations using PCA-initialized deep learning. BioData Min 2021; 14:46. [PMID: 34670583 PMCID: PMC8527604 DOI: 10.1186/s13040-021-00278-3] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2020] [Accepted: 09/07/2021] [Indexed: 01/16/2023] Open
Abstract
Background Cancer is one of the main causes of death worldwide. Combination drug therapy has been a mainstay of cancer treatment for decades and has been shown to reduce host toxicity and prevent the development of acquired drug resistance. However, the immense number of possible drug combinations and large synergistic space makes it infeasible to screen all effective drug pairs experimentally. Therefore, it is crucial to develop computational approaches to predict drug synergy and guide experimental design for the discovery of rational combinations for therapy. Results We present a new deep learning approach to predict synergistic drug combinations by integrating gene expression profiles from cell lines and chemical structure data. Specifically, we use principal component analysis (PCA) to reduce the dimensionality of the chemical descriptor data and gene expression data. We then propagate the low-dimensional data through a neural network to predict drug synergy values. We apply our method to O’Neil’s high-throughput drug combination screening data as well as a dataset from the AstraZeneca-Sanger Drug Combination Prediction DREAM Challenge. We compare the neural network approach with and without dimension reduction. Additionally, we demonstrate the effectiveness of our deep learning approach and compare its performance with three state-of-the-art machine learning methods: Random Forests, XGBoost, and elastic net, with and without PCA-based dimensionality reduction. Conclusions Our developed approach outperforms other machine learning methods, and the use of dimension reduction dramatically decreases the computation time without sacrificing accuracy.
Collapse
|
9
|
Texture analysis in the classification of T 2 -weighted magnetic resonance images in persons with and without low back pain. J Orthop Res 2021; 39:2187-2196. [PMID: 33247597 DOI: 10.1002/jor.24930] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/18/2020] [Revised: 10/30/2020] [Accepted: 11/25/2020] [Indexed: 02/04/2023]
Abstract
Magnetic resonance imaging findings often do not distinguish between people with and without low back pain (LBP). However, there are still a large number of people who undergo magnetic resonance imaging to help determine the etiology of their back pain. Texture analysis shows promise for the classification of tissues that look similar, and machine learning can minimize the number of comparisons. This study aimed to determine if texture features from lumbar spine magnetic resonance imaging differ between people with and without LBP. In total, 14 participants with chronic LBP were matched for age, weight, and gender with 14 healthy volunteers. A custom texture analysis software was used to construct a gray-level co-occurrence matrix with one to four pixels offset in 0° direction for the disc and superior and inferior endplate regions. The Random Forests Algorithm was used to select the most promising classifiers. The linear mixed-effect model analysis was used to compare groups (pain vs. pain-free) at each level controlling for age. The Random Forest Algorithm recommended focusing on intervertebral discs and endplate zones at L4-5 and L5-S1. Differences were observed between groups for L5-S1 superior endplate contrast, homogeneity, and energy (p = .02). Differences were observed for L5-S1 disc contrast and homogeneity (p < .01), as well as for the inferior endplates contrast, homogeneity, and energy (p < .03). Magnetic resonance imaging textural features may have potential in identifying structures that may be the target of further investigations about the reasons for LBP.
Collapse
|
10
|
A permutation test for assessing the presence of individual differences in treatment effects. Stat Methods Med Res 2021; 30:2369-2381. [PMID: 34570622 DOI: 10.1177/09622802211033640] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Abstract
An important goal of personalized medicine is to identify heterogeneity in treatment effects and then use that heterogeneity to target the intervention to those most likely to benefit. Heterogeneity is assessed using the predicted individual treatment effects framework, and a permutation test is proposed to establish if significant heterogeneity is present given the covariates and predictive model or algorithm used for predicted individual treatment effects. We first show evidence for heterogeneity in the effects of treatment across an illustrative example data set. We then use simulations with two different predictive methods (linear regression model and Random Forests) to show that the permutation test has adequate type-I error control. Next, we use an example dataset as the basis for simulations to demonstrate the ability of the permutation test to find heterogeneity in treatment effects for a predicted individual treatment effects estimate as a function of both effect size and sample size. We find that the proposed test has good power for detecting heterogeneity in treatment effects when the heterogeneity was due primarily to a single predictor, or when it was spread across the predictors. Power was found to be greater for predictions from a linear model than from random forests. This non-parametric permutation test can be used to test for significant differences across individuals in predicted individual treatment effects obtained with a given set of covariates using any predictive method with no additional assumptions.
Collapse
|
11
|
Modeling drivers' reaction when being tailgated: A Random Forests Method. JOURNAL OF SAFETY RESEARCH 2021; 78:28-35. [PMID: 34399925 DOI: 10.1016/j.jsr.2021.05.004] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/04/2020] [Revised: 12/28/2020] [Accepted: 05/06/2021] [Indexed: 06/13/2023]
Abstract
BACKGROUND Tailgating is a common aggressive driving behavior that has been identified as one of the leading causes of rear-end crashes. Previous studies have explored the behavior of tailgating drivers and have reported effective solutions to decrease the amount or prevalence of tailgating. This paper tries to fill the research gap by focusing on understanding highway tailgating scenarios and examining the leading vehicles' reaction using existing naturalistic driving data. METHOD A total of 1,255 tailgating events were identified by using the one-second time headway threshold criterion. Four types of reactions from the leading vehicles were identified, including changing lanes, slowing down, speeding up, and making no response. A Random Forests algorithm was employed in this study to predict the leading vehicle's reaction based on corresponding factors including driver, vehicle, and environmental variables. RESULTS The analysis of the tailgating scenarios and associated factors showed that male drivers were more frequently involved in tailgating events than female drivers and that tailgating was more prevalent under sunny weather and in daytime conditions. Changing lanes was the most prevalent reaction from the leading vehicle during tailgating, which accounted for more than half of the total events. The results of Random Forests showed that mean time headway, duration of tailgating, and minimum time headway were three main factors, which had the greatest impact on the leading vehicle drivers' reaction. It was found that in 95% of the events, leading vehicles would change lanes when being tailgated for two minutes or longer. Practical Applications: Results of this study can help to better understand the behavior and decision making of drivers. This understanding can be used in designing countermeasures or assistance systems to reduce tailgating behavior and related negative safety consequences.
Collapse
|
12
|
Computational intelligence identifies alkaline phosphatase (ALP), alpha-fetoprotein (AFP), and hemoglobin levels as most predictive survival factors for hepatocellular carcinoma. Health Informatics J 2021; 27:1460458220984205. [PMID: 33504243 DOI: 10.1177/1460458220984205] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]
Abstract
Liver cancer kills approximately 800 thousand people annually worldwide, and its most common subtype is hepatocellular carcinoma (HCC), which usually affects people with cirrhosis. Predicting survival of patients with HCC remains an important challenge, especially because technologies needed for this scope are not available in all hospitals. In this context, machine learning applied to medical records can be a fast, low-cost tool to predict survival and detect the most predictive features from health records. In this study, we analyzed medical data of 165 patients with HCC: we employed computational intelligence to predict their survival, and to detect the most relevant clinical factors able to discriminate survived from deceased cases. Afterwards, we compared our data mining results with those obtained through statistical tests and scientific literature findings. Our analysis revealed that blood levels of alkaline-phosphatase (ALP), alpha-fetoprotein (AFP), and hemoglobin are the most effective prognostic factors in this dataset. We found literature supporting association of these three factors with hepatoma, even though only AFP has been used in a prognostic index. Our results suggest that ALP and hemoglobin can be candidates for future HCC prognostic indexes, and that physicians could focus on ALP, AFP, and hemoglobin when studying HCC records.
Collapse
|
13
|
Uncovering the Most Important Factors for Predicting Sexual Desire Using Explainable Machine Learning. J Sex Med 2021; 18:1198-1216. [PMID: 37057427 DOI: 10.1016/j.jsxm.2021.04.010] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2020] [Revised: 04/02/2021] [Accepted: 04/21/2021] [Indexed: 11/27/2022]
Abstract
BACKGROUND Low sexual desire is the most common sexual problem reported with 34% of women and 15% of men reporting lack of desire for at least 3 months in a 12-month period. Sexual desire has previously been associated with both relationship and individual well-being highlighting the importance of understanding factors that contribute to sexual desire as improving sexual desire difficulties can help improve an individual's overall quality of life. AIM The purpose of the present study was to identify the most salient individual (eg, attachment style, attitudes toward sexuality, gender) and relational (eg, relationship satisfaction, sexual satisfaction, romantic love) predictors of dyadic and solitary sexual desire from a large number of predictor variables. METHODS Previous research has relied primarily on traditional statistical models which are limited in their ability to estimate a large number of predictors, non-linear associations, and complex interactions. We used a machine learning algorithm, random forest (a type of highly non-linear decision tree), to circumvent these issues to predict dyadic and solitary sexual desire from a large number of predictors across 2 online samples (N = 1,846; includes 754 individuals forming 377 couples). We also used a Shapley value technique to estimate the size and direction of the effect of each predictor variable on the model outcome. OUTCOMES The outcomes included total, dyadic, and solitary sexual desire measured using the Sexual Desire Inventory. RESULTS The models predicted around 40% of variance in dyadic and solitary desire with women's desire being more predictable than men's overall. Several variables consistently predicted dyadic sexual desire such as sexual satisfaction and romantic love, and solitary desire such as masturbation and attitudes toward sexuality. These predictors were similar for both men and women and gender was not an important predictor of sexual desire. CLINICAL TRANSLATION The results highlight the importance of addressing overall relationship satisfaction when sexual desire difficulties are presented in couples therapy. It is also important to understand clients' attitudes toward sexuality. STRENGTHS & LIMITATIONS The study improves on existing methodologies in the field and compares a large number of predictors of sexual desire. However, the data were cross-sectional and there may have been variables that are important for desire but were not present in the datasets. CONCLUSION Higher sexual satisfaction and feelings of romantic love toward one's partner are important predictors of dyadic sexual desire whereas regular masturbation and more permissive attitudes toward sexuality predicted solitary sexual desire. Vowels LM, Vowels MJ, Mark KP. Uncovering the Most Important Factors for Predicting Sexual Desire Using Explainable Machine Learning. J Sex Med 2021;18:1198-1216.
Collapse
|
14
|
A Methodological Framework to Discover Pharmacogenomic Interactions Based on Random Forests. Genes (Basel) 2021; 12:genes12060933. [PMID: 34207374 PMCID: PMC8235396 DOI: 10.3390/genes12060933] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2021] [Revised: 06/15/2021] [Accepted: 06/16/2021] [Indexed: 01/01/2023] Open
Abstract
The identification of genomic alterations in tumor tissues, including somatic mutations, deletions, and gene amplifications, produces large amounts of data, which can be correlated with a diversity of therapeutic responses. We aimed to provide a methodological framework to discover pharmacogenomic interactions based on Random Forests. We matched two databases from the Cancer Cell Line Encyclopaedia (CCLE) project, and the Genomics of Drug Sensitivity in Cancer (GDSC) project. For a total of 648 shared cell lines, we considered 48,270 gene alterations from CCLE as input features and the area under the dose-response curve (AUC) for 265 drugs from GDSC as the outcomes. A three-step reduction to 501 alterations was performed, selecting known driver genes and excluding very frequent/infrequent alterations and redundant ones. For each model, we used the concordance correlation coefficient (CCC) for assessing the predictive performance, and permutation importance for assessing the contribution of each alteration. In a reasonable computational time (56 min), we identified 12 compounds whose response was at least fairly sensitive (CCC > 20) to the alteration profiles. Some diversities were found in the sets of influential alterations, providing clues to discover significant drug-gene interactions. The proposed methodological framework can be helpful for mining pharmacogenomic interactions.
Collapse
|
15
|
Identification and Functional Annotation of Genes Related to Bone Stability in Laying Hens Using Random Forests. Genes (Basel) 2021; 12:genes12050702. [PMID: 34066823 PMCID: PMC8151682 DOI: 10.3390/genes12050702] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2021] [Revised: 05/05/2021] [Accepted: 05/06/2021] [Indexed: 12/20/2022] Open
Abstract
Skeletal disorders, including fractures and osteoporosis, in laying hens cause major welfare and economic problems. Although genetics have been shown to play a key role in bone integrity, little is yet known about the underlying genetic architecture of the traits. This study aimed to identify genes associated with bone breaking strength and bone mineral density of the tibiotarsus and the humerus in laying hens. Potentially informative single nucleotide polymorphisms (SNP) were identified using Random Forests classification. We then searched for genes known to be related to bone stability in close proximity to the SNPs and identified 16 potential candidates. Some of them had human orthologues. Based on our findings, we can support the assumption that multiple genes determine bone strength, with each of them having a rather small effect, as illustrated by our SNP effect estimates. Furthermore, the enrichment analysis showed that some of these candidates are involved in metabolic pathways critical for bone integrity. In conclusion, the identified candidates represent genes that may play a role in the bone integrity of chickens. Although further studies are needed to determine causality, the genes reported here are promising in terms of alleviating bone disorders in laying hens.
Collapse
|
16
|
Combining Random Forests and a Signal Detection Method Leads to the Robust Detection of Genotype-Phenotype Associations. Genes (Basel) 2020; 11:E892. [PMID: 32764260 PMCID: PMC7465705 DOI: 10.3390/genes11080892] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2020] [Revised: 07/28/2020] [Accepted: 08/03/2020] [Indexed: 12/21/2022] Open
Abstract
Genome wide association studies (GWAS) are a well established methodology to identify genomic variants and genes that are responsible for traits of interest in all branches of the life sciences. Despite the long time this methodology has had to mature the reliable detection of genotype-phenotype associations is still a challenge for many quantitative traits mainly because of the large number of genomic loci with weak individual effects on the trait under investigation. Thus, it can be hypothesized that many genomic variants that have a small, however real, effect remain unnoticed in many GWAS approaches. Here, we propose a two-step procedure to address this problem. In a first step, cubic splines are fitted to the test statistic values and genomic regions with spline-peaks that are higher than expected by chance are considered as quantitative trait loci (QTL). Then the SNPs in these QTLs are prioritized with respect to the strength of their association with the phenotype using a Random Forests approach. As a case study, we apply our procedure to real data sets and find trustworthy numbers of, partially novel, genomic variants and genes involved in various egg quality traits.
Collapse
|
17
|
Machine learning uncovers the most robust self-report predictors of relationship quality across 43 longitudinal couples studies. Proc Natl Acad Sci U S A 2020; 117:19061-19071. [PMID: 32719123 DOI: 10.1073/pnas.1917036117] [Citation(s) in RCA: 68] [Impact Index Per Article: 17.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Given the powerful implications of relationship quality for health and well-being, a central mission of relationship science is explaining why some romantic relationships thrive more than others. This large-scale project used machine learning (i.e., Random Forests) to 1) quantify the extent to which relationship quality is predictable and 2) identify which constructs reliably predict relationship quality. Across 43 dyadic longitudinal datasets from 29 laboratories, the top relationship-specific predictors of relationship quality were perceived-partner commitment, appreciation, sexual satisfaction, perceived-partner satisfaction, and conflict. The top individual-difference predictors were life satisfaction, negative affect, depression, attachment avoidance, and attachment anxiety. Overall, relationship-specific variables predicted up to 45% of variance at baseline, and up to 18% of variance at the end of each study. Individual differences also performed well (21% and 12%, respectively). Actor-reported variables (i.e., own relationship-specific and individual-difference variables) predicted two to four times more variance than partner-reported variables (i.e., the partner's ratings on those variables). Importantly, individual differences and partner reports had no predictive effects beyond actor-reported relationship-specific variables alone. These findings imply that the sum of all individual differences and partner experiences exert their influence on relationship quality via a person's own relationship-specific experiences, and effects due to moderation by individual differences and moderation by partner-reports may be quite small. Finally, relationship-quality change (i.e., increases or decreases in relationship quality over the course of a study) was largely unpredictable from any combination of self-report variables. This collective effort should guide future models of relationships.
Collapse
|
18
|
Growing Random Forests reveals that exposure and proficiency best account for individual variability in L2 (and L1) brain potentials for syntax and semantics. BRAIN AND LANGUAGE 2020; 204:104770. [PMID: 32114146 DOI: 10.1016/j.bandl.2020.104770] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/14/2019] [Revised: 01/18/2020] [Accepted: 02/01/2020] [Indexed: 06/10/2023]
Abstract
Late second language (L2) learners report difficulties in specific linguistic areas such as syntactic processing, presumably because brain plasticity declines with age (following the critical period hypothesis). While there is also evidence that L2 learners can achieve native-like online-processing with sufficient proficiency (following the convergence hypothesis), considering multiple mediating factors and their impact on language processing has proven challenging. We recorded EEG while native (n = 36) and L2-speakers of French (n = 40) read sentences that were either well-formed or contained a syntactic-category error. a lexical-semantic anomaly, or both. Consistent with the critical period hypothesis, group differences revealed that while native speakers elicited a biphasic N400-P600 in response to ungrammatical sentences, L2 learners as a group only elicited an N400. However, individual data modeling using a Random Forests approach revealed that language exposure and proficiency are the most reliable predictors in explaining ERP responses, with N400 and P600 effects becoming larger as exposure to French as well as proficiency increased, as predicted by the convergence hypothesis.
Collapse
|
19
|
Locating Forest Management Units Using Remote Sensing and Geostatistical Tools in North-Central Washington, USA. SENSORS 2020; 20:s20092454. [PMID: 32357414 PMCID: PMC7249656 DOI: 10.3390/s20092454] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/05/2020] [Revised: 04/21/2020] [Accepted: 04/24/2020] [Indexed: 11/17/2022]
Abstract
In this study, we share an approach to locate and map forest management units with high accuracy and with relatively rapid turnaround. Our study area consists of private, state, and federal land holdings that cover four counties in North-Central Washington, USA (Kittitas, Okanogan, Chelan and Douglas). This area has a rich history of landscape change caused by frequent wildfires, insect attacks, disease outbreaks, and forest management practices, which is only partially documented across ownerships in an inconsistent fashion. To consistently quantify forest management activities for the entire study area, we leveraged Sentinel-2 satellite imagery, LANDFIRE existing vegetation types and disturbances, monitoring trends in burn severity fire perimeters, and Landsat 8 Burned Area products. Within our methodology, Sentinel-2 images were collected and transformed to orthogonal land cover change difference and ratio metrics using principal component analyses. In addition, the Normalized Difference Vegetation Index and the Relativized Burn Ratio index were estimated. These variables were used as predictors in Random Forests machine learning classification models. Known locations of forest treatment units were used to create samples to train the Random Forests models to estimate where changes in forest structure occurred between the years of 2016 and 2019. We visually inspected each derived polygon to manually assign one treatment class, either clearcut or thinning. Landsat 8 Burned Area products were used to derive prescribed fire units for the same period. The bulk of analyses were performed using the RMRS Raster Utility toolbar that facilitated spatial, statistical, and machine learning tools, while significantly reducing the required processing time and storage space associated with analyzing these large datasets. The results were combined with existing LANDFIRE vegetation disturbance and forest treatment data to create a 21-year dataset (1999–2019) for the study area.
Collapse
|
20
|
Identification of Age-Specific and Common Key Regulatory Mechanisms Governing Eggshell Strength in Chicken Using Random Forests. Genes (Basel) 2020; 11:genes11040464. [PMID: 32344666 PMCID: PMC7230204 DOI: 10.3390/genes11040464] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2020] [Revised: 04/08/2020] [Accepted: 04/21/2020] [Indexed: 12/21/2022] Open
Abstract
In today's chicken egg industry, maintaining the strength of eggshells in longer laying cycles is pivotal for improving the persistency of egg laying. Eggshell development and mineralization underlie a complex regulatory interplay of various proteins and signaling cascades involving multiple organ systems. Understanding the regulatory mechanisms influencing this dynamic trait over time is imperative, yet scarce. To investigate the temporal changes in the signaling cascades, we considered eggshell strength at two different time points during the egg production cycle and studied the genotype-phenotype associations by employing the Random Forests algorithm on chicken genotypic data. For the analysis of corresponding genes, we adopted a well established systems biology approach to delineate gene regulatory pathways and master regulators underlying this important trait. Our results indicate that, while some of the master regulators (Slc22a1 and Sox11) and pathways are common at different laying stages of chicken, others (e.g., Scn11a, St8sia2, or the TGF- β pathway) represent age-specific functions. Overall, our results provide: (i) significant insights into age-specific and common molecular mechanisms underlying the regulation of eggshell strength; and (ii) new breeding targets to improve the eggshell quality during the later stages of the chicken production cycle.
Collapse
|
21
|
[Assessment of Heavy Metal Pollution in Surface Dust of Lanzhou Schools Based on Random Forests]. HUAN JING KE XUE= HUANJING KEXUE 2020; 41:1838-1846. [PMID: 32608692 DOI: 10.13227/j.hjkx.201908118] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
In this study, seven types of heavy metal elements and 11 types of characteristic parameters affecting heavy metal pollution and accumulation in surface dust were selected. Based on the comprehensive pollution index (PN) and potential ecological risk index (RI) calculated from the heavy metal element content of the school dust in the main urban area of Lanzhou City in 2018 as the training set, the PN and RI of the information sampling points were estimated using random forests. Then, the temporal and spatial characteristics of heavy metals in school dust in the main urban area of Lanzhou were analyzed. Finally, the correlation coefficient was used to evaluate the advantages and disadvantages of the traditional interpolation results and the random forest interpolation results. The results showed that the concentrations of heavy metals in the dust were higher than the local background values. The over standard rate of a single sample is 100%, Zn is 5 times higher than the background value, and Pb is 4 times higher than background value. PN in the study area was in the order Chengguan > Xigu > Anning > Qilihe, and RI was in the order Chengguan > Xigu > Qilihe > Anning. PN and RI exhibited very similar spatial distribution characteristics, both located in transportation hubs or downtown. In winter and summer, PN exhibited a high value, whereas RI had a high value. The reason for the high value of PN and RI in winter was the increase of coal sources in winter. The comparison of spatial interpolation results shows that the correlation coefficient between the results of random forest interpolation and traffic flow and normalized building index is greater than that of the traditional algorithm.
Collapse
|
22
|
Suitable climatic habitat changes for Mexican conifers along altitudinal gradients under climatic change scenarios. ECOLOGICAL APPLICATIONS : A PUBLICATION OF THE ECOLOGICAL SOCIETY OF AMERICA 2020; 30:e02041. [PMID: 31758621 DOI: 10.1002/eap.2041] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/05/2019] [Revised: 08/23/2019] [Accepted: 09/04/2019] [Indexed: 06/10/2023]
Abstract
The high biodiversity of the Mexican montane forests is concentrated on the Trans-Mexican Volcanic Belt, where several Protected Natural Areas exist. Our study examines the projected changes in suitable climatic habitat for five conifer species that dominate these forests. The species are distributed sequentially in overlapping altitudinal bands: Pinus hartwegii at the upper timberline, followed by Abies religiosa, the overwintering host of the Monarch butterfly at the Monarch Butterfly Biosphere Reserve, P. pseudostrobus, the most important in economic terms, and P. devoniana and P. oocarpa, which are important for resin production and occupy low altitudes where montane conifers merge with tropical dry forests. We fit a bioclimatic model to presence-absence observations for each species using the Random Forests classification tree with ground plot data. The models are driven by normal climatic variables from 1961 to 1990, which represents the reference period for climate-induced vegetation changes. Climate data from an ensemble of 17 general circulation models were run through the classification tree to project current distributions under climates described by the RCP 6.0 watts/m2 scenario for the decades centered on years 2030, 2060 and 2090. The results suggest that, by 2060, the climate niche of each species will occur at elevations that are between 300 to 500 m higher than at present. By 2060, habitat loss could amount to 46-77%, mostly affecting the lower limits of distribution. The two species at the highest elevation, P. hartwegii and A. religiosa, would suffer the greatest losses while, at the lower elevations, P. oocarpa would gain the most niche space. Our results suggest that conifers will require human assistance to migrate altitudinally upward in order to recouple populations with the climates to which they are adapted. Traditional in situ conservation measures are likely to be equivalent to inaction and will therefore be incapable of maintaining current forest compositions.
Collapse
|
23
|
Predicting in vitro human mesenchymal stromal cell expansion based on individual donor characteristics using machine learning. Cytotherapy 2020; 22:82-90. [PMID: 31987754 DOI: 10.1016/j.jcyt.2019.12.006] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2019] [Revised: 11/20/2019] [Accepted: 12/08/2019] [Indexed: 12/21/2022]
Abstract
BACKGROUND Human mesenchymal stromal cells (hMSCs) have become attractive candidates for advanced medical cell-based therapies. An in vitro expansion step is routinely used to reach the required clinical quantities. However, this is influenced by many variables including donor characteristics, such as age and gender, and culture conditions, such as cell seeding density and available culture surface area. Computational modeling in general and machine learning in particular could play a significant role in deciphering the relationship between the individual donor characteristics and their growth dynamics. METHODS In this study, hMSCs obtained from 174 male and female donors, between 3 and 64 years of age with passage numbers ranging from 2 to 27, were studied. We applied a Random Forests (RF) technique to model the cell expansion procedure by predicting the population doubling time (PDT) for each passage, taking into account individual donor-related characteristics. RESULTS Using the RF model, the mean absolute error between model predictions and experimental results for the PDT in passage 1 to 4 is significantly lower compared with the errors obtained with theoretical estimates or historical data. Moreover, statistical analysis indicate that the PD and PDT in different age categories are significantly different, especially in the youngest group (younger than 10 years of age) compared with the other age groups. DISCUSSION In summary, we introduce a predictive computational model describing in vitro cell expansion dynamics based on individual donor characteristics, an approach that could greatly assist toward automation of a cell expansion culture process.
Collapse
|
24
|
Abstract
Extending previous work on quantile classifiers (q-classifiers) we propose the q*-classifier for the class imbalance problem. The classifier assigns a sample to the minority class if the minority class conditional probability exceeds 0 < q* < 1, where q* equals the unconditional probability of observing a minority class sample. The motivation for q*-classification stems from a density-based approach and leads to the useful property that the q*-classifier maximizes the sum of the true positive and true negative rates. Moreover, because the procedure can be equivalently expressed as a cost-weighted Bayes classifier, it also minimizes weighted risk. Because of this dual optimization, the q*-classifier can achieve near zero risk in imbalance problems, while simultaneously optimizing true positive and true negative rates. We use random forests to apply q*-classification. This new method which we call RFQ is shown to outperform or is competitive with existing techniques with respect to tt-mean performance and variable selection. Extensions to the multiclass imbalanced setting are also considered.
Collapse
|
25
|
A hospital wide predictive model for unplanned readmission using hierarchical ICD data. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2019; 173:177-183. [PMID: 30777619 DOI: 10.1016/j.cmpb.2019.02.007] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/31/2018] [Revised: 01/22/2019] [Accepted: 02/12/2019] [Indexed: 06/09/2023]
Abstract
BACKGROUND AND OBJECTIVE Hospitals already acquire a large amount of data, mainly for administrative, billing and registration purposes. Tapping on these already available data for additional purposes, aiming at improving care, without significant incremental effort and cost. This potential of secondary patient data is explored through modeling administrative and billing data, as well as the hierarchical structure of pathology codes of the International Classification of Diseases (ICD) in the prediction of unplanned readmissions, as a clinically relevant outcome parameter that can be impacted on in a quality improvement program. METHODS In this single-center, hospital-wide observational cohort study, we included all adult patients discharged in 2016 after applying an exclusion protocol (n = 29,702). In addition to administrative variables, such as age and length of stay, structured pathology data were taken into account in predictive models. As a first research question, we compared logistic regression against penalized logistic regression, gradient boosting and Random Forests to predict unplanned readmission. As a second research goal, we investigated the level of hierarchy within the pathology data needed to achieve the best accuracy. Finally, we investigated which prediction variables play a prominent role in predicting hospital readmission. The performance of all models was evaluated using the Area Under the ROC Curve (AUC) measure. RESULTS All models have the best predictive results using Random Forests. An added value of 7% is observed compared to a baseline method such as logistic regression. The best model, based on Random Forests, achieved an AUC of 0.77, using the diagnosis category and procedure code as lowest level of the hierarchical pathology data. CONCLUSIONS The most accurate model to predict hospital wide unplanned readmission is based on Random Forests and includes the ICD hierarchy, especially diagnosis category. Such an approach lowers the number of predictor variables and yields a higher interpretability than a model based on a detailed diagnosis. The performance of the model proved high enough to be used as a decision support tool.
Collapse
|
26
|
EEG Window Length Evaluation for the Detection of Alzheimer's Disease over Different Brain Regions. Brain Sci 2019; 9:E81. [PMID: 31013964 PMCID: PMC6523667 DOI: 10.3390/brainsci9040081] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2019] [Revised: 04/10/2019] [Accepted: 04/10/2019] [Indexed: 12/31/2022] Open
Abstract
Alzheimer's Disease (AD) is a neurogenerative disorder and the most common type of dementia with a rapidly increasing world prevalence. In this paper, the ability of several statistical and spectral features to detect AD from electroencephalographic (EEG) recordings is evaluated. For this purpose, clinical EEG recordings from 14 patients with AD (8 with mild AD and 6 with moderate AD) and 10 healthy, age-matched individuals are analyzed. The EEG signals are initially segmented in nonoverlapping epochs of different lengths ranging from 5 s to 12 s. Then, a group of statistical and spectral features calculated for each EEG rhythm (δ, θ, α, β, and γ) are extracted, forming the feature vector that trained and tested a Random Forests classifier. Six classification problems are addressed, including the discrimination from whole-brain dynamics and separately from specific brain regions in order to highlight any alterations of the cortical regions. The results indicated a high accuracy ranging from 88.79% to 96.78% for whole-brain classification. Also, the classification accuracy was higher at the posterior and central regions than at the frontal area and the right side of temporal lobe for all classification problems.
Collapse
|
27
|
Crash injury severity analysis using a two-layer Stacking framework. ACCIDENT; ANALYSIS AND PREVENTION 2019; 122:226-238. [PMID: 30390518 DOI: 10.1016/j.aap.2018.10.016] [Citation(s) in RCA: 60] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/20/2018] [Revised: 10/18/2018] [Accepted: 10/22/2018] [Indexed: 06/08/2023]
Abstract
Crash injury severity analysis is useful for traffic management agency to further understand severity of crashes. A two-layer Stacking framework is proposed in this study to predict the crash injury severity: The fist layer integrates advantages of three base classification methods: RF (Random Forests), AdaBoost (Adaptive Boosting), and GBDT (Gradient Boosting Decision Tree); the second layer completes classification of crash injury severity based on a Logistic Regression model. A total of 5538 crashes were recorded at 326 freeway diverge areas. In the model calibration, several parameters including the number of trees in three base classification methods, learning rate, and regularization coefficient are optimized via a systematic grid search approach. In the model validation, the performance of the Stacking model is compared with several traditional models including the Support Vector Machine (SVM), Multi-Layer Perceptron (MLP) and Random Forests (RF) in the multi classification experiments. The prediction results show that Stacking model achieves superior performance evaluated by two indicators: accuracy and recall. Furthermore, all the factors used in severity prediction are classified into different categories according to their influence on the results, and sensitivity analysis of several significant factors is finally implemented to explore the impact of their value variation on the prediction accuracy.
Collapse
|
28
|
Understanding multiple stressors in a Mediterranean basin: Combined effects of land use, water scarcity and nutrient enrichment. THE SCIENCE OF THE TOTAL ENVIRONMENT 2018; 624:1221-1233. [PMID: 29929235 DOI: 10.1016/j.scitotenv.2017.12.201] [Citation(s) in RCA: 27] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/27/2017] [Revised: 12/16/2017] [Accepted: 12/18/2017] [Indexed: 06/08/2023]
Abstract
River basins are extremely complex hierarchical and directional systems that are affected by a multitude of interacting stressors. This complexity hampers effective management and conservation planning to be effectively implemented, especially under climate change. The objective of this work is to provide a wide scale approach to basin management by interpreting the effect of isolated and interacting factors in several biotic elements (fish, macroinvertebrates, phytobenthos and macrophytes). For that, a case study in the Sorraia basin (Central Portugal), a Mediterranean system mainly facing water scarcity and diffuse pollution problems, was chosen. To develop the proposed framework, a combination of process-based modelling to simulate hydrological and nutrient enrichment stressors and empirical modelling to relate these stressors - along with land use and natural background - with biotic indicators, was applied. Biotic indicators based on ecological quality ratios from WFD biomonitoring data were used as response variables. Temperature, river slope, % of agriculture in the upstream catchment and total N were the variables more frequently ranked as the most relevant. Both the two significant interactions found between single hydrological and nutrient enrichment stressors indicated antagonistic effects. This study demonstrates the potentialities of coupling process-based modelling with empirical modelling within a single framework, allowing relationships among different ecosystem states to be hierarchized, interpreted and predicted at multiple spatial and temporal scales. It also demonstrates how isolated and interacting stressors can have a different impact on biotic quality. When performing conservation or management plans, the stressor hierarchy should be considered as a way of prioritizing actions in a cost-effective perspective.
Collapse
|
29
|
Interannual Change Detection of Mediterranean Seagrasses Using RapidEye Image Time Series. FRONTIERS IN PLANT SCIENCE 2018; 9:96. [PMID: 29467777 PMCID: PMC5808188 DOI: 10.3389/fpls.2018.00096] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/25/2017] [Accepted: 01/17/2018] [Indexed: 05/25/2023]
Abstract
Recent research studies have highlighted the decrease in the coverage of Mediterranean seagrasses due to mainly anthropogenic activities. The lack of data on the distribution of these significant aquatic plants complicates the quantification of their decreasing tendency. While Mediterranean seagrasses are declining, satellite remote sensing technology is growing at an unprecedented pace, resulting in a wealth of spaceborne image time series. Here, we exploit recent advances in high spatial resolution sensors and machine learning to study Mediterranean seagrasses. We process a multispectral RapidEye time series between 2011 and 2016 to detect interannual seagrass dynamics in 888 submerged hectares of the Thermaikos Gulf, NW Aegean Sea, Greece (eastern Mediterranean Sea). We assess the extent change of two Mediterranean seagrass species, the dominant Posidonia oceanica and Cymodocea nodosa, following atmospheric and analytical water column correction, as well as machine learning classification, using Random Forests, of the RapidEye time series. Prior corrections are necessary to untangle the initially weak signal of the submerged seagrass habitats from satellite imagery. The central results of this study show that P. oceanica seagrass area has declined by 4.1%, with a trend of -11.2 ha/yr, while C. nodosa seagrass area has increased by 17.7% with a trend of +18 ha/yr throughout the 5-year study period. Trends of change in spatial distribution of seagrasses in the Thermaikos Gulf site are in line with reported trends in the Mediterranean. Our presented methodology could be a time- and cost-effective method toward the quantitative ecological assessment of seagrass dynamics elsewhere in the future. From small meadows to whole coastlines, knowledge of aquatic plant dynamics could resolve decline or growth trends and accurately highlight key units for future restoration, management, and conservation.
Collapse
|
30
|
A Regression Model for Predicting Shape Deformation after Breast Conserving Surgery. SENSORS 2018; 18:s18010167. [PMID: 29315279 PMCID: PMC5795402 DOI: 10.3390/s18010167] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/07/2017] [Revised: 01/03/2018] [Accepted: 01/05/2018] [Indexed: 01/12/2023]
Abstract
Breast cancer treatments can have a negative impact on breast aesthetics, in case when surgery is intended to intersect tumor. For many years mastectomy was the only surgical option, but more recently breast conserving surgery (BCS) has been promoted as a liable alternative to treat cancer while preserving most part of the breast. However, there is still a significant number of BCS intervened patients who are unpleasant with the result of the treatment, which leads to self-image issues and emotional overloads. Surgeons recognize the value of a tool to predict the breast shape after BCS to facilitate surgeon/patient communication and allow more educated decisions; however, no such tool is available that is suited for clinical usage. These tools could serve as a way of visually sensing the aesthetic consequences of the treatment. In this research, it is intended to propose a methodology for predict the deformation after BCS by using machine learning techniques. Nonetheless, there is no appropriate dataset containing breast data before and after surgery in order to train a learning model. Therefore, an in-house semi-synthetic dataset is proposed to fulfill the requirement of this research. Using the proposed dataset, several learning methodologies were investigated, and promising outcomes are obtained.
Collapse
|
31
|
Abstract
Background Clustering plays a crucial role in several application domains, such as bioinformatics. In bioinformatics, clustering has been extensively used as an approach for detecting interesting patterns in genetic data. One application is population structure analysis, which aims to group individuals into subpopulations based on shared genetic variations, such as single nucleotide polymorphisms. Advances in DNA sequencing technology have facilitated the obtainment of genetic datasets with exceptional sizes. Genetic data usually contain hundreds of thousands of genetic markers genotyped for thousands of individuals, making an efficient means for handling such data desirable. Results Random Forests (RFs) has emerged as an efficient algorithm capable of handling high-dimensional data. RFs provides a proximity measure that can capture different levels of co-occurring relationships between variables. RFs has been widely considered a supervised learning method, although it can be converted into an unsupervised learning method. Therefore, RF-derived proximity measure combined with a clustering technique may be well suited for determining the underlying structure of unlabeled data. This paper proposes, RFcluE, a cluster ensemble approach for determining the underlying structure of genetic data based on RFs. The approach comprises a cluster ensemble framework to combine multiple runs of RF clustering. Experiments were conducted on high-dimensional, real genetic dataset to evaluate the proposed approach. The experiments included an examination of the impact of parameter changes, comparing RFcluE performance against other clustering methods, and an assessment of the relationship between the diversity and quality of the ensemble and its effect on RFcluE performance. Conclusions This paper proposes, RFcluE, a cluster ensemble approach based on RF clustering to address the problem of population structure analysis and demonstrate the effectiveness of the approach. The paper also illustrates that applying a cluster ensemble approach, combining multiple RF clusterings, produces more robust and higher-quality results as a consequence of feeding the ensemble with diverse views of high-dimensional genetic data obtained through bagging and random subspace, the two key features of the RF algorithm.
Collapse
|
32
|
Variable importance-weighted Random Forests. QUANTITATIVE BIOLOGY 2017; 5:338-351. [PMID: 30034909 PMCID: PMC6051549] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
BACKGROUND Random Forests is a popular classification and regression method that has proven powerful for various prediction problems in biological studies. However, its performance often deteriorates when the number of features increases. To address this limitation, feature elimination Random Forests was proposed that only uses features with the largest variable importance scores. Yet the performance of this method is not satisfying, possibly due to its rigid feature selection, and increased correlations between trees of forest. METHODS We propose variable importance-weighted Random Forests, which instead of sampling features with equal probability at each node to build up trees, samples features according to their variable importance scores, and then select the best split from the randomly selected features. RESULTS We evaluate the performance of our method through comprehensive simulation and real data analyses, for both regression and classification. Compared to the standard Random Forests and the feature elimination Random Forests methods, our proposed method has improved performance in most cases. CONCLUSIONS By incorporating the variable importance scores into the random feature selection step, our method can better utilize more informative features without completely ignoring less informative ones, hence has improved prediction accuracy in the presence of weak signals and large noises. We have implemented an R package "viRandomForests" based on the original R package "randomForest" and it can be freely downloaded from http://zhaocenter.org/software.
Collapse
|
33
|
Abstract
Statistical classification is a critical component of utilizing metabolomics data for examining the molecular determinants of phenotypes. Despite this, a comprehensive and rigorous evaluation of the accuracy of classification techniques for phenotype discrimination given metabolomics data has not been conducted. We conducted such an evaluation using both simulated and real metabolomics datasets, comparing Partial Least Squares-Discriminant Analysis (PLS-DA), Sparse PLS-DA, Random Forests, Support Vector Machines (SVM), Artificial Neural Network, k-Nearest Neighbors (k-NN), and Naïve Bayes classification techniques for discrimination. We evaluated the techniques on simulated data generated to mimic global untargeted metabolomics data by incorporating realistic block-wise correlation and partial correlation structures for mimicking the correlations and metabolite clustering generated by biological processes. Over the simulation studies, covariance structures, means, and effect sizes were stochastically varied to provide consistent estimates of classifier performance over a wide range of possible scenarios. The effects of the presence of non-normal error distributions, the introduction of biological and technical outliers, unbalanced phenotype allocation, missing values due to abundances below a limit of detection, and the effect of prior-significance filtering (dimension reduction) were evaluated via simulation. In each simulation, classifier parameters, such as the number of hidden nodes in a Neural Network, were optimized by cross-validation to minimize the probability of detecting spurious results due to poorly tuned classifiers. Classifier performance was then evaluated using real metabolomics datasets of varying sample medium, sample size, and experimental design. We report that in the most realistic simulation studies that incorporated non-normal error distributions, unbalanced phenotype allocation, outliers, missing values, and dimension reduction, classifier performance (least to greatest error) was ranked as follows: SVM, Random Forest, Naïve Bayes, sPLS-DA, Neural Networks, PLS-DA and k-NN classifiers. When non-normal error distributions were introduced, the performance of PLS-DA and k-NN classifiers deteriorated further relative to the remaining techniques. Over the real datasets, a trend of better performance of SVM and Random Forest classifier performance was observed.
Collapse
|
34
|
Multivariate binary classification of imbalanced datasets-A case study based on high-dimensional multiplex autoimmune assay data. Biom J 2017. [PMID: 28626952 DOI: 10.1002/bimj.201600207] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
The classification of a population by a specific trait is a major task in medicine, for example when in a diagnostic setting groups of patients with specific diseases are identified, but also when in predictive medicine a group of patients is classified into specific disease severity classes that might profit from different treatments. When the sizes of those subgroups become small, for example in rare diseases, imbalances between the classes are more the rule than the exception and make statistical classification problematic when the error rate of the minority class is high. Many observations are classified as belonging to the majority class, while the error rate of the majority class is low. This case study aims to investigate class imbalance for Random Forests and Powered Partial Least Squares Discriminant Analysis (PPLS-DA) and to evaluate the performance of these classifiers when they are combined with methods to compensate imbalance (sampling methods, cost-sensitive learning approaches). We evaluate all approaches with a scoring system taking the classification results into consideration. This case study is based on one high-dimensional multiplex autoimmune assay dataset describing immune response to antigens and consisting of two classes of patients: Rheumatoid Arthritis (RA) and Systemic Lupus Erythemathodes (SLE). Datasets with varying degrees of imbalance are created by successively reducing the class of RA patients. Our results indicate possible benefit of cost-sensitive learning approaches for Random Forests. Although further research is needed to verify our findings by investigating other datasets or large-scale simulation studies, we claim that this work has the potential to increase awareness of practitioners to this problem of class imbalance and stresses the importance of considering methods to compensate class imbalance.
Collapse
|
35
|
Optimal Subset Selection of Time-Series MODIS Images and Sample Data Transfer with Random Forests for Supervised Classification Modelling. SENSORS 2016; 16:s16111783. [PMID: 27792152 PMCID: PMC5134442 DOI: 10.3390/s16111783] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/09/2016] [Revised: 08/23/2016] [Accepted: 10/19/2016] [Indexed: 11/16/2022]
Abstract
Nowadays, various time-series Earth Observation data with multiple bands are freely available, such as Moderate Resolution Imaging Spectroradiometer (MODIS) datasets including 8-day composites from NASA, and 10-day composites from the Canada Centre for Remote Sensing (CCRS). It is challenging to efficiently use these time-series MODIS datasets for long-term environmental monitoring due to their vast volume and information redundancy. This challenge will be greater when Sentinel 2-3 data become available. Another challenge that researchers face is the lack of in-situ data for supervised modelling, especially for time-series data analysis. In this study, we attempt to tackle the two important issues with a case study of land cover mapping using CCRS 10-day MODIS composites with the help of Random Forests' features: variable importance, outlier identification. The variable importance feature is used to analyze and select optimal subsets of time-series MODIS imagery for efficient land cover mapping, and the outlier identification feature is utilized for transferring sample data available from one year to an adjacent year for supervised classification modelling. The results of the case study of agricultural land cover classification at a regional scale show that using only about a half of the variables we can achieve land cover classification accuracy close to that generated using the full dataset. The proposed simple but effective solution of sample transferring could make supervised modelling possible for applications lacking sample data.
Collapse
|
36
|
Direct and indirect effects of climate change on projected future fire regimes in the western United States. THE SCIENCE OF THE TOTAL ENVIRONMENT 2016; 542:65-75. [PMID: 26519568 DOI: 10.1016/j.scitotenv.2015.10.093] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/22/2015] [Revised: 10/14/2015] [Accepted: 10/19/2015] [Indexed: 06/05/2023]
Abstract
We asked two research questions: (1) What are the relative effects of climate change and climate-driven vegetation shifts on different components of future fire regimes? (2) How does incorporating climate-driven vegetation change into future fire regime projections alter the results compared to projections based only on direct climate effects? We used the western United States (US) as study area to answer these questions. Future (2071-2100) fire regimes were projected using statistical models to predict spatial patterns of occurrence, size and spread for large fires (>400 ha) and a simulation experiment was conducted to compare the direct climatic effects and the indirect effects of climate-driven vegetation change on fire regimes. Results showed that vegetation change amplified climate-driven increases in fire frequency and size and had a larger overall effect on future total burned area in the western US than direct climate effects. Vegetation shifts, which were highly sensitive to precipitation pattern changes, were also a strong determinant of the future spatial pattern of burn rates and had different effects on fire in currently forested and grass/shrub areas. Our results showed that climate-driven vegetation change can exert strong localized effects on fire occurrence and size, which in turn drive regional changes in fire regimes. The effects of vegetation change for projections of the geographic patterns of future fire regimes may be at least as important as the direct effects of climate change, emphasizing that accounting for changing vegetation patterns in models of future climate-fire relationships is necessary to provide accurate projections at continental to global scales.
Collapse
|
37
|
The Random Forests statistical technique: An examination of its value for the study of reading. SCIENTIFIC STUDIES OF READING : THE OFFICIAL JOURNAL OF THE SOCIETY FOR THE SCIENTIFIC STUDY OF READING 2016; 20:20-33. [PMID: 26770056 PMCID: PMC4710485 DOI: 10.1080/10888438.2015.1107073] [Citation(s) in RCA: 37] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/22/2023]
Abstract
Studies investigating individual differences in reading ability often involve data sets containing a large number of collinear predictors and a small number of observations. In this paper, we discuss the method of Random Forests and demonstrate its suitability for addressing the statistical concerns raised by such datasets. The method is contrasted with other methods of estimating relative variable importance, especially Dominance Analysis and Multimodel Inference. All methods were applied to a dataset that gauged eye-movements during reading and offline comprehension in the context of multiple ability measures with high collinearity due to their shared verbal core. We demonstrate that the Random Forests method surpasses other methods in its ability to handle model overfitting, and accounts for a comparable or larger amount of variance in reading measures relative to other methods.
Collapse
|
38
|
Abstract
In this paper, we introduce a new type of tree-based method, reinforcement learning trees (RLT), which exhibits significantly improved performance over traditional methods such as random forests (Breiman, 2001) under high-dimensional settings. The innovations are three-fold. First, the new method implements reinforcement learning at each selection of a splitting variable during the tree construction processes. By splitting on the variable that brings the greatest future improvement in later splits, rather than choosing the one with largest marginal effect from the immediate split, the constructed tree utilizes the available samples in a more efficient way. Moreover, such an approach enables linear combination cuts at little extra computational cost. Second, we propose a variable muting procedure that progressively eliminates noise variables during the construction of each individual tree. The muting procedure also takes advantage of reinforcement learning and prevents noise variables from being considered in the search for splitting rules, so that towards terminal nodes, where the sample size is small, the splitting rules are still constructed from only strong variables. Last, we investigate asymptotic properties of the proposed method under basic assumptions and discuss rationale in general settings.
Collapse
|
39
|
Bias and Stability of Single Variable Classifiers for Feature Ranking and Selection. EXPERT SYSTEMS WITH APPLICATIONS 2014; 14:6945-6958. [PMID: 25177107 PMCID: PMC4144463 DOI: 10.1016/j.eswa.2014.05.007] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/20/2023]
Abstract
Feature rankings are often used for supervised dimension reduction especially when discriminating power of each feature is of interest, dimensionality of dataset is extremely high, or computational power is limited to perform more complicated methods. In practice, it is recommended to start dimension reduction via simple methods such as feature rankings before applying more complex approaches. Single Variable Classifier (SVC) ranking is a feature ranking based on the predictive performance of a classifier built using only a single feature. While benefiting from capabilities of classifiers, this ranking method is not as computationally intensive as wrappers. In this paper, we report the results of an extensive study on the bias and stability of such feature ranking method. We study whether the classifiers influence the SVC rankings or the discriminative power of features themselves has a dominant impact on the final rankings. We show the common intuition of using the same classifier for feature ranking and final classification does not always result in the best prediction performance. We then study if heterogeneous classifiers ensemble approaches provide more unbiased rankings and if they improve final classification performance. Furthermore, we calculate an empirical prediction performance loss for using the same classifier in SVC feature ranking and final classification from the optimal choices.
Collapse
|
40
|
Combining multiple HRT parameters using the ' Random Forests' method improves the diagnostic accuracy of glaucoma in emmetropic and highly myopic eyes. Invest Ophthalmol Vis Sci 2014; 55:2482-90. [PMID: 24609628 DOI: 10.1167/iovs.14-14009] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
PURPOSE To combine multiple Heidelberg Retina Tomograph (HRT) parameters using the Random Forests classifier to diagnose glaucoma, both in highly and physiologically myopic (highly myopic) eyes and emmetropic eyes. METHODS Subjects consisted of healthy subjects and age-matched patients with open-angle glaucoma in emmetropic (-1.0 to +1.0 diopters [D], 63 and 59 subjects, respectively) and highly myopic eyes (-10.0 to -5.0 D, 56 and 64 subjects, respectively). First, area under the receiver operating characteristic curve (AUC) was derived using 84 HRT global and sectorial parameters and the representative HRT raw parameter (largest AUC) was identified. Then, the Random Forests method was carried out using age, refractive error, and 84 HRT parameters. The AUCs were also derived using the following: (1) Frederick S. Mikelberg discriminant function (FSM) score, (2) Reinhard O.W. Burk discriminant function (RB) score, (3) Moorfields regression analysis (MRA) score, and (4) glaucoma probability score (GPS). RESULTS In combined emmetropic and highly myopic population, AUC with Random Forests method (0.96) was significantly larger than AUCs with the representative HRT raw parameter (vertical cup-to-disc ratio [global], 0.89), FSM (0.90), RB (0.83), MRA (0.87), and GPS (0.81) (P < 0.001). Similarly, AUC with the Random Forests method was significantly (P < 0.05) larger than these other parameters, both in emmetropic and highly myopic groups. Also, the Random Forests method achieved partial AUCs above 80%/90% significantly (P < 0.05) larger than any other HRT parameters in all populations. CONCLUSIONS Evaluating multiple HRT parameters using the Random Forests classifier provided accurate diagnosis of glaucoma, both in emmetropic and highly myopic eyes.
Collapse
|
41
|
Global localization of 3D anatomical structures by pre-filtered Hough forests and discrete optimization. Med Image Anal 2013; 17:1304-14. [PMID: 23664450 PMCID: PMC3807803 DOI: 10.1016/j.media.2013.02.004] [Citation(s) in RCA: 70] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2012] [Revised: 01/28/2013] [Accepted: 02/11/2013] [Indexed: 02/04/2023]
Abstract
The accurate localization of anatomical landmarks is a challenging task, often solved by domain specific approaches. We propose a method for the automatic localization of landmarks in complex, repetitive anatomical structures. The key idea is to combine three steps: (1) a classifier for pre-filtering anatomical landmark positions that (2) are refined through a Hough regression model, together with (3) a parts-based model of the global landmark topology to select the final landmark positions. During training landmarks are annotated in a set of example volumes. A classifier learns local landmark appearance, and Hough regressors are trained to aggregate neighborhood information to a precise landmark coordinate position. A non-parametric geometric model encodes the spatial relationships between the landmarks and derives a topology which connects mutually predictive landmarks. During the global search we classify all voxels in the query volume, and perform regression-based agglomeration of landmark probabilities to highly accurate and specific candidate points at potential landmark locations. We encode the candidates' weights together with the conformity of the connecting edges to the learnt geometric model in a Markov Random Field (MRF). By solving the corresponding discrete optimization problem, the most probable location for each model landmark is found in the query volume. We show that this approach is able to consistently localize the model landmarks despite the complex and repetitive character of the anatomical structures on three challenging data sets (hand radiographs, hand CTs, and whole body CTs), with a median localization error of 0.80 mm, 1.19 mm and 2.71 mm, respectively.
Collapse
|
42
|
Abstract
Identifying genetic variants associated with complex disease in high-dimensional data is a challenging problem, and complicated etiologies such as gene-gene interactions are often ignored in analyses. The data-mining method Random Forests (RF) can handle high-dimensions; however, in high-dimensional data, RF is not an effective filter for identifying risk factors associated with the disease trait via complex genetic models such as gene-gene interactions without strong marginal components. Here we propose an extension called Weighted Random Forests (wRF), which incorporates tree-level weights to emphasize more accurate trees in prediction and calculation of variable importance. We demonstrate through simulation and application to data from a genetic study of addiction that wRF can outperform RF in high-dimensional data, although the improvements are modest and limited to situations with effect sizes that are larger than what is realistic in genetics of complex disease. Thus, the current implementation of wRF is unlikely to improve detection of relevant predictors in high-dimensional genetic data, but may be applicable in other situations where larger effect sizes are anticipated.
Collapse
|
43
|
Are there pollination syndromes in the Australian epacrids (Ericaceae: Styphelioideae)? A novel statistical method to identify key floral traits per syndrome. ANNALS OF BOTANY 2013; 112:141-9. [PMID: 23681546 PMCID: PMC3690994 DOI: 10.1093/aob/mct105] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/09/2023]
Abstract
BACKGROUND AND AIMS Convergent floral traits hypothesized as attracting particular pollinators are known as pollination syndromes. Floral diversity suggests that the Australian epacrid flora may be adapted to pollinator type. Currently there are empirical data on the pollination systems for 87 species (approx. 15 % of Australian epacrids). This provides an opportunity to test for pollination syndromes and their important morphological traits in an iconic element of the Australian flora. METHODS Data on epacrid-pollinator relationships were obtained from published literature and field observation. A multivariate approach was used to test whether epacrid floral attributes related to pollinator profiles. Statistical classification was then used to rank floral attributes according to their predictive value. Data sets excluding mixed pollination systems were used to test the predictive power of statistical classification to identify pollination models. KEY RESULTS Floral attributes are correlated with bird, fly and bee pollination. Using floral attributes identified as correlating with pollinator type, bird pollination is classified with 86 % accuracy, red flowers being the most important predictor. Fly and bee pollination are classified with 78 and 69 % accuracy, but have a lack of individually important floral predictors. Excluding mixed pollination systems improved the accuracy of the prediction of both bee and fly pollination systems. CONCLUSIONS Although most epacrids have generalized pollination systems, a correlation between bird pollination and red, long-tubed epacrids is found. Statistical classification highlights the relative importance of each floral attribute in relation to pollinator type and proves useful in classifying epacrids to bird, fly and bee pollination systems.
Collapse
|
44
|
A Comparison of Logistic Regression, Logic Regression, Classification Tree, and Random Forests to Identify Effective Gene-Gene and Gene-Environmental Interactions. INTERNATIONAL JOURNAL OF APPLIED SCIENCE AND TECHNOLOGY 2012; 2:268. [PMID: 23795347 PMCID: PMC3686280] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
Genome wide association studies (GWAS) have identified numerous single nucleotide polymorphisms (SNPs) that are associated with a variety of common human diseases. Due to the weak marginal effect of most disease-associated SNPs, attention has recently turned to evaluating the combined effect of multiple disease-associated SNPs on the risk of disease. Several recent multigenic studies show potential evidence of applying multigenic approaches in association studies of various diseases including lung cancer. But the question remains as to the best methodology to analyze single nucleotide polymorphisms in multiple genes. In this work, we consider four methods-logistic regression, logic regression, classification tree, and random forests-to compare results for identifying important genes or gene-gene and gene-environmental interactions. To evaluate the performance of four methods, the cross-validation misclassification error and areas under the curves are provided. We performed a simulation study and applied them to the data from a large-scale, population-based, case-control study.
Collapse
|
45
|
Abstract
In this study we used a Random Forest-based approach for an assignment of small guanosine triphosphate proteins (GTPases) to specific subgroups. Small GTPases represent an important functional group of proteins that serve as molecular switches in a wide range of fundamental cellular processes, including intracellular transport, movement and signaling events. These proteins have further gained a special emphasis in cancer research, because within the last decades a huge variety of small GTPases from different subgroups could be related to the development of all types of tumors. Using a random forest approach, we were able to identify the most important amino acid positions for the classification process within the small GTPases superfamily and its subgroups. These positions are in line with the results of earlier studies and have been shown to be the essential elements for the different functionalities of the GTPase families. Furthermore, we provide an accurate and reliable software tool (GTPasePred) to identify potential novel GTPases and demonstrate its application to genome sequences.
Collapse
|