1
|
Park SH, Song SH, Burton F, Arsan C, Jobst B, Feldman M. Machine learning characterization of a rare neurologic disease via electronic health records: a proof-of-principle study on stiff person syndrome. BMC Neurol 2024; 24:272. [PMID: 39097681 PMCID: PMC11297611 DOI: 10.1186/s12883-024-03760-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2024] [Accepted: 07/12/2024] [Indexed: 08/05/2024] Open
Abstract
BACKGROUND Despite the frequent diagnostic delays of rare neurologic diseases (RND), it remains difficult to study RNDs and their comorbidities due to their rarity and hence the statistical underpowering. Affecting one to two in a million annually, stiff person syndrome (SPS) is an RND characterized by painful muscle spasms and rigidity. Leveraging underutilized electronic health records (EHR), this study showcased a machine-learning-based framework to identify clinical features that optimally characterize the diagnosis of SPS. METHODS A machine-learning-based feature selection approach was employed on 319 items from the past medical histories of 48 individuals (23 with a diagnosis of SPS and 25 controls) with elevated serum autoantibodies against glutamic-acid-decarboxylase-65 (anti-GAD65) in Dartmouth Health's EHR to determine features with the highest discriminatory power. Each iteration of the algorithm implemented a Support Vector Machine (SVM) model, generating importance scores-SHapley Additive exPlanation (SHAP) values-for each feature and removing one with the least salient. Evaluation metrics were calculated through repeated stratified cross-validation. RESULTS Depression, hypothyroidism, GERD, and joint pain were the most characteristic features of SPS. Utilizing these features, the SVM model attained precision of 0.817 (95% CI 0.795-0.840), sensitivity of 0.766 (95% CI 0.743-0.790), F-score of 0.761 (95% CI 0.744-0.778), AUC of 0.808 (95% CI 0.791-0.825), and accuracy of 0.775 (95% CI 0.759-0.790). CONCLUSIONS This framework discerned features that, with further research, may help fully characterize the pathologic mechanism of SPS: depression, hypothyroidism, and GERD may respectively represent comorbidities through common inflammatory, genetic, and dysautonomic links. This methodology could address diagnostic challenges in neurology by uncovering latent associations and generating hypotheses for RNDs.
Collapse
Affiliation(s)
- Soo Hwan Park
- Geisel School of Medicine at Dartmouth, Hanover, NH, USA
- Department of Neurology, Dartmouth Health, Lebanon, NH, USA
| | - Seo Ho Song
- Department of Psychiatry, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, MA, USA
| | - Frederick Burton
- Department of Psychiatry, University of California Los Angeles Health, Los Angeles, CA, USA
| | - Cybèle Arsan
- Department of Psychiatry, Oakland Medical Center, Kaiser Permanente, Oakland, CA, USA
| | - Barbara Jobst
- Geisel School of Medicine at Dartmouth, Hanover, NH, USA
- Department of Neurology, Dartmouth Health, Lebanon, NH, USA
| | - Mary Feldman
- Geisel School of Medicine at Dartmouth, Hanover, NH, USA.
- Department of Neurology, Dartmouth Health, Lebanon, NH, USA.
| |
Collapse
|
2
|
Wang H, Doumard E, Soule-Dupuy C, Kemoun P, Aligon J, Monsarrat P. Explanations as a New Metric for Feature Selection: A Systematic Approach. IEEE J Biomed Health Inform 2023; 27:4131-4142. [PMID: 37220033 DOI: 10.1109/jbhi.2023.3279340] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/25/2023]
Abstract
With the extensive use of Machine Learning (ML) in the biomedical field, there was an increasing need for Explainable Artificial Intelligence (XAI) to improve transparency and reveal complex hidden relationships between variables for medical practitioners, while meeting regulatory requirements. Feature Selection (FS) is widely used as a part of a biomedical ML pipeline to significantly reduce the number of variables while preserving as much information as possible. However, the choice of FS methods affects the entire pipeline including the final prediction explanations, whereas very few works investigate the relationship between FS and model explanations. Through a systematic workflow performed on 145 datasets and an illustration on medical data, the present work demonstrated the promising complementarity of two metrics based on explanations (using ranking and influence changes) in addition to accuracy and retention rate to select the most appropriate FS/ML models. Measuring how much explanations differ with/without FS are particularly promising for FS methods recommendation. While reliefF generally performs the best on average, the optimal choice may vary for each dataset. Positioning FS methods in a tridimensional space, integrating explanations-based metrics, accuracy and retention rate, would allow the user to choose the priorities to be given on each of the dimensions. In biomedical applications, where each medical condition may have its own preferences, this framework will make it possible to offer the healthcare professional the appropriate FS technique, to select the variables that have an important explainable impact, even if this comes at the expense of a limited drop of accuracy.
Collapse
|
3
|
O'Sullivan CM, Ghahramani A, Deo RC, Pembleton KG. Pattern recognition describing spatio-temporal drivers of catchment classification for water quality. THE SCIENCE OF THE TOTAL ENVIRONMENT 2023; 861:160240. [PMID: 36403827 DOI: 10.1016/j.scitotenv.2022.160240] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/01/2022] [Revised: 11/12/2022] [Accepted: 11/13/2022] [Indexed: 06/16/2023]
Abstract
Classification using spatial data is foundational for hydrological modelling, particularly for ungauged areas. However, models developed from classified land use drivers deliver inconsistent water quality results for the same land uses and hinder decision-making guided by those models. This paper explores whether the temporal variation of water quality drivers, such as season and flow, influence inconsistency in the classification, and whether variability is captured in spatial datasets that include original vegetation to represent the variability of biotic responses in areas mapped with the same land use. An Artificial Neural Network Pattern Recognition (ANN-PR) method is used to match catchments by Dissolved Inorganic Nitrogen (DIN) patterns in water quality datasets partitioned into Wet vs Dry Seasons and Increasing vs Retreating flows. Explainable artificial intelligence approaches are then used to classify catchments via spatial feature datasets for each catchment. Catchments matched for sharing patterns in both spatial data and DIN datasets were corroborated and the benefit of partitioning the observed DIN dataset evaluated using Kruskal Wallis method. The highest corroboration rates for spatial data classification with DIN classification were achieved with seasonal partitioning of water quality datasets and significant independence (p < 0.001 to 0.026) from non-partitioned datasets was achieved. This study demonstrated that DIN patterns fall into three categories suited to classification under differing temporal scales with corresponding vegetation types as the indicators. Categories 1 and 3 included dominance of woodlands in their datasets and catchments suited to classify together change depending on temporal scale of the data. Category 2 catchments were dominated by vineforest and classified catchments did not change under different temporal scales. This demonstrates that including original vegetation as a proxy for differences in DIN patterns will help guide future classification where only spatially mapped data is available for ungauged catchments and will better inform data needs for water modelling.
Collapse
Affiliation(s)
- Cherie M O'Sullivan
- Centre for Sustainable Agricultural Systems, Institute for Life Sciences and the Environment University of Southern Queensland, Toowoomba, QLD 4350, Australia. Cherie.O'
| | - Afshin Ghahramani
- Centre for Sustainable Agricultural Systems, Institute for Life Sciences and the Environment University of Southern Queensland, Toowoomba, QLD 4350, Australia
| | - Ravinesh C Deo
- School of Mathematics, Physics and Computing, University of Southern Queensland, Springfield, QLD 4300, Australia
| | - Keith G Pembleton
- Centre for Sustainable Agricultural Systems, Institute for Life Sciences and the Environment University of Southern Queensland, Toowoomba, QLD 4350, Australia; School of Agriculture and Environmental Science, University of Southern Queensland, Toowoomba, QLD 4350, Australia
| |
Collapse
|
4
|
Balestra C, Maj C, Müller E, Mayr A. Redundancy-aware unsupervised ranking based on game theory: Ranking pathways in collections of gene sets. PLoS One 2023; 18:e0282699. [PMID: 36893181 PMCID: PMC9997904 DOI: 10.1371/journal.pone.0282699] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2022] [Accepted: 02/13/2023] [Indexed: 03/10/2023] Open
Abstract
In Genetics, gene sets are grouped in collections concerning their biological function. This often leads to high-dimensional, overlapping, and redundant families of sets, thus precluding a straightforward interpretation of their biological meaning. In Data Mining, it is often argued that techniques to reduce the dimensionality of data could increase the maneuverability and consequently the interpretability of large data. In the past years, moreover, we witnessed an increasing consciousness of the importance of understanding data and interpretable models in the machine learning and bioinformatics communities. On the one hand, there exist techniques aiming to aggregate overlapping gene sets to create larger pathways. While these methods could partly solve the large size of the collections' problem, modifying biological pathways is hardly justifiable in this biological context. On the other hand, the representation methods to increase interpretability of collections of gene sets that have been proposed so far have proved to be insufficient. Inspired by this Bioinformatics context, we propose a method to rank sets within a family of sets based on the distribution of the singletons and their size. We obtain sets' importance scores by computing Shapley values; Making use of microarray games, we do not incur the typical exponential computational complexity. Moreover, we address the challenge of constructing redundancy-aware rankings where, in our case, redundancy is a quantity proportional to the size of intersections among the sets in the collections. We use the obtained rankings to reduce the dimension of the families, therefore showing lower redundancy among sets while still preserving a high coverage of their elements. We finally evaluate our approach for collections of gene sets and apply Gene Sets Enrichment Analysis techniques to the now smaller collections: As expected, the unsupervised nature of the proposed rankings allows for unremarkable differences in the number of significant gene sets for specific phenotypic traits. In contrast, the number of performed statistical tests can be drastically reduced. The proposed rankings show a practical utility in bioinformatics to increase interpretability of the collections of gene sets and a step forward to include redundancy-awareness into Shapley values computations.
Collapse
Affiliation(s)
- Chiara Balestra
- Department of Computer Science, TU Dortmund, Dortmund, Germany
- Department of Medical Biometry, Informatics and Epidemiology (IMBIE), University Hospital Bonn, Bonn, Germany
- * E-mail:
| | - Carlo Maj
- Institute for Genomic Statistics and Bioinformatics IGSB, University Hospital Bonn, Bonn, Germany
- Centre for Human Genetics, University of Marburg, Marburg, Germany
| | - Emmanuel Müller
- Department of Computer Science, TU Dortmund, Dortmund, Germany
| | - Andreas Mayr
- Department of Medical Biometry, Informatics and Epidemiology (IMBIE), University Hospital Bonn, Bonn, Germany
| |
Collapse
|
5
|
Saadat R, Syed-Mohamad SM, Azmi A, Keikhosrokiani P. Enhancing manufacturing process by predicting component failures using machine learning. Neural Comput Appl 2022. [DOI: 10.1007/s00521-022-07465-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
6
|
Hasan MK, Ghazal TM, Alkhalifah A, Abu Bakar KA, Omidvar A, Nafi NS, Agbinya JI. Fischer Linear Discrimination and Quadratic Discrimination Analysis-Based Data Mining Technique for Internet of Things Framework for Healthcare. Front Public Health 2021; 9:737149. [PMID: 34712639 PMCID: PMC8545792 DOI: 10.3389/fpubh.2021.737149] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2021] [Accepted: 08/20/2021] [Indexed: 11/20/2022] Open
Abstract
The internet of reality or augmented reality has been considered a breakthrough and an outstanding critical mutation with an emphasis on data mining leading to dismantling of some of its assumptions among several of its stakeholders. In this work, we study the pillars of these technologies connected to web usage as the Internet of things (IoT) system's healthcare infrastructure. We used several data mining techniques to evaluate the online advertisement data set, which can be categorized as high dimensional with 1,553 attributes, and the imbalanced data set, which automatically simulates an IoT discrimination problem. The proposed methodology applies Fischer linear discrimination analysis (FLDA) and quadratic discrimination analysis (QDA) within random projection (RP) filters to compare our runtime and accuracy with support vector machine (SVM), K-nearest neighbor (KNN), and Multilayer perceptron (MLP) in IoT-based systems. Finally, the impact on number of projections was practically experimented, and the sensitivity of both FLDA and QDA with regard to precision and runtime was found to be challenging. The modeling results show not only improved accuracy, but also runtime improvements. When compared with SVM, KNN, and MLP in QDA and FLDA, runtime shortens by 20 times in our chosen data set simulated for a healthcare framework. The RP filtering in the preprocessing stage of the attribute selection, fulfilling the model's runtime, is a standpoint in the IoT industry. Index Terms: Data Mining, Random Projection, Fischer Linear Discriminant Analysis, Online Advertisement Dataset, Quadratic Discriminant Analysis, Feature Selection, Internet of Things.
Collapse
Affiliation(s)
- Mohammad Kamrul Hasan
- Center for Cyber Security, Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia (UKM), Bangi, Malaysia
| | - Taher M Ghazal
- Center for Cyber Security, Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia (UKM), Bangi, Malaysia.,Skyline University College, University City of Sharjah, Sharjah, United Arab Emirates
| | - Ali Alkhalifah
- Department of Information Technology, College of Computer, Qassim University, Buraydah, Saudi Arabia
| | - Khairul Azmi Abu Bakar
- Center for Cyber Security, Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia (UKM), Bangi, Malaysia
| | - Alireza Omidvar
- Engineering Department, Tehran Urban and Suburban Railway Co., Tehran, Iran
| | - Nazmus S Nafi
- School of IT and Engineering (SITE), Melbourne Institute of Technology, Melbourne, VIC, Australia
| | - Johnson I Agbinya
- School of IT and Engineering (SITE), Melbourne Institute of Technology, Melbourne, VIC, Australia
| |
Collapse
|
7
|
Tan K, Huang W, Liu X, Hu J, Dong S. A Hierarchical Graph Convolution Network for Representation Learning of Gene Expression Data. IEEE J Biomed Health Inform 2021; 25:3219-3229. [PMID: 33449889 DOI: 10.1109/jbhi.2021.3052008] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
The curse of dimensionality, which is caused by high-dimensionality and low-sample-size, is a major challenge in gene expression data analysis. However, the real situation is even worse: labelling data is laborious and time-consuming, so only a small part of the limited samples will be labelled. Having such few labelled samples further increases the difficulty of training deep learning models. Interpretability is an important requirement in biomedicine. Many existing deep learning methods are trying to provide interpretability, but rarely apply to gene expression data. Recent semi-supervised graph convolution network methods try to address these problems by smoothing the label information over a graph. However, to the best of our knowledge, these methods only utilize graphs in either the feature space or sample space, which restrict their performance. We propose a transductive semi-supervised representation learning method called a hierarchical graph convolution network (HiGCN) to aggregate the information of gene expression data in both feature and sample spaces. HiGCN first utilizes external knowledge to construct a feature graph and a similarity kernel to construct a sample graph. Then, two spatial-based GCNs are used to aggregate information on these graphs. To validate the model's performance, synthetic and real datasets are provided to lend empirical support. Compared with two recent models and three traditional models, HiGCN learns better representations of gene expression data, and these representations improve the performance of downstream tasks, especially when the model is trained on a few labelled samples. Important features can be extracted from our model to provide reliable interpretability.
Collapse
|
8
|
Benchmarking Analysis of the Accuracy of Classification Methods Related to Entropy. ENTROPY 2021; 23:e23070850. [PMID: 34356391 PMCID: PMC8306704 DOI: 10.3390/e23070850] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/28/2021] [Revised: 06/18/2021] [Accepted: 06/24/2021] [Indexed: 11/19/2022]
Abstract
In the machine learning literature we can find numerous methods to solve classification problems. We propose two new performance measures to analyze such methods. These measures are defined by using the concept of proportional reduction of classification error with respect to three benchmark classifiers, the random and two intuitive classifiers which are based on how a non-expert person could realize classification simply by applying a frequentist approach. We show that these three simple methods are closely related to different aspects of the entropy of the dataset. Therefore, these measures account somewhat for entropy in the dataset when evaluating the performance of classifiers. This allows us to measure the improvement in the classification results compared to simple methods, and at the same time how entropy affects classification capacity. To illustrate how these new performance measures can be used to analyze classifiers taking into account the entropy of the dataset, we carry out an intensive experiment in which we use the well-known J48 algorithm, and a UCI repository dataset on which we have previously selected a subset of the most relevant attributes. Then we carry out an extensive experiment in which we consider four heuristic classifiers, and 11 datasets.
Collapse
|
9
|
Carrizosa E, Molero-Río C, Romero Morales D. Mathematical optimization in classification and regression trees. TOP (BERLIN, GERMANY) 2021; 29:5-33. [PMID: 38624654 PMCID: PMC7967110 DOI: 10.1007/s11750-021-00594-1] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/10/2020] [Accepted: 01/27/2021] [Indexed: 06/02/2023]
Abstract
Classification and regression trees, as well as their variants, are off-the-shelf methods in Machine Learning. In this paper, we review recent contributions within the Continuous Optimization and the Mixed-Integer Linear Optimization paradigms to develop novel formulations in this research area. We compare those in terms of the nature of the decision variables and the constraints required, as well as the optimization algorithms proposed. We illustrate how these powerful formulations enhance the flexibility of tree models, being better suited to incorporate desirable properties such as cost-sensitivity, explainability, and fairness, and to deal with complex data, such as functional data.
Collapse
Affiliation(s)
- Emilio Carrizosa
- Instituto de Matemáticas de la Universidad de Sevilla, Seville, Spain
| | | | | |
Collapse
|
10
|
Alzubi OA, Alzubi JA, Alweshah M, Qiqieh I, Al-Shami S, Ramachandran M. An optimal pruning algorithm of classifier ensembles: dynamic programming approach. Neural Comput Appl 2020. [DOI: 10.1007/s00521-020-04761-6] [Citation(s) in RCA: 60] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
11
|
Raimondi D, Orlando G, Vranken WF, Moreau Y. Exploring the limitations of biophysical propensity scales coupled with machine learning for protein sequence analysis. Sci Rep 2019; 9:16932. [PMID: 31729443 PMCID: PMC6858301 DOI: 10.1038/s41598-019-53324-w] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2019] [Accepted: 10/25/2019] [Indexed: 11/21/2022] Open
Abstract
Machine learning (ML) is ubiquitous in bioinformatics, due to its versatility. One of the most crucial aspects to consider while training a ML model is to carefully select the optimal feature encoding for the problem at hand. Biophysical propensity scales are widely adopted in structural bioinformatics because they describe amino acids properties that are intuitively relevant for many structural and functional aspects of proteins, and are thus commonly used as input features for ML methods. In this paper we reproduce three classical structural bioinformatics prediction tasks to investigate the main assumptions about the use of propensity scales as input features for ML methods. We investigate their usefulness with different randomization experiments and we show that their effectiveness varies among the ML methods used and the tasks. We show that while linear methods are more dependent on the feature encoding, the specific biophysical meaning of the features is less relevant for non-linear methods. Moreover, we show that even among linear ML methods, the simpler one-hot encoding can surprisingly outperform the “biologically meaningful” scales. We also show that feature selection performed with non-linear ML methods may not be able to distinguish between randomized and “real” propensity scales by properly prioritizing to the latter. Finally, we show that learning problem-specific embeddings could be a simple, assumptions-free and optimal way to perform feature learning/engineering for structural bioinformatics tasks.
Collapse
Affiliation(s)
| | - Gabriele Orlando
- Interuniversity Institute of Bioinformatics in Brussels, ULB-VUB, 1050, Brussels, Belgium
| | - Wim F Vranken
- Interuniversity Institute of Bioinformatics in Brussels, ULB-VUB, 1050, Brussels, Belgium.,Structural Biology Brussels, Vrije Universiteit Brussel, Brussels, 1050, Belgium
| | - Yves Moreau
- ESAT-STADIUS, KU Leuven, 3001, Leuven, Belgium.
| |
Collapse
|
12
|
Zaeri-Amirani M, Afghah F, Mousavi S. A Feature Selection Method Based on Shapley Value to False Alarm Reduction in ICUs A Genetic-Algorithm Approach. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2019; 2018:319-323. [PMID: 30440402 DOI: 10.1109/embc.2018.8512266] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
High false alarm rate in intensive care units (ICUs) has been identified as one of the most critical medical challenges in recent years. This often results in overwhelming the clinical staff by numerous false or unurgent alarms and decreasing the quality of care through enhancing the probability of missing true alarms as well as causing delirium, stress, sleep deprivation and depressed immune systems for patients. One major cause of false alarms in clinical practice is that the collected signals from different devices are processed individually to trigger an alarm, while there exists a considerable chance that the signal collected from one device is corrupted by noise or motion artifacts. In this paper, we propose a low-computational complexity yet accurate game-theoretic feature selection method which is based on a genetic algorithm that identifies the most informative biomarkers across the signals collected from various monitoring devices and can considerably reduce the rate of false alarms 1.
Collapse
|
13
|
Using game theory and decision decomposition to effectively discern and characterise bi-locus diseases. Artif Intell Med 2019; 99:101690. [PMID: 31606112 DOI: 10.1016/j.artmed.2019.06.006] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2018] [Revised: 06/21/2019] [Accepted: 06/30/2019] [Indexed: 01/08/2023]
Abstract
In order to gain insight into oligogenic disorders, understanding those involving bi-locus variant combinations appears to be key. In prior work, we showed that features at multiple biological scales can already be used to discriminate among two types, i.e. disorders involving true digenic and modifier combinations. The current study expands this machine learning work towards dual molecular diagnosis cases, providing a classifier able to effectively distinguish between these three types. To reach this goal and gain an in-depth understanding of the decision process, game theory and tree decomposition techniques are applied to random forest predictors to investigate the relevance of feature combinations in the prediction. A machine learning model with high discrimination capabilities was developed, effectively differentiating the three classes in a biologically meaningful manner. Combining prediction interpretation and statistical analysis, we propose a biologically meaningful characterization of each class relying on specific feature strengths. Figuring out how biological characteristics shift samples towards one of three classes provides clinically relevant insight into the underlying biological processes as well as the disease itself.
Collapse
|
14
|
Liénard JF, Achakulvisut T, Acuna DE, David SV. Intellectual synthesis in mentorship determines success in academic careers. Nat Commun 2018; 9:4840. [PMID: 30482900 PMCID: PMC6258699 DOI: 10.1038/s41467-018-07034-y] [Citation(s) in RCA: 41] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2018] [Accepted: 10/11/2018] [Indexed: 11/30/2022] Open
Abstract
As academic careers become more competitive, junior scientists need to understand the value that mentorship brings to their success in academia. Previous research has found that, unsurprisingly, successful mentors tend to train successful students. But what characteristics of this relationship predict success, and how? We analyzed an open-access database of 18,856 researchers who have undergone both graduate and postdoctoral training, compiled across several fields of biomedical science with an emphasis on neuroscience. Our results show that postdoctoral mentors were more instrumental to trainees' success compared to graduate mentors. Trainees' success in academia was also predicted by the degree of intellectual synthesis between their graduate and postdoctoral mentors. Researchers were more likely to succeed if they trained under mentors with disparate expertise and integrated that expertise into their own work. This pattern has held up over at least 40 years, despite fluctuations in the number of students and availability of independent research positions.
Collapse
Affiliation(s)
- Jean F Liénard
- Oregon Hearing Research Center, Oregon Health & Science University, Portland, Oregon, 97239-3098, USA.
- Okinawa Institute for Science and Technology, Onna-son, Okinawa, 904-0412, Japan.
| | - Titipat Achakulvisut
- Department of Bioengineering, University of Pennsylvania, Philadelphia, Pennsylvania, 19104, USA
| | - Daniel E Acuna
- School of Information Studies, Syracuse University, Syracuse, NY, 13244, USA
| | - Stephen V David
- Oregon Hearing Research Center, Oregon Health & Science University, Portland, Oregon, 97239-3098, USA
| |
Collapse
|
15
|
Afghah F, Razi A, Soroushmehr R, Ghanbari H, Najarian K. Game Theoretic Approach for Systematic Feature Selection; Application in False Alarm Detection in Intensive Care Units. ENTROPY (BASEL, SWITZERLAND) 2018; 20:E190. [PMID: 33265281 PMCID: PMC7512707 DOI: 10.3390/e20030190] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/10/2018] [Revised: 02/27/2018] [Accepted: 03/05/2018] [Indexed: 01/19/2023]
Abstract
Intensive Care Units (ICUs) are equipped with many sophisticated sensors and monitoring devices to provide the highest quality of care for critically ill patients. However, these devices might generate false alarms that reduce standard of care and result in desensitization of caregivers to alarms. Therefore, reducing the number of false alarms is of great importance. Many approaches such as signal processing and machine learning, and designing more accurate sensors have been developed for this purpose. However, the significant intrinsic correlation among the extracted features from different sensors has been mostly overlooked. A majority of current data mining techniques fail to capture such correlation among the collected signals from different sensors that limits their alarm recognition capabilities. Here, we propose a novel information-theoretic predictive modeling technique based on the idea of coalition game theory to enhance the accuracy of false alarm detection in ICUs by accounting for the synergistic power of signal attributes in the feature selection stage. This approach brings together techniques from information theory and game theory to account for inter-features mutual information in determining the most correlated predictors with respect to false alarm by calculating Banzhaf power of each feature. The numerical results show that the proposed method can enhance classification accuracy and improve the area under the ROC (receiver operating characteristic) curve compared to other feature selection techniques, when integrated in classifiers such as Bayes-Net that consider inter-features dependencies.
Collapse
Affiliation(s)
- Fatemeh Afghah
- School of Informatics, Computing and Cyber Systems, Northern Arizona University, Flagstaff, AZ 86011, USA
| | - Abolfazl Razi
- School of Informatics, Computing and Cyber Systems, Northern Arizona University, Flagstaff, AZ 86011, USA
| | - Reza Soroushmehr
- Department of Emergency Medicine, University of Michigan, Ann Arbor, MI 48109, USA
| | - Hamid Ghanbari
- Department of Emergency Medicine, University of Michigan, Ann Arbor, MI 48109, USA
| | - Kayvan Najarian
- Department of Emergency Medicine, University of Michigan, Ann Arbor, MI 48109, USA
| |
Collapse
|
16
|
CAFÉ-Map: Context Aware Feature Mapping for mining high dimensional biomedical data. Comput Biol Med 2016; 79:68-79. [PMID: 27764717 DOI: 10.1016/j.compbiomed.2016.10.006] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2016] [Revised: 10/05/2016] [Accepted: 10/10/2016] [Indexed: 12/18/2022]
Abstract
Feature selection and ranking is of great importance in the analysis of biomedical data. In addition to reducing the number of features used in classification or other machine learning tasks, it allows us to extract meaningful biological and medical information from a machine learning model. Most existing approaches in this domain do not directly model the fact that the relative importance of features can be different in different regions of the feature space. In this work, we present a context aware feature ranking algorithm called CAFÉ-Map. CAFÉ-Map is a locally linear feature ranking framework that allows recognition of important features in any given region of the feature space or for any individual example. This allows for simultaneous classification and feature ranking in an interpretable manner. We have benchmarked CAFÉ-Map on a number of toy and real world biomedical data sets. Our comparative study with a number of published methods shows that CAFÉ-Map achieves better accuracies on these data sets. The top ranking features obtained through CAFÉ-Map in a gene profiling study correlate very well with the importance of different genes reported in the literature. Furthermore, CAFÉ-Map provides a more in-depth analysis of feature ranking at the level of individual examples. AVAILABILITY CAFÉ-Map Python code is available at: http://faculty.pieas.edu.pk/fayyaz/software.html#cafemap . The CAFÉ-Map package supports parallelization and sparse data and provides example scripts for classification. This code can be used to reconstruct the results given in this paper.
Collapse
|
17
|
Robust Feature Selection from Microarray Data Based on Cooperative Game Theory and Qualitative Mutual Information. Adv Bioinformatics 2016; 2016:1058305. [PMID: 27127506 PMCID: PMC4818815 DOI: 10.1155/2016/1058305] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2015] [Revised: 02/20/2016] [Accepted: 02/22/2016] [Indexed: 11/17/2022] Open
Abstract
High dimensionality of microarray data sets may lead to low efficiency and overfitting. In this paper, a multiphase cooperative game theoretic feature selection approach is proposed for microarray data classification. In the first phase, due to high dimension of microarray data sets, the features are reduced using one of the two filter-based feature selection methods, namely, mutual information and Fisher ratio. In the second phase, Shapley index is used to evaluate the power of each feature. The main innovation of the proposed approach is to employ Qualitative Mutual Information (QMI) for this purpose. The idea of Qualitative Mutual Information causes the selected features to have more stability and this stability helps to deal with the problem of data imbalance and scarcity. In the third phase, a forward selection scheme is applied which uses a scoring function to weight each feature. The performance of the proposed method is compared with other popular feature selection algorithms such as Fisher ratio, minimum redundancy maximum relevance, and previous works on cooperative game based feature selection. The average classification accuracy on eleven microarray data sets shows that the proposed method improves both average accuracy and average stability compared to other approaches.
Collapse
|
18
|
Zeng K, She K, Niu X. Feature selection with neighborhood entropy-based cooperative game theory. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2014; 2014:479289. [PMID: 25276120 PMCID: PMC4158261 DOI: 10.1155/2014/479289] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/15/2014] [Revised: 07/27/2014] [Accepted: 08/10/2014] [Indexed: 11/18/2022]
Abstract
Feature selection plays an important role in machine learning and data mining. In recent years, various feature measurements have been proposed to select significant features from high-dimensional datasets. However, most traditional feature selection methods will ignore some features which have strong classification ability as a group but are weak as individuals. To deal with this problem, we redefine the redundancy, interdependence, and independence of features by using neighborhood entropy. Then the neighborhood entropy-based feature contribution is proposed under the framework of cooperative game. The evaluative criteria of features can be formalized as the product of contribution and other classical feature measures. Finally, the proposed method is tested on several UCI datasets. The results show that neighborhood entropy-based cooperative game theory model (NECGT) yield better performance than classical ones.
Collapse
Affiliation(s)
- Kai Zeng
- School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Kun She
- School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Xinzheng Niu
- School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China
| |
Collapse
|
19
|
|
20
|
|
21
|
Roy K, Bhattacharya P, Suen CY. Iris recognition using shape-guided approach and game theory. Pattern Anal Appl 2011. [DOI: 10.1007/s10044-011-0229-7] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
|
22
|
Schuster S, Kreft JU, Brenner N, Wessely F, Theissen G, Ruppin E, Schroeter A. Cooperation and cheating in microbial exoenzyme production--theoretical analysis for biotechnological applications. Biotechnol J 2010; 5:751-8. [PMID: 20540107 DOI: 10.1002/biot.200900303] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
The engineering of microorganisms to produce a variety of extracellular enzymes (exoenzymes), for example for producing renewable fuels and in biodegradation of xenobiotics, has recently attracted increasing interest. Productivity is often reduced by "cheater" mutants, which are deficient in exoenzyme production and benefit from the product provided by the "cooperating" cells. We present a game-theoretical model to analyze population structure and exoenzyme productivity in terms of biotechnologically relevant parameters. For any given population density, three distinct regimes are predicted: when the metabolic effort for exoenzyme production and secretion is low, all cells cooperate; at intermediate metabolic costs, cooperators and cheaters coexist; while at high costs, all cells use the cheating strategy. These regimes correspond to the harmony game, snowdrift game, and Prisoner's Dilemma, respectively. Thus, our results indicate that microbial strains engineered for exoenzyme production will not, under appropriate conditions, be outcompeted by cheater mutants. We also analyze the dependence of the population structure on cell density. At low costs, the fraction of cooperating cells increases with decreasing cell density and reaches unity at a critical threshold. Our model provides an estimate of the cell density maximizing exoenzyme production.
Collapse
Affiliation(s)
- Stefan Schuster
- Department of Bioinformatics, School of Biology and Pharmaceutics, Friedrich Schiller University of Jena, Ernst-Abbe-Platz 2, Jena, Germany
| | | | | | | | | | | | | |
Collapse
|